Allele-Specific Gene Expression Is Widespread Across the Genome and Biological Processes

Allelic specific gene expression (ASGE) appears to be an important factor in human phenotypic variability and as a consequence, for the development of complex traits and diseases. In order to study ASGE across the human genome, we have performed a study in which genotyping was coupled with an analysis of ASGE by screening 11,500 SNPs using the Mapping 10 K Array to identify differential allelic expression. We found that from the 5,133 SNPs that were suitable for analysis (heterozygous in our sample and expressed in peripheral blood mononuclear cells), 2,934 (57%) SNPs had differential allelic expression. Such SNPs were equally distributed along human chromosomes and biological processes. We validated the presence or absence of ASGE in 18 out 20 SNPs (90%) randomly selected by real time PCR in 48 human subjects. In addition, we observed that SNPs close to -but not included in- segmental duplications had increased levels of ASGE. Finally, we found that transcripts of unknown function or non-coding RNAs, also display ASGE: from a total of 2,308 intronic SNPs, 1510 (65%) SNPs underwent differential allelic expression. In summary, ASGE is a widespread mechanism in the human genome whose regulation seems to be far more complex than expected.


Introduction
Allelic-specific gene expression (ASGE) or allelic imbalance appears to be an important factor for human phenotypic variability and as a consequence, for the development of common diseases [1]. Traditionally, ASGE has been associated with the phenomena of X-chromosome inactivation and genomic imprinting [2]. However, several recent studies have emphasized the extent to which gene expression varies within and between populations [3,4,5,6], and it is now clear that ASGE is relatively common among non-imprinted autosomal genes [7,8,9]. Furthermore, certain genes display allelic variation in gene expression that is transmitted by Mendelian inheritance and this variation may be linked to common human disorders [8,10,11].
Variation in gene expression may result from changes in the sequence of regulatory elements, such as single nucleotide polymorphisms (SNPs), and recent surveys indicates that this phenomenon is widespread through the genome and tissues [3,12,13,14]. Such changes may explain up to 25 to 35% of the interindividual differences in allelic gene expression [15,16]. Hence, identification and characterization of ASGE will help us to appreciate the extent of functionally important regulatory variation. In turn, this will enable us and to focus on candidate haplotypes whose allelic differences in expression may provide an important link between individual genetic variation and complex traits or common diseases.
In order to study ASGE across the human genome, we have performed a study in which genotyping was coupled with an analysis of allele-specific gene expression by screening 11,560 SNPs using the Mapping 10 K Array (Affymetrix) to identify differential allelic expression. We found that ASGE is very common in the human genome and that it is widespread in many biological processes. We validated our findings using a new cohort of 48 subjects. In addition, we observed that SNPs close to -but not included in-segmental duplications had increased levels of ASGE and we assessed the effect of copy number variation (CNV) using a 44 K Agilent probe array in the same individuals. Finally, we found that transcripts of non-coding RNA (ncRNA) also display allelic imbalance.

Allele-specific expression screening
Because we screened for ASGE in peripheral blood mononuclear cells (PBMCs), SNPs suitable for analysis had to meet the following criteria: (1) at least one individual was heterozygous for the SNP; and (2) the transcript containing the SNP is expressed in PBMCs. We screened 20 individuals founding at least one heterozygous individual for 10,837 SNPs out of the 11,560 SNPs in the 10 K Mapping Array. Of these SNPs, 5,133 corresponded to transcripts expressed in PBMCs (see Methods). Thus, these 5,133 SNPs constitute the analyzed in this study. After the significance was established, the ASGE ratio was considered significant if it was bigger than 1.37 or smaller than 0.81. We found that 2,934 out of 5,133 (57%) SNPs were subject to allelic imbalance: 2,235 SNPs (76%) had a ratio between the expression intensity of the two alleles lower than two, 476 SNPs (16%) displayed a ratio between two-and threefold, and for 223 SNPs (8%) the ratio was greater than a threefold excess for at least one individual. In contrast, 2,199 SNPs (43%) did not display significantly different levels of expression between the two alleles. A complete list of all the SNPs studied and their characteristics can be found in the Supplementary Table S1.
As predicted, in female, the percentage of SNPs with differentially expressed alleles on the X chromosome was significantly higher (Chi-square test, p,0.001) than on autosomic chromosomes since genes subject to X-chromosome inactivation are expected to display skewed allelic expression [17]. Indeed, we were able to identify five known imprinted genes that met the criteria established for the analysis of allelic expression: KCNQ1, MEG3, PPP1R9A, SLC22A3 and SLC22A23. We confirmed ASGE for the first four (Table 1).
In order to validate the results of the Mapping 10 K Array experiments, we performed allele-specific quantitative PCR for 20 SNPs randomly selected in forty-eight new subjects. The results of the real-time quantitative PCR validated the results of the screening since we confirmed the allele-specific or non allelespecific expression in 18 of these 20 SNPs (90%; Table 2), suggesting a low false positive discovery rate. These results validate our experimental method as well as our sample handling and processing.
Allele-specific expression is widespread across the human genome and in different biological processes We mapped the SNPs that displayed ASGE to chromosomes in order to look for regions in the human genome with a higher density of such SNPs (Table 3 and Figure 1). When the SNP distribution in each chromosome was analyzed, we found that an average of 57% of the SNPs per chromosome displayed ASGE, the same percentage as for the overall genome, and without any chromosome deviating significantly from this percentage. This is further evidence that ASGE is widespread across the human genome. Furthermore, the ''SNP proximity'' test (see Methods) was used to search for clusters of differentially expressed allelic SNPs. As a result, we found a total of 133 clusters dispersed throughout the genome with a median of 4 SNPs per cluster (rank 1-36). Localization, length and p-value of clusters can be found in Supplementary Table S2. The subset of 5,133 SNPs studied corresponded to a total of 1,632 known genes, 1,195 of which displayed allelic imbalance in at least one of their SNPs (73%). In order to assess whether ASGE is more influential in any given biological process, we assessed the distribution of genes that did or did not display differential allele expression in the Gene Ontology (GO) database (www.geneontology.org). The comparison between the distributions of genes among different biological processes (GO terms present in levels 3 to 9) did not demonstrate any significant differences (Supplementary Table S3). Thus, the genes subject to differential allelic expression appear to participate in a wide range of different biological processes.

Allele-specific gene expression in ncRNA
We also focused our analysis on the recently described transcripts of ncRNA, previously named transcripts of unknown function, that introduce more complex strategies for transcriptional regulation than previously anticipated [18,19]. Eukaryotic genes contain clearly identifiable open reading frames (ORFs) that direct the translation of functional proteins. However, not all RNA transcripts (other than tRNA, rRNA or snRNA) are translated into polypeptides. Many non-translatable mRNA-like RNA transcripts have been found in the cell. They are polyadenylated, spliced and are lacking long ORFs [20]. In this work, ncRNA are defined as non-coding polyadenylated RNAs that are transcribed but for which there is no functional information. Like eukaryotic messenger RNA, ncRNA contain poly-A tails and thus, they are represented among cDNAs synthesised from mature RNAs using an oligo(dT) primer (see Methods). Adopting this strategy, we measure allele specific expression only of exonic SNPs because, other RNA molecules such as immature RNA that contain introns, are not represented. Surprisingly, in the subset of the 5,133 heterozygous SNPs expressed in PBMCs, a total of 2,311 (45%) and 2,455 (48%) SNPs were intronic and intergenic respectively ( Table 4). Of the intronic SNPs, 1,511 (65%) underwent differential allelic expression, as well as 1,190 (48%) of the intergenic SNPs. This result is consistent with the high levels of unannotated transcription detected [18,19] and it also shows that like known genes, ncRNA display ASGE. We validated ASGE in intronic and intergenic SNPs by real-time quantitative PCR in 18

Allele-specific expression is dependent on regulatory effects associated to segmental duplications
A potential cause of ASGE might be the considerable variation in gene copy number in the human genome [21]. If individuals have different copies of a given duplicon in homologous chromosomes, that is, if they are heterozygous for structural variants and these structural variants contain genes, it is possible that certain alleles appear to be differentially expressed. This differential expression may result simply because they are present in different copy numbers in different chromosomes and not through any regulatory effects. To test this hypothesis, we examined whether the location of a SNP within known structural variants (SDs or CNVs) might affect the probability that it were differentially expressed. As a first test, we used the location of SDs that can be found in public databases (see Methods). Only 106 of the SNPs in our study mapped within known segmental duplication regions and a test showed that SNPs presenting ASGE are not more likely to be located inside SDs than SNPs without ASGE (Table 5), even if the lack of statistical significance may be an effect of small sample size, as we will see below.
Another potential cause of ASGE is heterozygosity among cisregulatory elements. In particular, SDs may contain cis-regulatory elements that affect the expression of nearby genes. If this were a frequent phenomenon, allelic variation in gene expression should be more frequent in single-copy regions that are located in the vicinity of SDs than in single-copy regions far away from duplicons. Using the same dataset than above, we observed that SNPs with ASGE are more frequent near SDs. This enrichment in SNPs with ASGE is especially strong in the 10 Kb windows around SD regions ( Table 5). The effect decreases in more distant (non overlapping) windows.
As a second series of tests, we computed the average ratios of allele expression instead of the proportion of SNPs presenting ASGE. Results are presented in Table 6. Interestingly, although the proportion of SNP with and without ASGE located inside SDs were not significantly different, SNPs inside SDs have on average a higher ratio of allele expression. This effect, again, decreases with distance to the SD. This means that the closest a SNP is to a segmental duplication, the strongest the degree of allele specific gene expression, probably because of regulatory effects attributable to SDs. The maximum effect is registered in SNPs located within SDs, which are probably present in different copy numbers in different chromosomes.
We then used information about known Copy Number Variants (CNV; dbCNV database, see Methods). Similar analyses provide consistent, even if slightly different results. In particular, there are more SNPs with ASGE inside CNVs (Table 7), but these SNPs do not present a higher ratio of gene expression (Table 8). Unlike SNPs near SDs, SNPs within 10 Kb of a CNV do not present any significant effect, probably because of small sample size, since the effect is stronger in the 100 kb window.
Because there is high inter-individual variability in CNV and SD content [22], it is possible that some genome regions that contain CNVs or SDs in public databases are in fact single-copy in the individuals included in our study. To try to overcome this problem we studied CNVs in the tested individuals. We used the Agilent 44 K array to create the indCNV dataset, where individual CNV patterns can be associated to the corresponding individual ASGE patterns (see Methods). In this analysis, we retrieve the same trend that we reported above. In table 9, we see that there are more SNPs with ASGE close to the CNVs detected in the studied individuals than far away from these CNVs. However, this effect is not significant. Since the number of CNVs detected in each individual is much lower than the total of SDs and CNVs present in databases, we suggest that, sample size, and thus statistical power, in the vicinity of CNVs is very small.

Discussion
In this study, we have used high throughput screening of 11,500 SNPs to detect ASGE across the human genome. Our study indicates that allelic variation in gene expression is widespread across the human genome and in different biological processes, including systems of transcriptional regulation. We found that the 57% of human SNPs studied here undergo allelic imbalance and that these SNPs are distributed proportionally among chromosomes. Indeed, among different biological processes we did not find any difference in the distribution of genes that displayed differential allelic expression or those that did not. ASGE is also present in several tissues [21], indicating that this is a common mechanism of genomic regulation for many pathways and cell types. Indeed, ASGE is also implicated in a variety of disease states [23].
Thus, an interesting question that arises is how this ASGE is controlled. A potential cause is the variation in copy number within the human genome [21]. To test for this possibility we examined whether the probability of an SNP undergoing differential allelic expression changes if the SNP is located within SDs or CNVs described in public databases. We found that only a small percentage of the SNPs displaying differentially allele expression were included in these structural variants. However, the proportion of SNPs with differentially expressed alleles was higher in SNPs close to SDs, or within CNVs, suggesting that regulatory elements may lie within these genomic duplications. Moreover, SNPs inside SDs have higher ratio of ASGE, suggesting regulatory effects linked to SDs. Therefore, the relationship between structural variation and ASGE, detected both in variation in intensity and presence/absence of allelic expression, seems to be rather complex. We also studied the contribution of individual CNVs to allelic imbalance, but results of the analysis were not      conclusive, probably due to lack of statistical power. Overall, results coming from the analysis of known SDs and CNVs are largely consistent. Still, it may be surprising to see that the association with ASGE is not identical for both databases. One must take into account that the two datasets have different properties. SDs are computationally defined from the Reference Human Genome Assembly, whereas CNVs are structural variants detected by experimental hybridizations. This means that the two datasets are similar, but most certainly not identical. In addition, it must be considered that individuals in this study are likely to present a set of SDs and CNVs that only partially overlaps with those present in public databases. Consequently, in the tests conducted with information from public databases, we may have mislabeled some SNPs, because a large number of structural variations described in dbSD or dbCNV may not be present in the 20 individuals studied here. Finally, we completed our study by focusing on the implications of ASGE among ncRNA [18,19]. Thus, our current understanding of the repertoire of transcripts produced from the human genome is still evolving, further demonstrating the complexity of the transcriptome. Indeed, the organization and structure of the genome has potentially important implications for the regulation of transcription and the possible interpretation of the naturally occurring genetic variation in humans [24]. We found that ncRNA display ASGE similar to that of known genes revealing even more complexity in the system that regulates transcription.
In summary, ASGE is widespread across the human genome and it participates in all biological processes, especially in the regulation of gene expression in the immune system. If ASGE has important implications in the genotype to phenotype relations and in the regulation of complex interlaced transcriptional patterns, its identification and characterization will provide a better understanding of the complexities of transcription regulation. Furthermore, such knowledge should allow us to focus on haplotypes with allelic differences in expression that may be linked to complex traits and common diseases.

Subjects
A total of 68 healthy Caucasian individuals were recruited to this study. All of them were of Southern-European origin, which minimize differences in population structure [25,26]. Twenty of them, 12 male and 8 female, were used for the array screening assays and the rest for the real time quantitative PCR validation. The study was approved by the local ethical committee (IRB) and patients provided written informed consent.

RNA and DNA purification and cDNA synthesis
Total RNA was extracted from peripheral blood mononuclear cells (PBMCs). PBMCs were isolated from heparinized blood by density gradient centrifugation using Ficoll-Paque (Pharmacia Biotech). PBMCs were immediately submerged in the RNAlater RNA Stabilization Reagent (Qiagen) to preserve their gene expression patterns and total RNA was isolated using the RNeasy Mini Kit (Qiagen). During RNA purification, DNA was removed with a DNase treatment using the RNase-Free DNase Set (Qiagen). Genomic DNA (gDNA) was isolated from granulocytes obtained after density gradient centrifugation using the QIAamp DNA Mini Kit (Qiagen). Synthesis of cDNA for the array screening assays was performed on 2 mg of total RNA using a T7oligo dT12-18 primer (Amersham Pharmacia) and it was purified using phenol:chloroform:isoamyl alcohol and NH 4 Ac precipitation. The cDNA pellet was resuspended in 5 ml of reduced EDTA TE buffer (10 mM Tris HCl, pH 8.0, 0.1 mM EDTA, pH 8.0). Synthesis of cDNA for the real time quantitative PCR validation was performed using the High-Capacity cDNA Archive Kit (Applied Biosystems).

Mapping 10 K Array experiments
Genotyping and allele specific gene expression was assessed using GeneChip Mapping 10 K Arrays (Affymetrix) according to the manufacturer's instructions using either 250 ng of gDNA or Table 9. Details of the chi-square tables and P-values for the indCNV database. 250 ng of cDNA as the starting material. Allele calling was made by using the GeneChip DNA Analysis Software 2.0 (Affymetrix).

Allele expression data
A computational analysis of allele specific gene expression was carried out as previously described [9]. Briefly, to be included in our analysis, each SNP had to meet the following criteria: (1) at least one out of twenty individuals must be heterozygous for the SNP; and (2) the transcript containing the SNP must be expressed in PBMCs. For the heterozygous SNPs, the intensity values for each probe were extracted from the CEL files generated. The value for each probe pair was calculated by subtracting the mismatch (MM) intensity from the perfect match (PM) intensity. A t test was used to calculate a p-value for the presence of signal for each allele of each SNP (intensity greater than zero = expression detected). The manufacturer defines a mini-block as a group of four probes that include a PM and a MM probe for allele 1, and a PM and a MM probe for allele 2. The Mapping 10 K Array contains ten mini-blocks, 5 of which correspond to the forward strand and the other five to the reverse strand. A signal was considered if at least one allele developed a signal (p,0.01, t test) in any of the strands. If a signal was only present in one strand, the allele fraction (the ratio of expression of the two alleles) was calculated only with the mini-blocks of the corresponding strand. Thus, we quantified the ratio of expression of the two alleles for the heterozygous SNPs present in transcripts expressed in PBMCs. In order to obtain a statistical measure, the 99% confidence interval for the allele ratio of gDNA (equivalent to equal expression of the two alleles) was calculated for both alleles using all the heterozygous SNPs in the 20 individuals. We obtained ranges between 0.81 and 1.37. SNPs with differential allelic gene expression were considered if the ratio of allele 1 to allele 2 fell outside of the corresponding confidence interval.
The distribution of allele specific expression across the genome DNA-Chip Analyzer 2004 software [27] was used to map SNPs to chromosomes and to look for clusters of differentially expressed alleles of SNPs. To assess the significance of ''SNP proximity'', pvalues were calculated for all the stretches containing #20 SNPs with differential allelic expression. Significant stretches of differentially expressed SNPs were considered when the p-value, 0.00005.
[A paragraph has been removed from here]

Data about Segmental Duplication and Copy Number Variations obtained from public databases
Copy Number Variation (CNV) and Segmental Duplication (SD) data were obtained from publicly available databases and were divided into two categories. The first one, that we called ''dbCNV'' was obtained at http://projects.tcag.ca/variation/, and the second one called ''dbSD'' was downloaded from http:// eichlerlab.gs.washington.edu/database.html. For dbCNV, only the studies based on large samples obtained from general population were included, to avoid biasing our database towards rare or disease-related variants. All coordinates were from build35 (hg17). On each independent database, we first filtered duplicates out and then concatenated overlapping segments in order to form a list of unique and excluding coordinate pairs representing regions with SDs and/or CNVs. After these changes, the CNV database presented 3,272 regions, distributed all over the genome. The size of the CNV of this database ranged from 7,486,165 bp to 1,032 bp, with a mean size of 193,588 bp, and a standard deviation of 394,728 bp. The SD database after modifications showed 8,096 different duplications, with a size range from 875,877 bp to 999 bp, a mean size of 15,990 bp and a standard deviation of 44,422 bp.

CNV detection in the samples
To detect CNVs in the samples, two technologies were used. One was the Affymetrix GeneChipH Human Mapping 10 K Array, covering 10,136 SNPs that had been used for the rest of the analysis as explained above. The other one was the Agilent G4410B array, a commercially available 60-mer oligonucleotide microarray for CGH, with probes located in coding and noncoding sequences at an average spatial resolution of 35 kb, and where 44,887 probes were analyzed. For this second array, we hybridized the samples following the manufacturer's protocol (v2), in dye-swap experiments against a reference pool from the same gender. Reference samples consist in a pool of 50 normal individuals from the same gender. In brief, 1000 ng of DNA was digested with 5 units of Alu I and Afa I (GE Healthcare) during 2 hours at 37u. After inactivating the enzymes 20 minutes at 65uC, the DNA was labeled using the Bioprime arrayCGH Labeling kit (Invitrogen). 20 ml of 2.56 Random Primer solution was added and incubated 5 min at 95u followed by 5 min in ice. Then 5 ml of 106dNTP mix were added as well as 3 ml of 1 mM dUTP-Cy3 or dUTP-Cy5 (GE Healthcare) and 40 U of Klenow fragment (Invitrogen). The reaction was incubated 2 hours at 37u and was cleaned up using Microcons YM-30 (Millipore). 1.5 ml of the labeled DNA was used to check for the incorporation of fluorescent nucleotide incorporation using a Nanodrop instrument. Then test sample and reference were mixed together with 50 ml of 106 Blocking Agent (Agilent), 50 mg of human Cot-1 (Roche) and 250 ml of 26Hybridization buffer (Agilent). A denaturation step was performed during 3 min at 95u followed by an incubation of 30 min at 37u before hybridization. Arrays were hybridized during 40 at 65u in a hybridization oven rotating at 10 rpm. Arrays were washed 5 min in oligo aCGH wash buffer 1 (Agilent) at RT, 1 min in oligo aCGH wash buffer 2 (Agilent) at 37u, 30 sec in actonitrile (Sigma) at RT and 30 sec in stabilizing and drying solution (Agilent) at RT to prevent ozone degradation. All washes were performed with agitation using a magnetic stir.
Arrays were scanned using an Agilent G2565BA MicroArray Scanner System (Agilent Inc., Palo Alto, Ca) and the acquired images were analyzed using GenePix Pro 6.0 software (Axon, Molecular Devices) using the irregular feature finding option. Extracted raw data was filtered and Loess normalized using Bacanal (Lozano et a., unpublished), an in house web server implementation of the Limma package developed within the Bioconductor project in the R statistical programming environment.

CNV-detection algorithm
The data were analyzed with the R software [28]. Data obtained from both technologies (10 K and 44 K arrays) were analyzed separately. In both cases, the standard deviation of the mean log2 values for autosomes were calculated for all the individuals and the distribution of these values was plotted. Individuals for which standard deviations were below 0.18 and for which the distribution of the log2 values was symmetrical with respect to 0 were selected for a first analysis. In this first analysis, we wrote a R script that looks for two consecutive clones or 3 out of 4 consecutive clones that would have log2 ratios above a multiple of the standard deviation of the whole individual. Then, the script checks for consistency with dye-swap data, and only keeps the regions that are also called in the dye-swap experiment.
This approach is very sensitive to the quality of the data and is not adapted to cases where data are dispersed or if there are local trends in the data. This is why only data of 18 individuals obtained with the 44 K array were analyzed with the first method. R functions designed to deal with these problems were applied to the part of the dataset that did not match the criteria for the first analysis. The second analysis was applied to all the individuals (including those that were analyzed in the first analysis), and was done as follows. First, the normalized log ratios from each dyeswap were averaged before the analysis. Then, a denoising step was applied using the method described in Hsu et al. [29] using the Haar wavelet family and the sure estimator for thresholding, with Jo (the level up to which the wavelet coefficients are subject to thresholding) equals to 4. The wavelet decomposition and reconstruction functions were from the WAVESLIM package. Finally, the Circular Binary Segmentation algorithm described in Olshen et al [30] was used for clone calling. For this purpose, the functions CNA, smooth. CNA and segment, available in the DNAcopy package, were used with default parameter values. Data from the two analyses were combined. The second analysis method is, overall, more conservative, but has the ability to rescue some regions that would not pass the very strict threshold based on standard deviation. Consequently, the overlap between the regions called with the two methods is large, but not complete. This is why the clones that were called by only one method were manually checked in the data file for validity of signal. At the end of the BAC call process, regions of more than 1 Mb were removed from the list. The final list was constituted by 294 calls (an average of 14.7 per individual). The data obtained through this process are individual CNV coordinates and we subsequently refer to them as ''indCNV''.

Allele-specific gene expression analysis
The most characteristic GO term for each cluster was Assigned using FatiGO [31]. The list of imprinted genes was obtained from the Genomic Imprinting Database (http://www.geneimprint. com/).

Real-time quantitative PCR validation
Quantitative real-time PCR analysis was performed with a DNA Engine Opticon2 (MJ Research). Primer sequences and target-specific fluorescent labeled TaqMan probes used for both genotyping and allele-specific gene expression were purchased from Applied Biosystems (TaqMan SNP Genotyping Assays). PCR reactions were prepared following the manufacturer's protocol. Genotype calls were acquired with Opticon Monitor 2.01 software (MJ Research) and allele-specific gene expression was measured as described previously [9]. In short, 48 individuals were genotyped for each SNP by real time PCR. We selected eight heterozygous individuals for each SNP for allelic to validate expression. Allelespecific gene expression was measured in these 8 individuals by real time PCR. A standard curve (linear regression line) was generated for each SNP mixing gDNAs from two homozygous individuals at ratios 8:1, 4:1, 2:1, 1:1, 1:2, 1:4 and 1:8, one for each genotype. To check that standard curves were generated with truly homozygous individuals four of them were sequenced for three SNPs confirming the homozygosis. Each sample was run in triplicate, and cycle threshold (c(t)) values were obtained with Opticon Monitor 2.01 software. Using this information we subtracted the baseline signal as the lowest fluorescent signal measured, and we set the c(t) line to a standard deviation of 0.1. The log of FAM mean c(t)/VIC mean c(t) values were plotted against the log of the gDNA ratio. The linear graphs obtained (correlation coefficients .0.9) were used to calculate the corresponding allele-specific gene expression. The 99% confidence interval for the allele-specific gene expression (equivalent to equal expression) was generated from heterozygous DNAs. SNPs with allele-specific gene expression outside of the corresponding confidence interval were considered significant.

Statistical tests
On all the datasets (dbCNV, SD and indCNV), we tested the association between SNPs with significant ASGE and proximity of a CNV by the means of Chi-square tests. To this purpose, we wrote a series of PHP scripts to check whether each SNP was located within a SD/CNV, between 1 bp and 10 Kb upstream or downstream a SD/CNV, between 10 Kb and 100 Kb upstream or downstream a SD/CNV or between 100 Kb and 1 Mb upstream or downstream a SD/CNV. When a SNP could belong to two different SD/CNVs, we always exclusively considered the shortest distance. No SNP could therefore belong to two SDs or two CNVs, or could be included in two different categories of distance in the same SDs or CNVs.
For data obtained from public databases, we crossed the positions of the dbSD and dbCNVwith those of the 5,133 SNP that were heterozygous in at least one individual. For the individual data, instead of using the 5,133 SNP, we used only the ones that were heterozygous in the individual we were testing, and instead of using public database, we used the results of the 10 K and 44 K hybridization data. That is, we crossed the individual information of indCNV for a given individual and its own heterozygous SNPs. In this analysis, we therefore generated 20 tables for each of the distance windows we considered. Because very few indCNV were detected in the individual analysis, each contingency table had small sample sizes. To overcome this problem, we performed single Chi-square tests on synthetic tables built for each distance window, in which numbers in each cell represented the cumulate sample size of the 20 individuals for the corresponding category, previously obtained in the 20 individual tables. In the analysis of indCNV, in order to overcome problems of sample size, we also considered a window of distance that would include all SNP from inside up to 100 K upstream and downstream of a CNV.
In addition to Chi-square tests, we also performed permutation tests for each window of size, to assess whether the ratio of allele expression was the same inside or outside the distance considered.