Intronic CNVs cause gene expression variation in human populations

Introns can be extraordinarily large and they account for the majority of the DNA sequence in human genes. However, little is known about their population patterns of structural variation and their functional implication. By combining the most extensive maps of CNVs in human populations, we have found that intronic losses are the most frequent copy number variants (CNVs) in protein-coding genes in human, with more than 12,986 intronic deletions, affecting 4,147 genes (including 1,154 essential genes and 1,638 disease-related genes). These intronic length variation results in dozens of genes showing extreme population variability in size, with 40 genes with 10 or more different sizes and up to 150 allelic sizes. Intronic losses are frequent in evolutionarily ancient genes that are highly conserved at the protein sequence level. This result contrasts with losses overlapping exons, which are observed less often than expected by chance and almost exclusively affect primate-specific genes. An integrated analysis of CNVs and RNA-seq data showed that intronic loss can be associated with significant differences in gene expression levels in the population (CNV-eQTLs). These intronic CNV-eQTLs regions are enriched for intronic enhancers and can be associated with expression differences of other genes showing long distance intron-promoter 3D interactions. Our data suggests that intronic structural variation of protein-coding genes can exert an important role in maintaining the variability of gene expression and splicing in human populations. Author Summary Most human genes have introns that have to be removed after a gene is transcribed from DNA to RNA because they not encode information to translate RNA into proteins. As mutations in introns do not affect protein sequences, they are usually ignored when looking for normal or pathogenic genomic variation. However, introns comprise about half of the human non-coding genome and they can have important regulatory roles. We show that deletions of intronic regions appear more frequent than previously expected in the healthy population, with a significant proportion of genes with evolutionary ancient and essential functions carrying them. This finding was very surprising, as ancient genes tend to have high conservation of their coding sequence. However, we show that deletions of their non-coding intronic sequence can produce considerable changes in their locus length. We found that a significant number of these intronic deletions are associated with under- or over-expression of the affected genes or distant genes interacting in 3D. Our data suggests that the frequent gene length variation in protein-coding genes resulting from intronic CNVs might influence their regulation in different individuals.

Most human genes have introns that have to be removed after a gene is transcribed from DNA to RNA because they not encode information to translate RNA into proteins. As mutations in introns do not affect protein sequences, they are usually ignored when looking for normal or pathogenic genomic variation. However, introns comprise about half of the human non-coding genome and they can have important regulatory roles. We show that deletions of intronic regions appear more frequent than previously expected in the healthy population, with a significant proportion of genes with evolutionary ancient and essential functions carrying them. This finding was very surprising, as ancient genes tend to have high conservation of their coding sequence. However, we show that deletions of their non-coding intronic sequence can produce considerable changes in their locus length. We found that a significant number of these intronic deletions are associated with under-or over-expression of the affected genes or distant genes interacting in 3D. Our data suggests that the frequent gene length variation in protein-coding genes resulting from intronic CNVs might influence their regulation in different individuals.
Most eukaryotic protein coding genes contain introns that are removed from the messenger RNA during the process of splicing. In humans, up to 35% of the sequenced genome corresponds to intronic sequence, while exons cover around the 2.8% of the genome (based on the genome version and gene set used for this study). Human introns can have very different lengths, contrarily to exons. This difference in intron length leads to substantial differences in size among human genes, which cause differences in the time taken to transcribe a gene from seconds to over 24 hours [1]. Indeed, intron size is highly conserved in genes associated with developmental patterning [2], suggesting that genes that require a precise time coordination of their transcription are reliant on a consistent transcript length. It has been suggested that selection could be acting to reduce the costs of transcription by keeping short introns in highly expressed genes [3], which are enriched in housekeeping essential functions [4]. Genes transcribed early in development [5][6][7] and genes involved in rapid biological responses [8] also conserve intron-poor structures. Interestingly, Keane and Seoighe [9] recently found that intron lengths of some genes tend to coevolve (their relative sizes co-vary across species) possibly because a precise temporal regulation of the expression of these genes is required. In fact, these genes tend to be coexpressed or participating in the same protein complexe [9].
It is well known that introns contribute to the control of gene expression by their inclusion of regulatory regions and non-coding functional RNA genes or directly by their length [10][11][12].
Despite the importance of introns in regulating transcription levels, transcription timing and splicing, little attention has been payed to their potential role in human population variability studies. A recent analysis of the literature has revealed a substantial amount of pathogenic variants located "deep" within introns (more than 100 bp from exon-intron boundaries) which suggests that the sequence analysis of full introns may help to identify causal mutations for many undiagnosed clinical cases [13]. Given that direct associations between intronic mutations and certain diseases have been reported [13][14][15][16], we need to characterise the normal genetic variability in introns so we can better distinguish normal from pathogenic variations.
R e s u l t s

Deletions are enriched in purely intronic regions
We studied the effect of structural intronic variants on protein coding gene loci in healthy humans using five copy number variant (CNV) maps of high resolution [17][18][19][20][21]. Most of these CNVs were detected using whole genome sequencing (WGS) data, which allows to determine the exact genomic boundaries of these variants. CNVs may have neutral, advantageous or deleterious consequences [22] and can be classified in gains (regions that are found duplicated when compared with expected number from the reference genome, which is 2 for autosomes), losses (homozygously or heterozygously deleted regions) and gain/loss CNVs (regions that are found duplicated in some individuals -or alleles -and deleted in others). Each of the maps in our study was derived from a different number of individuals, from different populations and using different techniques and algorithms for CNV detection (Fig S1 and Table S1). Due to these differences, each dataset provided us with a different set of CNVs (Fig S1), which we analysed independently, excluding sex chromosomes and private variants.
CNVs affect genes in different ways depending on the degree of overlap with them. Some CNVs cover entire genes (from now on whole gene CNVs), other CNVs overlap with part of the coding sequence but not the whole gene (exonic CNVs) and other CNVs are found within purely intronic regions (intronic CNVs, not overlapping with any exon from any annotated isoform, Fig 1A). The latter group is the most common, with 63% of all CNVs falling within intronic regions, but remains the least studied. More than the 95% of these 12,986 intronic CNVs are losses (12,334) or gain/loss CNVs (652). The prevalence of losses in introns is in stark contrast with whole gene CNVs (1,412), which tend to be exclusively gains (55% of the cases) or gain/loss CNVs (25% of the cases) (Fig 1B-D).
Surprisingly, purely intronic losses are not only the most prevalent form of CNV, but also they are observed more often than expected by chance (Table S2). We compared the observed values with expected distributions calculated using permutations in local and global background models (see Methods and Fig S2). We find significantly more deletions (4.14-9.3%) falling in introns than expected in 3 out of the 5 maps (4.14% in Sudmant-Nature, P = 0.0002, global permutation test). For the sake of clarity, the P-values in the main text correspond to the results in Sudmant-Nature's map [20] using the global background model In contrast with intronic deletions, there are 51.2% fewer coding deletions (overlapping with exons) than would be expected by chance (P < 1e-04, Sudmant Nature, global permutation test). These patterns are consistent using the two different background models (Fig S2 and Table S2) and are independent of intron size (Fig S3).
The enrichment of deletions in introns might seem contradictory to what was originally reported by the 1000 Genomes (1KG) Project [20], as they stated that introns had less CNVs than expected by chance. However, we would like to note that they did not separate purely intronic from intron-exon overlapping deletions, while we are talking about strictly intronic deletions (see Methods for details). Indeed, if we group all purely intronic and intron-exon overlapping deletions together, we also observe a significant depletion ( Table S2, Fig S2).

Highly variable sizes in highly conserved protein-coding genes
The percentage of each intron that can be lost in the population due to CNV losses is highly variable, from 0.03% to 96.8% (51 bp to 293 kb), representing a loss of the 0.01% to 77.5%  [23], which is based on the amount of genetic variation of each gene at an exome level, only 0.169% genes in the human genome are more intolerant to variation in their coding sequence than CSMD1. In summary, intolerance to variation in the coding sequence seems to be compatible with extreme variation in the intronic sequence. These losses might affect their regulation without affecting their protein structure.
A total of 1,638 OMIM genes carry intronic deletions in the population. Diseases associated to SLC1A1 (OMIM: 133550) include Dicarboxylic Aminoaciduria and susceptibility to Schizophrenia, while LINGO2 (OMIM: 609793) has variants associated with essential tremor and Parkinson disease and also has an intronic SNP associated with body mass [24]. CSMD1 has been associated to diseases such as Benign Adult Familial Myoclonic Epilepsy (Malacards [25], MCID: BNG079) and Smallpox (MCID: SML019). Interestingly, rare intronic deletions in this gene have been recently reported to be associated to both male and female infertility [26]. To better understand possible epistatic effects between protein-coding and intronic mutations, it will be useful to incorporate information about gene length variation in future studies of these disease genes.

Intronic deletions are frequent in evolutionary ancient and essential genes
Structural variants in the germline DNA constitute an important source of genetic variability that serves as the substrate for evolution. Therefore, dating the evolutionary age of genes allows the study of structural variants that were fixed millions of years ago. Whole gene CNVs are known to differentially affect genes depending of their evolutionary age, mainly involving evolutionary young genes [27]. Genes of younger ages are generally cell-type specific, while ancient genes tend to be more conserved, ubiquitously expressed and enriched in cellular essential functions. Intrigued to see many cases where intronic CNVs were affecting highly conserved protein-coding loci, we compared the distribution of coding (including exonic and whole gene) and intronic deletions across different gene ages (Fig 3).
We observed that most ancient genes are depleted of deletions that affect their coding regions, while primate-specific genes are enriched with coding CNVs (Fig 3A). This pattern was also observed when CNV gains were included (Fig S4), meaning that the coding region of recent genes have a higher tendency to be lost or disrupted. The generation of random background models revealed that ancient genes (present in the Sarcopterygii ancestor) were significantly depleted of coding region losses (both exonic and whole gene, P < 1e-04, global permutation test), while these were enriched in young genes (from Hominoidea to Homo sapiens, P < 1e-04, global permutation test; see Fig 3A and Table S2).
In contrast with coding deletions, we found that the proportion of ancient genes with intronic deletions was higher than that of young genes (Fig S5), and also higher than expected by chance in Sudmant-Nature's map (P = 2e-04, global permutation test, Fig 3B, Table S2).
This finding was confirmed with additional analyses considering only genes with big introns (larger than 1,500 bp, see Table S2) and by calculating the enrichment within big introns independently from genes ( Fig S6 and Table S2). In summary, the frequency of coding deletions in ancient genes is lower, but their frequency of intronic deletions is higher than the one observed for young genes (Fig 3C and Fig S7).
We would expect that essential genes, which tend to be ancient [28], could be an exception to the enrichment of deletions. Essential genes have on average shorter introns than the rest of the genes [29,30] and relative to the genes of the same evolutionary age (Fig S8A and B). Up to 1,154 essential genes carry intronic deletions if we take into account all five CNV maps. In Sudmant Nature, 907 essential genes have intronic deletions, a higher number than expected by chance (P = 0.034, global permutation test, Table S2).
We investigated if intron variability in genes was associated with any biological function.
Genes with more or less intronic deletions than expected by chance (Table S3, see Methods) were not associated to any particular function using DAVID [31]. Nevertheless, genes with less intronic deletions than expected show more protein-protein interactions among them than expected by chance (P = 2.43e-10, calculated with STRING [32]). These results are compatible with previous evolutionary studies that showed high levels of conservation of intron length in genes associated with development protein complexes in mammals [2], presumably to facilitate a more precise temporal regulation of expression [9].
Population stratification of CNVs has previously been suggested to be indicative of loci under adaptive selection [20,21]. We identified 352 highly stratified variants (HSVs, maximum Vst>0.2, see Methods) from Sudmant-Nature's map overlapping with proteincoding genes: 282 are intronic, 53 exonic and 17 whole gene. We classified deleted regions according to the age of the genes and the type of gene structure affected and calculated the percentage of each group that is highly stratified (Fig 3D). Interestingly, the contribution of intronic HSVs is higher for younger genes, a pattern coherent with the expected higher functional impact of HSVs in older genes. Remarkably, the percentage of intronic HSVs is similar or higher than that of whole-gene and exonic HSVs in all age groups (and always higher than partial exonic deletions). These signatures of positive selection in purely-intronic CNVs suggest that a fraction of them might contribute to human adaptation.

Intronic deletions are associated with gene expression variability in the population
Multiallelic CNVs affecting whole genes have been shown to correlate with gene expression: generally, the higher the number of copies of a gene, the higher its expression levels [18,20].
We hypothesized that intronic size variation may also impact the expression of the affected genes (without affecting the actual number of copies of the gene). Therefore, we looked into the possible effect of intronic hemizygous deletions on gene expression variation at the population level, comparing the effects with hemizygous deletions in coding (whole gene and exonic) and intergenic non-coding deletions (Fig 4). We used available RNA-seq data from Geuvadis [33] that was derived from lymphoblastoid cell lines for 445 individuals for whom we have the matching CNV data from the 1KG Project [20].
In order to look for differences in gene expression we selected variants for which we had at least 2 hemizygous individuals (individuals with copy number = 1) and at least 2 wild-type individuals (copy number = 2) and we compared the expression levels among these two groups to identify deleted CNV regions associated with expression quantitative trait loci (eQTL, Fig 4F). We will refer to the deleted regions associated with expression changes as DEL-eQTLs, and the genes associated with as eGenes. For comparative purposes, we first looked at the effect of hemizygous deletions in coding regions (whole gene and exonic DEL-eQTLs). We found that 7 eGenes out of 50 genes with whole gene deletion CNVs resulted in significant downregulation of gene expression in lymphoblastoid cell lines (14%, a higher number eGenes than expected by chance, P < 5e-4, permutation test, Fig 4F). In addition, we found 35 eGenes out of 437 genes with partial exonic deletions that were differentially expressed (8%, a number higher than expected by chance, P < 5e-4, permutation test, Fig   4F). The majority of these eGenes (32/35) where down-regulated in the individuals carrying the deletion.
Although intronic deletions do not affect the coding sequence of genes, we observed significant differences in gene expression in 53 eGenes out of the 1,505 genes with intronic deletions, a number of intronic-eGenes that is also higher than expected by chance (P < 51e-4, permutation test) (Fig 4F). Given the higher abundance of intronic deletions in the population, the absolute number of intronic-eGenes (53 genes) was similar to the total of coding-eGenes (39 genes, Fig 4F and Table S4). Of the intronic-eGenes, 62% were downregulated and the other 38% upregulated, suggesting that intronic deletions might result both in enhancing or repressing gene expression (while coding losses mostly associate to gene down-regulation). Regulatory regions are known to be preferentially located in first introns [34]. From all 56 intronic eDeletions that are associated to changes of gene expression in our study, 17 (30.4%) are found within first introns. However, this percentage is not significantly higher than in non eDeletions (26%, P = 0.54, Fisher's test). Finally, we identified that four of the intronic cis-eDeletions in lymphoblastoid cells are HSVs, suggesting adaptive potential of these expression differences. These intronic HSVs are located in four ancient genes (Sarcopterygii or older): EXOC2, SKAP2, PTGR1 and PHYHD1 ( Fig S9). EXOC2 is an essential gene encoding one of the proteins of the exocyst complex and is among the top 5% most conserved genes in human (RVIS = 3.34).
Since intron length can impact the inclusion of alternative exons [35], we hypothesised that there might be genes with differentially expressed transcripts (eTranscripts) in any gene containing an intronic deletion. In addition to the 53 intronic-eGenes, we found 217 intronic-eTranscripts in a total of 185 genes (this is more than expected by chance, P = 0.018, permutation test, Fig 4F and Table S5). These results suggest that deletions within introns may cause the inclusion or exclusion of exons and thus influencing the relative proportion of alternative transcripts in many genes.
Changes in GC content as the result of intronic deletions might also contribute to these splicing differences, as in genes with long introns, the recognition of introns and exons by splicing machinery is based on their differential GC content [36,37] and the lower GC content in introns facilitates their recognition. We found that, in general, the deleted sequences have a significantly higher GC content to that of the introns where they are located (P = 1.8e-28, paired Student's t-test), and the loss of these sequences causes a significant decrease of the overall GC content of the introns (P = 2.23e-16, paired Student's t-test) ( Fig   S10 and Fig S11). This drop of GC content is more pronounced in introns with deletions originated through transposable element insertion (TEI, P = 2.01e-9, paired Student's t-test).
The 84% of TEI deletions overlap almost completely with Alu elements (Fig S12), which are known to be GC rich. The GC drops happening in introns with deletions associated to nonallelic homologous repair (NAHR) are less significant (P = 0.0063), while the difference is not significant in deletions caused by non-homologous repair (NH) (P = 0.7676). The drop in intronic GC content associated to most TEI and many NAHR deletions would increase the difference of GC content between introns and their flanking exons, what could facilitate exon definition during splicing and might contribute to the observed differential expression of some transcripts. It has been recently shown that human enhancers are associated to high GC [38] and that Alu elements can act as enhancers [39], suggesting that deletions could not only alter splicing but also influence regulatory features located within introns.

Deletion of intronic regions and changes in expression in trans
Introns in human are particularly enriched in regulatory regions and frequently interact with gene promoters of other genes via chromatin looping (Fig 4D). Therefore, deletions in introns that show long-range interactions with promoters of other genes could potentially affect their expression (trans effects). We used promoter-capture Hi-C published data for B-lymphocytes  Table S5).
We analysed the age of different types of eGenes and observed that whole-gene and exonic eGenes are enriched in young age classes (Fig 5A). This pattern is very different in intronic and intergenic eGenes: intronic cis-eGenes are enriched in old ages, while intronic trans-eGenes and intergenic-eGenes do not seem to be associated with gene age. If we compare the RVIS of the different types of eGenes, we find that whole gene and exonic eGenes are actually among the most tolerant genes to point mutations in their coding sequence (Fig 5B).
In contrast, we found that a significant proportion of intronic cis-eGenes with low RVIS percentiles, indicating that protein-coding genes that are intolerant to point mutations at the protein level can have intronic deletions associated to gene expression changes. Strikingly, trans-eGenes show the lowest RVIS percentiles, indicating that intronic variation might impact the gene expression of interacting genes that are quite intolerant to coding mutations (Fig 5B).

Intronic deletions can alter enhancer sequences or their location
To further study the potential impact of intronic deletions in regulatory regions, we analyzed the co-occurrence of these events with enhancers. In eGenes or eTranscripts, 15 intronic DEL-eQTLs overlap with enhancers (an overlap that is higher than expected by chance, P = 0.023, odds ratio = 2.04, Fisher's test). These 15 deletions represent the 24% of the tested intronic deletions overlapping with enhancers in this cell type. We need to consider that many intronic deletions that were not investigated because they fall within genes that are expressed in other cell types. Based on our observations in lymphoblastoid cells, we estimate that there might be 105 additional eDels of the 422 that overlap with enhancers. Regarding the deletions not overlapping with enhancers, we found that the distance between the DEL-eQTL and the closest enhancer is shorter than the distance of the deletions not associated with expression changes (P = 9.2e-04, Student's t-test). These results suggest that intronic DEL-eQTLs could also be affecting interactions between promoter and intronic enhancers without directly disrupting the enhancer sequence.
Motivated by these findings, we investigated if there is a global tendency (independently of gene expression) for intronic deletions to affect or not affect enhancers. First, we observed that enhancers are enriched in introns (P < 1e-04, global permutation test) agreeing with previous findings in plants [41,42]. Strikingly, we find that intronic deletions and intronic enhancers co-occur in the same intron more often than expected by chance (P < 2.2e-16, Fisher's test), possibly because most intronic deletions and intronic regulatory features are found in very long introns. However, by randomly relocating each intronic deletion within the same intron, we observed that the direct overlap of the deletions with enhancers is significantly lower than expected (P = 0.0304, global permutation test,, Table S6). A possible functional interpretation of these results is that it might be some degree on plasticity on the distance between intronic enhancer and promoters, but many intronic enhancers might be essential and cannot be lost. Interestingly, as we saw above, the loss of non-essential intronic enhancers can be associated to changes of gene expression.
Intronic CNVs constitute the most abundant form of CNV in protein-coding genes (Fig 1) and might have a previously unsuspected role in human evolution and disease. This variation in intronic length in healthy human populations implies that the actual size of many genes is different among individuals and, therefore, it might change in populations over time.
However, little attention has been given to this variability even if gene length has been shown to be important in many genes.
We have shown that that intronic deletions are occurring more often than expected by chance in three different CNV maps (using two different background models). Other studies have previously reported that CNVs are impoverished [20,43] or neither impoverished nor overrepresented within introns [44]. To explain this apparent controversy, we have to carefully review the different definitions of "intronic CNVs". Here, we looked at deleted regions located completely within constitutive intronic regions (excluding intronic regions that contain alternative exons). Mu et al. [44] showed that purely intronic CNVs in general are not either enriched or impoverished in their dataset, but they observed that the subset of events associated to NAHR are found more often than expected by chance using the Pilot Our results suggest that copy number variation is shaping gene evolution in different ways depending on the age of genes, duplicating or deleting young genes and contributing to finetuning the regulation in both young and old genes (Fig 5C). Although we expect stronger functional effects for CNVs affecting the coding sequence, we have shown that the purely intronic CNVs also show signatures of positive selection. Indeed, the proportion of intronic CNVs with adaptive potential is similar or higher than for coding CNVs. Popadin et al.
showed that primate-specific genes in human are enriched in single nucleotide variants correlated with gene expression (cis-eQTLs) with their associated SNPs tending to be closer to the TSS than in older genes [45]. These data highlight the need of dissecting the different types of genetic variation in order to understand the complex relationships between SNPs, CNVs, gene expression and gene age. While point mutations near the TSS [45] and coding CNVs seem to have a higher effect in young genes, intronic CNVs are frequently associated with gene expression variation in genes of any age.
Previously published studies on the effect of genetic variants on gene expression have proven the effect of CNVs on expression variability [46][47][48]. Chiang and co-workers identified 789 SVs associated to changes in gene expression, most of them (88.3%) not overlapping with exons from the eGene [48]. DeBoever and co-workers observed that a large proportion of common CNVs associated with gene expression levels is located in intergenic regulatory regions [47]. However, research on the subject has been mostly restricted to SVs found within 1Mb from the gene and previous works did not analysed intronic regions in detail. In contrast, we relied on Hi-C data to define deletions affecting regions in 3D contact with a gene. In this way, we do not require the CNV to be located within any particular distance to the TSS position of a gene. We tested intergenic eCNVs that can be located at any distance from 864bp to 82Mb from the nearest gene.
The differences that we observe in gene expression could be the result of intronic CNVs affecting the rate of transcription, the splicing process, the stability of RNA or a combination of them. For example, intronic deletions interfering with splicing recognition might trigger the nonsense-mediated decay (NMD) pathway that would degrade the transcript. Recently, it has been shown that the balance of unspliced and spliced mRNA (RNA velocity) is a cell type-specific signature that can be used to predict the future increases or decreases of gene expression in single cells [49]. As the amount of unspliced transcript detected will depend on the length of introns -which can be highly variable in some genes -we would expect that the RNA velocity of intron-varying genes will be also varying in human populations.
Despite the clear trends shown, our results are likely to underestimate the extent of the impact of intron loses in gene expression. On one hand, we only investigated the effect on gene expression in lymphoblastoid cell lines. On the other hand, the regulatory data currently available is also limited. The interaction maps change in different cell types [40,50] and many enhancers are tissue-specific [51]. Therefore, the loss of intronic sequence could affect the expression of genes in other cell types. In addition, the 3D contacts involving frequently deleted regions in the population will be underrepresented in the interaction map used in our study, as they are less likely to be present in the assayed samples. The availability of CNV, personal gene expression and genome interactomes from multiple tissues will allow to evaluate more accurately what is the impact of coding and non-coding deletions in the whole organism.

Supplemental Data
Supplemental Data include twelve figures and eight tables, including a list of all CNVs used in this study and a list of genes overlapping these CNVs.

Gene structures
Autosomal gene structures and sequences were retrieved from Ensembl [52] (http://www.ensembl.org; version 75) and principal isoforms were determined according to the APPRIS database [53], Ensembl version 74. In order to avoid duplicate identification of introns, intronic regions were defined as regions within introns that aren't coding in any transcript of any gene. When analyzing real introns, in order to avoid duplicate identification of introns, the principal isoform with a higher exonic content was taken. The complete list of genes affected by different types of CNVs is available in Table S8. Genomic sequences were obtained from the primary GRCh37/hg19 assembly, and were used for calculating the GC content of introns and intronic CNVs.

Dating gene and intron ages
An age was assigned to all duplicated genes as described before [27]. In the case of singletons gene ages were assigned from the last common ancestor to all the genes in their family according to the gene trees retrieved from ENSEMBL. Singleton's ages can be noisy for genes suffering important alterations as gene fusion/fission events or divergence shifts. As a consequence, these ages should not be interpreted as the age of the oldest region of the gene, but as a restrictive definition of gene age considering a similar gene structure and gene product.
The ages (from ancient to recent) and number of genes per age are as follows: FungiMetazoa: For other analyses, we only grouped the 16 ages in three, "ancient" (collapsing groups from FungiMetazoa to Sarcopterygii), "middle" (from Tetrapoda to Eutheria) and "young" genes (Primates).
Intronic regions were assigned the evolutionary age of the gene they belonged to. In the cases when an intron could be assigned to more than one gene, the most recent age was assigned to them.

Statistical assessment of genome-wide distribution of CNVs
To estimate statistical significance of our results we performed permutation tests. In order to compare the number of overlaps of CNVs with genic functional elements we compared our observed values to a background model. A global background was obtained by relocating all the CNVs in the whole genome 10,000 times, avoiding low-mappability regions in R package "BSgenome.Hsapiens.UCSC.hg19.masked"). Genome coordinates and low mappability regions were downloaded using RegioneR package [60]. A local background was obtained by segmenting the genome in 278 windows of at least 10Mb and randomly shuffling the CNVs within their original window 10,000 times, also avoiding low-mappability regions. P-values were computed using a function derived from the permTest function from package RegioneR version 1.6.2 [60]. Code is available in https://github.com/orgs/IntronicCNVs.
We compared the location of the CNVs in our datasets and compared with their distribution in the random models in order to calculate enrichments or depletions depending on the intron size and gene age and essentiality.

Regulatory features
We downloaded a genome-wide set of regions that are likely to be involved in gene regulation from the Ensembl Regulatory Build [61], assembled from IHEC epigenomic data [62]. We checked if introns are enriched in these regulatory features (promoters, enhancers, promoter flanking regions or insulators) by comparing to a random background model generated by relocating 10,000 times all regulatory features in the genome. P-values are the fraction of random values superior or inferior to the observed values.
In order to check for the significance of the overlaps between intronic deletions and regulatory features we relocated 10,000 times each intronic deletion within their host intronic region, avoiding overlaps with exons. Then, we compared the observed and the expected overlap with regulatory features. Introns that overlapped with low-mappability regions were previously removed.

Alu elements
Alu element genomic coordinates were extracted from the RepeatMasker tracks from UCSC, build GRCh37.

CNV mechanisms
The analysis intronic deletions generated through different mechanisms was done using the dataset from Abyzov's [20] study.

Gene expression analysis
We used available RNA-seq data at Geuvadis [33] that was derived from lymphoblastoid cell lines for 445 individuals who were sequenced by the 1KG Project and for whom we have the intronic deletions in the largest CNV map [20]. We focused our analyses on the 763 genes that have only one intronic deletion in the population with at least two individuals affected in the Geuvadis dataset. For each of these genes we classified the PEER normalized gene expression levels [63] in two groups: 1) gene expression of individuals homozygous for the reference genotype and 2) gene expression of individuals with one allele with the deletion and the other with the reference genotype. We then performed Student's t-tests to compare the expression of the two different genotypes. We corrected for multiple testing with p.adjust R function (Benjamini-Hochberg method). In addition, in order to see if the number of significant differentially expressed genes is higher than expected by chance, for each intronic deletion, we the shuffled 10,000 times the genotypes of the individuals and performed t-tests with the expression of the random groups of wild-type and heterozygous individuals. For example, if a deletion is found in heterozygosis in 50 individuals and the rest are wild-type, we will test if there is differential expression when comparing the expression of 50 randomly selected individuals versus the rest. By repeating this shuffling 10,000 times for every tested deletion we can calculate the expected percentages of significantly differentially expressed genes."

Observed vs expected intronic deletion content score
The number and size of expected intronic deletions per gene was calculated in two different ways: 1) relocating 10,000 times all deletions in the whole genome (except for low mappability regions) and 2) relocating 1,000 all intronic deletions within the intronic regions.
In both cases, a score was generated to determine what genes have more or less intronic

Functional enrichment analysis
Functional enrichment analysis of the genes with a lower scores and higher scores was performed with DAVID [31] and STRING [32]. Enrichment of essential genes in our datasets was performed with a Fisher test using our list of essential genes (see the "Essential genes" section in Materials and Methods) .

Population stratification
For the study of population stratification of deletions, Vst statistics were extracted from Sudmant Nature [20].        Percentage of genes from each gene evolutionary age that contain intronic deletions in (A) or deletions overlapping with exons (including partial and whole gene CNVs) in (B). The gray line represents the expected value, calculated as the median of the genes in the 10,000 random permutations. Significance is marked with asterisks: * for P<0.05. ** for P<0.005.*** for P<0.0005 and their color represents enrichment (red) or impoverishment (black).

Figure S6. Differential effect of intronic deletions in big introns
Ratios of observed versus expected number of deletions within in introns bigger than 1.5kb from different evolutionary ages. Expected values are calculated 10,000 random permutations using a global background model. Asterisks show an enrichment when above the box, a depletion when below the box: * for P<0.05, ** for P<0.005 and *** for P<0.0005.