Genome-wide characterization of copy number variants in epilepsy patients

Epilepsy will affect nearly 3% of people at some point during their lifetime. Previous copy number variants (CNVs) studies of epilepsy have used array-based technology and were restricted to the detection of large or exonic events. In contrast, whole-genome sequencing (WGS) has the potential to more comprehensively profile CNVs but existing analytic methods suffer from limited sensitivity and specificity. We show that this is in part due to the non-uniformity of read coverage, even after intra-sample normalization. To improve on this, we developed PopSV, an algorithm that uses multiple samples to control for technical variation and enables the robust detection of CNVs. Using WGS and PopSV, we performed a comprehensive characterization of CNVs in 198 individuals affected with epilepsy and 301 controls. We found an enrichment of rare exonic events in epilepsy patients, especially in genes with predicted loss-of-function intolerance. Notably, this genome-wide survey also revealed an enrichment of rare non-coding CNVs near previously known epilepsy genes. This enrichment was strongest for non-coding CNVs located within 100 Kbp of an epilepsy gene and in regions associated with changes in the gene expression, such as expression QTLs or DNase I hypersensitive sites. Finally, we report on 21 potentially damaging events that could be associated with known or new candidate epilepsy genes. Our results suggest that comprehensive profiling of CNVs could help explain a larger fraction of epilepsy cases.


Introduction
Structural variants (SVs) are defined as genetic mutations affecting more than 50 base pairs and 20 encompass several types of rearrangements: deletion, duplication, novel insertion, inversion and translocation. Deletions and duplications, which affect DNA copy number, are collectively known as copy number variants (CNVs). SVs arise from a broad range of mechanisms and show a heterogeneous distribution of location and size across the genome 1,2,3 . Numerous diseases are caused by SVs with a demonstrated detrimental effect 4,5 . While cytogenetic approaches and array-based tech- 25 nologies have been used to identify large SVs, whole-genome sequencing (WGS) has the potential to uncover the full range of SVs both in terms of type and size 6,7 . SV detection methods that use read-pair and split read information 8 can detect deletions and duplications but most CNV-focused approaches look for an increased or decreased read coverage, the expected consequence of a dupli-cation or a deletion. Coverage-based methods exist to analyze single samples 9 , pairs of samples 10 30 or multiple samples 11, 12,13 but the presence of technical bias in WGS remains an important challenge. Indeed, various features of sequencing experiments, such as mappability 14,15 , GC content 16 or replication timing 17 , have a negative impact on the uniformity of the read coverage 18 .
Epilepsy is a common neurological disorder characterized by recurrent and unprovoked seizures.
It is estimated that up to 3% of the population will suffer from a form of epilepsy at some point during 35 their lifetime. Although the disease presents a strong genetic component that can be as high as 95%, typical "monogenic" epilepsy is rare, accounting for only a fraction of cases 19,20 . For the last thirty years, many teams have focused on finding genetic susceptibility factors associated with the disease.
Genome-wide association studies have had only moderate success and revealed only a few marginal associations 21,22 . A meta-analysis combining multiple epilepsy cohorts did find positive associations 40 with the disease, the strongest in SCN1A, a gene already associated with the genetic mechanism of the disease 23 . Thanks to array-based technologies, surveys of large CNVs (>50 Kbp) have shown the importance of large and de novo CNVs as well as identified a few associations with specific genes 24,25,26,27,28 . Rare genic CNVs were typically found in around 10% of epilepsy patients 25,29,28 and CNVs larger than 1 Mbp were significantly enriched in patients compared to controls 30,29,31 . 45 Unfortunately, small CNVs and other types of SVs could not be efficiently or consistently detected using these technologies, hence much remains to be done.
To more comprehensively characterize the role of CNVs in epilepsy, we performed whole-genome sequencing of epileptic patients from the Canadian Epilepsy Network (CENet), the largest WGS study on epilepsy to date. In the present study, we assessed the frequency of CNVs in epileptic 50 individuals using 198 unrelated patients and 301 healthy individuals. Using this data, we showed that technical variation in WGS remains problematic for CNV detection despite state-of-the-art intra-sample normalization. To correct for this and to maximize the potential of the CENet cohorts, we developed a population-based CNV detection algorithm called PopSV. Our method uses information across samples to avoid systematic biases and to more precisely detect regions with 55 abnormal coverage. Using two public WGS cohorts 32 33 , and additional orthogonal validation, we showed that PopSV outperforms other analytical methods both in terms of specificity and sensitivity, especially for small CNVs. Using this tool, we built a comprehensive catalog of CNVs in the CENet epilepsy patients and studied the properties of these potentially damaging structural events across the genome.

Information.
Testing for technical biases in WGS To investigate the bias in read depth (RD), we fragmented the genome in non-overlapping bins of 5 Kbp and counted the number of properly mapped reads. In each sample, we corrected for GC bias and removed bins with extremely low or high coverage (see Supplementary Information). Then, read counts across all samples were combined and quantile-80 normalized. Using simulations and permutations, we constructed two control RD datasets with no region-specific or sample-specific bias. We computed the mean and standard deviation of the coverage in each bin across samples. Next, to investigate experiment-specific bias, we retrieved which sample had the highest coverage in each bin. Then we computed, for each sample, the proportion 4 of the genome where it had the highest coverage. The same analysis was performed monitoring the 85 lowest coverage.
PopSV The main idea behind PopSV is to assess whether the coverage observed in a given location of the genome diverges significantly from the coverage observed in a set of reference samples. The genome is first segmented into bins and the number of reads with proper mapping in each bin was counted for each sample. In a typical design, the genome is segmented in non-overlapping 90 consecutive windows of equal size, but custom designs could also be used. With PopSV, we propose a new normalization procedure which we call targeted normalization that retrieves, for each bin, other genomic regions with similar profile across the reference samples and uses these bins to normalize read coverage (see Supplementary Information). Our targeted normalization was compared to global approaches that adjust for the median coverage, or quantile-based approaches. After normalization, 95 the value observed in each bin is compared with the profiles observed in the reference samples and a Z-score is calculated (Figure 1b). False Discovery Rate (FDR) is estimated based on these Z-score distributions and a bin is marked as abnormal based on a user-defined FDR threshold. Consecutive or nearby abnormal bins are merged and considered as one variant.
Validation and benchmark of PopSV We compared PopSV to FREEC 10 , CNVnator 9 and 100 cn.MOPS 11 , three popular RD methods that can be applied to WGS datasets, to identify CNVs using a publicly available dataset from a Twin study 32 . We also ran LUMPY 8 which uses an orthogonal mapping signal: the insert size, orientation and split mapping of paired reads. For LUMPY, all the CNVs (deletions and duplications) and intra-chromosomal translocations (labeled as 'BND' in Lumpy's output) larger than 300 bp were kept for the upcoming analysis. First, we 105 compared the frequency at which a region is affected by a CNV using the calls from the different methods. To investigate the presence of systematic calls in each method, we compute how many of the calls in a typical sample are called at different frequencies in the cohort. For example, on average, how many calls in one sample are called in more than 90% of the samples. Then, for each method, the samples were clustered using the CNV calls and hierarchical clustering with different linkage 110 criteria (see Supplementary Information). The Rand index estimated the concordance between the clustering and the known pedigree. Next, we measured the number of CNVs identified in each twin that were also found in the matching twin. We removed calls present in more than 50% of the samples to ensure that systematic errors were not biasing our replication estimates. Hence, a replicated call is most likely true as it is present in a minority of samples but consistently in the 115 twin pair. The number and the proportion of replicated calls are used as a surrogate to explore sensitivity and specificity, respectively.
The approach described previously comparing pairs of twins was also applied in a renal cell carcinoma dataset 33 , on pairs of normal/tumor samples. In this case, a replicated call is found in the normal sample and in the paired tumor sample. Finally, we compared calls using small bins 120 (500 bp) and calls using larger bins (5 Kbp). This comparison explores the quality of the calls, the size of detectable events and the resolution for different bin size. First, we counted how many small bin calls supported any large bin call. We then looked at the proportion of small bin calls of different size that were also found in the large bin calls.
Validation by Taqman RT-PCR We first selected CNV calls in epilepsy patients that spanned 125 at least 2 consecutive bins. We kept CNVs overlapping a Taqman probe and prioritized exonic CNVs.
A second batch of CNVs, containing small non-coding CNVs, was also sent for validation. Hundreds of non-coding CNVs spanning only one bin were randomly selected. When possible the breakpoints were manually fine-tuned from manual inspection of a base-pair level coverage representation or using IGV; the breakpoints remained unchanged when they couldn't be refined. Finally, we kept 130 regions overlapping a Taqman probe.
Probes were selected using the assay search tool on the Thermofisher website. All probes were tested for patients and controls that were called in PopSV as well as an additional 10 control individuals to ensure the validity of the probe. For each CNV, one assay was chosen in the middle of the genomic region of interest and located in an exon when possible. All reactions with Taq- CNVs enrichment in exonic regions CNV were called using PopSV and 5 Kbp bins, using 150 all the samples as reference. For each cohort, we retrieved the CNV catalog by merging CNV that are recurrent in multiple samples. Hence, the CNV catalog represents all the different CNVs found in each cohort. To control for the population size, we sub-sampled 150 samples in each cohort.
For each sub-sampling and each cohort, control regions were selected to fit the size distribution of the CNV catalog and the overlap with centromeres, telomeres and assembly gaps. The fold-155 enrichment represents how much more/less of the CNVs overlap an exon compared to the control regions. The same was performed for exons from genes that are likely loss-of-function intolerant 35 (see Supplementary Information). The significance was assessed by computing the fold-enrichment after permuting a thousand times the cohort labels. The same approach was used with rare CNVs, i.e. being present in less than 1% of PopSV calls in the Twins and renal cancer datasets, and in 160 four public datasets (see Supplementary Information).
To test for a difference in deletion/duplication ratio among rare CNVs, we compared the numbers of rare deletions and duplications in the epilepsy patients and controls using a χ 2 test. The same test was performed after downsampling the controls to the sample size of the epilepsy cohort.
In each cohort, we then retrieved the CNV catalog of rare exonic CNVs. We evaluated the 165 proportion of the CNVs in the catalog that are private (i.e. seen in only one sample), or seen in 2 samples or more, 3 samples or more, etc. The control cohort was down-sampled a thousand times to the same sample size as the epilepsy cohort to provide a confidence interval and empirical Pvalue (see Supplementary Information). We performed the same analysis after removing the top 20 samples with the most non-private rare exonic CNVs ( Figure S10). The analysis was also replicated 170 using French-Canadian individuals only.
CNV enrichment in and near epilepsy genes We used the list of genes associated with epilepsy from the EpilepsyGene resource 36 which consists of 154 genes strongly associated with epilepsy. For a particular set of CNV we count how many of the genes hit are known epilepsy genes. To control for the gene size of epilepsy genes and CNV-hit genes, we randomly selected genes 175 with sizes similar to the genes hit by CNVs and evaluated how many of these were epilepsy genes.
After ten thousand samplings, we computed an empirical P-value (see Supplementary Information).
Using this sampling approach we tested different sets of CNVs: deletion or duplications of different frequency in the epilepsy cohort, control individuals and samples from the twin study.
To investigate rare non-coding CNVs close to known epilepsy genes, we counted how many 180 patients have such a CNV at different distance threshold. We compared this cumulative distribution to the control cohort, after down-sampling it to the sample size of the epilepsy cohort. We performed the same analysis on rare non-coding CNVs that overlap an eQTL associated with the epilepsy genes 37 or a DNase I hypersensitive site associated with the promoter of epilepsy genes 38 . A Kolmogorov-Smirnov test was used to test the difference in distribution. We computed the odds 185 ratio of having such a CNV for different distance thresholds between epilepsy patients and controls.
Putatively pathogenic CNVs Exonic CNVs larger than 10 Kbp and found in less than 1% of the 301 controls were first selected. We further retained either CNVs overlapping the exon of a known epilepsy-associated gene 36 or deletions overlapping the exon of a loss-of-function intolerant gene 35 , or CNVs present in two or more of our epilepsy patients. All the putatively pathogenic CNVs were validated by Taqman RT-PCR. 8

Results
Technical bias in read coverage We sequenced the genome of 198 unrelated individuals affected with epilepsy and 301 unrelated healthy controls. We compared the coverage between samples and across the genome to look for technical biases (e.g. GC correction, mappability filtering). In contrast 195 to simulated datasets, we found that the inter-sample mean coverage in each bin varied between genomic regions even after stringent corrections and filters (Figure 1a). Supporting this observation, the bin coverage variance across samples was also lower than expected and varied between regions ( Figure S1a). We also observed experiment-specific biases. In particular, some samples consistently had the highest, or the lowest, coverage across large portions of the genome ( Figure S1b). These 200 observations were not unique to our data and could also be observed in two public WGS datasets ( Figure S2). This fluctuation of coverage has implications for CNV detection approaches that assume a uniform distribution 10,9,39 after standard bias correction and will lead to false positives.
CNV detection with PopSV To better control for technical bias, we developed PopSV, a new SV detection method. PopSV uses read depth across the samples to normalize coverage and detect 205 change in DNA copy number (Figure 1b). The normalization step here is critical since most approaches will fail to give acceptable normalized coverage scores ( Figure S1b). Moreover, with global median/variance adjustment or quantile normalization, the remaining subtle experimental variation impairs the abnormal coverage test ( Figure S3a). The targeted normalization used by PopSV was found to have better statistical properties ( Figure S3b). In order to assess the performance of our 210 tool, CNVs calls were obtained from several algorithms on a twin cohort 8,9,10,11 . We found that PopSV performed as well or better in different aspects. First, for several algorithms, a large proportion of the detected events in a typical sample were also identified in almost all samples (60% of the calls found in >95% of the samples, Figure S4). PopSV's calls were better distributed across the frequency spectrum, hence more informative as we expect the relative frequency of disease-related 215 variants to be rare. In addition, the pedigree structure was more accurately recovered when the CNVs were used to cluster the individuals in the Twins cohort ( Figure S5). The agreement with the pedigree was computed by the Rand index after clustering the individuals with three hierar- Next, coverage of the sample is normalized (2) and each bin is tested by computing a Z-score (3), estimating p-values (4) and identifying abnormal regions (5). c) Number and proportion of calls from a twin that was replicated in the other twin.
chical clustering approaches (see Supplementary Information). Looking at the replication between twins, PopSV detected more replicated CNVs compared to other methods, while maintaining similar 220 replication rate (Figure 1c). Similar results were also obtained in a different dataset ( Figure S6).
Finally, we repeated the twin analysis using 500 bp bins and observed high consistency with the 5 Kbp calls ( Figure S7). These results suggest that PopSV can accurately detect 75% of events that are as large as half the bin size used. was not higher in epilepsy patients, the proportion of deletion was significantly higher compared to controls (χ 2 test: P-value 10 −7 ). Next, we selected 151 CNVs and further validated them using a Taqman CNV assay and RT-PCR. To explore PopSV's performance across different CNV profiles, we selected variants of different type, size and frequency. We found that the calls were concordant 240 in 90.7% of the cases (Table S2 and S3). As expected, the estimated false positive rate was slightly higher for rare or smaller variants (12.1% for rare CNVs; 15.1% for CNV <20 Kbp).

CNV enrichment in exonic regions
To assess the role of CNVs in the pathogenic mechanism of epilepsy, we evaluated the prevalence of exonic CNVs in our epileptic cohort compared with healthy controls. First, focusing on CNVs larger than 50 Kbp, we observed the increased exonic burden  CNV in the CNV catalog of the epilepsy cohort was annotated with its maximum frequency in five SV databases. c) Enrichment in exonic sequence for all CNVs and rare CNVs. d) Proportion of rare exonic CNVs (y-axis) seen in X or more individuals (x-axis). The ribbon shows the 5%-95% confidence interval. described previously 30,29,31 (Figure S9). We observed fewer large CNVs overlapping exonic sequence than expected by chance but the fold-enrichment was higher in epileptic patients than in controls.
When considering all CNVs however, there was no significant difference between epileptic patients and controls (Figure 2c). The number of CNVs overlapping exonic sequences of genes intolerant to loss-of-function mutations 35 was even lower. Interestingly, the coding regions of those genes were 250 significantly more affected by CNVs in epileptic patients compared with controls (permutation Pvalue < 0.001). Because they are more likely pathogenic and of greater interest, we performed the same analysis using rare CNVs only. We observed a clear and significant enrichment of rare exonic CNVs in epileptic patients (permutation P-value < 0.001). The exonic enrichment for intolerant genes was also stronger (Figure 2c). In both cohorts, most of the rare exonic CNVs were private, 255 i.e. present in only one individual. However, we observed that rare exonic CNVs were less likely private in the epileptic patients (permutation P-value < 0.001, Figure 2d). We also replicated this result using only individuals with a similar population background (French-Canadians, Figure S10).
Overall we concluded that rare CNVs were not only enriched in exons but also affected exons more recurrently in the epilepsy cohort as compared to controls.

260
CNV enrichment in and near epilepsy genes We then sought to evaluate if there was an excess of CNVs disrupting epilepsy-related genes or nearby functional regions. We first retrieved genes whose exons were hit by rare deletions or duplications and evaluated how many were known epilepsy genes based on a list of 154 genes previously associated with epilepsy 36 (Figure 3a). Because epilepsy genes tend to be large, we controlled for the gene size when testing for enrichment ( Figure   265 S11a). In the epilepsy cohort only, we noted a clear enrichment for epilepsy genes hit by rare deletions ( Figure S11b). Moreover, the enrichment became stronger for rare CNVs. For instance, the exons of 921 genes were disrupted in the epilepsy cohort when considering deletions completely absent from the public databases, 17 of which were epilepsy genes (P-value 0.015, Figure 3b). In addition, we observed significantly more epilepsy patients with a rare non-coding CNV close to an shows the cumulative number of individuals (y-axis) with a rare non-coding CNV located at X Kbp or less (x-axis) from the exonic sequence of a known epilepsy gene. We used CNVs overlapping regions functionally associated with the epilepsy gene (eQTL or promoter-associated DNase site).
14 rare non-coding CNVs overlapping these functional regions, the enrichment in epileptic patients was greatly strengthened and clearly present up to 100 Kbp from an epilepsy gene (Kolmogorov-

275
Smirnov test: P-value 9×10 −5 , Figure 3c). Comparing epilepsy patients and controls, the odds ratio of having such a CNV at a distance of 100 Kbp or less was 1.33 and gradually increased the closer to the exon (2.9 for CNVs at 5 Kbp or less, Figure S13). These non-coding CNVs were rare even in the epileptic cohort, but collectively represented an important fraction of affected patients. While 20 patients (10.1%) had exonic CNVs in epilepsy genes that were not seen in any control or public 280 database, this number rose to 57 patients (28.8%) when counting non-coding CNVs in functional regions located at less than 100 Kbp of an epilepsy gene. These non-coding CNVs were never seen in the controls nor public databases and overlap with annotated enhancer of epilepsy genes.
Although their functional impact remains putative, we believe these CNVs to be of high-interest for the identification of disease causing genes. Among these CNVs of high-interest, a duplication 285 of a regulatory region 5 Kbp downstream of CSNK1E was detected and validated in two different patients but absent from our controls and the public databases ( Figure S14a). Another example is a short deletion of an extremely conserved region downstream of FAM63B, detected in one patient and overlapping expression QTLs for this epilepsy gene ( Figure S14b).
Putatively pathogenic CNVs Next, we used an array of criteria to select the rare CNVs (less 290 than 1% in 301 controls) with the highest disruptive potential in the epilepsy cohort. Priority was given to CNVs in genes already known to be associated with epilepsy. For CNVs in other genes, we also prioritize recurrent variants and deletions in genes highly intolerant to loss-of-function mutations. In total, we identified 21 such putative pathogenic CNVs (Tables 1-2 and Table S4).
Out of these, 8 directly affected a gene previously associated with epilepsy 36 (Table 1). In particular, 295 we identified a deletion resulting in the loss of more than half of the DEPDC5 gene in a patient affected with partial epilepsy. A number of point mutations have previously been reported in this gene for the same condition 44,45 . We also identified two deletions and one duplication in CHD2 gene (see Figure 4). The first deletion is large and affects a major portion of the gene while the second is a small 4.6 Kbp deletion of exon 13, the last exon of CHD2's second isoform (Figure 300 S15  Four of the 21 putative pathogenic CNVs were found in more than one individual (see Table 2 for precise numbers). To assess their global prevalence we tested them in an additional cohort of 325 epileptic patients and 380 ethnically matched controls ( Table 2). Two regions were replicated: the first region in chromosome 2 consists of duplication of the genes TTC27, LTPB1 and BIRC6. In total, 4 patients carried this duplication and it was not reported in any of the two sets of controls.

310
The second region was found on chromosome 16 and encompasses several genes. Two deletions were found in epileptic patients for this region and 1 epileptic individual and 1 control were also carrier of a duplication in the same region. Finally, the remaining putative pathogenic CNVs were also associated with a number of genes (see Table S4). However, as we lack additional evidence for  those specific CNV regions, we propose that these genes should be assessed in independent epilepsy 315 cohorts. Of note, one patient had a rare 170 Kbp deletion encompassing three exons of the PTPRD gene which is predicted to be highly intolerant to loss-of-function mutations (pLI=1) 35 . Rare deletions in this gene were previously found in four independent individuals with attention-deficit hyperactivity disorder 47 and associated with intellectual disability 48 . De novo deletions were also found in an individual with autism 49 and more recently in a patient with epileptic encephalopathy 27 .

Discussion
Although several tools exist for the detection of CNVs using WGS data, we found that none of them  We also observed a clear enrichment of non-coding CNVs in the neighborhood of previously implicated genes. When focusing on CNVs seen only in the epilepsy cohort and around epilepsy genes, 10.1% of epilepsy patients have an exonic CNVs and our results shows that up to 28.8% of patients harbor non-coding CNVs of high-interest in the proximity of epilepsy genes. These non-335 coding variants are present in the epilepsy cohort only and located in annotated regulatory regions associated to known epilepsy genes. Although it is challenging to directly test their functional impact, their frequency and location suggest a putative importance in the genetic mechanism of epilepsy and should be further investigated in the future.
Finally, to better understand the impact of these findings on an individual scale, we selected 340 CNVs with the highest pathogenic potential within our patients. These CNVs highlighted known but also potentially new epilepsy genes. Using a second cohort, we were also able to identify two chromosomal regions that were recurrently disrupted by CNVs. These findings highlight the benefits of having a comprehensive survey of CNVs when trying to understand the genetic causes of a disease. The raw sequence data will be deposited at EGA prior to publication.