Advertisement
  • Loading metrics

Global characterization of copy number variants in epilepsy patients from whole genome sequencing

  • Jean Monlong ,

    Contributed equally to this work with: Jean Monlong, Simon L. Girard

    Roles Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Human Genetics, McGill University, Montréal, Canada, Canadian Center for Computational Genomics, Montréal, Canada

  • Simon L. Girard ,

    Contributed equally to this work with: Jean Monlong, Simon L. Girard

    Roles Conceptualization, Formal analysis, Investigation, Project administration, Writing – original draft, Writing – review & editing

    Affiliations Department of Human Genetics, McGill University, Montréal, Canada, Département des sciences fondamentales, Université du Québec à Chicoutimi, Chicoutimi, Canada, Centre de Recherche du Centre Hospitalier de l’Université de Montréal, Montréal, Canada

  • Caroline Meloche,

    Roles Investigation, Methodology, Validation

    Affiliation Centre de Recherche du Centre Hospitalier de l’Université de Montréal, Montréal, Canada

  • Maxime Cadieux-Dion,

    Roles Validation

    Affiliations Centre de Recherche du Centre Hospitalier de l’Université de Montréal, Montréal, Canada, Center for Pediatric Genomic Medicine, Children’s Mercy Hospital, Kansas City, Missouri, United States of America

  • Danielle M. Andrade,

    Roles Data curation, Resources

    Affiliation Epilepsy Genetics Program, Division of Neurology, Toronto Western Hospital, University of Toronto, Toronto, Canada

  • Ron G. Lafreniere,

    Roles Data curation, Resources

    Affiliation Centre de Recherche du Centre Hospitalier de l’Université de Montréal, Montréal, Canada

  • Micheline Gravel,

    Roles Data curation, Resources

    Affiliation Centre de Recherche du Centre Hospitalier de l’Université de Montréal, Montréal, Canada

  • Dan Spiegelman,

    Roles Formal analysis

    Affiliation Montreal Neurological Institute, McGill University, Montréal, Canada

  • Alexandre Dionne-Laporte,

    Roles Formal analysis

    Affiliation Montreal Neurological Institute, McGill University, Montréal, Canada

  • Cyrus Boelman,

    Roles Data curation, Resources

    Affiliation Division of Neurology, BC Children’s Hospital, Vancouver, Canada

  • Fadi F. Hamdan,

    Roles Data curation, Resources

    Affiliation CHU Sainte-Justine Research Center, Montréal, Canada

  • Jacques L. Michaud,

    Roles Data curation, Resources

    Affiliation CHU Sainte-Justine Research Center, Montréal, Canada

  • Guy Rouleau,

    Roles Data curation, Resources

    Affiliation Montreal Neurological Institute, McGill University, Montréal, Canada

  • Berge A. Minassian,

    Roles Data curation, Resources

    Affiliation Division of Neurology, The Hospital for Sick Children, Toronto, Canada

  • Guillaume Bourque ,

    Roles Conceptualization, Funding acquisition, Investigation, Supervision, Writing – original draft, Writing – review & editing

    guil.bourque@mcgill.ca (GB); patrick.cossette@umontreal.ca (PC)

    Affiliations Department of Human Genetics, McGill University, Montréal, Canada, Canadian Center for Computational Genomics, Montréal, Canada, McGill University and Génome Québec Innovation Center, Montréal, Canada

  •  [ ... ],
  • Patrick Cossette

    Roles Conceptualization, Funding acquisition, Investigation, Writing – review & editing

    guil.bourque@mcgill.ca (GB); patrick.cossette@umontreal.ca (PC)

    Affiliation Centre de Recherche du Centre Hospitalier de l’Université de Montréal, Montréal, Canada

  • [ view all ]
  • [ view less ]

Global characterization of copy number variants in epilepsy patients from whole genome sequencing

  • Jean Monlong, 
  • Simon L. Girard, 
  • Caroline Meloche, 
  • Maxime Cadieux-Dion, 
  • Danielle M. Andrade, 
  • Ron G. Lafreniere, 
  • Micheline Gravel, 
  • Dan Spiegelman, 
  • Alexandre Dionne-Laporte, 
  • Cyrus Boelman
PLOS
x

Abstract

Epilepsy will affect nearly 3% of people at some point during their lifetime. Previous copy number variants (CNVs) studies of epilepsy have used array-based technology and were restricted to the detection of large or exonic events. In contrast, whole-genome sequencing (WGS) has the potential to more comprehensively profile CNVs but existing analytic methods suffer from limited accuracy. We show that this is in part due to the non-uniformity of read coverage, even after intra-sample normalization. To improve on this, we developed PopSV, an algorithm that uses multiple samples to control for technical variation and enables the robust detection of CNVs. Using WGS and PopSV, we performed a comprehensive characterization of CNVs in 198 individuals affected with epilepsy and 301 controls. For both large and small variants, we found an enrichment of rare exonic events in epilepsy patients, especially in genes with predicted loss-of-function intolerance. Notably, this genome-wide survey also revealed an enrichment of rare non-coding CNVs near previously known epilepsy genes. This enrichment was strongest for non-coding CNVs located within 100 Kbp of an epilepsy gene and in regions associated with changes in the gene expression, such as expression QTLs or DNase I hypersensitive sites. Finally, we report on 21 potentially damaging events that could be associated with known or new candidate epilepsy genes. Our results suggest that comprehensive sequence-based profiling of CNVs could help explain a larger fraction of epilepsy cases.

Author summary

Epilepsy is a common neurological disorder affecting around 3% of the population. In some cases, epilepsy is caused by brain trauma or other brain anomalies but there are often no clear causes. Genetic factors have been associated with epilepsy in the past such as rare genetic variations found by linkage studies as well as common genetic variations found by genome-wide association studies and large copy-number variants. We sequenced the genome of ∼200 epilepsy patients and ∼300 healthy controls and compared the distribution of deletion (loss of a copy) and duplication (additional copy) of genomic regions. Thanks to the sequencing technology and a new method that takes advantage of the large sample size, we could compare the distribution of small copy-number variants between epilepsy patients and controls. Overall, we found that small variants are also associated with epilepsy. Indeed, the genome of epilepsy patients had more exonic copy-number variants, especially when rare or affecting genes with predicted loss-of-function intolerance. Focusing on regions around genes that have been previously associated with epilepsy, we also found more non-coding variants in epilepsy patients, especially deletions or variants in regulatory regions. Finally, we provide a list of 21 regions in which we found likely pathogenic variants.

Introduction

Structural variants (SVs) are defined as genetic mutations affecting more than 50 base pairs and encompass several types of rearrangements: deletion, duplication, novel insertion, inversion and translocation. Deletions and duplications, which affect DNA copy number, are collectively known as copy number variants (CNVs). SVs arise from a broad range of mechanisms and show a heterogeneous distribution of location and size across the genome [13]. Numerous diseases are caused by SVs with a demonstrated detrimental effect [4, 5]. While cytogenetic approaches and array-based technologies have been used to identify large SVs, whole-genome sequencing (WGS) has the potential to uncover the full range of SVs both in terms of type and size [6, 7]. SV detection methods that use read-pair and split read information [8] can detect deletions and duplications but most CNV-focused approaches look for an increased or decreased read coverage, the expected consequence of a duplication or a deletion. Coverage-based methods exist to analyze single samples [9], pairs of samples [10] or multiple samples [1113] but the presence of technical bias in WGS remains an important challenge. Indeed, various features of sequencing experiments, such as mappability [14, 15], GC content [16], replication timing [17], DNA quality and library preparation [18], have a negative impact on the uniformity of the read coverage [19].

Epilepsy is a common neurological disorder characterized by recurrent and unprovoked seizures. It is estimated that up to 3% of the population will suffer from a form of epilepsy at some point during their lifetime. Although the disease presents a strong genetic component that can be as high as 95%, typical “monogenic” epilepsy is rare, accounting for only a fraction of cases [20, 21]. Genetic factors have been associated with epilepsy in the past such as rare genetic variations found by linkage studies as well as common genetic variations found by genome-wide association studies [22, 23] For example, a meta-analysis combining multiple epilepsy cohorts found positive associations with the disease [24], the strongest in SCN1A, a gene already associated with the genetic mechanism of the disease via linkage studies and subsequent sequencing [25] or more recently as harboring de novo variants [26]. Thanks to array-based technologies, surveys of large CNVs (>50 Kbp) first associated CNVs in genomic hotspots such as 15q11.2 and 16p13.11 with generalized epilepsy [27, 28]. Other studies have further shown the importance of large and de novo CNVs as well as identified a few associations with specific genes [2934]. Rare genic CNVs were typically found in around 10% of epilepsy patients [30, 34, 35] and CNVs larger than 1 Mbp were significantly enriched in patients compared to controls [33, 3537]. Unfortunately, small CNVs and other types of SVs could not be efficiently or consistently detected using these technologies, hence much remains to be done.

To more comprehensively characterize the role of CNVs in epilepsy, we performed whole-genome sequencing of epileptic patients from the Canadian Epilepsy Network (CENet), the largest WGS study on epilepsy to date. In the present study, we assessed the frequency of CNVs in epileptic individuals using 198 unrelated patients and 301 healthy individuals. Using this data, we showed that technical variation in WGS remains problematic for CNV detection despite state-of-the-art intra-sample normalization. To correct for this and to maximize the potential of the CENet cohorts, we developed a population-based CNV detection algorithm called PopSV. Our method uses information across samples to avoid systematic biases and to more precisely detect regions with abnormal coverage. Using two public WGS datasets [38, 39], and additional orthogonal validation, we showed that PopSV outperforms other analytical methods both in terms of specificity and sensitivity, especially for small CNVs. Using this tool, we built a comprehensive catalog of CNVs in the CENet epilepsy patients and studied the properties of these potentially damaging structural events across the genome.

Results

Technical bias in read coverage

We sequenced the genomes of 198 unrelated individuals affected with epilepsy and 301 unrelated healthy controls. Because CNV detection relies on read coverage we first investigated the presence of technical bias and the value of standard corrections and filters (e.g. GC correction, mappability filtering). The genome was fragmented in 5 Kb bins and we counted the number of uniquely mapped reads in each bin. In contrast to simulated datasets, we found that the inter-sample mean coverage in each bin varied between genomic regions even after stringent corrections and filters (Fig 1a). Supporting this observation, the bin coverage variance across samples was also lower than expected and varied between regions (S1 Fig). We also observed experiment-specific biases. In particular, some samples consistently had the highest, or the lowest, coverage across large portions of the genome (S1 Fig). These observations were not unique to our data and could also be observed in two public WGS datasets, and persisted even after correcting the GC bias and mappability using the more elaborate model from the QDNAseq pipeline [40] (S2 Fig). Our results across multiple samples suggest that existing GC bias and mappability corrections [40] cannot correct completely the technical variation in read coverage. This fluctuation of coverage has implications for CNV detection approaches that assume a uniform distribution [9, 10, 41] after standard bias correction and will lead to false positives.

thumbnail
Fig 1. PopSV approach.

a) Technical bias across the genome remains after stringent correction and filtering. The distribution of the bin inter-sample mean coverage in the epilepsy cohort (red) is compared to null distributions (blue: bins shuffled, green: simulated normal distribution). b) PopSV approach. First the genome is fragmented and reads mapping in each bin are counted for each sample and GC corrected (1). Next, coverage of the sample is normalized (2) and each bin is tested by computing a Z-score (3), estimating p-values (4) and identifying abnormal regions (5). c) Number and proportion of calls from a twin that was replicated in the other monozygotic twin.

https://doi.org/10.1371/journal.pgen.1007285.g001

CNV detection with PopSV

To better control for technical bias, we developed PopSV, a new SV detection method. PopSV uses read depth across the samples to normalize coverage and detect change in DNA copy number (Fig 1b). The normalization step here is critical since most approaches will fail to give acceptable normalized coverage scores (S1 Fig). Moreover, with global median/variance adjustment or quantile normalization, the remaining subtle experimental variation impairs the abnormal coverage test (S3 Fig). The targeted normalization used by PopSV was found to have better statistical properties (S3 Fig). In order to assess the performance of our tool, we compared it to several algorithms [811] using a dataset that included monozygotic twins and also performed experimental validation of different types of predicted CNVs in the epilepsy cohort (see below). We found that PopSV performed as well or better in different aspects. First, for several algorithms, a large proportion of the detected events in a typical sample were also identified in almost all samples (60% of the calls found in >95% of the samples, S4 Fig). PopSV’s calls were better distributed across the frequency spectrum, hence more informative as we expect the relative frequency of disease-related variants to be rare. In addition, the pedigree structure was more accurately recovered when the CNVs were used to cluster the individuals in the Twins dataset (S5 Fig). The agreement with the pedigree was computed by the Rand index after clustering the individuals with three hierarchical clustering approaches (see S1 Text). Looking at the replication between 10 pairs of monozygotic twins, PopSV detected more replicated CNVs compared to other methods, while maintaining similar replication rates (Fig 1c). The CNV calls were further filtered with gradually more stringent significance thresholds and PopSV remained superior in term of number of replicated calls (S6 Fig). When investigating the overlap of calls between different methods, we noticed that PopSV was better recovering calls from CNVnator [9], FREEC [10], cn.MOPS [11] or LUMPY [8], especially if found by two or more methods (S7 Fig). For example, around 92% of the CNVs called by other methods were also found by PopSV when focusing on calls found in at least two methods. Similar results were also obtained in a cancer dataset where we looked for replicated germline CNVs in the paired tumor (S8 Fig). Finally, we repeated the twin analysis using 500 bp bins and observed high consistency with the 5 Kbp calls (S9 Fig). These results suggest that PopSV can accurately detect around 75% of events that are as large as half the bin size used (see S1 Text).

CNVs in the CENet cohorts and experimental validation

Having demonstrated the quality of the PopSV calls, we applied our tool to the epilepsy and control cohorts. The epilepsy cohort comprises 198 individuals diagnosed with either generalized (n = 160), focal (n = 32) or unclassified (n = 6) epilepsy. CNVs ranged from 5 Kbp to 3.2 Mbp with an average size of 9.98 Kbp. We observed an average of 870 CNVs per individual accounting for 8.7 Mb of variant calls (Fig 2a). This is around 9 times more variants and considerably smaller than in typical array-based studies [42, 43], such as the previous epilepsy surveys [30, 31, 34, 35], although a similar size distribution was previously obtained using denser arrays [4] but were never applied to epilepsy (S10 Fig). Next, we annotated each variant using four public SV databases [13, 4446] as well as an internal database of the germline calls from PopSV in the two public datasets used earlier (see S1 Text). For each CNV, we derived the maximum frequency across these databases and defined as rare any region consistently annotated in less than 1% of the individuals (Fig 2b). In total, we identified 12,480 regions with rare CNVs in the epilepsy cohort including: 8,022 (64.3%) with heterozygous deletions, 21 (0.2%) with homozygous deletions and 4,850 (38.9%) with duplications. Although the overall amount of rare CNVs was not higher in epilepsy patients, the proportion of deletion was significantly higher compared to controls (χ2 test: P-value 10−7). Next, we selected 151 CNVs and further validated them using a Taqman CNV assay and Real-Time PCR. To explore PopSV’s performance across different CNV profiles, we selected variants of different types, sizes and frequencies. We found that the calls were concordant in 90.7% of the cases (Table 1 and S2 Table). As expected, the estimated false positive rate was slightly higher for rare or smaller variants (12.1% for rare CNVs; 15.1% for CNV <20 Kbp). Furthermore, we noted that calls supported by both PopSV and LUMPY (when available) had a similar validation rate as calls found by PopSV only (86.2% and 87.5% respectively).

thumbnail
Fig 2. CNVs in the epilepsy and control cohorts.

a) Regions with a CNV in each epilepsy patient. b) Each CNV in the CNV catalog of the epilepsy and control cohorts was annotated with its maximum frequency in five CNV databases. c) Enrichment in exonic sequence for all CNVs (left) and rare CNVs (right), larger than 50 Kbp (top) or smaller than 50 Kbp (bottom). The fold-enrichment (y-axis) represents how many CNVs overlap coding sequences compared to control regions randomly distributed in the genome.

https://doi.org/10.1371/journal.pgen.1007285.g002

CNV enrichment in exonic regions

To assess the role of CNVs in the pathogenic mechanism of epilepsy, we evaluated the prevalence of exonic CNVs in our epileptic cohort compared with healthy controls. First, focusing on CNVs larger than 50 Kbp, we found no difference between epileptic patients and controls (Fig 2c). As expected, we observed fewer CNVs overlapping exonic sequence than expected by chance but similar levels for both groups. The number of CNVs overlapping exonic sequences of genes intolerant to loss-of-function mutations [47] was even lower. Interestingly, the coding regions of those genes were significantly more affected by CNVs in epileptic patients compared with controls (permutation P-value<0.001, Fig 2c and S11 Fig). Because they are more likely pathogenic and of greater interest, we performed the same analysis using rare CNVs only. Here, we observed the increased exonic burden described previously for large rare CNVs [3537]. In contrast to previous studies, we could also detect and compare small CNVs (<50 Kbp) in epileptic patients and healthy controls. We found similar enrichment patterns than for large CNVs (Fig 2c and S11 Fig), suggesting that small rare exonic CNVs are also associated with epilepsy. Indeed, there was no significant difference between epileptic patients and controls when considering all small CNVs and all genes. The exonic enrichment was significant for genes with predicted loss-of-function intolerance and for rare variants (permutation P-value<0.001, Fig 2c and S11 Fig). In both cohorts, most of the rare exonic CNVs were private, i.e. present in only one individual. However, we observed that rare exonic CNVs were less likely private in the epileptic patients (permutation P-value<0.001, S12 Fig). We replicated this result using only individuals with a similar population background (French-Canadians, S12 Fig). Overall we concluded that rare CNVs were not only enriched in exons but also affected exons more recurrently in the epilepsy cohort as compared to controls.

CNV enrichment in and near epilepsy genes

We then sought to evaluate if there was an excess of CNVs disrupting epilepsy-related genes or nearby functional regions. We first retrieved genes whose exons were hit by rare deletions or duplications and evaluated how many were known epilepsy genes based on a list of 154 genes previously associated with epilepsy [48] (Fig 3a). Because epilepsy genes tend to be large, we controlled for the gene size when testing for enrichment (S13 Fig). In the epilepsy cohort only, we noted a clear enrichment for epilepsy genes hit by rare deletions (S13 Fig). Moreover, the enrichment became stronger for rare CNVs. For instance, the exons of 921 genes were disrupted in the epilepsy cohort when considering deletions completely absent from the public and internal databases, 17 of which were epilepsy genes (P-value 0.015, Fig 3b). In addition, we observed significantly more epilepsy patients with a rare non-coding CNV close to an epilepsy gene compared to control individuals (S14 Fig). Interestingly, this enrichment was stronger for non-coding deletions (S14 Fig). We further explored the distribution of rare non-coding deletions by testing each epilepsy gene for a difference in mutation load between patients and controls. The GABRD gene had the strongest and only nominally significant association with four non-coding deletions among the 198 epileptic patients and none in the 301 controls. GABRD encodes the delta subunit of the gamma-aminobutyric acid A receptor and has been associated with juvenile myoclonic epilepsy [49]. In our cohort, two of the four patients with a rare non-coding deletion close to GABRD had been diagnosed with this syndrome, including one patient with a 2.7 Kbp deletion located only 3 Kbp upstream of GABRD’s transcription start site (S15 Fig). Although none survived multiple testing correction, we noted that the strongest associations were all in the direction of a higher mutation load in the epilepsy cohort rather than in the control cohort.

thumbnail
Fig 3. CNVs and epilepsy genes.

a) Number of rare CNVs in or close to exons of protein-coding genes (top) or epilepsy genes (bottom), in the epilepsy cohort. b) Number of epilepsy genes hit by exonic deletions in the epilepsy cohort and never seen in the public and internal databases (dotted line), compared to the expected distribution in all genes and size-matched genes (histograms). c) Rare non-coding CNVs in functional regions near epilepsy genes. The graph shows the cumulative number of individuals (y-axis) with a rare non-coding CNV located at X Kbp or less (x-axis) from the exonic sequence of a known epilepsy gene. We used CNVs overlapping regions functionally associated with the epilepsy gene (eQTL or promoter-associated DNase site).

https://doi.org/10.1371/journal.pgen.1007285.g003

To get a better idea of the functional regions close to epilepsy genes, we retrieved their associated eQTLs in the GTEx database [50] and the DNase hypersensitivity sites associated with their promoter regions [51]. Notably, focusing on rare non-coding CNVs overlapping these functional regions, the enrichment in epileptic patients was greatly strengthened and clearly present up to 100 Kbp from an epilepsy gene (Kolmogorov-Smirnov test: P-value 9 × 10−5, Fig 3c). Comparing epilepsy patients and controls, the odds ratio of having such a CNV at a distance of 100 Kbp or less from an exon was 1.33 and gradually increased the closer to the exon (2.9 for CNVs at 5 Kbp or less, S16 Fig). These non-coding CNVs were rare even in the epileptic cohort, but collectively represented an important fraction of affected patients. While 20 patients (10.1%) had exonic CNVs in epilepsy genes that were not seen in any control or in the public and internal databases, this number rose to 57 patients (28.8%) when counting non-coding CNVs in functional regions located at less than 100 Kbp of an epilepsy gene. These non-coding CNVs were never seen in the controls nor the CNV databases and overlap with annotated enhancer of epilepsy genes. Although their functional impact remains putative, we believe these CNVs to be of high-interest for the identification of disease causing genes. Among these CNVs of high-interest, a duplication of a regulatory region 5 Kbp downstream of CSNK1E was detected and validated in two different patients but absent from our controls and the public and internal databases (S15 Fig). Another example is a short deletion of an extremely conserved region downstream of FAM63B, detected in one patient and overlapping expression QTLs for this epilepsy gene (S15 Fig).

Putatively pathogenic CNVs

Next, we used an array of criteria to select the rare CNVs (less than 1% in 301 controls) with the highest disruptive potential in the epilepsy cohort. Priority was given to exonic CNVs in genes already known to be associated with epilepsy. For CNVs in other genes, we also prioritize recurrent variants and deletions in genes highly intolerant to loss-of-function mutations. In total, we identified 21 such putative pathogenic CNVs (Tables 2 and 3 and S3 Table). Out of these, 8 directly affected a gene previously associated with epilepsy [48] (Table 2). In particular, we identified a deletion resulting in the loss of more than half of the DEPDC5 gene in a patient affected with partial epilepsy. A number of point mutations have previously been reported in this gene for the same condition [52, 53]. We also identified two deletions and one duplication in CHD2 gene (see Fig 4). The first deletion is large and affects a major portion of the gene while the second is a small 4.6 Kbp deletion of exon 13, the last exon of CHD2’s second isoform (S17 Fig). No exon-disruptive CNVs were reported in any individuals from the control cohort. This gene was previously associated with patients suffering from photosensitive epilepsy [54]. Interestingly, all three patients carrying the CNVs in CHD2 have been diagnosed with eyelid myoclonia epilepsy with absence, the same diagnosis that was largely enriched in the Galizia et al. study. Other known epilepsy genes affected by deletions include LGI1 and the 15q13.3 region.

thumbnail
Fig 4. Exonic CNVs in CHD2 detected by PopSV.

The ‘CNV’ panel shows the exonic deletions (blue) and duplications (red) called by PopSV. The ‘Coverage’ panel shows the read depth signal in the affected individuals (colored points/lines) and the coverage distribution in the reference samples (boxplot and grey point).

https://doi.org/10.1371/journal.pgen.1007285.g004

Four of the 21 putative pathogenic CNVs were found in more than one individual (see Table 3 for precise numbers). To assess their global prevalence we tested them in an additional cohort of 325 epileptic patients and 380 ethnically matched controls (Table 3). Two regions were replicated: the first region in chromosome 2 consists of duplication of the genes TTC27, LTPB1 and BIRC6. In total, 4 patients carried this duplication and it was not reported in any of the two sets of controls. The second region was found on chromosome 16 and encompasses several genes. Two deletions were found in epileptic patients for this region and 1 epileptic individual and 1 control were also carriers of a duplication in the same region. This region corresponds to a genomic hotspot whose deletions were previously associated with epilepsy [30] and other neurological disorders. Finally, the remaining putative pathogenic CNVs were also associated with a number of genes (S3 Table). However, as we lack additional evidence for those specific CNV regions, we propose that these genes should be assessed in independent epilepsy cohorts. Of note, one patient had a rare 170 Kbp deletion encompassing three exons of the PTPRD gene which is predicted to be highly intolerant to loss-of-function mutations (pLI = 1) [47]. Rare deletions in this gene were previously found in four independent individuals with attention-deficit hyperactivity disorder [55] and associated with intellectual disability [56]. In addition, de novo deletions were found in an individual with autism [57] and more recently in a patient with epileptic encephalopathy [32]. A common intronic variant in PTPRD was also associated with remission of seizures after treatment in a clinical cohort of epilepsy patients [58].

Discussion

Although several tools exist for the detection of CNVs using WGS data, we found that none of them could efficiently account for technical biases, thus resulting in limited sensitivity. To improve on this, we developed a new tool, PopSV, which we demonstrated was able to accurately detect CNVs, including rare and small events.

A key aspect of our approach is the use of a set of reference samples to identify abnormal read coverage. In this context, the choice and number of reference samples will have an effect on the analysis. Results from running PopSV using different reference cohort sizes suggest that CNV calls are consistent across runs but that a higher number of reference samples increases the sensitivity and robustness of the CNV detection (S18 Fig). Based on these results, we recommend PopSV when 20 samples or more can be used as reference. In a given study, all samples can be used as a reference, or a subset of a few hundreds if the total sample size is extremely large. Although variants with frequency around 50% might not be detected, PopSV excels at detecting less frequent variants, smaller variants or variants in challenging regions such as repeat-rich regions. In a case/control design, the control samples could be used as reference in order to maximize the detection of case-specific variants. In the current study we used both epilepsy patients and controls as reference in order to be able to directly compare the observed CNV distributions. Finally, in a cancer project with paired normal and tumor samples, only normal samples should be used as reference such that PopSV can detect somatic CNVs of any frequency.

To maximize performance, the same library preparation, sequencing and data pre-processing should be employed on all the samples. To identify potential batch effects, a principal component analysis of read coverage was implemented as part of the PopSV package and is recommended to assess the homogeneity of the reference samples. The read length and aligner can lead to drastic changes in the read coverage and should be consistent across the cohort when analyzed with PopSV. This is particularly important in repeat-rich regions. Although the different datasets were produced by different sequencing and pre-processing protocols and showed varying degrees of technical bias (Fig 1a, S1 and S2 Figs), the performance of PopSV was comparable when benchmarking the methods in the two public datasets and experimentally validating calls in the CENet cohort.

PopSV’s approach does not require a uniform read coverage and integrate the coverage variation separately in each studied region. For these reasons, it would be straightforward to analyze targeted sequencing data, such as exome-sequencing. PopSV could also be extended for the detection of other types of SVs such as balanced SVs. To do this, instead of counting properly mapped reads, the method could be modified to test for an excess of discordant reads. Finally, additional modules could be added to PopSV to help characterize the detected variants. For instance, instead of computing a copy-number estimate from the average coverage in the reference, a HMM approach including all samples could provide a better genotyping strategy. Similar to other approaches [9, 16], an additional step in the pipeline could explore the effect of the bin size on the variation in read coverage across the population and suggest an optimal bin size.

As in previous array-based studies [3537], we observed an enrichment of large rare exonic CNVs in patients compared to controls. However, thanks to the resolution of WGS and PopSV, we found that the global distribution of small CNVs (<50 Kbp) in 198 unrelated epilepsy patients was also skewed towards rare exonic CNVs. In addition, genes disrupted by rare deletions in patients were enriched for previously known epilepsy genes. These observations support the association of small CNVs with epilepsy and could not have been detected in previous array-based studies.

We also observed a clear enrichment of non-coding CNVs in the neighborhood of previously implicated genes. When focusing on CNVs seen only in the epilepsy cohort and around epilepsy genes, 10.1% of epilepsy patients have an exonic CNVs and our results shows that up to 28.8% of patients harbor non-coding CNVs of high-interest in the proximity of epilepsy genes. These non-coding variants are present in the epilepsy cohort only and located in annotated regulatory regions associated to known epilepsy genes. Although it is challenging to directly test their functional impact, their frequency and location suggest a putative importance in the genetic mechanism of epilepsy and should be further investigated in the future.

Finally, to better understand the impact of these findings on an individual scale, we selected CNVs with the highest pathogenic potential within our patients. These CNVs highlighted known but also potentially new epilepsy genes. Using a second epilepsy cohort, we were also able to identify two chromosomal regions that were recurrently disrupted by CNVs. These findings highlight the benefits of having a comprehensive survey of CNVs when trying to understand the genetic causes of a disease.

Materials and methods

Ethics statement

This study was approved by the Research Ethics Board at the Sick Kids Hospital (REB number 1000033784) and the ethics committee at the Centre Hospitalier Universitaire de Montréal (project number 2003-1394,ND02.058-BSP(CA)). Before their inclusion in this study, patients or parents (when needed) had to give written informed consents.

Epilepsy patients and sequencing

Patients were recruited through two main recruitment sites at the Centre Hospitalier Universitaire de Montréal (CHUM) and the Sick Kids Hospital in Toronto as part of the Canadian Epilepsy Network (CENet). The main cohort of this study was constituted of 198 unrelated patients with various types of epilepsy; 85 males and 113 females. The mean age at onset of the disease for our cohort was 9.2 (±6.7) years. S1 Table presents a detailed description of the clinical features for the various individuals recruited in this study. 301 unrelated healthy parents of other probands from CENet were also included in this study and used as a control cohort. DNA was exclusively extracted from blood DNA.

Libraries were generated using the TruSeq DNA PCR-Free Library Preparation Kit (Illumina) and paired-end reads of size 125 bp were sequenced on a HiSeq 2500 to an average coverage of 37.6x ± 5.6x. Reads were aligned to reference Homo_sapiens b37 with BWA [59]. Finally, Picard was used to merge, realign and mark duplicate reads. Raw sequence data has been deposited in the European Genome-phenome Archive, under the accession code EGAS00001002825. For more details, see S1 Text.

Public WGS datasets

Two high-coverage public datasets were used to benchmark PopSV against existing methods.

A Twin study provided WGS sequencing data for 45 individuals, including 10 monozygotic twin quartets from the Quebec Study of Newborn Twins [38]. All patients gave informed consent in written form to participate in the Quebec Study of Newborn Twins. Ethic boards from the Centre de Recherche du CHUM, from the Université Laval and from the Montreal Neurological Institute approved this study. DNA was extracted from blood and sequencing was done on an Illumina HiSeq 2500 (paired-end mode, fragment length 300 bp). The reads were aligned using a modified version of the Burrows-Wheeler Aligner [59] (bwa version 0.6.2-r126-tpx with threading enabled). The options were ‘bwa aln -t 12 -q 5’ and ‘bwa sampe -t 12’. Aligned reads are available on the European Nucleotide Archive under ENA PRJEB8308. The 45 samples had an average sequencing depth of 40x (minimum 34x / maximum 57x).

A cancer dataset from a study of renal cell carcinoma [39] was also used. 95 pairs of normal/tumor tissues were sequenced using GAIIx and HiSeq2000 instruments. Paired-end reads of size 100 bp totaled an average sequencing depth of 54x (minimum 26x / maximum 164x). Reads were trimmed with FASTX-Toolkit and mapped per lane with BWA [59] backtrack to the GRCh37 reference genome. Picard was used to adjust pairs coordinates, flag duplicates and merge lanes. Finally, realignment was done with GATK. Raw sequence data has been deposited in the European Genome-phenome Archive, under the accession code EGAS00001000083. More details can be found in Scelo et al. [39].

Testing for technical biases in WGS

To investigate the bias in read depth (RD), we fragmented the genome in non-overlapping bins of 5 Kbp and counted the number of properly mapped reads. In each sample, we corrected for GC bias and removed bins with extremely low or high coverage (see S1 Text). Then, read counts across all samples were combined and quantile-normalized. Using simulations and permutations, we constructed two control RD datasets with no region-specific or sample-specific bias. We computed the mean and standard deviation of the coverage in each bin across samples. Next, to investigate experiment-specific bias, we retrieved which sample had the highest coverage in each bin. Then we computed, for each sample, the proportion of the genome where it had the highest coverage. The same analysis was performed monitoring the lowest coverage. This analysis was performed separately on the CENet dataset, the Twin dataset and the normal samples from the cancer dataset. On the Twin dataset, the same analysis was also run after correcting the read coverage following the QDNAseq pipeline [40] (see S1 Text).

PopSV

The main idea behind PopSV is to assess whether the coverage observed in a given location of the genome diverges significantly from the coverage observed in a set of reference samples. PopSV was implemented in an R package (see Data and code availability). The genome is first segmented into bins and the number of reads with proper mapping in each bin is counted for each sample. In a typical design, the genome is segmented in non-overlapping consecutive windows of equal size, but custom designs could also be used. With PopSV, we propose a new normalization procedure which we call targeted normalization that retrieves, for each bin, other genomic regions with similar profile across the reference samples and uses these bins to normalize read coverage (see S1 Text). Our targeted normalization was compared to global approaches that adjust for the median coverage, or quantile-based approaches. After normalization, the value observed in each bin is compared with the profiles observed in the reference samples and a Z-score is calculated (Fig 1b). False Discovery Rate (FDR) is estimated based on these Z-score distributions and a bin is marked as abnormal based on a user-defined FDR threshold. Consecutive abnormal bins are merged and considered as one variant. In PopSV’s R package, circular binary segmentation [60] can also be used to merge bins into variant regions. Copy number was estimated by dividing the coverage in a region by the average coverage across the reference samples, multiplied by 2 (see S1 Text).

Validation and benchmark of PopSV

We compared PopSV to CNVnator [9], FREEC [10] and cn.MOPS [11], three popular RD methods that can be applied to WGS datasets. We also ran LUMPY [8] which uses an orthogonal mapping signal: the insert size, orientation and split mapping of paired reads. For LUMPY, all the CNVs (deletions and duplications) and intra-chromosomal translocations (labeled as ‘BND’ in Lumpy’s output) larger than 300 bp were kept for the upcoming analysis. These methods were run on the two publicly available datasets, using 5 Kbp bins for the RD methods.

First, we compared the frequency at which a region is affected by a CNV using the calls from the different methods. To investigate the presence of systematic calls in each method, we compute how many of the calls in a typical sample are called at different frequencies in the dataset. For example, on average, how many calls in one sample are called in more than 90% of the samples. In the Twin dataset, the samples were clustered using the CNV calls from each method. Different linkage criteria were used for the hierarchical clustering (see S1 Text). The Rand index estimated the concordance between the clustering and the known pedigree (family-level). Next, we measured the number of CNVs identified in each twin that were also found in their monozygotic twin. We removed calls present in more than 50% of the samples to ensure that systematic errors were not biasing our replication estimates. Hence, a replicated call is most likely true as it is present in a minority of samples but consistently in the twin pair. For CNVnator, LUMPY and PopSV, the eval1/eval2 columns, number of supporting reads and adjusted P-values (respectively) were used to gradually filter low-quality calls and explore their effect on the replication metrics. In addition to their replication, we annotated the calls when their region overlapped a call found by other methods in the same sample. For calls found by at least two methods, we computed the proportion of calls from a method found by each of the other methods.

The approach described previously comparing pairs of twins was also applied in the cancer dataset, on pairs of normal/tumor samples. In this case, a replicated call is found in the normal sample and in the paired tumor sample. Finally, we compared calls using small bins (500 bp) and calls using larger bins (5 Kbp). This comparison explores the quality of the calls, the size of detectable events and the resolution for different bin sizes. First, we counted how many small bin calls supported any large bin call. We then looked at the proportion of small bin calls of different sizes that were also found in the large bin calls.

CNV detection in the CENet cohorts

CNVs were called using PopSV using 5 Kbp bins and all the samples from both the epilepsy and control cohorts as reference. We annotated the frequency of the CNVs using germline CNV calls from the Twin and cancer datasets (internal database) as well as four public CNV databases from the 1000 Genomes Project [13, 45], the Genome of Netherlands [44] and the Simons Genome Diversity Project [46]. CNVs were annotated with the maximum frequency in the databases. Hence, a rare CNV is defined as present in less than 1% of the samples in each of the five CNV databases.

To test for a difference in deletion/duplication ratio among rare CNVs, we compared the numbers of rare deletions and duplications in the epilepsy patients and controls using a χ2 test. The same test was performed after downsampling the controls to the sample size of the epilepsy cohort.

Validation by Taqman RT-PCR

We first selected CNV calls in epilepsy patients that spanned at least 2 consecutive bins. We kept exonic CNVs of different sizes and overlapping a Taqman probe. A second batch of CNVs, containing small non-coding CNVs, was also sent for validation. Here, hundreds of non-coding CNVs spanning only one bin were randomly selected. When possible the breakpoints were manually fine-tuned from manual inspection of a base-pair level coverage representation or using IGV [61]; the breakpoints remained unchanged when they could not be refined. Finally, we kept regions overlapping a Taqman probe.

Probes were selected using the assay search tool on the Thermofisher website. All probes were tested for patients and controls that were called in PopSV as well as an additional 10 control individuals to ensure the validity of the probe. For each CNV, one assay was chosen in the middle of the genomic region of interest and located in an exon when possible. All reactions with TaqMan Copy Number Assays were performed in duplex using the FAM dye label based assay for the target of interest (Taqman copy number assay, Made to order, #4400291, Applied Biosystems by Life Technologies) and the VIC dye label based TaqMan Copy Number Reference Assay for RNase P (4403326, Life technologies). Amplification reactions (10μL), which were performed in quadruplicate, consisted of: 10 ng gDNA, 1X TaqMan Copy Number Assay, 1X TaqMan Copy Number Reference Assay, RNase P, 1X TaqMan Genotyping Master Mix (4371355, Life Technologies) or 1X SensiFAST Probe Lo-ROX Kit (BIO-84020, Froggabio). PCR was performed with an Applied Biosystems QuantStudio7 flex Real-Time PCR system using the standard curve settings and the default universal cycling conditions: 95 °C 10 minutes followed by 40 cycles: 95 °C 15 seconds, 60 °C 60 seconds. Data was analyzed with QuantStudio Real-Time PCR system software v1.2 (Applied Biosystems by Life Technologies) using autobaseline and manual Ct threshold of 0.2. Results export files were opened in CopyCaller Software v2.0 for sample copy number analysis by the relative quantitation method. The median ΔCt was used as the calibrator sample in the analysis settings.

CNV enrichment in exonic regions

For each cohort (epilepsy and control), we retrieved the CNV catalog by merging CNV that are recurrent in multiple samples. Hence, the CNV catalog represents all the different CNVs found in each cohort. Because the epilepsy and control cohorts have different sample sizes, the CNV catalogs for each cohort were built using 150 randomly selected samples. For each sub-sampling and each cohort, control regions were selected to fit the size distribution of the CNV catalog and the overlap with centromeres, telomeres and assembly gaps (see S1 Text). The fold-enrichment represents how much more/less of the CNVs overlap an exon compared to the control regions. To robustly compare the two cohorts, we computed the median difference in fold-enrichment between the CNV catalogs from patients and controls across 100 sub-sampled catalogs. The cohort labels of the CNV catalogs were then permuted 10,000 times and the analysis repeated to derive a null distribution for the median difference in fold enrichment. A permuted P-value was computed from the observed difference and the null distribution.

Small (<50Kbp) and large (>50 Kbp) CNVs were analyzed separately. Exons from genes predicted to be loss-of-function intolerant [47] (probability of loss-of-function intolerance > 0.9) were also analyzed separately. The same analysis was repeated using only rare CNVs, i.e. being present in less than 1% of PopSV calls in the Twins and renal cancer datasets, and in four public datasets (see S1 Text).

In each cohort, we then retrieved the CNV catalog of rare exonic CNVs. We evaluated the proportion of the CNVs in the catalog that are private (i.e. seen in only one sample). The control cohort was down-sampled a thousand times to the same sample size as the epilepsy cohort to provide a confidence interval and empirical P-value (see S1 Text). We also visualize the proportion of CNVs in the catalog seen in 2 samples or more, 3 samples or more, etc (S12 Fig). We performed the same analysis after removing the top 20 samples with the highest number of non-private rare exonic CNVs. The analysis was also repeated using French-Canadian individuals only.

CNV enrichment in and near epilepsy genes

We used the list of genes associated with epilepsy from the EpilepsyGene resource [48] which consists of 154 genes strongly associated with epilepsy. We tested different sets of CNVs: deletion or duplications in the epilepsy cohort, control individuals and samples from the twin study, and using different threshold of maximum frequency. For each set of CNVs, we counted how many of the genes hit were known epilepsy genes. To control for the size of epilepsy genes and CNV-hit genes, we randomly selected genes with sizes similar to the genes hit by CNVs and evaluated how many were epilepsy genes. After sampling 10,000 gene sets, we computed an empirical P-value (see S1 Text).

To investigate rare non-coding CNVs close to known epilepsy genes, we counted how many patients have such a CNV at different thresholds of distance to the nearest exon. We compared this cumulative distribution to the control cohort, after down-sampling it to the sample size as the epilepsy cohort. We performed the same analysis using deletions only. Each epilepsy gene was also tested for an excess of rare non-coding deletions in patients versus controls using a Fisher test. Next, we restricted our analysis to rare non-coding CNVs that overlap an eQTL associated with the epilepsy genes [50] or a DNase I hypersensitive site associated with the promoter of epilepsy genes [51]. A Kolmogorov-Smirnov test was used to test the difference in distribution. Finally, using different values for the maximum distance to the nearest epilepsy gene, we computed the odds ratio of having such a CNV between epilepsy patients and controls.

Putatively pathogenic CNVs

Exonic CNVs larger than 10 Kbp and found in less than 1% of the 301 controls were first selected. We further retained either CNVs overlapping the exon of a known epilepsy-associated gene [48] or deletions overlapping the exon of a loss-of-function intolerant gene [47], or CNVs present in two or more of our epilepsy patients. All the putatively pathogenic CNVs were validated by Taqman RT-PCR.

Data and code availability

The PopSV R package and its documentation are available at http://jmonlong.github.io/PopSV/. Scripts are provided to run the pipeline on different high performance computing systems. The code used for the analysis and to produce figures and numbers is documented at http://github.com/jmonlong/epipopsv and archived in https://doi.org/10.5281/zenodo.1172181. Necessary data, including the CNV calls, was deposited at https://figshare.com/s/20dfdedcc4718e465185. Raw sequence data has been deposited in the European Genome-phenome Archive, under the accession code EGAS00001002825.

Supporting information

S1 Text. Supplementary text for the experiments and methods.

https://doi.org/10.1371/journal.pgen.1007285.s001

(PDF)

S1 Table. Clinical features of epileptic patients.

The Excel file contains the type of epilepsy, age of onset, sex, family history, pharmaco-resistance and potential intellectual disabilities.

https://doi.org/10.1371/journal.pgen.1007285.s002

(XLSX)

S2 Table. PopSV calls validated by RT-PCR.

The Excel file contains the location of each region, the CNV type, the number of carriers in the CENet cohorts, the maximum proportion of carriers in the CNV databases, Taqman probe ID and validation status.

https://doi.org/10.1371/journal.pgen.1007285.s003

(XLSX)

S1 Fig. Variation and bias in whole-genome sequencing experiments in the epilepsy cohort.

a) Distribution of the bin inter-sample standard deviation coverage (red) and null distribution (blue: bins shuffled, green: simulated normal distribution). b) Proportion of the genome in which a given sample (x-axis) has the highest (red) or lowest (blue) RD. In the absence of bias all samples should be the most extreme at the same frequency (dotted horizontal line).

https://doi.org/10.1371/journal.pgen.1007285.s005

(PDF)

S2 Fig. Variation and bias in whole-genome sequencing experiments in the normals from CageKid (a,d,g), the twin dataset (b,e,h) and the twin dataset after using QDNAseq [40] correction (c,f,i).

a-c) Distribution of the bin inter-sample standard deviation coverage (red) and null distribution (blue: bins shuffled, green: simulated normal distribution). d-f) Same for the bin inter-sample standard deviation coverage. g-i) Proportion of the genome in which a given sample (x-axis) has the highest (red) or lowest (blue) RD. In the absence of bias all samples should be the most extreme at the same frequency (dotted horizontal line).

https://doi.org/10.1371/journal.pgen.1007285.s006

(PDF)

S3 Fig. Comparison of different normalization approaches.

a) For each normalization approach, the sample with the least normal Z-score distribution is shown. b) After targeted normalization, a lower proportion of the genome looks problematic for the analysis. Fewer bins have non-normal bin counts (top-left), the sample ranks are more random suggesting less sample-specific bias (top-right), and Z-scores fit better a Normal distribution on average (bottom-left) and in the worst sample (bottom-right). The dotted line is computed from simulated bin counts.

https://doi.org/10.1371/journal.pgen.1007285.s007

(PDF)

S4 Fig. Frequency of calls in an average sample from the twin study.

The bars show the proportion of calls in an average samples (y-axis), grouped by the frequency of the call in the dataset (x-axis), for different methods.

https://doi.org/10.1371/journal.pgen.1007285.s008

(PDF)

S5 Fig. CNV clustering and twin pedigree.

The hierarchical cluster tree from the CNV calls is cut at different levels (x-axis), cluster groups are compared to the known pedigree using the Rand index (y-axis). Different clustering linkage criteria (point style) are used and the one showing the best Rand index is highlighted by the line.

https://doi.org/10.1371/journal.pgen.1007285.s009

(PDF)

S6 Fig. Replication in monozygotic twins for different significance thresholds.

Each point represents the number of replicated calls per sample (average across samples) and the proportion of replicated calls per sample. The vertical error bar shows the variation of the replication rate across the samples. The points and lines were computed by filtering calls at different significance levels (q-value for PopSV, number of supporting reads for LUMPY and eval1/eval2 for CNVnator, see S1 Text).

https://doi.org/10.1371/journal.pgen.1007285.s010

(PDF)

S7 Fig. Calls found by several methods.

Focusing on calls found by at least two methods, the heatmap shows the proportion of calls from one method (x-axis) that were also found by another (y-axis) on average per sample.

https://doi.org/10.1371/journal.pgen.1007285.s011

(PDF)

S8 Fig. Benchmark across paired normal/tumor in CageKid.

Number (a) and proportion (b) of germline calls replicated in the paired tumor in CageKid. c) Number and proportion of replicated calls when filtering calls at different significance levels. d) Focusing on calls found by at least two methods, the color shows the proportion of calls from one method (x-axis) that were also found by another (y-axis) on average per sample.

https://doi.org/10.1371/journal.pgen.1007285.s012

(PDF)

S9 Fig. Comparison of PopSV results using different bin sizes.

a) 5 Kbp calls of different sizes (x-axis) are split according to the proportion of the call supported by 500 bp calls. The Z-score of 500 bp bins in 5 Kbp calls is consistent with the call for deletion b) and duplication c) signal. 5 Kbp calls with lower significance (e.g. single-bin calls) are less supported by 500 bp calls (a) but their Z-scores are in the consistent direction (b,c) although not always significant enough to be called. d) Proportion of 500 bp calls of different sizes (x-axis) overlapping a 5 Kbp call.

https://doi.org/10.1371/journal.pgen.1007285.s013

(PDF)

S10 Fig. CNV size in our cohort and four array-based studies.

The bars show the average number of CNVs called in a sample in the different cohorts. Redon 2006 [42] and Itsara 2009 [43] are population studies using technology similar to previous epilepsy studies. Addis 2016 [34] is a recent study of large CNVs in absence epilepsy. Conrad 2010 [4] is a population study that used multiple arrays to increase its resolution.

https://doi.org/10.1371/journal.pgen.1007285.s014

(PDF)

S11 Fig. Exonic enrichment significance.

The grey violin plot represents the difference in fold-enrichment between patients and controls across 10,000 permutations where the patient/control labels had been shuffled. The red point represents the observed difference between patients and controls (Fig 2c).

https://doi.org/10.1371/journal.pgen.1007285.s015

(PDF)

S12 Fig. Rare exonic CNVs are less private in the epilepsy cohort.

Proportion of rare exonic CNVs (y-axis) seen in X or more individuals (x-axis). The ribbon shows the 5%–95% confidence interval. In b), only French-Canadians individuals were analyzed and we down-sampled the epilepsy cohort to match the sample size of the French-Canadians controls. In c), the top 20 samples with the most non-private rare exonic SVs were removed.

https://doi.org/10.1371/journal.pgen.1007285.s016

(PDF)

S13 Fig. Enrichment in epilepsy genes.

a) Epilepsy genes (red) are genes known to be associated with epilepsy. The control genes (dotted blue) are random genes selected so that the size distribution is similar to the sizes of genes hit by CNVs (plain blue). b) In three different datasets (color), genes hit by rare deletion (top) or duplications (bottom) at different frequency thresholds (x-axis) were tested for enrichment in epilepsy genes (y-axis, point-size).

https://doi.org/10.1371/journal.pgen.1007285.s017

(PDF)

S14 Fig. Rare non-coding CNVs near epilepsy genes.

The graphs show the cumulative number of individuals (y-axis) with a rare non-coding variants located at X Kbp or less (x-axis) from the exonic sequence of a known epilepsy gene. The controls were down-sampled to the sample size of the epilepsy cohort. The ribbon shows the 5%/95% confidence interval. In a), deletions and duplications were considered; in b), only deletions were used.

https://doi.org/10.1371/journal.pgen.1007285.s018

(PDF)

S15 Fig. Non-coding CNVs with putative pathogenicity.

a) 2.7 Kbp deletion in an epilepsy patient, never seen in controls or CNV databases. Three other epilepsy patients have a rare non-coding deletions located at less than 200 Kbp from the GABRD gene. b) 8.8 Kbp duplication in two epilepsy patients, never seen in controls or CNV databases and overlapping a regulatory region associated with CSNK1E. c) 6.5 Kbp deletion of an ultra-conserved regions downstream of FAM63B. Two expression QTLs for this gene are highlighted with arrows.

https://doi.org/10.1371/journal.pgen.1007285.s019

(PDF)

S16 Fig. The enrichment in rare non-coding CNVs overlapping functional regions increases close to epilepsy genes.

The graph shows the log odds ratio of having a rare non-coding CNV located at X Kbp or less (x-axis) from the exonic sequence of a known epilepsy gene. The y-axis shows the log odds ratio between epilepsy patients and controls. The controls were down-sampled to the sample size of the epilepsy cohort. We used CNVs overlapping regions functionally associated with the epilepsy gene (eQTL or promoter-associated DNase site).

https://doi.org/10.1371/journal.pgen.1007285.s020

(PDF)

S17 Fig. Small deletion of exon 13 in CHD2.

Abnormal mapping of the read pairs highlighted in red support the deletion detected by PopSV using the read coverage. The deletion region is highlighted in orange.

https://doi.org/10.1371/journal.pgen.1007285.s021

(PDF)

S18 Fig. Reference cohort size and CNV detection quality.

PopSV was run on the Twins study using 10, 20, 30 or 45 samples as reference (color). In a), the y-axis shows how many calls from the down-sampled run were found in the original 45-refs run. The x-axis represents the FDR threshold (lower threshold being more stringent). b) Replication in monozygotic twins. For different cohort sizes and FDR thresholds, the number (x-axis) and proportion (y-axis) of calls replicated in the other monozygotic twin is shown. In both graphs, the lines represents the median per sample and the ribbon the minimum/maximum values.

https://doi.org/10.1371/journal.pgen.1007285.s022

(PDF)

S19 Fig. Targeted normalization.

The coverage across the reference samples (blue) in the bin to normalize is used to find supporting bins across the genome. These supporting bins only are used to compute the normalization factor. The same supporting bins will be used to normalize the bin count in a test sample (red).

https://doi.org/10.1371/journal.pgen.1007285.s023

(PDF)

Acknowledgments

Data analyses were enabled by compute and storage resources provided by Compute Canada and Calcul Québec. We would like to thank Pascale Marquis at the Canadian Centre for Computational Genomics for processing the raw sequencing data to genomic variant calls and for her active participation in various quality assessment steps. The Canadian Centre for Computational Genomics (C3G) is a Node of the Canadian Genomic Innovation Network and is supported by the Canadian Government through Genome Canada. We are grateful to the team of the Québec Study of Newborn Twins who provided the twin dataset and the Cagekid consortium who provided the renal cancer dataset. We would like to thank Sylvia Dobrzeniecka for sample handling and lab work. We are grateful to Dr. Ledia Brunga for her work on the epileptic cohort and to Brianna Goldenstein and Claudia Moreau for revising this manuscript. Finally, we would like to thank Simon Gravel, Mathieu Blanchette, Mathieu Bourgey, Toby Dylan Hocking and Claudia Moreau for helpful discussions.

References

  1. 1. Hall IM, Quinlan AR. Detection and Interpretation of Genomic Structural Variation in Mammals. In: Methods in Molecular Biology. vol. 838. Springer Science; 2012. p. 225–248.
  2. 2. Sharp AJ, Cheng Z, Eichler EE. Structural Variation of the Human Genome. Annual Review of Genomics and Human Genetics. 2006;7(1):407–442. pmid:16780417
  3. 3. Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470(7332):59–65. pmid:21293372
  4. 4. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010;464(7289):704–712. pmid:19812545
  5. 5. Spielmann M, Klopocki E. CNVs of noncoding cis-regulatory elements in human disease. Current Opinion in Genetics & Development. 2013;23(3):249–256.
  6. 6. Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013;14(Suppl 11):S1.
  7. 7. Pirooznia M, Goes F, Zandi PP. Whole-genome CNV analysis: Advances in computational approaches. Frontiers in Genetics. 2015;6(MAR):1–9.
  8. 8. Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biology. 2014;15(6):R84. pmid:24970577
  9. 9. Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Research. 2011;21(6):974–984. pmid:21324876
  10. 10. Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, et al. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics. 2011;27(2):268–269. pmid:21081509
  11. 11. Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A, Bodenhofer U, et al. Cn.MOPS: Mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Research. 2012;40(9):e69–e69. pmid:22302147
  12. 12. Glusman G, Severson A, Dhankani V, Robinson M, Farrah T, Mauldin DE, et al. Identification of copy number variants in whole-genome data using reference coverage profiles. Frontiers in Genetics. 2015;5(FEB):1–13.
  13. 13. Handsaker RE, Van Doren V, Berman JR, Genovese G, Kashin S, Boettger LM, et al. Large multiallelic copy number variations in humans. Nature Genetics. 2015;47(3):296–303. pmid:25621458
  14. 14. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics. 2011;13(1):36–46. pmid:22124482
  15. 15. Teo SM, Pawitan Y, Ku CS, Chia KS, Salim A. Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics. 2012;28(21):2711–2718. pmid:22942022
  16. 16. Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research. 2012;40(10):e72–e72. pmid:22323520
  17. 17. Koren A, Handsaker RE, Kamitaki N, Karlić R, Ghosh S, Polak P, et al. Genetic variation in human DNA replication timing. Cell. 2014;159(5):1015–1026. pmid:25416942
  18. 18. van Dijk EL, Jaszczyszyn Y, Thermes C. Library preparation methods for next-generation sequencing: tone down the bias. Experimental cell research. 2014;322(1):12–20. pmid:24440557
  19. 19. Cheung MS, Down TA, Latorre I, Ahringer J. Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Research. 2011;39(15):e103–e103. pmid:21646344
  20. 20. Berkovic SF, Howell RA, Hay DA, Hopper JL. Epilepsies in twins: Genetics of the major epilepsy syndromes. Annals of Neurology. 1998;43(4):435–445. pmid:9546323
  21. 21. Zara F, Bianchi A, Avanzini G, Di Donato S, Castellotti B, Patel PI, et al. Mapping of genes predisposing to idiopathic generalized epilepsy. Human Molecular Genetics. 1995;4(7):1201–7. pmid:8528209
  22. 22. Kasperavičiüte D, Catarino CB, Heinzen EL, Depondt C, Cavalleri GL, Caboclo LO, et al. Common genetic variation and susceptibility to partial epilepsies: A genome-wide association study. Brain. 2010;133(7):2136–2147. pmid:20522523
  23. 23. Guo Y, Baum LW, Sham PC, Wong V, Ng PW, Lui CHT, et al. Two-stage genome-wide association study identifies variants in CAMSAP1L1 as susceptibility loci for epilepsy in Chinese. Human Molecular Genetics. 2012;21(5):1184–1189. pmid:22116939
  24. 24. International League Against Epilepsy Consortium on Complex Epilepsies. Genetic determinants of common epilepsies: a meta-analysis of genome-wide association studies. The Lancet Neurology. 2014;13(9):893–903. pmid:25087078
  25. 25. Escayg A, MacDonald BT, Meisler MH, Baulac S, Huberfeld G, An-Gourfinkel I, et al. Mutations of SCN1A, encoding a neuronal sodium channel, in two families with GEFS+2. Nature genetics. 2000;24(4):343–5. pmid:10742094
  26. 26. Claes L, Ceulemans B, Audenaert D, Smets K, Löfgren A, Del-Favero J, et al. De novo SCN1A mutations are a major cause of severe myoclonic epilepsy of infancy. Human mutation. 2003;21(6):615–21. pmid:12754708
  27. 27. Helbig I, Mefford HC, Sharp AJ, Guipponi M, Fichera M, Franke A, et al. 15q13.3 microdeletions increase risk of idiopathic generalized epilepsy. Nature genetics. 2009;41(2):160–2. pmid:19136953
  28. 28. de Kovel CGF, Trucks H, Helbig I, Mefford HC, Baker C, Leu C, et al. Recurrent microdeletions at 15q11.2 and 16p13.11 predispose to idiopathic generalized epilepsies. Brain: a journal of neurology. 2010;133(Pt 1):23–32.
  29. 29. Biervert C. A Potassium Channel Mutation in Neonatal Human Epilepsy. Science. 1998;279(5349):403–406. pmid:9430594
  30. 30. Mefford HC, Muhle H, Ostertag P, von Spiczak S, Buysse K, Baker C, et al. Genome-Wide Copy Number Variation in Epilepsy: Novel Susceptibility Loci in Idiopathic Generalized and Focal Epilepsies. PLoS Genetics. 2010;6(5):e1000962. pmid:20502679
  31. 31. Helbig I, Swinkels MEM, Aten E, Caliebe A, van’t Slot R, Boor R, et al. Structural genomic variation in childhood epilepsies with complex phenotypes. European Journal of Human Genetics. 2014;22(7):896–901. pmid:24281369
  32. 32. Mefford H. Copy number variant analysis from exome data in 349 patients with epileptic encephalopathy. Annals of Neurology. 2015;78(2):323–328.
  33. 33. Lal D, Ruppert AK, Trucks H, Schulz H, de Kovel CG, Kasteleijn-Nolst Trenité D, et al. Burden Analysis of Rare Microdeletions Suggests a Strong Impact of Neurodevelopmental Genes in Genetic Generalised Epilepsies. PLOS Genetics. 2015;11(5):e1005226. pmid:25950944
  34. 34. Addis L, Rosch RE, Valentin A, Makoff A, Robinson R, Everett KV, et al. Analysis of rare copy number variation in absence epilepsies. Neurology Genetics. 2016;2(2):e56. pmid:27123475
  35. 35. Mefford HC, Yendle SC, Hsu C, Cook J, Geraghty E, McMahon JM, et al. Rare copy number variants are an important cause of epileptic encephalopathies. Annals of Neurology. 2011;70(6):974–985. pmid:22190369
  36. 36. Heinzen EL, Radtke RA, Urban TJ, Cavalleri GL, Depondt C, Need AC, et al. Rare deletions at 16p13.11 predispose to a diverse spectrum of sporadic epilepsy syndromes. American journal of human genetics. 2010;86(5):707–18. pmid:20398883
  37. 37. Striano P. Clinical Significance of Rare Copy Number Variations in Epilepsy. Archives of Neurology. 2012;69(3):322. pmid:22083797
  38. 38. Boivin M, Brendgen M, Dionne G, Dubois L, Pérusse D, Robaey P, et al. The Quebec Newborn Twin Study Into Adolescence: 15 Years Later. Twin Research and Human Genetics. 2013;16(01):64–69. pmid:23200437
  39. 39. Scelo G, Riazalhosseini Y, Greger L, Letourneau L, Gonzàlez-Porta M, Wozniak MB, et al. Variation in genomic landscape of clear cell renal cell carcinoma across Europe. Nature Communications. 2014;5(May):5135. pmid:25351205
  40. 40. Scheinin I, Sie D, Bengtsson H, van de Wiel MA, Olshen AB, van Thuijl HF, et al. DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. Genome research. 2014;24(12):2022–32. pmid:25236618
  41. 41. Xi R, Hadjipanayis AG, Luquette LJ, Kim TM, Lee E, Zhang J, et al. Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proceedings of the National Academy of Sciences. 2011;108(46):E1128–E1136.
  42. 42. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, et al. Global variation in copy number in the human genome. Nature. 2006;444(7118):444–454. pmid:17122850
  43. 43. Itsara A, Cooper GM, Baker C, Girirajan S, Li J, Absher D, et al. Population Analysis of Large Copy Number Variants and Hotspots of Human Genetic Disease. The American Journal of Human Genetics. 2009;84(2):148–161. pmid:19166990
  44. 44. Francioli LC, Menelaou A, Pulit SL, van Dijk F, Palamara PF, Elbers CC, et al. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nature Genetics. 2014;46(8):818–825.
  45. 45. Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75–81. pmid:26432246
  46. 46. Sudmant PH, Mallick S, Nelson BJ, Hormozdiari F, Krumm N, Huddleston J, et al. Global diversity, population stratification, and selection of human copy-number variation. Science. 2015;349(6253):aab3761–aab3761. pmid:26249230
  47. 47. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–291. pmid:27535533
  48. 48. Ran X, Li J, Shao Q, Chen H, Lin Z, Sun ZS, et al. EpilepsyGene: A genetic resource for genes and mutations related to epilepsy. Nucleic Acids Research. 2015;43(D1):D893–D899. pmid:25324312
  49. 49. Delgado-Escueta AV, Koeleman BPC, Bailey JN, Medina MT, Durón RM. The quest for Juvenile Myoclonic Epilepsy genes. Epilepsy & Behavior. 2013;28:S52–S57.
  50. 50. Ardlie KG, Deluca DS, Segre AV, Sullivan TJ, Young TR, Gelfand ET, et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015;348(6235):648–660.
  51. 51. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012;337(6099):1190–1195. pmid:22955828
  52. 52. Dibbens LM, de Vries B, Donatello S, Heron SE, Hodgson BL, Chintawar S, et al. Mutations in DEPDC5 cause familial focal epilepsy with variable foci. Nature Genetics. 2013;45(5):546–551. pmid:23542697
  53. 53. Ishida S, Picard F, Rudolf G, Noé E, Achaz G, Thomas P, et al. Mutations of DEPDC5 cause autosomal dominant focal epilepsies. Nature Genetics. 2013;45(5):552–555. pmid:23542701
  54. 54. Galizia EC, Myers CT, Leu C, de Kovel CGF, Afrikanova T, Cordero-Maldonado MLo, et al. CHD2 variants are a risk factor for photosensitivity in epilepsy. Brain. 2015;138(5):1198–1208. pmid:25783594
  55. 55. Elia J, Gai X, Xie HM, Perin JC, Geiger E, Glessner JT, et al. Rare structural variants found in attention-deficit hyperactivity disorder are preferentially associated with neurodevelopmental genes. Molecular Psychiatry. 2010;15(6):637–646. pmid:19546859
  56. 56. Choucair N, Mignon-Ravix C, Cacciagli P, Abou Ghoch J, Fawaz A, Mégarbané A, et al. Evidence that homozygous PTPRD gene microdeletion causes trigonocephaly, hearing loss, and intellectual disability. Molecular Cytogenetics. 2015;8(1):39. pmid:26082802
  57. 57. Pinto D, Pagnamenta AT, Klei L, Anney R, Merico D, Regan R, et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature. 2010;466(7304):368–372. pmid:20531469
  58. 58. Speed D, Hoggart C, Petrovski S, Tachmazidou I, Coffey A, Jorgensen A, et al. A genome-wide association study and biological pathway analysis of epilepsy prognosis in a prospective cohort of newly treated epilepsy. Human molecular genetics. 2014;23(1):247–58. pmid:23962720
  59. 59. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26(5):589–595. pmid:20080505
  60. 60. Seshan V, Olshen A. DNAcopy: DNA copy number data analysis. R package version 1501. 2017;.
  61. 61. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics. 2013;14(2):178–92. pmid:22517427