Conceived and designed the experiments: JYHK CW HGB CPP JAV. Performed the experiments: JYHK NW. Analyzed the data: JYHK NW CW RP CPP JAV. Contributed reagents/materials/analysis tools: RP CG BBAdV. Wrote the paper: JYHK NW CW CPP JAV.
The authors have declared that no competing interests exist.
Copy number variants (CNVs) have recently been recognized as a common form of genomic variation in humans. Hundreds of CNVs can be detected in any individual genome using genomic microarrays or whole genome sequencing technology, but their phenotypic consequences are still poorly understood. Rare CNVs have been reported as a frequent cause of neurological disorders such as mental retardation (MR), schizophrenia and autism, prompting widespread implementation of CNV screening in diagnostics. In previous studies we have shown that, in contrast to benign CNVs, MR-associated CNVs are significantly enriched in genes whose mouse orthologues, when disrupted, result in a nervous system phenotype. In this study we developed and validated a novel computational method for differentiating between benign and MR-associated CNVs using structural and functional genomic features to annotate each CNV. In total 13 genomic features were included in the final version of a Naïve Bayesian Tree classifier, with LINE density and mouse knock-out phenotypes contributing most to the classifier's accuracy. After demonstrating that our method (called GECCO) perfectly classifies CNVs causing known MR-associated syndromes, we show that it achieves high accuracy (94%) and negative predictive value (99%) on a blinded test set of more than 1,200 CNVs from a large cohort of individuals with MR. These results indicate that this classification method will be of value for objectively prioritizing CNVs in clinical research and diagnostics.
Rare copy number variants (CNVs) are a frequent cause of neurological disorders such as mental retardation (MR). However CNVs are also commonly identified in healthy individuals. It is therefore crucial for both diagnostic and research applications to be able to distinguish between disease-causing CNVs and “benign” CNVs occurring as normal genomic variation. Separating these two types can take advantage of significant differences in their genomic contents. For example, benign CNVs are enriched in repetitive sequences. By contrast, CNVs associated with MR tend to have high densities of functional elements, including genes whose mouse orthologues, when knocked-out, lead to specific nervous system abnormalities. We have developed a novel objective approach that is effective in distinguishing MR-associated CNVs from benign CNVs based on the presence of 13 genomic attributes. This method is able to achieve high accuracies in a cohort of CNVs known to cause MR and in a cohort of individuals with unexplained MR. The development of this technique promises to substantially improve the methodology for determining the pathogenicity of CNVs.
Improvements in microarray resolution and hybridization robustness have resulted in the widespread implementation of genomic microarray technologies in medical research and diagnostics. This technology is most effective in detecting genomic deletions and duplications larger than 1kb, known as copy number variants (CNVs). Genomic microarrays are commonly used to identify rare, but highly penetrant, and commonly single CNVs in patients suffering from neurological disorders such as autism
At present up to 5% of the human genome has been shown to vary in large scale copy number in numerous healthy controls
Although a large number of methods are available for the computational prioritization and classification of genomic data
We started by selecting genomic features
Genomic Feature | Structural | Functional | Categorical | Continuous | |
Type (Gain/Loss) | * | * | |||
Length | * | * | |||
# LINEs | * | * | |||
LINE density | * | * | |||
# SINEs |
* | * | |||
SINE density | * | * | |||
# Segmental Duplications | * | * | |||
Segmental Duplication Density | * | * | |||
# Genes |
* | * | |||
Gene Density |
* | * | |||
* | * | ||||
* | * | ||||
* | * | ||||
KEGG Pathway (hsa01510) | * | * | |||
MGI Phenotype (MP:0003631) | * | * | |||
Gene expression | * | * |
Each feature is either categorical or a continuous numerical feature. Furthermore, each feature relates to either a structural genomic attribute or a functional genomic attribute.
We identified a total of 16 genomic features as suitable attributes for the classifier which could be divided into either: (1) structural features such as segmental duplication density, and (2) functional features, such as gene density
The relative frequencies of the two different classes of CNV in the training set are very different (they are ‘imbalanced’). MR-associated CNVs are identified in ∼10% of MR patients screened and, for these, in the large majority of cases MR is attributable to only a single CNV (see
This figure shows the relationship between the fraction of available benign CNVs used in the training set and the accuracy of the classifier (calculated over 1,000 independent test and training runs). Maximum accuracy is achieved with a similar number of MR-associated and benign CNVs in the training set (∼5% of the benign CNV instances available).
A consequence of using a balanced training set with equal numbers of MR-associated and benign CNVs is that not all available benign CNVs are used during training. In order to select the optimal training set we randomly re-sampled the training set over 10,000 iterations selecting 82 MR-associated CNVs and 82 benign CNVs, with the remaining benign CNVs being placed in the test set. A mean accuracy of 86% (±2.8%) was obtained from these iterations, which demonstrates that the classifier achieves a reasonable level of accuracy irrespective of which benign CNVs are selected for the training set. In addition, this analysis identified an optimal subset of CNVs for training which achieved a maximum accuracy of 95.7% and an area under the ROC curve of 0.98 when classifying the test set of CNVs. The resulting classifier using this optimal training set contains 5 tree nodes with univariate splits based on the CNV length, and on the segmental duplication, LINE, SINE and gene densities. The 6 leaves of the tree each contain a different Bayesian classifier based on all features used during training.
The optimal training set was obtained by training the classifier on all 16 available features. To quantify the contribution of each feature to the accuracy of the classifier we used a leave-one-out policy for each feature, retrained the classifier and then measured the percentage decline in classification accuracy (
Both structural and functional genomic features are evaluated for their impact on classification accuracy. This analysis is performed by measuring the decrease in accuracy of the classifier as each classification feature is removed individually. KEGG Pathway refers to the CNV region containing at least one gene implicated in a KEGG neurodegenerative pathway, and MGI Pheno refers to the CNV region containing at least one gene displaying a nervous system phenotype in a knockout mouse. Gene Expression refers to the stability of gene expression of genes present in the CNV. Removal of the LINE density from the classifier results in the largest decrease in accuracy (6%) whilst removing MGI knockout phenotypes results in a drop of 5% in accuracy. The number of SINE elements, the non-synonymous substitution rate (
The Decipher database of known syndromes associated with genomic structural variants (
We performed a second more extensive study to validate the accuracy of the classifier using an independent set of 584 MR patients in which 1,203 CNVs (the set “MR diagnostics”) had been identified during routine diagnostics using Affymetrix 250k SNP microarrays. These CNVs were identified as being associated with MR (
The CNVs are ranked and their probability of belonging to the MR-associated CNV class is plotted, A) 1,203 CNVs with known inheritance collected from routine diagnostics are classified with a sensitivity of 88% and a specificity of 94%. 1,085 of the 1,154 of the common inherited CNVs were correctly classified (blue), and 43 of 49 CNVs previously associated with MR were correctly classified as MR-associated (green). 6 CNVs which had been interpreted as not being associated with MR, were classified as MR-associated (red), as well as 69 CNVs classified as MR-associated which had previously been interpreted as benign (purple). B) Similarly, 41 rare inherited CNVs with unknown clinical significance are classified, 27 of which were classified as MR-associated with a MR distance >0.5 (green), and 14 were classified as benign (MR distance <0.5, blue).
Classifier Output | Validation Set (Rare de novo vs. commonly inherited CNVs) | Application Set (CNVs of unknown clinical significance) | |||
MR CNVs (rare |
Benign CNVs (commonly inherited) | Sub Total | Rare Inherited | Rare CNVs of unknown inheritance | |
43 | 69 | 112 | 27 | 46 | |
6 | 1,085 | 1,091 | 14 | 7 | |
49 | 1,154 | 1,203 | 41 | 53 |
The accuracy of the classifier developed was tested on an independent cohort of CNVs. Phase 1 contained the validation set of 1,203 CNVs known to be either rare
To further investigate the contribution of particular features to misclassification rates we calculated the mean values for each feature in the correctly and incorrectly classified CNV groups (
Finally we sought to use our classifier on two further CNV datasets with unknown clinical significance, termed candidate CNVs and rare inherited CNVs. We first selected a set of 53 rare CNVs identified in the clinic, not known to vary in copy number among the general population, for which inheritance could not be established due to the unavailability of one or both parents. Due to their unknown inheritance and rare status, we are unable to determine using current diagnostic procedures whether these CNVs are indeed causal. In total, 46 of these 53 CNVs were classified as MR-associated CNVs (
In this study we present a novel computational method to objectively identify clinically relevant CNVs using an NBTree classifier and 13 diverse genomic features. This is the first description of such a method applied to CNVs that can significantly improve interpretation of this important class of genomic variation. Our classification method has been validated on a set of 1,203 CNVs detected in 584 patients with MR, achieving a high accuracy (94%), with a sensitivity of 88% and a specificity of 94% (
Several other computational methods have been developed previously to predict if disruption or disturbance of genomic elements have pathogenic consequences. Often these methods are focused on identifying disease genes or on predicting if mutation or splicing events are pathogenic
The classifier incorporated specific knowledge about CNVs via 13 diverse structural and functional genomic features (including a number of different transposable element types). The proximity of these elements to CNVs has been reported previously and it has been hypothesized that they mediate the formation of recurrent CNVs
Despite the MGI mouse phenotype dataset being incomplete, this feature contributes greatly to the classifier's accuracy (5%). To date, gene knockout experiments with recorded ontology based phenotype information have been performed for approximately 5,000 of the possible 15,287 genes with mouse 1∶1 orthologues
Most of the CNVs we used to train the classifier were identified on low-resolution (BAC–based) microarray platforms. In contrast, the replication set contained CNVs collected solely from Affymetrix 250k SNP microarrays. Despite the different microarray technologies used, only a negligible decrease in classification accuracy (−1.7%) was observed between the training and the replication set. This indicates that the classifier is platform-independent and will not require retraining when used on data generated from comparable microarray platforms.
MR-associated CNVs discovered thus far are, in general, larger than benign CNVs
Although current clinical interpretation of CNVs focuses on large, rare and
This CNV classifier may also be informative of disorders other than mental retardation. This is of particular relevance because CNVs have recently been associated with other neurodevelopmental disorders such as autism and schizophrenia
In conclusion, we have developed a novel objective method to identify disease-associated CNVs which has overcome several limitations with current CNV interpretation methodology. Our NBTree classifier is able to distinguish between MR-associated CNVs and benign CNVs with high accuracy without the use of data from large control cohorts or parental samples. Results indicate that computational classification methods can be used for objectively prioritizing CNVs in clinical research and diagnostics. The tool for classifying CNVs, called GECCO (Genomic Classification of CNVs Objectively), as well as the Java source code, are readily available online. The benefits of such methods will increase with advancements in microarray technology, which already identifies many thousands of such structural variants per individual
In this study we investigate if rare
The CNVs used during the training and test phase (164 rare
MR Diagnostics and application datasets were collected through in-house routine diagnostics using Affymetrix 250k SNP microarrays (Affymetrix, Santa Clara, USA), and consisted of 584 samples containing 1,297 CNVs. Regions were excluded that contained fewer than 5 microarray targets, that were smaller than 10kb in size or that were the result of a mosaic or complex chromosomal aberration. In total, the validation/application set contained 49 rare
Initially a training set was created by randomly selecting 82 of the 164 rare
In total 20 different genomic features were investigated as potential classifier attributes. The variance inflation factor (VIF) was used to measure the co-linearity within the model across the repeat, gene and evolution measures (simple repeats, repeat masker, LINE, SINE, long terminal repeats, RNA gene elements, segmental duplications, ENSEMBL genes, mean non-synonymous substitution rate (
The functional genomic features consisted of the gene count, gene density and the variance in gene expression levels, the mean non-synonymous substitution rate (
Workflow used to develop the classifier. The classifier is able to distinguish between MR CNVs and benign CNVs based upon solely genomic features without the use of inheritance information. Several classification methods are tested. A training set consisting of both MR and benign CNVs is selected and the genomic features extracted. These data are used to train the classifier which is then evaluated with a separate test set of CNVs. The process of training set selection is repeated until an optimal performance is obtained. Subsequently, the classifier is validated on an independent set of MR and benign CNVs.
(0.05 MB DOC)
Classification Results of 32 MR syndromes from the DECIPHER database. The chromosome location, syndrome name, as well as the CNV length and type are given. The classification results are shown with the MR distance measure, showing the confidence of each classification decision.
(0.06 MB DOC)
Mean (and standard deviation) of each genomic feature used by the classifier during the validation and application studies. For each class of CNV the feature mean and (standard deviation) for the correctly and incorrectly classified CNVs are indicated.
(0.06 MB DOC)
The authors of this article wish to thank members of the Genomic Disorders group and M.S.G. Kwa for their discussions and input in the development of the algorithm and manuscript.