Family-Based Benchmarking of Copy Number Variation Detection Software

The analysis of structural variants, in particular of copy-number variations (CNVs), has proven valuable in unraveling the genetic basis of human diseases. Hence, a large number of algorithms have been developed for the detection of CNVs in SNP array signal intensity data. Using the European and African HapMap trio data, we undertook a comparative evaluation of six commonly used CNV detection software tools, namely Affymetrix Power Tools (APT), QuantiSNP, PennCNV, GLAD, R-gada and VEGA, and assessed their level of pair-wise prediction concordance. The tool-specific CNV prediction accuracy was assessed in silico by way of intra-familial validation. Software tools differed greatly in terms of the number and length of the CNVs predicted as well as the number of markers included in a CNV. All software tools predicted substantially more deletions than duplications. Intra-familial validation revealed consistently low levels of prediction accuracy as measured by the proportion of validated CNVs (34-60%). Moreover, up to 20% of apparent family-based validations were found to be due to chance alone. Software using Hidden Markov models (HMM) showed a trend to predict fewer CNVs than segmentation-based algorithms albeit with greater validity. PennCNV yielded the highest prediction accuracy (60.9%). Finally, the pairwise concordance of CNV prediction was found to vary widely with the software tools involved. We recommend HMM-based software, in particular PennCNV, rather than segmentation-based algorithms when validity is the primary concern of CNV detection. QuantiSNP may be used as an additional tool to detect sets of CNVs not detectable by the other tools. Our study also reemphasizes the need for laboratory-based validation, such as qPCR, of CNVs predicted in silico.


Introduction
The term 'copy number variation' (CNV) refers to the recurrence of moderately sized stretches of DNA (>1 kb) that exhibit inter-individual differences in the number of times they occur in a genome [1,2].Scientific interest in human CNVs has been stirred partly by the fact that only a minor proportion of the heritability of common complex diseases is explained by disease-associated single-nucleotide polymorphisms (SNPs) [3][4][5].Other forms of genetic variation, including CNVs, are therefore likely to play an important role in the etiology of these diseases [3].Correspondingly, CNVs have been implicated in various common disorders, including Crohn disease [6], rheumatoid arthritis and diabetes [7], psoriasis [8], intellectual disability [9], obesity [10], myocardial infarction [11], schizophrenia [12] and autism [13].While CNV detection in the past was based solely upon aCGH or SNP array signal intensity data [14], technological progress of DNA sequencing today allows direct detection of CNVs [15], for example, in exomes [16].However, genome-wide SNP array data still form an important basis of CNV detection, not the least due to their ample availability from past genomewide association studies (GWAS).A re-assessment of phenotypic associations of CNVs in these large legacy sample collections is warranted and to be expected for the coming years.
A variety of software tools have been developed for the detection of CNVs in SNP array data.Depending upon the underlying mathematical model, these tools can be divided broadly into two classes, namely those implementing a Hidden Markov model (HMM) and those using a segmentation algorithm.In a nutshell, HMM-based approaches aim at predicting covert copy number (CN) states along a Markov chain whereas segmentation algorithms split chromosomes into segments and try to sensibly assign a CN state to each segment.The interpretation of the derived CN states also differs between algorithms because 'state' either refers to a nominal class or a numerical genotype.Thus, a copy number class merely indicates the type of variation, i.e. whether there is a gain or loss of genetic material, whereas a copy number genotype specifies the number of copies present in a diploid genome.All available HMM algorithms predict up to six different copy number genotypes whilst all segmentation algorithms predict copy number class as one of three different types.
Different approaches have been taken in the past to benchmark CNV detection software [17][18][19][20][21].Using early Affymetrix 100K SNP data, Baross et al. (2007) [17] noted substantial false-positive prediction rates with software tools CNAG (Copy Number Analyzer for Gene-Chip) [22], dChip (DNA-Chip Analyzer) [23] and GLAD [24].The same authors also reported a high variability of these tools in terms of the number of CNVs predicted.Winchester et al. (2009) [18] assessed the accuracy of CNV prediction for five other software tools, using data from the more recent Affymetrix Genome-Wide Human SNP Array 6.0 and Illumina 1M-Duo BeadChip chips.They compared their SNP-based results to those of previously published sequencing studies [1,25,26], but only in single HapMap samples.In any case, the Winchester et al. study revealed that a large number of predicted CNVs could not be confirmed by any previous publication (up to 80%, depending upon the software used), and that predictions differed greatly both between software tools and between confirmation studies.In the same vein, Zhang et al. (2011) [19] applied Birdsuite [27], Partek (Partek Inc, St. Loius, MO), HelixTree (Golden Helix, Inc) and PennCNV [28] to three different data sets and observed a positive correlation between the number of markers included in a CNV and the 'recovery rate', defined by the authors as the proportion of previously published, validated CNVs that were also detected in their own study.Interestingly, the recovery rate was found to be negatively correlated with CNV population frequency.The same study also revealed a low consistency of the CNVs predicted in eight samples previously analyzed by Kidd et al. (2008) [25] [20] reported substantial variability of the pairwise concordance of CNV predictions by PennCNV [28], Affymetrix Power Tools (APT) [29], Aroma.Affymetrix [30] and CRLMM (Corrected Robust Linear Model with Maximum Likelihood Distance) [31].An in-depth assessment of PennCNV and CRLMM revealed a median concordance of 52% for deletions and of 48% for duplications.More deletions than duplications were predicted by both tools, and the empirical false-positive prediction rates were as high as 26% for CRLMM and 24% for PennCNV.Pinto et al. (2011) [21] analyzed six samples on 11 different microarrays and predicted CNVs using as many different software tools including PennCNV and QuantiSNP.The data generated by each microarray platform was analyzed with one to five of these tools.The experiments were performed in triplicate for each sample, and the authors observed inter-software concordance of < 50% and a reproducibility in replicate experiments of < 70%.
None of the above studies used family data for CNV validation but instead relied upon experimental validation of a very limited set of CNVs, DNA sequencing information, or a concordant prediction made by different algorithms.Moreover, none of the studies paid any attention to population differences in CNV prediction, despite previous reports that such differences do exist [1,[32][33][34].A general conclusion has been that more than one software tool should be used synergistically to increase specificity, and that CNVs should be validated experimentally by more reliable methods such as qPCR.However, although many of the currently available software tools were included in at least one of the studies, no systematic comparison has yet been undertaken of the main characteristics of CNVs predicted by a given algorithm, including the length, marker density and inter-marker distance.
We therefore assessed in detail the performance of six commonly used software tools for CNV detection in Affymetrix SNP array data.The tools of interest included HMM-based algorithms APT [29], QuantiSNP [35] and PennCNV [28] in addition to segmentation-based algorithms R-gada [36], GLAD [24] and VEGA [37].The SNP genotyping of APT is based on the birdseed algorithm of the well-known Birdsuite software package and can be seen as a extension of the Birdsuite approach.The Birdsuite software package was therefore not included in our comparison.We used publicly available SNP array signal intensity data from the International HapMap project [38][39][40][41][42] for CNV detection and a trio design for validation.Our results may guide future choices of CNV software for particular applications and should also instruct the interpretation of the results obtained.

Proband Data
We used signal intensity data of 60 trios (180 individuals) from the Affymetrix Sample Data Set, which were part of HapMap Phases 1 and 2 public releases 21a (released on 1st November 2007).Half of the trios were African (Yoruba in Ibadan, Nigeria, YRI) whereas the other half was of European ancestry (Utah Residents with Northern and Western European Ancestry, CEU).All samples had been genotyped with Affymetrix Genome-Wide Human SNP Array 6.0, which contains probes for 906,600 SNPs and an additional 945,826 CNV probes [43].NetAffx annotation files (release 31, UCSC hg19) were used to map the markers on the chip to the human genome.The average genotyping call rate in the complete public release 21a data set (270 samples) was 99.83% (technical documentation) and the concordance with HapMap genotypes (release 21a) was 99.84%.Since there are no general CNV-specific quality control measures, we used all samples and applied software-specific default quality control if available (see below).We used the 60 unrelated offspring samples for CNV detection and the 120 parental samples for subsequent validation.Our analysis was confined to autosomes.

CNV definition
In contrast to previous publications [1,2], we defined as a CNV any stretch of DNA that either has additional copies (duplication, gain) or is lacking (deletion, loss) compared to a reference genome, not restricting the CNV predictions by their size.We used the NetAffx Annotation File (release 31), containing marker positions according to the UCSC genome assembly version hg19 to annotate the predicted CNVs.The two copy number (CN) classes of 'gain' and 'loss' were subdivided into CN states according to the number of chromosomes (1 or 2) affected by the respective gain or loss.

SNP array data
Depending upon technology, SNP arrays contain multiple probes for each of the two alleles (A and B) of a SNP.All probes specific to one allele are collectively called a 'channel'.The intensity of each of the two channel signals (denoted R A and R B ) reflects the amount of genetic material hybridized.The signal ratio allows inference, not only of the SNP genotype, but also of the relative amount of genomic material present at the target locus.Affymetrix Human SNP Array 6.0 contains six to eight probes per SNP, corresponding to three to four probes per channel.The array also contains a large number of probes for regions that may contain CNVs, but not SNPs, and these probes directly measure the total amount of genetic material present [44,45].
Inference of the copy number status at a given locus is made by comparing the sum of the observed channel signals, R obs = R A + R B , to its expectation R exp .The definition of R exp varies between CNV detection algorithms.However, all algorithms use the marker-specific Log-2 Raw Data Ratio LLR = log 2 R obs -log 2 R exp as the basic input to infer a CN state, although some also rely upon the B allele fraction (BAF).If a CNV is present, the BAF is notably different from 0 (genotype AA), 0.5 (AB) and 1 (BB).The BAF is derived from a transformation, θ i , of the sample-specific channel intensity ratios for the i th sample, calculated as 2=p Á arctanðR A i =R B i Þ.For each marker, this yields three genotype-specific median θ values, taken over all samples, namely θ AA , θ AB and θ BB .The BAF of an observed θ value is then calculated as

Software tools
We studied six commonly used software tools for CNV detection in Affymetrix SNP array data, namely APT [29], QuantiSNP [35] and PennCNV [28], implementing an HMM algorithm, and segmentation-based tools R-gada [36], GLAD [24], and VEGA [37].All programs were run with their default options unless stated otherwise, which also includes default quality control measures by the respective software (see S1 File for the used commands).CNVs were defined separately for each sample, ignoring familial relationships.
APT.The Affymetrix Power Tools (APT) [29] equate R exp to the median sum of the sample-specific-channel signals, taken over all markers, or use a pre-computed reference [46] to obtain marker-specific LRR values.A Hidden Markov model is then fitted to the sequence of LRR values along the genome to assign hidden copy number states.We used program aptcopynumber-workflow of the APT bundle (version 1.14.2) with default settings in the singlesample mode and option-text-output set to true.The pre-computed Copy Number Analysis HapMap Reference File (Release 31) was used as a reference and the NetAffx Annotation File (Release 31) was used for alignment (publically available at the Affymetrix website http://www.affymetrix.com/support/index.affx).
PennCNV.CNV analysis of Affymetrix data with PennCNV [28] follows the Penn-Affy protocol (http://www.openbioinformatics.org/penncnv/) according to which LRR and BAF are inferred from canonical genotype clusters [47] by means of linear interpolation.The genotype clusters are generated from genotype calls that have been obtained with the APT software (see above).The sequences of LRR and BAF values are used in an HMM algorithm to infer the hidden copy number states.The PennCNV 2011Jun16 version was included in our study.First, apt-probeset-genotype and apt-probeset-summarize of APT (version 1.14.2) were used for genotype calling and allele-specific signal extraction, as laid down in the protocol.Second, canonical genotype clusters [47] were generated using generate_affy_geno_cluster.pl, which is part of the PennCNV-Affy tool.Clusters were then used to calculate LRR and BAF values via linear interpolation with normalize_affy_geno_cluster.pl.The sequence of LRR and BAF values was analyzed using detect_cnv.plwith default parameters.
QuantiSNP.QuantiSNP [35] relies on pre-computed LRR and BAF values (e.g. from PennCNV) that are subjected to its own HMM algorithm to infer hidden copy number states.QuantiSNP (version 2) was applied according to the instructions given on the QuantiSNP project webpage (https://sites.google.com/site/quantisnp/).Signal files created with PennCNV were used as input.
R-gada.The segmentation algorithm implemented in R-gada [36] uses LRR values precomputed along the genome and tries to find discontinuities by sparse Bayesian learning.For the resulting segments, the average LRR of all markers falling into the segment is compared to the median LRR of the respective chromosome.Based upon the outcome, a CN class is assigned to the segment.R-gada (version 0.8-5) was run using LRR values calculated with APT program apt-copynumber-workflow.
GLAD.The GLAD software [24] was developed for the analysis of aCGH data.However, since the program uses signal intensities for segmentation, it can also be applied to SNP array data.GLAD uses pre-computed LRR values and applies the Adaptive Weights Smoothing algorithm to find discontinuities along the genome.CN classes are then assigned depending upon the difference between the segment-specific median LRR and the median LRR closest to zero.GLAD (version 2.20.0) was run using R 2.15 and APT-derived LRR values (see above).
VEGA.The segmentation algorithm implemented in the VEGA software [37] is based upon the Mumford and Shah model [37] and uses pre-computed LRR values as well.After segmentation, CN states are assigned to the resulting segments depending upon whether mean LRR is smaller or larger than zero.VEGA (version 1.7.0) was run using R 2.15 and APTderived LRR values (see above).

Standardization of output
While PennCNV, QuantiSNP, R-gada and VEGA report a list of segments and their respective CN state, APT and GLAD output a list of markers and their CN states.To allow comparison between tools, we converted all output to lists of segments, if not provided by the software itself.We also summarized CN genotypes into CN classes with three possible states per segment ('normal', 'gain' or 'loss').Since some algorithms (e.g., PennCNV) do not support sex-chromosomal analyses by default, we considered only autosomal CNVs.All autosomal CNV predictions including outliers < 1kb were retained for the benchmark.

CNV benchmarking
We evaluated the six software tools in terms of both the characteristics of the predicted CNVs (i.e.their number, length and type) and the validity of the predictions made.We also compared the marker density within those CNVs that were detected in the 60 unrelated children from trios.The presence or absence of a CNV in the parents was used to validate each prediction in the offspring, since the overwhelming majority (up to 99%) of all CNVs in a genome is inherited [44,48] and will also be present in one of the parental genomes.While de-novo CNVs do exist, they play only a minor role.More specifically, a CNV detected in an offspring was considered validated if one or more segments of the same CN class (i.e.'gain' or 'loss') that covered >90% of the offspring CNV were found in at least one parent.We also applied other thresholds for the required overlap.All analyses were repeated separately in the African (YRI) and European (CEU) samples to recognize possible population differences in terms of CNV detection.To assess the likelihood of a CNV being validated by chance alone following our family approach, we randomly reassigned parents to offspring and repeated this procedure ten times.This analysis was carried out twice, once drawing parents from a joint pool of CEU and YRI trios and once considering CEU and YRI trios separately.
The possible influence on CNV prediction of the underlying mathematical model was assessed by comparing the median of the outcome variables of interest for the three HMMbased programs (APT, PennCNV, QuantiSNP) to those for the three segmentation-based programs (GLAD, R-gada, VEGA).Again, to evaluate possible population differences, we repeated our analyses separately for the CEU and YRI samples.Finally, we investigated the inter-software concordance in terms of CNV prediction by considering the proportion of CNVs predicted by one tool that were also found by the other tool, using only validated CNV predictions in this approach.A CNV predicted by one tool was considered verified by the other if >90% of the CNV was assigned the same CN state by the second tool.We also considered other thresholds for the necessary overlap.Additionally, we generated a sample-specific set of CNVs concordantly called by at least three algorithms and compared each tool to this call set.
All statistical analyses were performed with R 2.15.2.Outcome differences between software tools were tested for statistical significance using a pairwise Wilcoxon signed-rank test.

CNV prediction
The spectra of CNVs predicted by different programs varied widely, both in terms of their number and length and of the marker density within CNVs.In the offspring of the 60 European (CEU) and African (YRI) HapMap trios, the median CNV number ranged from 75 per sample, predicted by PennCNV, to 211 per sample for R-gada (Fig 1 and Table 1).Segmentation algorithms predicted significantly more CNVs than HMM algorithms (median: 182 vs. 98, Wilcoxon signed rank test p = 1.6×10-11) and showed a (non-significant) trend towards a higher inter-software variability in CNV number (median absolute deviation 42.3 vs. 19.3,p = 0.12 from 10,000 permutations of class labels).All software except PennCNV predicted fewer CNVs in Europeans (CEU) than in Africans (YRI, p<0.05 for all tools; S1 Table ).
The distribution of the median CNV length per sample was found to be skewed for all six tools, including some outlier samples with exceptionally long CNVs (Fig 2).In particular, Rgada yielded median CNV lengths of up to 1.9 Mb per sample and predicted CNVs comprising up to 126 Mb.The median of the sample-wise median lengths, taken over all CNVs predicted, was found to be similar for all tools except PennCNV, which showed a trend towards longer CNVs.In general, HMM-based tools tended to yield longer CNVs per sample (median length: 9.7 kb) than segmentation algorithms (7.4 kb, p = 1.6×10-11;Table 1).The cumulative CNV length per sample also differed greatly between tools, ranging from a median of 4.6 Mb (IQR: 3.7-5.7)for APT via 8.1 Mb (5.7-23.2) for QuantiSNP to 121.0 Mb (18.9-281.4)for R-gada.
The median cumulative CNV length per sample was consistently larger for Europeans than for Africans (p<0.05 for all tools; S1 Table ).
The median number of markers included in a CNV was similar for the different software tools except for PennCNV which, on average, included three times as many markers in a CNV as the other tools.Consequently, PennCNV also exhibited the smallest median inter-marker distance per sample (Table 1).Notably, all six tools were characterized by a median intermarker distance within CNVs that was well below the overall median of the Affymetrix Human SNP Array 6.0 (684 bp), which is consistent with a preferential prediction of CNVs in regions of increased marker density.Inter-marker distance within CNVs did not differ significantly between Europeans and Africans (S1 Table ).
All six tools predicted many more deletions than duplications.The median deletions-toduplications ratio (DDR) per sample ranged from 2.8 for GLAD to 5.5 for PennCNV (Table 1).HMM-based tools yielded higher DDR values than segmentation algorithms (4.3 vs. 3.6, p = 6.9×10-4;Table 1).No consistent differences in DDR value were noted between European and African samples (S1 Table ).

In-silico validation of predicted CNVs using family information
In view of the observed discrepancies in CNV prediction between different tools, we sought to validate in silico the CNVs predicted for the children using the raw signal intensity data available for the parents.More specifically, we predicted CNVs in the two parents and regarded an offspring CNV as validated if it overlapped by least 90% with a parental CNV of the same state (i.e. gain or loss), predicted by the same tool.
The proportion of CNVs that could be validated per sample differed greatly between tools, with a median percentage over samples that ranged from 41.1% with R-gada to 60.9% with PennCNV (Table 2).HMM-based algorithms yielded more validated CNVs than segmentation algorithms, both in general (55.9% vs. 41.4%) and at the level of the individual tool (Table 2).This trend was apparent for deletions and duplications alike (Table 2).Slightly more deletions than duplications were validated in the case of PennCNV, QuantiSNP and R-gada (median DDR>1) whereas slightly more duplications were validated for APT, GLAD and VEGA (median DDR<1).Nevertheless, the inter-quartile range of the DDR value among validated  CNVs included unity for all six tools.The percentage of validation was largely independent of CNV size, evidenced by highly similar validation rates across different bins of CNV size (S1 Fig) .In order to assess the impact of false-negative validations that are due to false-negative CNV predictions in the parents, we additionally considered a CNV in the offspring to be validated if at least one out of the six tools predicted a CNV in a parent with an overlap of at least 90% ("extended validation").Not surprisingly, total numbers and validation rates increased throughout by ~10-20%, but the difference in validation rates between the software tools remained largely unchanged (S4 Table ).Features of CNVs with extended validation were very similar to those validated by only a single tool (S5 Table ).We also sought for external technical verification from the Database of Genomic Variants (DGV).Sequencing-based CNV data was available for four CEU samples (NA07048, NA10847, NA108 51, NA12878) and two YRI samples (NA19129, NA19240).Unfortunately, complete data were not available for any of the 60 trios studies here.Sequencing-based verification yielded substantially low validation rates between 26% for VEGA to 36% for PennCNV (S2 Fig) .No population differences in validation efficacy were observed (S2 Table ).For HMM-based tools, a median of 55.3% of the total genomic sequence included in offspring CNVs was also included in at least one parental CNV.Segmentation-based methods performed substantially worse in this respect.Their median proportion of validated CNV sequence per offspring sample was as low as 5.2% for R-gada (Table 2).However, this abnormality was due to a number of very large CNVs predicted by R-gada that could not be validated.

Features of validated CNVs
Validated CNVs differed from non-validated CNVs with respect to their total number, their median length, the median number of markers included in a CNV, and the average intermarker distance.Validated CNVs tended to be longer and more densely covered with markers than non-validated CNVs (Table 3 sample ranged from 42.5 (PennCNV) to 83.5 (QuantiSNP).Observed DDR values per sample were similar for validated and non-validated CNVs, with a median for validated CNVs ranging from 3.0 for QuantiSNP and VEGA to 6.2 for PennCNV (Table 3, S3

CNV validation by chance alone?
In some cases, family-based validation of a CNV may have occurred by chance alone, and may not have been due to the inheritance of the respective CNV by the offspring.To assess the likelihood of such "pseudo-validation", we repeatedly permutated the assignment of parents to offspring and analyzed the resulting trios as described above.We performed ten replications, each time evaluating the median proportion of validated CNVs.Unexpectedly, a substantial proportion of CNVs was indeed found to be pseudo-validated in these analyses with a median, over replicates, of the median proportion per sample that ranged from 13.6% (R-gada) to 20.3% (APT) (Table 4).Pseudo-validation rates were higher when parents were permutated within the original population than across.The software-specific median validation proportion ranged from 16.8% to 29.2% in Europeans and from 14.3% to 24.1% in Africans (Table 4).

Pairwise concordance between tools
A popular albeit heuristic approach to increase the specificity of prediction tools is to use different tools simultaneously.Entities predicted by two different algorithms are usually thought to deserve higher confidence than entities predicted by one algorithm only.In order to assess the concordance between pairs of CNV detection tools, we determined, for each of the 60 offspring individuals from the HapMap trios, the proportion of genomic sequence included in validated CNVs predicted by one tool ('predictor') that was also included in validated CNVs predicted by another tool ('verifier').Note that this definition of concordance is not necessarily symmetric.Comparisons employed the median proportion (i) of concordant sequence per CNV per sample and (ii) of cumulated concordant CNV sequence per sample.A CNV was considered verified by another algorithm if the proportion of concordant sequence exceeded 90%.The level of concordance differed greatly between pairs of tools (S7 Table ).Thus, the largest median proportion of concordant sequence per CNV per sample was observed for PennCNV as predictor and QuantiSNP as verifier (73.2%).Notably, QuantiSNP as predictor yielded a much lower level of concordance with PennCNV as verifier (40.1%).In general, GLAD as a predictor yielded the highest level of verification by any other tool (52.9-63.7%)whereas QuantiSNP was least verified (40.0-41.4%).The proportion of verified CNVs per sample showed a similar trend and ranged from 32.5% (R-gada as predictor, PennCNV as verifier) to 68.3% (PennCNV, QuantiSNP) and (S6 Table ).Use of verification thresholds other than 90% yielded similar results.The median of the proportion of verified cumulated CNV sequence ranged from 14.2% (QuantiSNP as predictor, R-gada as verifier) to 67.76% (VEGA checked by Rgada), but no general trend towards a single tool showing a consistently high level of verification by the other tools was apparent (S7 Table ).

Discussion
Copy-number variation (CNV) has been implicated in the etiology of many complex diseases.While CNV detection is increasingly being based upon next-generation sequencing (NGS) data (see Zhao et al., 2013 [59], for a comprehensive review), NGS-based CNV detection is still faced with a number of issues, including the substantially lower data quality compared to SNP genotyping arrays and problems in detecting forms of structural variation other than deletions, likely contributing to the lack of benchmarking studies for NGS-based CNV detection.On the other hand, many studies still infer CNVs from genome-wide SNP array data.As a legacy of the era of genome-wide association studies, data from this platform is readily available for large sample collections but still not analyzed to its full potential.One reason might be the lack of comparative studies of CNV prediction software.In order to assess the reliability of such CNV detection, we evaluated six frequently used software tools drawing upon parental information for CNV validation.In addition, we investigated the potential for population differences in CNV prediction.Finally, we assessed a common albeit heuristic approach to increase the specificity of CNV detection, namely reliance upon concordant predictions.One important limitation of family-based validation is the inability to detect unusual inheritance patterns of  multi-allelic CNVs.Although these are believed to be rare, it is difficult to assess their impact on this study.The six software tools studied showed major differences in terms of the number and length of CNVs predicted.This discrepancy should raise serious concerns about the general validity of the respective results.Indeed, our family-based validation study revealed a trend for Hidden-Markov models (HMM) to predict fewer CNVs of consistently higher validity than segmentation-based software.In fact, HMM-based software PennCNV predicted the smallest number of CNVs, but achieved the highest level of validation of all tools considered.The six programs also differed in terms of the number and features of the validated CNVs, although these differences were similar to those seen for non-validated CNVs.The median number of CNVs per individual predicted by PennCNV in this study was nearly four times higher than the number reported by Wang et al. [28].Similarly, we observed a longer CNV length in the predictions of PennCNV.When comparing these results, it has to be considered, that different samples and, more importantly, different array technologies were used.The Illumina Human-Hap550 BeadChip [60] used by Wang et al. uses half a million marker (median distance 3 kb), whereas the Affymetrix Human SNP Array 6.0 [44] uses two million marker (median distance 684 bp).This alone is likely lead to major differences in the software performance.Our study thus confirms previous reports of a generally low validity and high false-positive rate of arraybased CNV detection, and a preferential prediction of deletions over duplications [17][18][19][20].While the presented validation rates may serve as a proxy for specificity, the use of real-world data with unknown underlying CNVs structure in our study prevents an assessment of sensitivity.In order to compensate for this limitation, we compared the software-specific predictions against a consensus call set consisting of all regions that were predicted to be CNVs by at least three different tools.Sequencing-based verification of six samples using DGV data yielded rather low rates similar to those reported by Pinto et al., 2011 [21].The observed differences in length of the CNV predictions between algorithms classes are consistent with the observation that segmentation-based algorithms tend to fragment larger CNVs into smaller predictions.This, however, is unlikely to affect the cross comparison of software tools (see below), given that only the amount of covered sequence is crucial for verifying a CNV, not the continuity between segments.
The population origin of a sample played only a minor role in CNV prediction.Anyhow, all software tools except PennCNV showed a trend towards the prediction of fewer and longer CNVs in Europeans than in Africans.This finding may be explicable by a higher level of overall genetic heterogeneity among Africans than non-Africans.Notably, however, the rate of CNV validation was virtually the same in both populations.Somewhat unexpectedly, the likelihood of a chance CNV validation was found to be high and was even increased when European and African trios were considered separately in the assignment of 'random' parents.This observation may point towards a population-specific distribution of CNVs, consistent with previous reports [1,[32][33][34].However, it should be noted that the efficacy of CNV detection hinges on the distribution of the markers used for prediction, which is likely to be population-specific by itself.
The pairwise concordance between tools was often high, but not necessarily symmetric.In particular, PennCNV was superior to all other tools with regard to the median proportion of both the number and concordant sequence of verified CNVs.This renders PennCNV the first choice for initial CNV prediction if specificity is most important.On the other hand, Quan-tiSNP had the second highest validation rate but showed low concordance with other tools, suggesting that PennCNV and QuantiSNP could be used jointly in order to detect different sets of CNVs.
A high false-positive rate, high probability of chance validation and an insufficient level of concordance CNV prediction between different algorithms as observed in our study would have two important implications for CNV detection.First, CNVs require independent experimental validation, even if predicted concordantly by different algorithms, as has been suggested before by Winchester et al. (2009) [18].Second, the marker distribution appears to be critical for the ability to predict CNVs reliably.For example, any determination of the breakpoints of a CNV may be difficult in genomic regions that are poorly covered by markers.
The above shortcomings notwithstanding, the high validation rate attained by HMM-based software still render the respective tools a promising means of CNV detection if followed by validation by another method.We thus recommend use of HMM-based tools such as PennCNV and QuantiSNP, perhaps in combination, to achieve high specificity.Anyhow, in view of the large collections of SNP array data that are available from past genome-wide association studies and the still numerous issues with NGS-based CNV detection, a systematic reanalysis of these data aiming at CNV detection seems a worthwhile effort.

Table 1 .
Sample-specific features of predicted CNVs.Given are the median and, in parentheses, the inter-quartile range per sample.DDR: Ratio of deletions to duplications. doi:10.1371/journal.pone.0133465.t001

Table 3 .
Sample-specific features of validated CNVs.Given are the median and, in parentheses, the inter-quartile range per sample.DDR: Ratio of deletions to duplications, doi:10.1371/journal.pone.0133465.t003

Table 4 .
Family-based CNV validation by chance alone?