Optimized detection of insertions/deletions (INDELs) in whole-exome sequencing data

Insertion and deletion (INDEL) mutations, the most common type of structural variance, are associated with several human diseases. The detection of INDELs through next-generation sequencing (NGS) is becoming more common due to the decrease in costs, the increase in efficiency, and sensitivity improvements demonstrated by the various sequencing platforms and analytical tools. However, there are still many errors associated with INDEL variant calling, and distinguishing INDELs from errors in NGS remains challenging. To evaluate INDEL calling from whole-exome sequencing (WES) data, we performed Sanger sequencing for all INDELs called from the several calling algorithm. We compared the performance of the four algorithms (i.e. GATK, SAMtools, Dindel, and Freebayes) for INDEL detection from the same sample. We examined the sensitivity and PPV of GATK (90.2 and 89.5%, respectively), SAMtools (75.3 and 94.4%, respectively), Dindel (90.1 and 88.6%, respectively), and Freebayes (80.1 and 94.4%, respectively). GATK had the highest sensitivity. Furthermore, we identified INDELs with high PPV (4 algorithms intersection: 98.7%, 3 algorithms intersection: 97.6%, and GATK and SAMtools intersection INDELs: 97.6%). We presented two key sources of difficulties in accurate INDEL detection: 1) the presence of repeat, and 2) heterozygous INDELs. Herein we could suggest the accessible algorithms that selectively reduce error rates and thereby facilitate INDEL detection. Our study may also serve as a basis for understanding the accuracy and completeness of INDEL detection.


Introduction
Recent advances in next-generation sequencing (NGS) technologies have rapidly altered the research and routine work of human geneticists. Specifically, whole-exome sequencing (WES) has been used to elucidate genetic variants underlying human diseases [1]. WES has proven to be a valuable method for the discovery of the genetic causes of rare and complex diseases due to its moderate costs, the amount of manageable data, and straightforward interpretation of results [2,3].
Several types of natural genetic variations are present in patient samples, including singlenucleotide polymorphisms (SNPs), short insertions or deletions (INDELs) ranging from 1 base (bp) to 10 kilobases (kb) in length, and larger structural variants ranging from 10 kb to a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 several megabases in length. INDELs is a common and functionally important type of sequence polymorphism [4]. This will provide an important resource for applications in medical sequencing, as INDELs have been implicated in a number of diseases [5].
By applying NGS on a large scale, WES is now possible at an individual level [6]. One of the most important aspects of genetics is to identify genetic variants in individuals [1]. INDELs can cause or contribute to human genetic diseases. For example, cystic fibrosis (CF, MIM #219700), neurofibromatosis (NF1, MIM #162200), Charcot-Marie-Tooth neuropathy type 2A (CMT2A, MIM #118210), glycogen storage disease 2 (GSD2, MIM #23200), Huntington disease (HD, MIM #143100), and Duchenne muscular dystrophy (DMD, MIM #310200) are caused by INDELs in the coding regions of DNA. Therefore, the results of INDEL calling from individual WES can be used to predict the future health of individuals and to develop customized medical treatments [7].
Large number of tools are available for short-read alignment and searching for variants (e.g. SNVs and INDELs). However, the accurate detection of INDELs is still difficult and remains a critical issue. False-positive (FP) and false-negative (FN) rates are critical, especially for genetic diagnosis and Mendelian disease studies. For the future of personalized medicine and genetic diagnosis, highly accurate variant calling remains one of the most important problems [8].
In this study, we used whole exome data from one human genome and analyzed four INDEL detection algorithms: Genome Analysis Toolkit (GATK), Sequence Alignment/Map tools (SAMtools), Dindel, and Freebayes. Here, we show algorithms for available and commonly used methods that detect INDELs and compared their performances using the actual validation data.

Materials and methods Subject
This study examined whole-exome data available from a previous study [9]. Informed consent was obtained from the participant, and the Institutional Review Board of the Korea National Institutes of Health (NIH) approved this study.

Whole-exome data analysis
Whole-exome libraries were generated from genomic DNA of one individual using the Seq-Cap EZ Human Exome Library v2.0 (Roche/NimbleGen, Madison, WI, USA) and sequenced using the Illumina HiSeq2000 system (Illumina, San Diego, CA, USA) with paired end reads of 101 bp according to the manufacturer's protocols. Raw reads in FASTQ format from WES were aligned to the reference genome hg19 using the Burrows-Wheeler Aligner (BWA; http:// bio-bwa.sourceforge.net/). Duplicates were removed with Picard (http://picard.Sourceforget. net).

Sanger sequencing analysis
INDELs found using the four algorithms were subsequently validated with Sanger sequencing. The Primer3 program (http://frodo.wi.mit.edu/primer3) was used to design primers for amplification of the INDELs identified via exome sequencing. Amplicons from blood genomic DNA were analyzed via gel electrophoresis and were sequenced using an ABI 3730 genetic analyzer (Applied Biosystems, Forster City, CA, USA) with forward and reverse primers.

Statistical analysis
Their effects on positive predictive value (PPV) and sensitivity were tested using Pearson's correlation tests. To assess the performance of the different algorithms, we defined several metrics. We defined a call as a true-positive (TP) when WES called a variant and Sanger sequencing detected a variant. A false-positive (FP) call was considered when WES called a variant but Sanger sequencing revealed a wild-type; PPV was calculated as TP/(TP+FP). We defined a false-negative (FN) when Sanger sequencing detected a variant, but the WES called this locus a reference; the sensitivity was calculated as TP/(TP + FN).

Performance of INDEL calling in WES
We provide an analysis pipeline for the detection of INDELs. The genomic pipeline is outlined in Fig 1. For INDEL detection, BAM files were merged so that INDEL calling was performed using four algorithms (i.e. GATK, SAMtools, Dindel, and Freebayes), and were analyzed. The identified INDELs were then annotated using ANNOVAR to include information such as what gene the variant was in and the consequence of the mutation. S1 Table lists all 840 INDELs identified from the human exome data using the four algorithms.

Validation of INDELs by Sanger sequencing
Sanger sequencing was used to evaluate INDEL calling by the four algorithms. The INDEL counts from the four algorithms and validation are presented in Table 1 We compared the distribution of INDEL sizes called by the four algorithms. All INDEL distributions based on size are shown in Fig 2B. We found that 800 (95%) of the INDELs were 1-10 bp in size. In fact, most INDELs called were 10 bp, which accounted for 95% (665) of calls by GATK, 96% (535) of calls by SAMtools and Dindel 96% (680), and 97% (575) of calls by Freebayes.

Discussion
In this study, we investigated the performance of tools available for the INDEL detection from WES data. We evaluated four publicly available algorithms that are well-known for calling   [14][15][16][17][18]. Previous evaluation by Neuman et al. was based on simulated data [14]. Notably, only random selected 215 INDELs were validated [15,16]. However, our study used actual  GATK is a collection of analysis tools for human data that was developed by the Broad Institute. GATK performs variant calling using HaplotyperCaller (HC) [10]. SAMtools is based on a Bayesian model for INDEL calling, which parses SAM and BAM files and includes BCFtools to call SNPs and short INDELs from a single alignment [11]. Dindel is a program developed by the Wellcome Trust Sanger Institute that uses a Bayesian approach for calling INDELs from NGS data [12]. Freebayes is a Bayesian genetic variant detector designed to find SNPs, INDELs, MNPs, and complex events smaller than the length of a short-read sequencing alignment [13].
In our actual validation data, a total of 629 true positive INDELs in GATK and 628 in Dindel were identified. GATK and Dindel had the least FNs and the highest number of TPs, showing sensitivity of 90.2% (GATK: 629 of 697) and 90.1% (Dindel: 628 of 697), respectively. We also examined the positive predictive value (PPV) for the two algorithms, and GATK had a higher PPV than Dindel (89.5 vs. 88.6%). On the other hand, SAMtools and Freebayes had the least FPs. By decreasing the false positive rate, the accuracy (PPV) of SAMtools and Freebayes improved to 94.4% (525 of 556) and 94.4% (528 of 591), but it reduce the power of true positive INDEL detection. The GATK and SAMtools intersection INDELs were much higher than those of the intersection for GATK and Dindel, Dindel and SAMtools, and GATK and Freebayes. Based on these results, GATK had the fewest FN calls, while SAMtools had the fewest FP calls. Thus, GATK had high sensitivity, while SAMtools had high accuracy. Collectively, GATK and SAMtools complement the strengths and weaknesses of the other algorithm to yield superior results.
We compared the distribution of INDEL size called by the four algorithms. Most INDELs called by the algorithms were 10 bp. The statistical tests showed that the distribution of INDEL size did not differ significantly among the algorithms. In other words, INDEL size is not a confounding factor that affects the performance of these calling algorithms.
To determine the error of INDEL call from WES data, INDELs were compared based on where they were repeats or heterozygous. The PPVs for heterozygous and repeat INDELs were 76.9 and 34.9%, respectively, while homozygous and non-repeat INDELs were validated 92.1 and 88.4%. For the heterozygous and repeat INDELs called by both GATK and SAMtools, 96.0 and 81.0%, were successfully validated.
GATK had the highest sensitivity of all the algorithms, while SAMtools had high PPV. Thus, we recommend that GATK and SAMtools be used in combination for the detection of INDELs. GATK and SAMtools show better performance in calling INDELs than Dindel and Freebayes. Additionally, two key sources of difficulties in accurate INDEL detection are the presence of repeats and heterozygous INDELs. Our study may also serve as a basis for understanding the accuracy and completeness of INDEL detection. We believe that our method is a useful tool for understanding human diseases through WES analysis.
Supporting information S1