Optimized detection of insertions/deletions (INDELs) in whole-exome sequencing data

Bo-Young Kim; Jung Hoon Park; Hye-Yeong Jo; Soo Kyung Koo; Mi-Hyun Park

doi:10.1371/journal.pone.0182272

Abstract

Insertion and deletion (INDEL) mutations, the most common type of structural variance, are associated with several human diseases. The detection of INDELs through next-generation sequencing (NGS) is becoming more common due to the decrease in costs, the increase in efficiency, and sensitivity improvements demonstrated by the various sequencing platforms and analytical tools. However, there are still many errors associated with INDEL variant calling, and distinguishing INDELs from errors in NGS remains challenging. To evaluate INDEL calling from whole-exome sequencing (WES) data, we performed Sanger sequencing for all INDELs called from the several calling algorithm. We compared the performance of the four algorithms (i.e. GATK, SAMtools, Dindel, and Freebayes) for INDEL detection from the same sample. We examined the sensitivity and PPV of GATK (90.2 and 89.5%, respectively), SAMtools (75.3 and 94.4%, respectively), Dindel (90.1 and 88.6%, respectively), and Freebayes (80.1 and 94.4%, respectively). GATK had the highest sensitivity. Furthermore, we identified INDELs with high PPV (4 algorithms intersection: 98.7%, 3 algorithms intersection: 97.6%, and GATK and SAMtools intersection INDELs: 97.6%). We presented two key sources of difficulties in accurate INDEL detection: 1) the presence of repeat, and 2) heterozygous INDELs. Herein we could suggest the accessible algorithms that selectively reduce error rates and thereby facilitate INDEL detection. Our study may also serve as a basis for understanding the accuracy and completeness of INDEL detection.

Citation: Kim B-Y, Park JH, Jo H-Y, Koo SK, Park M-H (2017) Optimized detection of insertions/deletions (INDELs) in whole-exome sequencing data. PLoS ONE 12(8): e0182272. https://doi.org/10.1371/journal.pone.0182272

Editor: Obul Reddy Bandapalli, German Cancer Research Center (DKFZ), GERMANY

Received: February 23, 2017; Accepted: July 14, 2017; Published: August 9, 2017

Copyright: © 2017 Kim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This study was supported by the Korea National Institute of Health intramural research grant 4800-4861-312-210. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The author employed by Macrogen Inc. received no specific funding for this work, and this company did not play a role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: Macrogen Inc. does not affect our adherence to PLOS ONE policies on sharing data and materials.

Introduction

Recent advances in next-generation sequencing (NGS) technologies have rapidly altered the research and routine work of human geneticists. Specifically, whole-exome sequencing (WES) has been used to elucidate genetic variants underlying human diseases [1]. WES has proven to be a valuable method for the discovery of the genetic causes of rare and complex diseases due to its moderate costs, the amount of manageable data, and straightforward interpretation of results [2, 3].

Several types of natural genetic variations are present in patient samples, including single-nucleotide polymorphisms (SNPs), short insertions or deletions (INDELs) ranging from 1 base (bp) to 10 kilobases (kb) in length, and larger structural variants ranging from 10 kb to several megabases in length. INDELs is a common and functionally important type of sequence polymorphism [4]. This will provide an important resource for applications in medical sequencing, as INDELs have been implicated in a number of diseases [5].

By applying NGS on a large scale, WES is now possible at an individual level [6]. One of the most important aspects of genetics is to identify genetic variants in individuals [1]. INDELs can cause or contribute to human genetic diseases. For example, cystic fibrosis (CF, MIM #219700), neurofibromatosis (NF1, MIM #162200), Charcot-Marie-Tooth neuropathy type 2A (CMT2A, MIM #118210), glycogen storage disease 2 (GSD2, MIM #23200), Huntington disease (HD, MIM #143100), and Duchenne muscular dystrophy (DMD, MIM #310200) are caused by INDELs in the coding regions of DNA. Therefore, the results of INDEL calling from individual WES can be used to predict the future health of individuals and to develop customized medical treatments [7].

Large number of tools are available for short-read alignment and searching for variants (e.g. SNVs and INDELs). However, the accurate detection of INDELs is still difficult and remains a critical issue. False-positive (FP) and false-negative (FN) rates are critical, especially for genetic diagnosis and Mendelian disease studies. For the future of personalized medicine and genetic diagnosis, highly accurate variant calling remains one of the most important problems [8].

In this study, we used whole exome data from one human genome and analyzed four INDEL detection algorithms: Genome Analysis Toolkit (GATK), Sequence Alignment/Map tools (SAMtools), Dindel, and Freebayes. Here, we show algorithms for available and commonly used methods that detect INDELs and compared their performances using the actual validation data.

Materials and methods

Subject

This study examined whole-exome data available from a previous study [9]. Informed consent was obtained from the participant, and the Institutional Review Board of the Korea National Institutes of Health (NIH) approved this study.

Whole-exome data analysis

Whole-exome libraries were generated from genomic DNA of one individual using the SeqCap EZ Human Exome Library v2.0 (Roche/NimbleGen, Madison, WI, USA) and sequenced using the Illumina HiSeq2000 system (Illumina, San Diego, CA, USA) with paired end reads of 101 bp according to the manufacturer’s protocols. Raw reads in FASTQ format from WES were aligned to the reference genome hg19 using the Burrows-Wheeler Aligner (BWA; http://bio-bwa.sourceforge.net/). Duplicates were removed with Picard (http://picard.Sourceforget.net).

WES data were analyzed using four INDEL calling algorithms, (1) GATK (http://www.broadinstitute.org/gatk/) [10], (2) SAMtools (http://samtools.sourceforge.net/) [11], (3) Dindel [12], and (4) Freebayes [13], following the guidelines provided in the user manuals. INDELs were called with each algorithm and the variants annotated using the ANNOVAR program (http://www.openbioinformatics.org/annovar/).

Sanger sequencing analysis

INDELs found using the four algorithms were subsequently validated with Sanger sequencing. The Primer3 program (http://frodo.wi.mit.edu/primer3) was used to design primers for amplification of the INDELs identified via exome sequencing. Amplicons from blood genomic DNA were analyzed via gel electrophoresis and were sequenced using an ABI 3730 genetic analyzer (Applied Biosystems, Forster City, CA, USA) with forward and reverse primers.

Statistical analysis

Their effects on positive predictive value (PPV) and sensitivity were tested using Pearson’s correlation tests. To assess the performance of the different algorithms, we defined several metrics. We defined a call as a true-positive (TP) when WES called a variant and Sanger sequencing detected a variant. A false-positive (FP) call was considered when WES called a variant but Sanger sequencing revealed a wild-type; PPV was calculated as TP/(TP+FP). We defined a false-negative (FN) when Sanger sequencing detected a variant, but the WES called this locus a reference; the sensitivity was calculated as TP/(TP + FN).

Results

Performance of INDEL calling in WES

We provide an analysis pipeline for the detection of INDELs. The genomic pipeline is outlined in Fig 1. For INDEL detection, BAM files were merged so that INDEL calling was performed using four algorithms (i.e. GATK, SAMtools, Dindel, and Freebayes), and were analyzed. The identified INDELs were then annotated using ANNOVAR to include information such as what gene the variant was in and the consequence of the mutation. S1 Table lists all 840 INDELs identified from the human exome data using the four algorithms.

Download:

Fig 1. INDEL calling workflow in WES.

INDELs were called using four algorithms: GATK, SAMtools, Dindel, and Freebayes. Analysis pipelines and workflow systems are shown.

https://doi.org/10.1371/journal.pone.0182272.g001

Validation of INDELs by Sanger sequencing

Sanger sequencing was used to evaluate INDEL calling by the four algorithms. The INDEL counts from the four algorithms and validation are presented in Table 1. The 840 INDELs were detected in coding regions and included 429 insertions (51%) and 411 deletions (49%). Fig 2A shows the number of INDELs called by each algorithm. GATK can call INDELs and reported 703 variants and SAMtools identified 556 INDELs. Dindel and Freebayes detected 709 and 591 INDELs, respectively.

Download:

Table 1. INDELs called and validation in four algorithms.

https://doi.org/10.1371/journal.pone.0182272.t001

Download:

Fig 2. Number of INDELs called by the four algorithms.

(A) INDELs were called using four algorithms: GATK, SAMtools, Dindel, and Freebayes. (B) Histograms of insertion (right) and deletion (left) counts by INDEL size. Counts were adjusted within each algorithm to account for the fraction of polarizable calls. (C) Accuracy of detection of INDELs in the four algorithms.

https://doi.org/10.1371/journal.pone.0182272.g002

We compared the distribution of INDEL sizes called by the four algorithms. All INDEL distributions based on size are shown in Fig 2B. We found that 800 (95%) of the INDELs were 1–10 bp in size. In fact, most INDELs called were ≤ 10 bp, which accounted for 95% (665) of calls by GATK, 96% (535) of calls by SAMtools and Dindel 96% (680), and 97% (575) of calls by Freebayes.

We also examined the overall performance of the four algorithms and computed the sensitivity and positive predictive value (PPV) for each algorithm. The FP and FN number of INDELs called by each algorithm are shown in Table 2. The sensitivity values for GATK, SAMtools, Dindel, and Freebayes were 90.2, 75.3, 90.1, and 80.1%, respectively. The PPVs for GATK (89.5%), SAMtools (94.4%), Dindel (88.6%), and Freebayes (94.4%) were determined by Sanger sequencing (Fig 2C). GATK had the highest sensitivity (90.2%) and SAMtools and Freebayes had the highest PPV (94.4%).

Download:

Table 2. Validation of the four algorithms used for INDEL calling with WES and Sanger sequencing.

https://doi.org/10.1371/journal.pone.0182272.t002

Comparison of INDEL-calling algorithms

We compared the performance of the GATK, SAMtools, Dindel, and Freebayes algorithms for INDEL detection (Table 3). Fig 3 shows the concordance and PPVs of INDELs called by each algorithm and intersection. The concordance for the intersection of the four algorithms (461, 54.9%), three algorithms (494, 59.9%), and GATK and SAMtools (502, 66.3%) were determined (Fig 3A). In addition, the PPV for the four algorithms intersection, the three algorithms intersection, and the GATK and SAMtools intersection INDELs were much higher than those of the intersection for GATK and Dindel, Dindel and SAMtools, and GATK and Freebayes (98.7, 97.6, and 97.6% vs. 94.6, 95.8, and 97.1%, respectively). INDELs were identified with high accuracy (four algorithms intersection: 98.7%, three algorithms intersection: 97.6%, and GATK and SAMtools intersection: 97.6%) (Fig 3B).

Download:

Table 3. Comparison of INDEL-calling algorithms.

https://doi.org/10.1371/journal.pone.0182272.t003

Download:

Fig 3. Performance versus detected INDELs and PPVs.

(A) Concordance of INDEL detection between the four algorithms: GATK, SAMtools, Dindel, and Freebayes. Venn diagram showing the numbers and percentages of shared INDELs from the four algorithms: 4 algorithm intersection INDELs, 3 algorithm intersection INDELs, 2 algorithm intersection INDELs, and algorithm-specific INDELs. (B) Validation rates and PPVs of the intersecting INDELs between algorithms. The sensitivity increases at higher intersecting algorithms.

https://doi.org/10.1371/journal.pone.0182272.g003

The size distributions of validated INDELs are shown in Fig 4. For the not validated INDELs, there was striking enrichment of heterozygous INDELs (39.9%) and yielded 9.2-fold (2% to 18.4%) more repeat INDELs than validated set. The PPVs of heterozygous INDELs (76.9%), homozygous INDELs (92.1%), repeat INDELs (34.9%), and non-repeat INDELs (88.4%) were also calculated. We found that the validation rate of heterozygous and repeat INDELs for GATK and SAMtools intersection increase with 96.0 and 81.0%.

Download:

Fig 4. Sources of INDEL detection error from WES.

(A) Number of validated INDELs in the following INDEL size. (B) Percentages of homozygous, heterozygous, repeat, and non-repeat in the validated and not validated set. (C) PPVs of error sources, 1) heterozygous, 2) repeat INDELs in all and GATK & SAMtools intersecting call set.

https://doi.org/10.1371/journal.pone.0182272.g004

Discussion

In this study, we investigated the performance of tools available for the INDEL detection from WES data. We evaluated four publicly available algorithms that are well-known for calling short INDELs. We provide an analysis pipeline for the detection of INDELs so that INDEL calling were performed using four algorithms (i.e. GATK, SAMtools, Dindel, and Freebayes) to identify TP INDEL calls while reducing FP calls.

Many studies have reported the INDEL calling capabilities of available tools from NGS data [14–18]. Previous evaluation by Neuman et al. was based on simulated data [14]. Notably, only random selected 215 INDELs were validated [15, 16]. However, our study used actual validation data. We reported 840 INDELs called from the four programs in one human genome, all of these INDELs were validated by Sanger sequencing.

GATK is a collection of analysis tools for human data that was developed by the Broad Institute. GATK performs variant calling using HaplotyperCaller (HC) [10]. SAMtools is based on a Bayesian model for INDEL calling, which parses SAM and BAM files and includes BCFtools to call SNPs and short INDELs from a single alignment [11]. Dindel is a program developed by the Wellcome Trust Sanger Institute that uses a Bayesian approach for calling INDELs from NGS data [12]. Freebayes is a Bayesian genetic variant detector designed to find SNPs, INDELs, MNPs, and complex events smaller than the length of a short-read sequencing alignment [13].

The GATK’s model is derived from Dindel’s model, where GATK is expected to show similar performance to Dindel. Freebayes is a haplotype-based caller, similar to GATK; however, GATK contains additional algorithms for filtering with low mapping quality and local realignments (http://software.broadinstitute.org/gatk/) [19]. SAMtools may improve the processing of INDELs through likelihood algorithms, such as the indel genotype likelihood model, genotype-free analysis, and physical phasing (http://samtools.sourceforge.net/) [19, 20].

In our actual validation data, a total of 629 true positive INDELs in GATK and 628 in Dindel were identified. GATK and Dindel had the least FNs and the highest number of TPs, showing sensitivity of 90.2% (GATK: 629 of 697) and 90.1% (Dindel: 628 of 697), respectively. We also examined the positive predictive value (PPV) for the two algorithms, and GATK had a higher PPV than Dindel (89.5 vs. 88.6%). On the other hand, SAMtools and Freebayes had the least FPs. By decreasing the false positive rate, the accuracy (PPV) of SAMtools and Freebayes improved to 94.4% (525 of 556) and 94.4% (528 of 591), but it reduce the power of true positive INDEL detection. The GATK and SAMtools intersection INDELs were much higher than those of the intersection for GATK and Dindel, Dindel and SAMtools, and GATK and Freebayes. Based on these results, GATK had the fewest FN calls, while SAMtools had the fewest FP calls. Thus, GATK had high sensitivity, while SAMtools had high accuracy. Collectively, GATK and SAMtools complement the strengths and weaknesses of the other algorithm to yield superior results.

We compared the distribution of INDEL size called by the four algorithms. Most INDELs called by the algorithms were ≤10 bp. The statistical tests showed that the distribution of INDEL size did not differ significantly among the algorithms. In other words, INDEL size is not a confounding factor that affects the performance of these calling algorithms.

To determine the error of INDEL call from WES data, INDELs were compared based on where they were repeats or heterozygous. The PPVs for heterozygous and repeat INDELs were 76.9 and 34.9%, respectively, while homozygous and non-repeat INDELs were validated 92.1 and 88.4%. For the heterozygous and repeat INDELs called by both GATK and SAMtools, 96.0 and 81.0%, were successfully validated.

GATK had the highest sensitivity of all the algorithms, while SAMtools had high PPV. Thus, we recommend that GATK and SAMtools be used in combination for the detection of INDELs. GATK and SAMtools show better performance in calling INDELs than Dindel and Freebayes. Additionally, two key sources of difficulties in accurate INDEL detection are the presence of repeats and heterozygous INDELs. Our study may also serve as a basis for understanding the accuracy and completeness of INDEL detection. We believe that our method is a useful tool for understanding human diseases through WES analysis.

Supporting information

S1 Table. Summary of INDELs called by GATK, SAMtools, Dindel, and Freebayes.

https://doi.org/10.1371/journal.pone.0182272.s001

(XLSX)

Acknowledgments

This study was supported by the Korea National Institute of Health intramural research grant 4800-4861-312-210. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The author employed by Macrogen Inc. received no specific funding for this work, and this company did not play a role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM et al. Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet. 2010; 42: 30–5. pmid:19915526
- View Article
- PubMed/NCBI
- Google Scholar
2. Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med. 2012; 63: 35–61. pmid:22248320
- View Article
- PubMed/NCBI
- Google Scholar
3. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011; 12: 745–55. pmid:21946919
- View Article
- PubMed/NCBI
- Google Scholar
4. Mullaney JM, Mills RE, Pittard WS, Devine SE. Small insertions and deletions (INDELs) in human genomes. Hum Mol Genet. 2010; 19: R131–R6. pmid:20858594
- View Article
- PubMed/NCBI
- Google Scholar
5. Miki Y, Swensen J, Shattuck-Eidens D, Futreal PA, Harshman K, Tavtigian S et al. A strong candidate for the breast and ovarian cancer susceptibility gene brca1. Science. 1994; 266: 66–71. pmid:7545954
- View Article
- PubMed/NCBI
- Google Scholar
6. Chio M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA. 2009; 106: 19096–101. pmid:19861545
- View Article
- PubMed/NCBI
- Google Scholar
7. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Human Genet. 2008; 9: 387–402.
- View Article
- Google Scholar
8. Shigemizu D, Fujimoto A, Akiyama S, Abe T, Nakano K, Boroevich KA et al. A practical method to detect SNVs and INDELs from whole genome and exome sequencing data. Scientific Reports. 2013; 3: 2161. pmid:23831772
- View Article
- PubMed/NCBI
- Google Scholar
9. Choi BO, Koo SK, Park MH, Rhee H, Yang SJ, Choi KG et al. Exome sequencing is an efficient tool for genetic screening of Charcot-Marie-Tooth Disease. Hum Mutat. 2012; 33: 1610–1615. pmid:22730194
- View Article
- PubMed/NCBI
- Google Scholar
10. DePristo MA, Banks E, Poplin R, Garimella KV, Maquire JR, Hartl C et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011; 43: 491–8. pmid:21478889
- View Article
- PubMed/NCBI
- Google Scholar
11. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009; 25: 2078–2079. pmid:19505943
- View Article
- PubMed/NCBI
- Google Scholar
12. Albers CA, Lunter G, MacArthur DG, MaVean G, Ouwehand WH, Durbin R. Dindel: accurate INDEL calls from short-read data. Genome Res. 2011; 21: 961–973. pmid:20980555
- View Article
- PubMed/NCBI
- Google Scholar
13. Garrison E, Marth G. Haplotype-base variant detection from short-read sequencing. Preprint at arXiv: 1207.3907v2 [q-bio.GN].
- View Article
- Google Scholar
14. Neuman JA, Isakov O, Shomron N. Analysis of insertion-deletion from deep-sequencing data: software evaluation for optimal detection. Brief Bioinform. 2013; 14: 46–55. pmid:22707752
- View Article
- PubMed/NCBI
- Google Scholar
15. Hasan MS, Wu X, Zhang L. Performance evaluation of INDEL calling tools using real short-read data. Human Genomics. 2015; 19: 20.
- View Article
- Google Scholar
16. Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Scientific Reports. 2015; 5:17875. pmid:26639839
- View Article
- PubMed/NCBI
- Google Scholar
17. Ghoneim DH, Myers JR, Tuttle E, Paciorkowski AR. Comparison of insertion/deletion calling algorithms on human next-generation sequencing data. BMC Res Notes. 2014; 7: 864. pmid:25435282
- View Article
- PubMed/NCBI
- Google Scholar
18. Fang H, Wu Y, Narzisi G, O’Rawe JA, Barron LT, Rosenbaum J et al. Reducing INDEL calling errors in whole genome and exome sequencing data. Genome Med. 2014; 6:89. pmid:25426171
- View Article
- PubMed/NCBI
- Google Scholar
19. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetic parameter estimation from sequencing data. Bioinformatics. 2011; 27: 2987–93. pmid:21903627
- View Article
- PubMed/NCBI
- Google Scholar
20. Spencer DH, Tyagi M, Vallania F, Bredemeyer AJ, Pfeifer JD, Mitra RD et al. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J Mol Diagn. 2014; 16: 75–88. pmid:24211364
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM et al. Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet. 2010; 42: 30–5. pmid:19915526
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med. 2012; 63: 35–61. pmid:22248320
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011; 12: 745–55. pmid:21946919
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Mullaney JM, Mills RE, Pittard WS, Devine SE. Small insertions and deletions (INDELs) in human genomes. Hum Mol Genet. 2010; 19: R131–R6. pmid:20858594
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Miki Y, Swensen J, Shattuck-Eidens D, Futreal PA, Harshman K, Tavtigian S et al. A strong candidate for the breast and ovarian cancer susceptibility gene brca1. Science. 1994; 266: 66–71. pmid:7545954
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Chio M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA. 2009; 106: 19096–101. pmid:19861545
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Human Genet. 2008; 9: 387–402.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref8] 8. Shigemizu D, Fujimoto A, Akiyama S, Abe T, Nakano K, Boroevich KA et al. A practical method to detect SNVs and INDELs from whole genome and exome sequencing data. Scientific Reports. 2013; 3: 2161. pmid:23831772
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Choi BO, Koo SK, Park MH, Rhee H, Yang SJ, Choi KG et al. Exome sequencing is an efficient tool for genetic screening of Charcot-Marie-Tooth Disease. Hum Mutat. 2012; 33: 1610–1615. pmid:22730194
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. DePristo MA, Banks E, Poplin R, Garimella KV, Maquire JR, Hartl C et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011; 43: 491–8. pmid:21478889
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009; 25: 2078–2079. pmid:19505943
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Albers CA, Lunter G, MacArthur DG, MaVean G, Ouwehand WH, Durbin R. Dindel: accurate INDEL calls from short-read data. Genome Res. 2011; 21: 961–973. pmid:20980555
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref13] 13. Garrison E, Marth G. Haplotype-base variant detection from short-read sequencing. Preprint at arXiv: 1207.3907v2 [q-bio.GN].
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref14] 14. Neuman JA, Isakov O, Shomron N. Analysis of insertion-deletion from deep-sequencing data: software evaluation for optimal detection. Brief Bioinform. 2013; 14: 46–55. pmid:22707752
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref15] 15. Hasan MS, Wu X, Zhang L. Performance evaluation of INDEL calling tools using real short-read data. Human Genomics. 2015; 19: 20.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref16] 16. Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Scientific Reports. 2015; 5:17875. pmid:26639839
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref17] 17. Ghoneim DH, Myers JR, Tuttle E, Paciorkowski AR. Comparison of insertion/deletion calling algorithms on human next-generation sequencing data. BMC Res Notes. 2014; 7: 864. pmid:25435282
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref18] 18. Fang H, Wu Y, Narzisi G, O’Rawe JA, Barron LT, Rosenbaum J et al. Reducing INDEL calling errors in whole genome and exome sequencing data. Genome Med. 2014; 6:89. pmid:25426171
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref19] 19. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetic parameter estimation from sequencing data. Bioinformatics. 2011; 27: 2987–93. pmid:21903627
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref20] 20. Spencer DH, Tyagi M, Vallania F, Bredemeyer AJ, Pfeifer JD, Mitra RD et al. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J Mol Diagn. 2014; 16: 75–88. pmid:24211364
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Subject

Whole-exome data analysis

Sanger sequencing analysis

Statistical analysis

Results

Performance of INDEL calling in WES

Validation of INDELs by Sanger sequencing

Comparison of INDEL-calling algorithms

Discussion

Supporting information

S1 Table. Summary of INDELs called by GATK, SAMtools, Dindel, and Freebayes.

Acknowledgments

References