Fig 1.
Workflow for comparative performance evaluation of variant callers.
Schematic overview of the benchmarking methodology. Seven variant calling tools (FreeBayes, SAMTools, DeepVariant, GATK, Strelka2, VarScan2, and Octopus) were run on the same sequencing dataset using their default settings. Each tool generated a single Variant Call Format (VCF) file output. These VCF files were then evaluated against a high-confidence gold standard variant set using RTG Tools’ vcfeval function. This benchmarking process produced two primary categories of results: Performance Metrics (Precision, Recall, and F1-score) and detailed Variant Statistics (e.g., Ti/Tv ratio, Het/Hom ratio, indel counts).
Fig 2.
Performance metrics for variant callers on chromosome 20.
Comparison of precision, recall, and F1-score across seven variant calling tools evaluated on the NA12878 chromosome 20 subset. All metrics were calculated using RTG Tools’ vcfeval against the Genome in a Bottle consortium gold standard reference.
Fig 3.
Performance metrics for variant callers on whole-genome (WGS).
Comparison of precision, recall, and F1-score across five variant calling tools (GATK, Strelka2, Octopus, FreeBayes, SAMtools) evaluated on the NA12878 whole-genome dataset. All metrics were computed with RTG Tools’ vcfeval against the Genome in a Bottle (GIAB) truth set.
Fig 4.
Stability of recall from chromosome 20 to whole-genome sequencing.
Comparison of recall metrics for four variant callers (GATK, Octopus, FreeBayes, SAMtools) between the chromosome 20 subset and the whole-genome (WGS) dataset. Recall remained highly stable across sequencing scales for most tools, with GATK, Octopus, and FreeBayes showing negligible change (). SAMtools was the only caller to show improvement, increasing from 0.955 on chr20 to 0.975 on WGS (
). The high consistency in recall suggests that the fundamental sensitivity of these variant callers is not dependent on the scale of the input data. Metrics were computed with RTG Tools’ vcfeval against the Genome in a Bottle (GIAB) truth set.
Fig 5.
Change in precision from chromosome 20 to whole-genome sequencing.
Comparison of precision metrics for four variant callers (GATK, Octopus, FreeBayes, SAMtools) between the chromosome 20 subset and the whole-genome (WGS) dataset. All tools showed increase in precision on WGS data. GATK exhibited the largest improvement (from 0.662 to 0.777, ), followed closely by FreeBayes (
). Octopus, which had the highest precision on chr20 (0.741), maintained the highest absolute precision on WGS (0.801) with a gain of +0.060. The precision shift suggests that a whole-genome context provides enhanced statistical power for filtering false positives. Metrics were computed with RTG Tools’ vcfeval against the Genome in a Bottle (GIAB) truth set.
Fig 6.
Change in F1-score from chromosome 20 to whole-genome sequencing.
Comparison of F1-scores for four variant callers (Octopus, GATK, SAMtools, FreeBayes) between the chromosome 20 subset and the whole-genome (WGS) dataset. All tools showed improved F1-scores on WGS data, driven primarily by gains in precision. GATK exhibited the largest gain (), improving from 0.791 to 0.869. FreeBayes and SAMtools showed similar improvements (
and
, respectively). Octopus, which started with the highest F1-score on chr20 (0.846), showed a moderate gain (
) but achieved the highest final F1-score of 0.883 on WGS. Metrics were computed with RTG Tools’ vcfeval against the Genome in a Bottle (GIAB) truth set.
Table 1.
Computational efficiency of variant calling tools.
Table 2.
Characterization of variant callers.
Fig 7.
Variant caller selection guide based on precision and runtime performance.
This decision guide provides a clear, evidence-based framework for selecting variant callers by visually mapping their performance along two critical operational dimensions: precision and computational runtime.