ExScalibur: A High-Performance Cloud-Enabled Suite for Whole Exome Germline and Somatic Mutation Identification

Whole exome sequencing has facilitated the discovery of causal genetic variants associated with human diseases at deep coverage and low cost. In particular, the detection of somatic mutations from tumor/normal pairs has provided insights into the cancer genome. Although there is an abundance of publicly-available software for the detection of germline and somatic variants, concordance is generally limited among variant callers and alignment algorithms. Successful integration of variants detected by multiple methods requires in-depth knowledge of the software, access to high-performance computing resources, and advanced programming techniques. We present ExScalibur, a set of fully automated, highly scalable and modulated pipelines for whole exome data analysis. The suite integrates multiple alignment and variant calling algorithms for the accurate detection of germline and somatic mutations with close to 99% sensitivity and specificity. ExScalibur implements streamlined execution of analytical modules, real-time monitoring of pipeline progress, robust handling of errors and intuitive documentation that allows for increased reproducibility and sharing of results and workflows. It runs on local computers, high-performance computing clusters and cloud environments. In addition, we provide a data analysis report utility to facilitate visualization of the results that offers interactive exploration of quality control files, read alignment and variant calls, assisting downstream customization of potential disease-causing mutations. ExScalibur is open-source and is also available as a public image on Amazon cloud.


Introduction
Next Generation Sequencing (NGS) technologies are promptly becoming the most popular high-throughput strategy for drug discovery and biomedical research in the post-genome era. Whole Exome Sequencing (WES) is a powerful and cost-effective approach for the detection of Data Availability Statement: The NA12878 trio datasets used in this study are public available from the The Sequence Read Archive (SRA) database (accession numbers SRX079575, SRX079576, SRX079577). The AML datasets are available from The Cancer Genome Atlas (TCGA) for researchers who meet the criteria for access to the protected data. To submit an application, please follow the TCGA controlled-access data application process (URL: https://wiki.nci.nih.gov/display/TCGA/Application +Process). Once approved, researchers may use the TCGA sample IDs (provided in the supplementary single-nucleotide variants (SNVs) and small insertions/deletions (InDels) in exonic regions, which represent less than 2% of the human genome and are assumed to contain~85% of known disease-causing variants in Mendelian disorders [1]. Analysis of the sequencing data requires in-depth bioinformatics skills and tens to thousands of computer processors for mammalian-sized genomes, which generates difficulties for researchers who may not have the expertise or the access to high-performance computing (HPC) resources. Moreover, unlike microarrays, there is no standard protocol for analysis of WES data, which also depends on the biological questions of interest. Though many tools are available, great discrepancies were reported for short-read aligners and variant callers [2][3][4]. Despite the rapid decline of sequencing cost, it remains challenging and time consuming to analyze large amounts of sequencing data and synthesize useful biological insights.
To address these challenges, several NGS data analysis pipelines have been published that offer different functionalities and operate on various platforms [5][6][7][8][9][10][11]. Most pipelines implement only one aligner and/or variant caller, lacking the facility to compare and integrate results from different algorithms. Many either do not cover the entire analysis workflow from raw sequencing data to annotated variants, or are only able to detect germline (those inherited from parents) or somatic (those gained during development) mutations. While reports are often provided, few offer a portable dynamic interface for viewing both project-and samplelevel results. Moreover, setting up a pipeline usually requires complex installation and configuration, which may generate challenging tasks for most inexperienced users.
Our aim is to provide researchers the capacity to perform complex and computationallydemanding data analysis that simultaneously utilizes multiple alignment and variant detection algorithms with elastic access to resources on an as-needed basis. We present ExScalibur, a suite of highly scalable WES analysis pipelines for the detection of germline and somatic mutations, with the implementation of three aligners, six germline callers, and six somatic callers. It automates the full analysis workflow from raw sequencing reads to annotated variants and provides an interactive visualization of the results. Features include real-time progress monitoring, restarting of interrupted analyses, and seamless adaptation to different platforms. ExScalibur is an open-source project and is also available as a pre-configured environment on Amazon EC2, which greatly simplifies installation and management of complex analysis.

Pipeline Design
ExScalibur consists of germline (ExScalibur-GMD) and tumor/normal paired somatic mutation detection (ExScalibur-SMD) pipelines that analyze WES data generated on Illumina's high-throughput platform. A typical analysis workflow contains seven main modules: 1) quality control (QC), 2) preprocessing, 3) alignment, 4) alignment refinement, 5) variant calling and filtering, 6) annotation, and 7) project report generation (Fig 1). Both the germline and somatic pipelines implement three short read aligners and six variant callers. Any combination of aligner and caller can be specified by the user, allowing simultaneous launching of multiple callers and direct comparison of different variant detection results (S1 and S2 Tables). At the end of analysis, the pipelines automatically collect results into the archive directory, allowing for easy downloading of essential result files.
Quality control. At the beginning of a pipeline run, quality of raw sequencing reads is assessed for base quality, duplication level, nucleotide composition distribution, and GC bias. Users have the option to provide the pipeline with specific metrics for these QC categories. The QC statistics are subsequently parsed to determine whether the sequencing quality passes the chosen thresholds. Samples that pass all QC criteria will be carried on to the next analysis step. Highly modulated architecture of ExScalibur. The pipelines contain seven major analysis steps. First, the pipeline checks the quality of the sequencing reads, performs adapter trimming (for both SE and PE reads), and merges 3' overlapping PE reads (for PE reads only). Then the reads are aligned to the reference genome, filtered, duplicates removed, and the alignment refined. The pipelines calculate exon coverage and collect callable loci from the alignment. Afterwards, the pipelines detect, filter, and annotate variants for each aligner+caller combination. Finally, the pipelines archive the results, integrate metrics and all variants sets, and generate a project data analysis report for visualization in ExScaliburViz. At the pipeline completion, a runtime Preprocessing. Raw reads are processed to remove adapters (for both single-(SE) and paired-end (PE) reads by default) and to merge 3' overlapping mates (PE reads only), for the purpose of removing the artifacts of double-counting variants located in the overlapping regions.
Alignment. Processed reads are mapped to the reference genome using any combination of three short-read aligners including BWA-aln [12], BWA-mem [13], and Novoalign (Novocraft Inc., Malaysia). Unmapped reads and low-quality alignments are filtered out. Alignments from technical replicates (e.g., multiple runs/lanes) are merged and read duplicates are removed.
Alignment refinement. The alignment is further refined by local InDel realignment and base quality score recalibration (BQSR) following the GATK Best Practices [14]. Multiple alignment summary statistics are collected and exon coverage is calculated.
Variant filtering. Customized quality filters are applied to the raw calls to remove potential false positives (e.g., low coverage, low mapping quality, low variant quality, strong strand bias, strong read end bias, or those located within SNV clusters; S3 and S4 Tables). Somatic variants are further filtered by allele frequency (AF) in both tumor and normal samples (S4 Table). Additional flags can be added to germline variants to label those located within the ENCODE blacklist (BLK) [27] (https://sites.google.com/site/anshulkundaje/projects/ blacklists), low mappability regions (LMP) [28], or low complexity regions (LCR) [13], where alignment artifacts are more likely to occur.
Data analysis report generation. At the completion of a pipeline run, a comprehensive data analysis report is generated, which consists of various quality statistics and variant calls. ExScalibur aggregates variants and estimates the concordance of all aligner+caller combinations using a simple multiplicative score (N aligner x N caller ). To facilitate the exploration of the results, we provide ExScaliburViz, an R Shiny [35] web application for desktop viewing (S1-S4 Figs).

Pipeline Implementation
ExScalibur is implemented in BigDataScript (BDS), a platform-independent high-level programming language designed for pipeline development and management of large-scale data sets [36]. Utility scripts were written in Perl and Python to assist with customized pipeline functions. A project is initialized with a sample metadata table (S5 and S6 Tables) and a highly customizable pipeline configuration file. With the execution of one master script, the pipelines run from raw reads to annotated variants, and the real time progress is updated in log files. To report is generated to illustrate the timeline of analysis, with detailed description of the commands, inputs, outputs, and dependencies. Intermediate reports will be generated if the pipelines prematurely terminate due to software/hardware failure. doi:10.1371/journal.pone.0135800.g001 run ExScalibur on different platforms, the only requirement from the user is to specify a handful of platform-specific parameters (S7 Table), facilitating data sharing and reproducibility in the scientific community.
Highly modulated architecture. ExScalibur employs a flexible dependency structure, with multiple intermediate steps that are automated by BDS (Fig 1). For example, after ExScalibur-GMD completes a run with BWA-mem and GATK HaplotypeCaller, additional aligners and callers can be added without the need to repeat already-completed upstream modules. In addition, users have the option to generate a customized analysis workflow by including specific modules from the pipelines (S8 and S9 Tables).
Highly scalable analysis. ExScalibur can be easily scaled to analyze tens to thousands of samples simultaneously given sufficient computing power. With a small cluster on Amazon EC2 (5 nodes; 8 cores/node; 14.6GB RAM/node), analyses of human WES data on three germline samples (80x coverage) and two tumor/normal pairs (50x coverage) involving two aligners and two variant callers finished within 12 hours and 16 hours respectively, at a cost of less than 10 US dollars per sample. To demonstrate scalability, we simulated 100 exome samples from human chromosome 22 with 1 million 2x100bp PE reads per sample, and ran the pipelines using one aligner and one caller with 32 cores on Amazon EC2. The entire analysis workflow finished within 4 hours with over 5,000 tasks successfully executed.
Robust handling of errors. ExScalibur captures the abnormal exit status of a task and optionally launches job resubmission through BDS [36]. If ExScalibur detects a software/hardware failure, it will gracefully handle the termination of all tasks (e.g., deleting all dependency jobs and removing incomplete files) and report detailed information of those that failed. The analysis can be restarted from the interrupted breakpoint, taking advantage of the highly modulated dependency structure.
Intuitive pipeline documentation. A runtime report is generated by BDS at the completion of a run, where all the commands, input and output files, and dependencies are easily accessible. In addition, the report displays an overview of the timeline of each module in an interactive graph (S1 File). A YAML format of the report is also generated that can be used for collecting runtime stats, creating custom plots and quickly retrieving commands, dependency, runtime and exit status of each task.
Optimized parallelization procedure. ExScalibur implements a scatter-gather design [37] for variant calling, which splits the exome callable regions into a number of even-sized bins and merges the results. This design allows for the submission of hundreds to thousands of jobs on HPC clusters and cloud infrastructures, dramatically reducing the analysis time.
Availability and resources. ExScalibur is available under open-source license at http:// exscalibur.cri.uchicago.edu. The website hosts documentation and tutorials and provides access to an Amazon's Elastic Compute Cloud (EC2) image with pre-installed pipeline scripts and tools. With minimal installation requirements, users may instantiate the provided image with as many resources as needed. Nodes may be added or removed on the fly and ExScalibur can immediately make use of the available hardware. The cloud image is built on StarCluster [38] running Ubuntu operating system, allowing fast and easy provision of a cluster environment in the cloud.

Results
To evaluate the pipeline performance, we ran GMD and SMD analyses on simulated/benchmark data and 30 acute myeloid leukemia (AML) tumor/normal pairs [39]. In addition, we illustrated discrepancies between the two commonly-used somatic mutation detection approaches by using the GMD-derived subtraction method to identify somatic calls via contrasting genotypes of paired tumor/normal samples and comparing the results with those directly detected by somatic callers in the SMD pipeline.

ExScalibur-GMD Evaluation
Raw PE reads were assessed by FastQC [40] (see S15 Table for all tool versions) for quality, preprocessed by SeqPrep [41], and aligned to human reference genome (hg19) using BWA-mem and Novoalign. Alignments were filtered to remove duplicates, unmapped reads, and reads with mapping quality (MAPQ) less than 30. GATK was used to realign InDel regions and recalibrate base quality score. Variants were identified using four callers (GATK HaplotypeCaller, FreeBayes, SAMtools mpileup/bcftools and IVC), filtered for confident calls, normalized (vcflib [42] and vt normalize [43]), and annotated using ANNOVAR. Unless otherwise noted, variants labeled with "HQ" (high quality) refer to those that passed all filters.
For evaluation, we focused on SNVs and small InDels, which composed the majority of the variant calls. Variants generated by every aligner+caller combination ("Observed") were compared to a validation dataset ("Expected"). Sites detected in both the observed and expected sets were considered true positives (TPs), while sites detected as variant in the observed set but as invariant in the expected set were considered false positives (FPs). True negatives (TNs) refer to sites detected as invariant in both sets, while false negatives (FNs) refer to sites detected as invariant in the observed set but as variant in the expected set. To make the results comparable, we focused on candidate loci covered by at least six reads across all combinations.
NIST-GIAB benchmark data. The NIST-GIAB benchmark data were generated by the Genome in a Bottle Consortium that provides a list of high-confidence variant calls from the genome of NA12878 [44]. In this study, we identified variants from a trio including NA12878 (SRA accession ID SRX079575), NA12891 (SRX079576) and NA12892 (SRX079577) using ExScalibur-GMD pipeline with different tools.
We retrieved NA12878 variants from the multi-sample variant detection results and compared to the NIST-GIAB gold standard dataset for evaluation. We removed off-target calls and variants located within genome regions where no confident calls could be made [44]. Variants of low quality or called as multiallelic were also excluded. A total of 15,914,394 loci were included for evaluation. We retrieved variants detected by at least two aligners and two callers (S10 Table, 2aligners+2callers). We detected the highest sensitivity in the "2aligner+2caller" list across all combinations for the detection of SNVs (99.03%). Compared to single combinations that detected over 11,000 true SNVs (e.g. BWA-mem+GATKHaplotypeCaller), this approach greatly reduced the number of false positives by recruiting multiple callers (113 versus 54). Close to 90% sensitivity was observed for the InDels, which is among the best-performing combinations but lower than that of GATKHaplotypeCaller, possibly due to large differences in the performance of the other three callers. Considering both sensitivity and precision, our results suggested that GATK HaplotypeCaller showed the best performance among the four callers.
AML data. We obtained WES data of 30 AML tumor/normal pairs from the TCGA portal (released December 2014). We included only the normal samples for ExScalibur-GMD evaluation. We used Affymetrix human SNP array 6.0 genotype calls as the validation set, which were processed by Washington University at St. Louis and the Broad Institute's TCGA groups. We retrieved concordant calls detected by both groups, lifted genomic coordinates from hg18 to hg19, filtered for array sites covered by at least six reads in the AML exome data, and retrieved candidate loci consistent across all combinations within each sample. On average, 38,970 ± 14,800 loci were included in the evaluation. We caution that this analysis was restricted to a limited set of loci targeted by the SNV array, which tend to include only the common SNVs from public databases.
We detected higher than 99% precision in all combinations when averaging across all samples (Table 1). A larger discrepancy in sensitivity was observed between single aligner+caller combinations, with BWA-mem and GATK HaplotypeCaller having the best performance. Of note, variants detected by at least two aligners and two callers showed the highest sensitivity (98.23% ± 0.99%) with little tradeoff in precision (99.72% ± 0.56%) ( Table 1, 2aligners+2callers).

ExScalibur-SMD Evaluation
Reads were processed, aligned to the hg19 assembly using BWA-mem and Novoalign, and refined as described above. Somatic variants were called using six somatic callers (MuTect, Shimmer, SomaticSniper, Strelka, VarScan2, and Virmid), followed by caller-specific filtering to remove ambiguous and low-confidence calls. After filtering, we restricted our evaluation to include only somatic SNVs that were heterozygous in the tumor sample, homozygous reference in the matched normal sample, and had at least 8x coverage.
Simulation data. We implemented the virtual-tumor benchmarking approach [20] to generate one dataset for the estimation of specificity and another for sensitivity (Datasets 1 and 2, respectively). Briefly, Dataset 1 was generated by randomly assigning WES reads from NA12891 to a virtual tumor/normal pair. Any somatic variants detected in the virtual sample pair were considered as false positives and used to estimate specificity (defined as 1-FP/total number of exome sites). Dataset 2 was generated using alignments from NA12891 and a second individual, NA12878. First, we retrieved high-confidence variants from the 1000 Genomes database where NA12878 was a homozygous reference and NA12891 was a heterozygous reference. Then, we simulated a virtual NA12878-tumor sample by substituting NA12891 alleles into NA12878 at a frequency of 0.8 and a minimum coverage of 20x. Any somatic variants detected in this NA12878-tumor/NA12878 pair were considered as true positives and used to estimate sensitivity (defined as TP/total number of substituted loci). To make the results comparable within each dataset, we focused on candidate loci available across all combinations.
We detected close to 90% sensitivity for all aligners when averaging across all callers ( Table 2). A larger variation was observed between SMD callers, with Shimmer and Strelka having the lowest sensitivity and SomaticSniper and VarScan2 having the greatest. Specificity was above 99% across all aligner+caller combinations. However, VarScan2 and SomaticSniper detected a relatively greater number of false positives (S11 Table). After applying the default filters implemented in ExScalibur-SMD, the number of false positives detected by VarScan2 dropped down to a similar level as the other somatic callers, while the discrepancy in SomaticSniper persisted ( Table 2). Of note, the combination of at least two aligners and two somatic callers produced the highest sensitivity with no tradeoff in specificity ( Table 2, 2aligners +2callers).
AML data. We detected somatic variants in the AML WES data using ExScalibur-SMD and compared results with validated somatic mutations generated by the TCGA group, which were identified by SomaticSniper and validated by hybridization arrays (validated somatic mutations; VSMs). We lifted genomic coordinates from hg18 to hg19. Because the use of the VSMs in our evaluation is limited to a handful of loci included in the validation panel, any somatic variants detected by ExScalibur-SMD but are not present on the panel cannot be evaluated. With this caveat in mind, we calculated two evaluation metrics for each sample: (1) Recovery rate, defined as the ratio of the number of VSMs detected in our results over the total number of VSMs; (2) Novel call rate, defined as the ratio of the number of VSMs detected in our results over the total number of somatic SNVs detected.
We observed large discrepancies in the recovery rate between aligner+caller combinations, with an average of 35% to 83% of the VSMs detected before filtering (S12 Table, VSM Recovery Rate). Of note, applying the somatic variant filters dramatically reduced the number of false positives but with a tradeoff in the recovery rate. In particular, SomaticSniper showed the most drastic drop in VSM recovery rate (from 44% to 9%) after filtering, mostly due to low genotype quality in the tumor sample. The combination of two aligners and two callers produced high recovery rates similar to VarScan2 but resulted in better performance than all other single aligner+caller combinations (S12 Table, 2aligner+2caller). Interestingly, the majority of somatic variants detected by ExScalibur-SMD did not overlap with the VSMs, suggesting that the use of multiple callers may increase the sensitivity of somatic mutation detection.

Comparison of somatic calls between SMD and GMD pipelines
Germline variants are usually associated with an expected ploidy-dependent allele frequency. In a diploid genome, this frequency is expected to be close to 0% for homozygous reference alleles, 50% for heterozygous alleles, and 100% for homozygous alternative alleles. In contrast, somatic variants often have an unexpected spectrum of tumor allele frequencies and ploidy changes. Moreover, reliable detection of somatic mutations is often compromised by contamination from the normal tissue. To address these issues, modern SMD software requires paired tumor/normal samples and implements complex statistical models to handle unexpected frequencies correcting for contamination rate. An alternate approach involves the detection of variants in tumor and normal samples separately and then contrasting the tumor and normal genotypes (GMD-derived subtraction approach). In this case, somatic variants were identified as sites that carry homozygous or heterozygous alternative alleles in the tumor sample but carry homozygous reference in the matched normal sample.
To compare the two somatic mutation detection approaches (the paired tumor/normal SMD vs. GMD-derived subtraction), we analyzed the AML data using both pipelines. For SMD, the approaches were described in the previous section. For GMD, we retrieved SNVs from three callers (GATK HaplotypeCaller, FreeBayes, and SAMtools) and filtered for loci that were heterozygous genotype in tumor sample (AF > 0.20) and homozygous reference in matched normal (AF < 0.05). On average, an over 80% recovery rate was observed for GATK HaplotypeCaller and SAMtools (S13 Table). Of note, a higher recovery rate was observed (92%) in variants concordantly detected by at least 2 aligners and 2 callers, with a reduction of 90% in the number of false positivies.
Overall, the GMD-subtraction method detected more than double the number of somatic mutations compared to the SMD pipeline (S14 Table). Of these, 29.03% of the somatic mutations detected by SMD overlapped with 8.32% of the somatic mutations detected by GMD-subtraction. To further investigate the large discrepancy between GMD and SMD somatic calls, we randomly selected 50 variants detected by each pipeline and visually inspected them in Integrative Genomics Viewer (IGV) [45,46]. Our manual inspection suggested that many of the differences could be explained by discrepancies in allele frequencies, perhaps due to intrinsic differences in the variant caller algorithms between SMD and GMD. Though the GMD-subtraction method showed a higher recovery rate, we recommend using SMD for somatic variant calling due to its low FP rate and high sensitivity.

Discussion
We introduce ExScalibur, a set of highly scalable and configurable WES pipelines. The pipelines cover the complete workflow from raw reads to variant calling and annotation, allowing accurate detection of germline and somatic variants in the human genome. ExScalibur executes the requested analysis steps, allows for fine control over software parameters (with carefully chosen default parameters), manages data across all processes, and distributes computationally expensive tasks across HPC nodes. It is available for implementation across platforms, facilitating large-scale data analysis in individual laboratories as well as institutions that process samples routinely. We also provide a ready-to-use virtual image that can be easily deployed on Amazon EC2, allowing execution of complex sequencing analyses for researchers who may not have access to a HPC environment. In our experience, ExScalibur is the first WES analysis suite implemented in the BDS language, which is equipped with unique features to manage pipeline execution and robustness for the complex analysis of big data.
Our evaluation suggests that the combination of multiple aligners and callers often results in more confident variant detection in both GMD and SMD pipelines. While low concordance was observed between somatic variant callers, we recommend using more than one caller and retrieving concordant calls detected by at least two or more somatic callers for increased sensitivity and confidence.
ExScalibur is a set of open-source pipelines that assist researchers in quickly gaining biological insights into genomic aberrations identified through exome sequencing. We believe it will be highly useful to those who do not have access to large-scale hardware resources or necessary expertise to run the analyses. More importantly, our suite of tools will provide a new framework to implement and compare different aligners and variant callers. ExScalibur is under active development and maintained for long-term use. The pipelines are under heavy use in a biomedical research environment and have successfully identified causal mutations in rare Mendelian diseases and cancer.
Supporting Information S1 File. Pipeline runtime report of ExScalibur pipelines on sample data. Horizontal bars represent the progress of each module. Text on/next to each bar indicates sample/read group and software information. Runtime is shown as x-axis at the bottom of the panel. Task information and system settings not shown.   Table. Description of command-line parameters in ExScalibur-SMD pipeline. (XLSX) S10 Table. Evaluation of GMD germline variant detection in the benchmark dataset. (XLSX) S11 Table. Evaluation of SMD somatic SNV detection in the simulation datasets. All variants (before filtering) were included. (XLSX) S12 Table. Evaluation of SMD SNV detection in the AML dataset. Values represent averages and standard deviations across all 30 TCGA AML tumor/normal pairs. HQ: high quality. VSM: validated somatic mutations. See context for detail. (XLSX) S13 Table. Evaluation of GMD somatic SNV detection in the AML dataset. Values represent averages and standard deviations across 30 AML tumor/normal pairs. (XLSX) S14 Table. Comparison of somatic SNV detection between SMD in the AML dataset. Values represent averages and standard deviations across in 30 AML tumor/normal pairs. HQ: highquality variants. (XLSX) S15