The authors have declared that no competing interests exist.
Conceived and designed the experiments: KJK GHF ARM NPT JTD. Performed the experiments: KJK GHF ARM JTD. Contributed reagents/materials/analysis tools: KJK GHF ARM MS NPT JTD. Wrote the paper: KJK JTD.
The increasing public availability of personal complete genome sequencing data has ushered in an era of democratized genomics. However, read mapping and variant calling software is constantly improving and individuals with personal genomic data may prefer to customize and update their variant calls. Here, we describe STORMSeq (Scalable Tools for Open-Source Read Mapping), a graphical interface cloud computing solution that does not require a parallel computing environment or extensive technical experience. This customizable and modular system performs read mapping, read cleaning, and variant calling and annotation. At present, STORMSeq costs approximately $2 and 5–10 hours to process a full exome sequence and $30 and 3–8 days to process a whole genome sequence. We provide this open-access and open-source resource as a user-friendly interface in Amazon EC2.
Individuals are now empowered to obtain and explore their full personal genome and exome sequences owing to declining costs in genome sequencing, and direct-to-consumer genetic testing companies have begun to provide sequencing services: in 2011, 23andMe conducted a pilot exome sequencing program for$999, while at the time of this writing, DNADTC provides the service for $895. Software and algorithms for short read mapping and variant calling are an active area of development and individuals may prefer to customize which software or parameters to use to process their raw genetic data. However, as these programs require significant computational resources, such a task is generally intractable without access to large-scale computing resources. Furthermore, execution of the required software pipeline requires proficiency in command-line programming, or alternatively, expensive commercial software options geared towards experts. These concerns can be ameliorated by use of intuitive open-source software operating in a cloud-computing environment.
A number of solutions enabling researchers to process sequencing data using cloud computing are available. The majority of open-source, cloud-based tools for genomic data are command-line based and require substantial technical skills to use. Notable exceptions are Galaxy, Crossbow, and SIMPLEX. Galaxy aims to provide a reproducible environment for genome informatics accessible to non-technical investigators
Thus, we created STORMSeq (Scalable Tools for Open-source Read Mapping) to fill the need for a user-friendly processing pipeline for personal human whole genome and exome sequence data. STORMSeq utilizes the Amazon Web Services (
STORMSeq's cloud-based architecture is illustrated in
The user uploads short reads to Amazon S3 and starts a webserver on Amazon EC2, which controls the mapping and variant calling pipeline. Progress can be monitored on the webserver and results are uploaded to persistent storage on Amazon S3.
Read mapping software packages, including BWA
Read cleaning pipeline with GATK
Variant (SNP and indel) calling packages, GATK and Samtools
Annotation using VEP
The system backend is modular, and designed to be easily expandable by researchers wishing to add additional functionality or incorporate other software packages.
Once the user has set the relevant parameters (or uses the default set provided) and clicked “GO,” the system starts a compute cluster on the Amazon Elastic Compute Cloud (with the number of machines started related to the number of files uploaded and whether exome or genome analyses are selected) and runs the relevant software. The use of the software is free, and the user simply pays for compute time and storage on the Amazon servers, which as of 11/1/13 (for spot instances) costs $0.026 per hour for the (large) systems required for BWA, and $0.14 per hour for the (quadruple extra-large) high-memory systems required for SNAP, and $0.095 per GB-month for persistent storage of reads and variant call results. As the pipeline progresses, a progress bar is updated on the webserver and once the pipeline is finished, summary statistics, such as depth of coverage and other variant information, and visualizations using ggbio
STORMSeq provides basic visualization for summary statistics, such as (A) genome-wide SNP density and (B) size distribution of short indels.
We tested the STORMSeq system using two paired-end 100 bp read datasets: a personal genome sequence dataset with 1.1B reads (approximately 38X coverage), and a personal exome sequence data set with 90M reads (approximately 45X coverage; available in STORMSeq's demo functionality). For the personal exome data, the pipeline cost approximately $2 USD using spot pricing and took 10 hours using BWA and 5 hours using SNAP (
Analysis Type | Exome | Genome | ||
Pipeline | SNAP | BWA | SNAP | BWA |
Cost (Spot) | $2.26 | $1.90 | $26.42 | $32.76 |
Cost (On-demand) | $19.68 | $8.16 | $254.20 | $129.12 |
Time | 5 h | 10 h | 176 h | 98 h |
Note that these costs are approximate and may depend on a number of factors related to the input files.
We offer STORMSeq free for public use, where users pay only for compute time on the Amazon cloud. The source code for the STORMSeq software is available for download from
(PDF)
(PDF)
(PDF)
We would like to acknowledge the individuals who helped in the design of the system at the BioCurious hackathon in July 2012, in particular David Dehghan for his insights on cloud computing.