Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing

Background The widespread popularity of genomic applications is threatened by the “bioinformatics bottleneck” resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. Results We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Conclusions Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.


A. Requirements for pipeline Input
To run the full CloVR-16S analysis track, at least two different input files have to be provided by the user: a sequence file in the FASTA format and a tab-delimited metadata file (.txt). Sequence data may consist of a single .fasta file that contains sequences from multiple samples, individually pyrotagged by sample-specific barcodes as commonly used in the 454 Amplicon Sequencing protocol (http://www.454.com/productssolutions/experimental-design-options/amplicon-sequencing.asp). No two FASTA headers within any submitted file may be identical. The metadata file provides sampleassociated information with the following formatting requirements, based on the Qiime mapping file. The following rules apply: 1. All entries are tab-delimited. 2. All entries in every column are defined (no empty fields). 3. The header line begins with the following fields: "#SampleID<tab>BarcodeSequence<tab> LinkerPrimerSequence". 4. The header line must end with the field "Description". 5. The BarcodeSequence and LinkerPrimerSequences fields have valid IUPAC DNA characters. 6. There are no duplicate header fields. 7. No header fields or corresponding entries contain invalid characters (only alphanumeric and underscore characters allowed). 8. There are no duplicates when the primer and barcodes are appended.

A.2. Metadata file requirements for runs on multiple sequences
Multiple fasta files can be provided so that each file comprises sequences from different samples. In this case, the metadata file must meet the following requirements: Pairwise comparisons: To utilize the Metastats statistical methodology, which detects differential abundances of taxa between two sample groups, the associated header field must end with "_p", (e.g. "Treatment_p", or "ph_p"). If a header with the "_p" ending exists, pairwise Metastats calculations will be carried out between all groups specified in the corresponding column (provided that a group contains at least three samples).

B. Sequence Processing and analysis with Mothur
The Mothur component of CloVR-16S follows in large parts the pyrosequencing 16S rRNA sequence analysis example on the Mothur wiki page (http://www.mothur.org/wiki/Costello_stool_analysis). Sequence pools are pre-processed (trimmed, sorted, filtered), aligned against a reference (the curated Silva 16S and 18S rRNA alignment [6]), further processed to remove redundancy and to filter the alignment, used to generate a distance matrix, clustered and assigned to OTUs. Based on the sample OTU classification, the within-sample community composition is analyzed using common richness and diversity estimators as well as collectors and rarefaction curves (!-diversity).

B.1. Sequence pre-processing
To check each read from the sequence pool for quality and to sort sequences based on the sample-specific barcodes, the "trim.seqs" program is used with the following parameters: • "minlength=100" (minimum sequence length) • "maxhomop=8" (maximum homopolymer length) • "maxambig=0" (maximum number of ambiguous base calls) • "flip=F" (do not use the reverse complement of the sequences -reverse complements are considered in the alignment step).
All length parameters refer to base pairs (bp). This step generates trimmed .fasta and .groups files, which are used in the downstream analysis.

B.2 Alignment
To speed up the downstream analysis and to facilitate the analysis of large data sets, identical sequences, which can constitute a significant fraction of the sequence data are removed, using the "unique.seqs" command. The non-redundant sequence dataset is then aligned against the Greengenes reference core imputed alignment, which is available from the Greengenes website (http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files), using the "align.seqs" command with the default parameters and "flip=t" (if the alignment of a sequence read falls below the default threshold [0.50], the reverse complement is tried). In order to keep only those sequence reads that produce alignments of a minimum length of 50 bp, the "screen.seqs" command is run with the "minlength=50" option. With the "filter.seqs" command in combination with the "vertical=T" option, any column, which only contains gaps, is removed from the alignment. Since the trimming of the alignment has created new duplicate sequences, identical sequences are removed again using the "unique.seqs" command.

B.3. Threshold for the maximum number of processed sequences
Since the following steps of the Mothur pipeline can be computationally very demanding, the threshold for the number of unique sequence reads from the alignment of the previous step is set to 50,000. If the number of sequences exceeds this threshold, the dataset is not further processed with Mothur but instead analyzed through the remaining CloVR-16S components.

B.4. Clustering and OTU assignment
In Mothur, sequence reads are assigned to OTUs based on uncorrected pairwise distances between all aligned sequences. With the "dist.seqs" command a column-! ! %! formatted distance matrix is generated, using the "cutoff=0.10" option, which limits the distance matrix to keep only sequence reads with a distance smaller than 0.10 (at least 90% similar). Using the default "furthest neighbor" option, the "cluster" command assigns sequence reads to OTUs based on the distance matrix generated in the previous step.

B.5. Alpha diversity analysis
To perform the alpha diversity analysis, the OTU clustering results are read into Mothur with the "read.otu" command and the "label=unique-0.03-0.05-0.10" option to output all OTU levels using 97%, 95% and 90% similarity thresholds, respectively. The command "collect.single" generates collectors curves that describe how comprehensively a microbial community has been assessed in the sample. This is done by calculating how community richness and diversity change as more individuals from the community are sampled. "collect.single" is performed with the "freq=5" option, which sets the frequency with which the richness and diversity are calculated to every 5 sequences. To generate intra-sample rarefaction curves, applying a re-sampling without replacement approach, the "rarefaction.single" command is used with the same "freq=5" option. Rarefaction curves provide a way of comparing the richness observed in different samples. The "summary.single" command produces a summary file of various richness and diversity estimators for each sample.

C. RDP classification of all sequence reads
The output of the pre-processing step (B.1. Mothur:trim.seqs), i.e. all sorted, trimmed and filtered sequence reads, are classified using the RDP classifier tool [2], as described on the Ribosomal Database Project website (http://rdp.cme.msu.edu/classifier/classifier.jsp). As output, a results file is created, which contains the taxonomic classification of each sequence read from all samples, including a confidence score (up to 1.0) assigned by the RDP classifier. In addition, tab-delimited table files (.tsv) are generated, which show the composition of each sample at different taxonomic levels, including phylum, class, order, family and genus. Assignments with confidence values below 0.8 (80%) are assigned as "unknown" for the generation of the .tsv files.

D. Sequence processing and analysis with Qiime
The Qiime component of CloVR-16S follows the Overview Tutorial on the Qiime website (http://qiime.sourceforge.net/tutorials/tutorial.html). It uses the same unprocessed sequence pools as the Mothur component as input and takes advantage of three Qiime workflow scripts to combine related steps of the analysis pipeline: "pick_otus_through_otu_table.py" is used for sequence clustering, alignment, classification and phylogenetic tree prediction; and "beta_diversity_through_3d_plots.py" is used to calculate "-diversity estimators and to generate 3D Principal Coordinate Analysis (PCoA) plots for the graphical representation of differences in microbial community compositions between samples (beta diversity).

D.1. Sequence pre-processing
To assign multiplex reads from sequence pools to specific samples using sequence barcodes, as well as to remove low-quality reads and to filter reads by length, the "split_libraries.py" script is used. This step removes all reads from the analysis that do not have the user-specified barcode sequence. The following options are used: "--minseq-length 100" (sets the minimum sequence length to 100 bp), "--barcode-type variable_length" (disables barcode corrections), and "--max-homopolymer 8" (sets the maximum homopolymer length to 8 bp).

D.2. Sequence clustering, alignment, classification and phylogenetic tree prediction
The "pick_otus_through_otu_table.py" workflow script calls the following Python scripts: 1) "pick_otus.py" is used to cluster reads from all samples into OTUs based on nucleotide sequence identity. The clustering program for this step is "Uclust" [9] and the nucleotide sequence identity threshold for all reads within an OTU is 97%. 2) "assign_taxonomy.py" uses the RDP classifier [2] with a confidence threshold of 0.8 to assign each OTU-representing read to a known taxon based on the pre-built database from the RDP classifier program. A .txt file is created by this script, which shows the most specific classification of each read above the confidence threshold, i.e. the resolution of the classification varies between reads, showing taxonomic lineages of different lengths. 3) "make_otu_table.py" generates an OTU table from the classification results, together with the information about the number of reads that each OTU represents, which specifies the OTU counts that each sample contains for each taxonomic assignment. 4) "align_seqs.py" uses the PyNAST tool [10] to align OTUrepresenting reads against the Greengenes reference alignment [7]. 5) "filter_alignment.py" uses the Greengenes Lane mask [8] to defines those positions from the alignment that will be ignored when building the phylogenetic tree. 6) "make_phylogeny.py" uses the "FastTree" program [11] to generate a phylogenetic tree in the Newick format.

D.3. Beta diversity sample analysis
The "beta_diversity_through_3d_plots.py" workflow script calls the following Python scripts: 1) "beta_diversity.py" takes the OTU table and phylogenetic tree as input to calculate beta diversity estimators, including phylogenetic distance as measured through weighted and unweighted UniFrac analysis [4], and to generate a distance matrix. 2) "principal_coordinates.py" maps the multidimensional variation between samples from the distance matrix on three principal coordinates. 3) "make_prefs_file.py" sets the parameters for the PCoA display based on the user-provided metadata information. 4) "make_3d_plots.py" generates 3D PCoA plots in the .html and .kin format, which can be opened with a web browser or the free KiNG Display Software, which is available from http://kinemage.biochem.duke.edu/software/king.php.

E. Additional beta diversity analysis using Metastats and the R statistical package
The output from the taxonomic classification of each sequence read from all samples by the RDP classification step (see section C) and of the RDP-based classification of the OTU-representing sequence reads from all samples by "Qiime:assign.taxonomy" is further analyzed and graphically represented using the "Metastats" program [5] and customized scripts in the R programming language.

E.1. Detection of differentially abundant features
The "Metastats" program uses count data from the taxonomic assignment of sequences with the RDP classifier to compare two groups containing at least three samples each in order to detect features with differential abundance in the two groups [5].

A. Requirements for Pipeline Input
To run the full CloVR-Metagenomics analysis track, two different inputs have to be provided by the user: a set of fasta-formatted sequence files and a tab-delimited metadata file in the .txt format. The metadata file provides sample-associated information with the following formatting requirements: Pairwise comparisons: To utilize the Metastats statistical methodology for the detection of taxonomic and functional assignments with differential abundance, the associated header field must end with "_p", (e.g. "Treatment_p", or "ph_p"). Otherwise Metastats will skip pairwise analysis.

B. Sequence clustering and artificial replicate removal with UCLUST
To reduce redundant database searches downstream, the UCLUST component of CloVR-Metagenomics first clusters all DNA sequences using a stringent 99% identity threshold. Similar to the procedure in [10], any non-representative sequence in a cluster that shares a prefix of length 8 with the representative (and whose length is within 10 bp of the representative's length) is determined to be an artificial 454 pyrosequencing replicate [11] and is removed from further analysis. Taxonomic and functional annotations made to representative members are later propagated to all non-replicate sequences.

C. Taxonomic assignment of DNA sequences
All representative DNA sequences from clusters are searched against the RefSeq database of finished prokaryotic genomes (by default) using BLASTN with the following options: "-e 1.0e-5" (e-value threshold), "-b 1" (number of alignments to show) and "-m 8" (tabular output). Each sequence is assigned to the taxonomy of the best-BLAST-hit.

D. Functional assignment of DNA sequences
All representative DNA sequences from UCLUST (section B) are searched against the COG database of orthologous gene groups (by default) using BLASTX with the same options as in section C ("-e 1.0e-5 -b 1-m 8"). Alternatively, the user may opt to employ ! ! !!! the KEGG genes, eggNOG or NCBI NR databases for functional annotation. Each sequence is assigned to the function of the best BLAST hit of the respective database.

E.1. Detection of differentially abundant features
The program Metastats uses count data from annotated sequences to compare two populations in order to detect differentially abundant features [3]. BLASTN results are processed to detect different taxonomic groups at multiple levels (phylum, class, order, family, genus), while BLASTX results are parsed for differentially abundant functional groups. Metastats produces a tab-delimited

E.3. Pie chart visualization
Custom R scripts are used to form pie charts displaying proportions of sequences assigned to specific functional and taxonomic levels for up to 12 samples. Outputs are in .pdf format. For more than 12 samples this function is not performed, as the visual comparison for the user would be cumbersome.

E.4. Stacked histogram visualization
Custom R scripts are used to form stacked histograms displaying proportions of sequences assigned to specific functional and taxonomic levels for up to 50 samples and 25 features. Graphical outputs are in .pdf format. For more than 50 samples or 25 features this function is not performed, as the visual comparison for the user would be difficult.

Software
Step

A. Assembly
During the assembly process the fragmentary whole-genome shotgun (WGS) sequence data typically produced in microbial single-genome projects are used to reconstruct long contiguous sequences or contigs. Scaffolds are composed of multiple contigs that are linked by paired-end reads. In the assembly output each scaffold will be represented by a single sequence (FASTA) in which stretches of "N"s indicate gaps and estimated gap lengths spanned by paired-end reads.

Raw sequence data file conversion
The Celera Assembler [2] uses .frg files as input, which are generated from 454 or Sanger raw sequences or from combinations of both data types with the sffToCA tool. Raw sequences from the 454 pyrosequencing platform are accepted either as single read or mated paired-end reads in the .sff format. Raw sequences from the Sanger platform are accepted as pairs of .fasta and .qual files with the same name prefix. As a default, sffToCA is run with the following parameters, which were optimized for data generated with the 454 GS FLX XLR Titanium platform: "-clear 454" sets the clear range for each sequence read "as is", using the clear range determined by the 454 sequencing machine; "-trim chop" erases sequences outside of the 454 clear ranges. If paired-end data are being used as input, a 454 linker has to be specified as either "-linker flx" or "linker titanium", depending on the 454 sequencing platform generation that is being used. In addition, an insert size range has to be selected, e.g. "8000 1000" where the first number specifies the average insert length (8 kbp) and the second number the standard deviation (1 kbp). More than one .sff file can be used as input for sffToCA, which produces a single .frg file that serves as input for the assembly step.

A.1.1.2. Assembly
For the assembly process, Celera Assembler is run with the "runCA" executive script, using default parameters. Alternatively, a "spec file" can be provided by the user to select alternative parameters for the assembly process. Spec file examples can be downloaded from the Celera Assembler documentation page (http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=RunCA_ Examples). The assembly step generates a number of different output files. First, assembly statistics are collected in a .txt summary report file. Second, .fsa (FASTA) files are generated as the direct assembly output, which serve as the input for the Gene Finding step. Third, a .asm file is generated, which can be used in combination with the .frg file from the sffToCA output to generate the .bnk file for visualization of the assembly results with the Hawkeye assembly viewer [13].

A.1.1.3. Celera assembly visualization
Using information from the raw sequence data (.frg) and the sequence assembly (.asm), a .bnk file is generated with the "toAmos" and "bank-transact" tools (http://sourceforge.net/apps/mediawiki/amos/index.php?title=Hawkeye). The .bnk file serves as the direct input for the Hawkeye assembly viewer [13], which itself is not part of the CloVR-Microbe pipeline and which is not installed on the CloVR virtual machine image.

A.1.2. Assembly with the Velvet Assembler
In the case of assembling Illumina-based WGS sequence data, the Velvet assembly program [3] can accept as input any combination of short or long read data (with or ! ! !'!

B.2.3 Glimmer3 gene prediction B.2.3.1. First iteration
The first iteration of gene finding is run with the ICM model from B.2.2 using the "glimmer3" program executed with the following parameters: "-o50 -g110 -t30 -z11 -l -X". "-o50" sets the maximum overlap between CDS to 50 nucleotides; "-g110" sets the minimum CDS length to 110 nucleotides; "-t30" sets the threshold score for CDS to 30; "-z11" determines the NCBI translation table code "11" to specify stop codons; "-l" sets the input sequence as linear, and "-X" allows CDS to extend off the end of the input FASTA sequence.

B.2.3.2 Generation of Position Weight Matrix
A Position Weight Matrix (PWM) is generated for regions upstream of start-sites using the output from B.2.3.1. First, the script "upstream-coords.awk" is run with parameters '25 0' to extract sequence regions upstream of CDS predicted in B.2.3.1. Next, the "ELPH" program is run with parameter 'LEN=6' to create a PWM from the region upstream of the CDS start sites. ELPH is a general-purpose Gibbs sampler for finding motifs in a set of DNA or protein sequences (http://www.cbcb.umd.edu/software/ELPH/). The distribution of start codons is also generated from the output of the first iteration.

B.2.3.3 Second iteration
The second iteration of glimmer3 gene finding is run with the ICM from B.2.3.1 and the PVM from B.2.3.2 with parameters "-o50 -g110 -t30 -z11 -l -X -P <number-list>". <number-list> consists of three comma-delimited values that specify the probabilities of different start codons as determined from the output of B.2.3.2. The output includes a set of putative CDS described in a .txt summary report and coordinate files of all predicted genes in the BSML format.
C. Annotation

C.1. Translation of preliminary CDS into peptides
All predicted protein-coding genes from all scaffolds and/or contigs are translated into peptide sequences using a wrapper to the "transeq" program from the EMBOSS package [7] with the parameters "-table 11" to specify the bacterial translation table. The "-trim 1" option is also used to eliminate trailing "X" or "*" characters from the translation. A peptide FASTA file (.fsa) is provided as the output.

C.2. CDS homology searches (round 1)
Two types of homology searches are performed to generate the evidence, which is used to assign a functional annotation to each CDS: the translated CDS are compared against the UniRef100 non-redundant protein database from UniProt (http://www.uniprot.org/) using BLASTX, and against the two protein family databases Pfam and TIGRFAM [10], using HMMER [9].

C.2.1. Protein comparison and pairwise alignment
For this step, the Blast-Extend-Repraze (BER) tool (http://sourceforge.net/projects/ber/) employs a two-step process that starts with a BLASTX search followed by a modified Smith-Waterman alignment. The BER output is used to detect possible frameshifts and in-frame stop codons within the predicted CDS. ! ! !(!

C.2.1.2. Modified Smith-Waterman nucleotide sequence alignment
In order to identify frameshifts, in-frame stop codons and erroneous start codons the BER tool creates nucleotide sequence alignments of the CDS and the nucleotide sequence corresponding to the protein matches identified in C.2.1.1. For these sequence alignments the CDS are extended by 300 nucleotides upstream and downstream of the start and stop codon. Therefore, if there is a sequencing error or a natural mutation that has split one gene into two, the BER tool creates an alignment across those two fragments. BER is executed with the following parameters "-e 1e-5" (maximum e-value), -E 1e-5 (maximum p-value), "-n 150" (maximum number of hits, similar to the BLAST -v option)), "-N 0" (maximum number of hits per region).

C.3. CDS overlap analysis
The results from the tRNA, rRNA and CDS gene calls together with the evidence generated through the HMM homology searches are used to remove overlapping genes either between CDS or between CDS and tRNA or rRNA genes. Only overlaps of at least 60 nucleotides are considered for resolution. When a CDS that has no homology evidence (HMM matches passing cutoff or BER alignments) overlaps with another CDS that does contain evidence, the CDS without evidence is removed. If both (or neither) of the CDS' show homology evidence, they are left in place. When a CDS overlaps with an RNA prediction, the RNA prediction is given a higher priority and the CDS is removed.

C.4. CDS homology searches (round 2)
After the automatic curation of start sites, the newly changed gene models are retranslated. These new polypeptides are then run through another set of BLASTX, HMM and BER searches to update similarity evidence for functional annotation. In addition, each polypeptide is run against the NCBI COG database using BLASTP with parameters "-e 1e-5 -F 'F' -b 500 -v 500 -M BLOSUM62".

C.5. Functional annotation
An in-house script is used to assign functional names, gene symbols, EC numbers and GO terms to each CDS based on a ranked hierarchy of evidence sources generated through the homology searches. The script considers sequence homology to TIGRFAM and PFAM HMMs and matches to the UniRef100 protein database. BER evidence is used to identify searches against UniRef100 with at least 35% sequence identity over 80% gene length. A series of naming rules are applied based on the type of HMM match. Equivalog HMM hits are most preferred and names are used without modification. For other HMM types protein names are appended with 'family protein' or 'domain protein' based on the isology type of the HMM. If the protein matches a ! ! !)! hypothetical equivalog HMM the name will be 'conserved hypothetical protein'. The decision process that is used to determine functional annotations is identical to the one from the IGS Annotation Engine, which is described in detail in a publication elsewhere [1].

C.5. Creation of different output files
To allow users of the pipeline to edit, visualize, and publish the annotated sequences and to submit them to the public sequence databases, the pipeline output is stored in a number of different file formats. For each annotated sequence assembly or contig .gbk and .gff files are generated, which can be opened and edited with the Artemis sequence annotation tool [14]. A .txt feature file can be loaded into the Sequin tool and used for GenBank sequence submission (http://www.ncbi.nlm.nih.gov/Sequin/index.html).