AGeS: A Software System for Microbial Genome Sequence Annotation

Background The annotation of genomes from next-generation sequencing platforms needs to be rapid, high-throughput, and fully integrated and automated. Although a few Web-based annotation services have recently become available, they may not be the best solution for researchers that need to annotate a large number of genomes, possibly including proprietary data, and store them locally for further analysis. To address this need, we developed a standalone software application, the Annotation of microbial Genome Sequences (AGeS) system, which incorporates publicly available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance. Methodology The AGeS system supports three main capabilities. The first is the storage of input contig sequences and the resulting annotation data in a central, customized database. The second is the annotation of microbial genomes using an integrated software pipeline, which first analyzes contigs from high-throughput sequencing by locating genomic regions that code for proteins, RNA, and other genomic elements through the Do-It-Yourself Annotation (DIYA) framework. The identified protein-coding regions are then functionally annotated using the in-house-developed Pipeline for Protein Annotation (PIPA). The third capability is the visualization of annotated sequences using GBrowse. To date, we have implemented these capabilities for bacterial genomes. AGeS was evaluated by comparing its genome annotations with those provided by three other methods. Our results indicate that the software tools integrated into AGeS provide annotations that are in general agreement with those provided by the compared methods. This is demonstrated by a >94% overlap in the number of identified genes, a significant number of identical annotated features, and a >90% agreement in enzyme function predictions.


Introduction
The cataloguing and analysis of microbial genomes sequenced using next-generation technologies opens new avenues for screening unknown microbes and analyzing their genetic diversity. For such applications, the analysis of sequenced genomes needs to be rapid, high-throughput, fully automated, integrated, and readily accessible to the intended users. To address this need, we have developed the Annotation of microbial Genome Sequences (AGeS) software system, which incorporates publicly available and in-house-developed bioinformatics programs and databases, many of which are parallelized for high-throughput performance.
AGeS performs gene and protein annotation for bacterial genomes using an integrated software pipeline. The input to AGeS is a multi-FASTA file containing contigs generated by high-throughput sequencing. AGeS analyzes these contigs and locates genomic regions that code for proteins, RNA, and other genomic elements by using a set of tools, such as Glimmer [1], RNAmmer [2], and TRF [3], through the Do-It-Yourself Annotation (DIYA) framework [4]. The identified protein coding regions are then annotated using high-throughput protein function annotation methods implemented in the inhouse-developed PIPA pipeline [5]. The output of an AGeS run consists of annotated sequences. These annotated sequences are visualized using GBrowse [6], which is fully integrated into the AGeS pipeline. The results can also be downloaded as a GenBank format file for further analysis. Several features make AGeS a useful tool for scientists that need high-throughput annotation of their genomic sequences: fully automated annotation of completed and draft bacterial genomes performed by combining the DIYA framework with the PIPA protein function annotation pipeline; annotations compliant with the de facto standards, i.e., Minimum Information About a Genomic Sequence (MIGS) [7] for genomic sequences, and Gene Ontology [8] for protein function annotations; user-friendly visualization based on the familiar open-source genome browser GBrowse; and high-throughput annotation accomplished through the efficient use of high-performance computing.
Figure 1-1 shows the system architecture of AGeS. It comprises of a Web application server (AGeS server) that provides an easy-to-use GUI accessible via a Web browser, an embedded relational database management system for storing sequences and other job-related data, and a high-throughput software pipeline for the annotation of input genomes. The AGeS server and annotation pipeline can be installed on either a standalone Linux computer or a Linux cluster using the step-by-step instructions provided in the next chapter "Installing AGeS." Once the software is installed, multiple users can access the AGeS GUI via the AGeS server using standard Web browsers. The AGeS GUI provides three main functions to the users: (i) sequence management for uploading and manipulating genomic sequences; (ii) job submission for running the annotation pipeline; and (iii) graphical visualization of the annotated sequences with GBrowse. The AGeS server uses a workflow manager module to guide the entire lifecycle of the user's job from the input sequence upload to the visualization of the annotated sequence.

AGeS Server
Workflow   AGeS can be installed in a standalone Linux computer (with single or multiple cores) or a Linux cluster. When run on a multi-core Linux computer or a Linux cluster, AGeS supports OpenMPI for parallel execution and PBS for batch submission. All software tools used by AGeS are installed during the AGeS installation process described in the next chapter "Installing AGeS."

Installing AGeS
The AGeS source code is available at http://www.bioanalysis.org/downloads/ages.tar.gz. It is freely available under a BSD license. It can be installed using the step-by-step instructions detailed below. These instructions will work unmodified on the Red Hat Enterprise Linux 5.5 or CentOS 5.5 Linux distributions. If you are using another Linux distribution, use these instructions as a guide and adapt to your particular distribution as necessary.
This installation guide assumes that both the GUI and the pipeline will be installed on the same host. If the GUI and the pipeline are installed on different hosts, Steps 1 and 2 are required on the host that is running the GUI, and Steps 1 and 3 are required for the host that is running the pipeline. If the pipeline is running on a cluster, follow Step 4 for configuration of the cluster.
Steps for installing AGeS: 1. Common requirements for AGeS GUI and pipeline 1.1. Download the AGeS tar ball from http://www.bioanalysis.org/downloads/ages.tar.gz. After extracting the tar ball you will see the following directory structure: Set the environment variable $AGES_HOME to the ages directory.
1.3. Install Java: JDK 1.6(+) is required to run AGeS GUI. Download and install Java (JDK1.6+) from the Oracle Web page (http://www.oracle.com/technetwork/indexes/downloads/index.html). Setup the $JAVA_HOME environment variable and point it to the JDK installation location.

Run the GUI:
Once all the required modules are installed, make sure that the required environment variables are set (PATH, JAVA_HOME, PERL_HOME, and PERL5LIB). Go to the ages root directory and run the script "rungui.sh." It will bring up a Jetty Web server that listens on port 9000. Open a Web browser and enter the URL "http://localhost:9000/ages" to access the Web GUI for AGeS. The Web server can be also accessed from another computer provided that port 9000 is not blocked by the firewall. In this case, use your hostname or IP address to replace the "localhost" in the URL. http://<your_host_name or your host IP address>:9000/ages.

2.4.
You can run a demo annotation task though the GUI. The demo uses a pre-loaded GenBank file as the annotation output for the demonstration purpose. In order to run a real annotation, you have to install and configure the annotation pipeline using Step 3.

Customize Database (Optional): The AGeS distribution package uses Apache Derby as its default database engine.
If you intend to deploy the application in a production server with a large user base, we recommend using a standalone database server, such as PostgreSQL. The AGeS package has been tested on PostgreSQL 8.3. Please follow the steps below to configure your system for PostgreSQL: 2.5.1. Create a database called "ages".

Steps for installing AGeS pipeline
The AGeS pipeline was tested using JDK 1.

fasta -p T -o T -t uniref50 -n uniref50
You can parallelize DIYA using mpiBLAST. Please refer to Step 4.1 for installing and setting up mpiBLAST for DIYA. If using mpiBLAST, make appropriate changes in the data formatting procedure.
Configuration of the DIYA components can be controlled by the file $AGES_HOME/conf/diya_repeat.conf. To run the correct version of these components, change the paths appropriately in the configuration file $AGES_HOME/conf/diya_repeat.conf. If using mpiBLAST, make appropriate changes in the DIYA configuration file.
DIYA also uses the property file diya_default.properties located in the $AGES_HOME/lib directory to specify the location of temporary files. These properties are overridden by the properties in the annotation.properties file.
3.2. Install and configure PIPA: Download PIPA from http://www.bhsai.org/downloads/PIPA.tar.gz and install it in the $AGES_HOME/software/PIPA directory. Make sure you set the following environment variables: export PIPA_HOME=AGES_HOME/software/PIPA export PIPA_WRAP=AGES_HOME/software/PIPA/src/Profann/dsrc/pipa_submission_wrapper.pl export MERGE_GBK_GFF=AGES_HOME/software/PIPA/src/Profann/misc/mergePipaGff2Genbank.pl export MPI_PIPA=AGES_HOME/software/PIPA/src/Profann/App/MPIProfannBatch.pl export IPRSCAN_HOME=$PIPA_HOME/software/iprscan PIPA also uses the property file pipa_default.properties located in the $AGES_HOME/lib directory to specify the location of temporary files. These properties are overridden by the properties in the annotation.properties file. You can also override the tools to run within PIPA, the default parameters of these tools and whether to run PIPA in MPI or SERIAL. To run PIPA in MPI mode you will also have to make sure that mpirun and jobmanager parameters are set in the $PIPA_HOME/config/pipa_user.config configuration file.
3.4. Running Annotation pipeline through GUI: The AGeS GUI interfaces with the pipeline through the *_env files. Make sure your default shell is bash. Make appropriate changes in the files anna_env, mpi_env, pipa_env, and diya_env.

Steps to Run AGeS on a Cluster (Optional)
The AGeS pipeline can be configured to run on clusters for high-throughput annotation. Parallelization can be achieved through message passing interface (mpiBLAST for DIYA and MPI for PIPA). The pipeline can also be invoked through a PBS batch submission system. Therefore, there are three operation modes for AGeS: serial, MPI, and PBS.

an AGeS Session
The screenshot in Figure 3-1 shows the entry page of the AGeS system. The user starts AGeS by clicking on the Enter button. Unique session IDs identify different users. If you have visited the site before, your session ID will be automatically appended to the URL and your previous data will be automatically loaded.

AGeS GUI Overview
Three main components of the AGeS system are accessed through the three tabs on the AGeS main page: Manage Sequences, Annotate, and View Annotations, as presented in Figure 4-1. The user can upload and store the sequences (contigs in multi-FASTA format) through the Manage Sequence tab. The next step is to annotate these contigs, i.e., to find genes, the locations of these genes in the contigs, the proteins encoded by the genes, and the functions of these proteins. These tasks are performed under the Annotate tab. Finally, the user can view the annotations using the View Annotations tab.

Sequence Management
There are two ways to load your sequence into AGeS: copy the contigs (in FASTA format) and paste them in the text box, or upload a FASTA file. The screenshot in Figure 4-2 shows how to copy/paste a sequence segment, and Figure 4-3 shows how to upload a sequence in FASTA format. The box on the right shows the sequences whose annotations are pending and/or those that have already been annotated, and on the left side are sequences yet to be annotated. The user should select one of these sequences and hit the Annotate button.  For questions or problems related to AGeS please contact: ages@bioanalysis.org.