dEMBF: A Comprehensive Database of Enzymes of Microalgal Biofuel Feedstock

Microalgae have attracted wide attention as one of the most versatile renewable feedstocks for production of biofuel. To develop genetically engineered high lipid yielding algal strains, a thorough understanding of the lipid biosynthetic pathway and the underpinning enzymes is essential. In this work, we have systematically mined the genomes of fifteen diverse algal species belonging to Chlorophyta, Heterokontophyta, Rhodophyta, and Haptophyta, to identify and annotate the putative enzymes of lipid metabolic pathway. Consequently, we have also developed a database, dEMBF (Database of Enzymes of Microalgal Biofuel Feedstock), which catalogues the complete list of identified enzymes along with their computed annotation details including length, hydrophobicity, amino acid composition, subcellular location, gene ontology, KEGG pathway, orthologous group, Pfam domain, intron-exon organization, transmembrane topology, and secondary/tertiary structural data. Furthermore, to facilitate functional and evolutionary study of these enzymes, a collection of built-in applications for BLAST search, motif identification, sequence and phylogenetic analysis have been seamlessly integrated into the database. dEMBF is the first database that brings together all enzymes responsible for lipid synthesis from available algal genomes, and provides an integrative platform for enzyme inquiry and analysis. This database will be extremely useful for algal biofuel research. It can be accessed at http://bbprof.immt.res.in/embf.


Introduction
With the irreversible depletion of petroleum resources, renewable biofuels are sustainable alternative to meet the global energy needs. Microalgae as a rich source of lipid, especially triacylglycerols (TAGs) have emerged as a potential biofuel feedstock due to several distinct advantages over other starch-based or lignocellulosic plant species, such as higher photosynthetic efficiency and higher biomass production rate. Besides, microalgae can be grown on non-arable land using wastewater, thus not competing with agri-resources and mitigating CO 2 emissions efficiently [1,2]. However, to make biofuel production from microalgae a cost-competitive process, the oil content in oleaginous algae needs to be significantly improved through genetic engineering techniques [3,4]. It has been proposed that lipid biosynthesis can be increased by over expressing the rate-limiting enzymes of fatty acid biosynthesis pathway, of which acetyl-CoA carboxylase (ACCase) that catalyzes the first committed step of fatty acid synthesis viz., conversion of acetyl CoA to malonyl CoA plays a pivotal role [5,6]. In addition, overexpression of the acyltransferases enzymes catalyzing the main regulatory steps involved in TAG biosynthesis, widely known as the Kennedy pathway, have also been determined as a potential approach to boost oil accumulation. For instance, overexpression of a type 2 diacylglycerol acyltransferases (DGAT) enzyme in the diatom Phaeodactylum tricornutum resulted in 35% increase in TAG content [7]. In another study, co-overexpression of multiple genes of the Kennedy pathway including glycerol-3 phosphate acyltransferase (GPAT), lysophosphatidyl acyltransferase (LPAT), phosphatidic acid phosphatase (PAP), diacylglycerol acyltransferase, glycerol-3 phosphate dehydrogenase (GPDH) and phospholipid:diacylglycerol acyltransferase (PDAT) in Chlorella minutissima resulted in a two-fold increase of TAG content [8]. Introduction of diacylglycerol acyltransferase 2 gene from Brassica napus to Chlamydomonas reinhardtii has also resulted in enhanced lipid production [9]. Together these studies indicate that understanding the regulation of microalgal lipid metabolism is absolutely essential for developing engineered microalgae with enhanced lipid production capabilities. [10]. While algal sequence data from genome assembly projects is rapidly increasing, the generated annotation for predicted sequences are usually limited and includes only user-defined function prediction with no detailed pathway, structure or genome-context information [11]. This limits our understanding of the overall lipid biosynthetic pathway in microalgae [12]. On contrary, the genes and enzymes involved in plant lipid biosynthetic pathway have been characterized extensively [13], and a number of biomass-related enzyme databases are also available to promote the development of transgenic biofuel crops [14][15][16][17]. Considering the importance of microalgae biofuel, paucity of information on algal lipid biosynthesis and unavailability of dedicated databases on enzymes underpinning the process, the present study was performed to identify a total of 289 enzymes responsible for lipid accumulation in fifteen sequenced microalgal species by using available homologous sequences from the model plant species, Arabidopsis thaliana. Functional annotation of the putative enzymes has also been improved by employing several bioinformatic tools to study metabolic pathways, ontology, subcellular location, secondary and tertiary structure, biophysical properties, cellular processes and protein family information. Furthermore, the emanated data are made publicly accessible through an open-access webbased database, dEMBF (database of Enzymes of Microalgal Biofuel Feedstock, http://bbprof. immt.res.in/embf). dEMBF is the first integrative platform that provides a complete list of enzymes putatively involved in lipid biosynthesis in microalgae. This database will certainly provide a roadmap for experimental as well as computational studies leading to identification of orthologous lipid synthesis enzymes in newly sequenced algal species and facilitate further R&D research aimed at attaining a sustainable and cost-effective biofuel production from microalgae.

Enzyme prediction
The genome databases were queried by both keywords and sequence similarity BLASTp [18] search (E-value < 1e-5) using sequences of enzymes that are known to be involved in neutral lipid synthesis in Arabidopsis (Table 1). Subsequently, the successful hits were mapped to Uni-Prot ID [19], Enzyme commission (EC) number, Cluster of Orthologous groups (KOG) using KOGnitor [20], OrthoMCL [21] and Gene Ontology (GO) terms [22] using AmiGO [23], to remove any false positives. In addition, Pfam [24] was also employed to ensure that each candidate sequence shared the domain of the enzyme family to which it belongs. Finally, a complete set of 316 enzymes was collected from the studied algal species including Arabidopsis, for further detailed analysis of functional annotations as discussed below.

Functional annotations
Physico-chemical properties. The total number of amino acids, molecular weight, isoelectric point (pI), percentage of acidic/ basic amino acids, aliphatic index as well as GRAVY index was calculated using the Expasy's ProtParam server [25]. Hydropathy plot was generated using the BioEdit [26] software.
Gene structure analysis. The exon-intron organizations of genes encoding the enzymes were determined by GeneWise program [33] through comparison of predicted coding sequence with corresponding genomic sequence.
Homology modeling of 3D structures. As no crystal structures of the predicted enzymes for microalgal species was found in the Protein Data Bank [34], we tried to model their 3D structure using MaxMod program [35]. Templates were selected based on crystal structures having more than 30% sequence identity. Ramachandran plot of the developed models were generated using the Procheck [36] program.

Database architecture
The dEMBF database runs on an Apache server (v. 2.2.17), where PHP v 5.3.5 was used for server side scripting while Java Script, AJAX, XHTML and CSS were used for client side scripting. Data was stored in a relational format using MySQL v 5.0.7 as the backend database, following basic normalization rules in order to reduce data redundancy and increase database efficiency.

Annotation details of enzymes in dEMBF
The annotation detail page of dEMBF (Fig 1) displays multiple sequence and structural properties of the enzymes that has either been extracted manually from public resources or has been computed using a plethora of bioinformatics tools as described in the Methods section. Each sequence is annotated with information like symbol, gene name, enzyme class, organism, taxonomic identifier which is linked to NCBI taxonomic browser and organism lineage. The general information section furnishes information on chromosomal location, subcellular location, reaction, KEGG pathway, KEGG ortholog (KO) and KOG details. Similarly, the page also contains other predicted protein features such as gene ontology, physico-chemical properties, schematic representation of conserved domain, secondary structure, transmembrane topology, modeled 3D structure, intron-exon organization, amino acid and nucleotide sequences in fasta format, and cross references to external protein and gene databases. Particularly, the modeled 3D structure that has been built using homology modeling protocol along with details of the template employed and target-template alignment will be useful for users to study the structural conformation of enzymes in detail. To facilitate dynamic visualization of developed protein 3D conformation, JSmol applet (http://www.jmol.org) has been integrated. Furthermore, pre-generated Ramachandran plots for each modeled structure can be viewed using the "View Ramachandran plot" option.

Web-interface of dEMBF
The dEMBF database comprises of six major web interfaces, namely "Home", "Browse", "Search", "Tools", "Organisms" and "Resources". A schematic overview of dEMBF architecture is shown in Fig 2. Home. The home page (Fig 3) contains a brief introduction to dEMBF and a site map detailing the outline of the database. Various convenient utilities are also available in the homepage to view and retrieve data from dEMBF. For instance with the "Search database" option, users can search for enzyme by name, symbol, UniProt Id, gene name, enzyme class, EC number, or organism name, directly from the homepage of database using auto complete text fields. Likewise, the "Metabolic Pathway Browser" greatly facilities users to browse detailed information of any particular enzyme by just clicking on the enzyme name that has been manually mapped onto the lipid biosynthetic pathway. The "Database Summary" provides a complete list of the total number of lipid biosynthetic enzymes in various algal species currently present in dEMBF. Links to some of the important tools of dEMBF such as "BLAST", "Compare" and "Phylogeny", is also provided.
Browse. A number of browsing options are provided in dEMBF to allow users to navigate by specific criteria, such as selecting browse by "All Entries" for retrieving all enzymes present in the database or browse by "Enzyme Classification", "Organism", and "Enzyme Class", for specific enzymes of interest (Fig 4). On clicking the "Browse" option, user will be redirected to a page displaying all enzymes along with their respective accession ID, abbreviation, gene name, EC number, organism name and annotation details. The "Annotation details", option provides comprehensive sequence and structural properties of an enzyme (Fig 1), as discussed in the "Annotation details of enzymes in dEMBF" section of results. In addition, the "Metabolic  The "Metabolic Pathway Browser" is provided where users can select any enzymes (yellow color) of the lipid biosynthetic pathway to retrieve its detailed information. The "Database Summary" lists the total number of lipid biosynthetic enzymes currently present in dEMBF. An easy-to-use search field with multiple search criteria including enzyme name, symbol, UniProt ID, gene name, enzyme class, EC number, and organism name is also provided. Links to some of the important tools of dEMBF such as "Compare", "BLAST" and "Phylogeny" is available in the homepage.  Pathway Browser" is a dynamic browsing interface where the lipid biosynthesis enzymes have been linked to its information details.
Search. The "Search" function permits users to perform a simple search and advanced search in the database (Fig 5). The "Simple Search" option provides search queries for the enzyme name, symbol, gene name, organism, enzyme class, EC number, KOG ID, Gene Ontology, Pfam ID, UniProt ID and gene ID. The "Advanced Search" allows users to combine multiple search criteria in order to locate specific enzymes of interest more precisely.
Tools. A number of web-based tools have been integrated in dEMBF to facilitate further analysis of the enzymes. A brief description of these tools is as follows: BLAST. The standalone NCBI's BLAST software was integrated as a part of the dEMBF tools. Users can perform a BLAST search of a query sequence either against the entire dEMBF database or against each individual enzyme to identify homologous sequences (Fig 6a). A wide range of E-values are available to control search sensitivity. The BLAST results are displayed on the same page in a tabular format sorted by percentage of identity, similarity, query coverage, bit score and E-value. This interface is particularly useful for users to annotate the function of an unknown sequence.
Compare. The "Compare" tool (Fig 6b) allows user to perform comparative analysis of enzymes between one or multiple algal species. User has to select at least two enzymes from the same or different organism alongwith the annotation features based on which the comparison will be carried out. The results are displayed in a condensed tabular format. Screenshots of dEMBF analysis tools displaying their query pages and resulting outputs. (A) "BLAST" tool allow users to perform similarity search for protein or nucleotide sequences against NCBI, dEMBF database or against individual enzymes. A wide range of E-values are provided to control search sensitivity. BLAST results are sorted by percentage of identity, similarity, query coverage, bit score and E-value. (B) "Compare" tool to perform comparative analysis of enzyme between one or multiple algal species. Users can select various clickable annotation feature alongwith enzyme name and corresponding organisms between which comparisons is to be carried out. (C) "Motif" tool to identify conserved motifs in query sequences using the integrated MEME program. (D) "MSA" tool to align two or more protein sequences with the MUSCLE program. (E) "Phylogeny" tool to construct phylogenetic tree (Newick rooted tree or Circular tree) using PhyML and jsPhyloSVG. Motif. Users can predict conserved motifs using the MEME [37] program integrated in dEMBF (Fig 6c). On submission of protein sequences, the database will redirect the query to MEME and after the completion of job, results in various pre-defined formats are made available for download.
MSA and Phylogeny. In addition to above tools, both multiple sequence alignment ( Fig  6d) and phylogenetic (Fig 6e) tools are also provided in the "Tools" page of dEMBF. Alignment of two or more protein sequences is done by MUSCLE [38] and newick trees are built using PhyML [39] and jsPhyloSVG [40], the later being a java-independent function for viewing phylogenetic tree files online.
Organisms. The "Organism" page (Fig 7) displays the list of sequenced genomes analyzed in dEMBF, which comprises of fifteen microalgal species alongwith Arabidopsis as the Resources. Given below are brief descriptions of the various utilities, available in the "Resources" page of dEMBF: Data analysis: A statistical overview of the data present in dEMBF is provided (Fig 8a). Publications: Recent research articles on algal lipid biosynthesis pathway have been compiled alongwith hyperlinks to PubMed for user references (Fig 8b).
Useful Links: External database links are provided to other bioinformatics resources such as algal genome project databases, Arabidopsis lipid gene database, metabolic pathway databases and protein databases (Fig 8c).
Downloads: All protein and nucleotide sequences present in the dEMBF are available for download from "Download" page of Resources (Fig 8d).

Discussion
After a thorough examination of the fifteen algal genomes, a total of 289 enzymes with putative roles in lipid synthesis were identified (S1 and S2 Tables). Sequence-structure information of these enzymes, together with the 27 well characterized homologous enzymes from Arabidopsis used as reference dataset in this study, are provided in the database. While previous studies have identified some key enzymes associated with lipid metabolic pathway in few algal species [41][42][43][44][45][46][47][48][49], the genomes of C. variabilis, M. pusilla, Micromonas sp., B. prasinos, T. pseudonana, P. tricornutum, E. siliculosus, A. anophagefferens and E. huxleyi have been mined for the first time in this study to collate the entire repertoire of enzymes responsible for lipid accumulation in microalgae. In addition to genome mining, we have assigned pathways, gene ontology terms and cluster of orthologous (S3 Table), subcellular location, secondary and tertiary structure, biophysical properties, cellular processes and protein family terms to each of the enzymes. Consequently, we have improved the existing functional annotation of all 289 enzymes including 86 previously uncharacterized sequences for which a putative function in lipid biosynthesis has been determined (Fig 9). We observed that the analyzed algal genomes exhibited an overall comparable enzymatic makeup and each encode the major enzymes for lipid synthesis similar to Arabidopsis (S2 Table and [42]. Furthermore, we found that the acyltransferases (60% of the total number of enzymes) is the most abundant enzyme class (Fig 10). The increased number of enzymes belonging to this class is probably significant considering that the three acyltransferases including GPAT, LPAT and DGAT catalyzes sequentially to acylate glycerol backbone, to ultimately produce TAG. These enzymes play a vital role in determining the acyl composition of glycerolipids and the final content of TAG [12]. In particular, relatively more number of DGAT (80 in number) followed by ACCase (39 in number) enzyme was observed in all algal genomes. The fact that ACCase catalyzes the initial rate limiting step of fatty acid biosynthesis by converting acetyl CoA to malonyl CoA while the DGAT enzyme drives the final step of TAG synthesis acylating diacylglycerol to TAG [4,10], clearly reflects the high lipid accumulation capability of microalgae for biofuel production. To our knowledge, dEMBF is the first comprehensive database on enzymes responsible for lipid accumulation in fifteen diverse algal species whose genome sequences are available. This work could be useful towards better understanding of fatty acid and TAG biosynthetic pathways in microalgae, besides facilitating the development of genetically engineered algal strains for a sustainable and economical viable biofuel production.