Cyanobacterial KnowledgeBase (CKB), a Compendium of Cyanobacterial Genomes and Proteomes

Cyanobacterial KnowledgeBase (CKB) is a free access database that contains the genomic and proteomic information of 74 fully sequenced cyanobacterial genomes belonging to seven orders. The database also contains tools for sequence analysis. The Species report and the gene report provide details about each species and gene (including sequence features and gene ontology annotations) respectively. The database also includes cyanoBLAST, an advanced tool that facilitates comparative analysis, among cyanobacterial genomes and genomes of E. coli (prokaryote) and Arabidopsis (eukaryote). The database is developed and maintained by the Sub-Distributed Informatics Centre (sponsored by the Department of Biotechnology, Govt. of India) of the National Facility for Marine Cyanobacteria, a facility dedicated to marine cyanobacterial research. CKB is freely available at http://nfmc.res.in/ckb/index.html.


Introduction
Cyanobacteria comprise over 1,600 species with various morphologies and species-specific characteristics, such as cell movement, cell differentiation, and nitrogen fixation [1]. These are the only known oxygenic photosynthetic prokaryotic organisms that inhabit a wide range of ecological habitats (e.g., extreme cold, extreme hot, marine, fresh water, and terrestrial) and exhibit symbiotic associations with other living organisms. These primitive oxygenic Gram negative bacteria are widely used as a valuable model to study the mechanism of carbon fixation and helpful for evolutionary biologists to understand the endosymbiotic theory, as they are considered as the origin of chloroplast. Since these ancient life forms play a major role in many biogeochemical cycles of the global ecological system, they serve as a study material in diverse fields of life-science research [2].
Cyanobacteria are well-known for the formation of toxic cyanobacterial water blooms in freshwater, brackish and coastal marine ecosystems, which are of vital ecological and human health concerns [3]. However, in recent times, these organisms have captured the attention of the researchers worldwide because of their capability of producing prolific bioactive natural products as secondary metabolites, which are of great economic and medical value [4][5][6].
The National Facility for Marine Cyanobacteria (Sponsored by the Department of Biotechnology, Govt. of India) is dedicated to cyanobacterial research, especially marine cyanobacteria. One of the principal foci of the facility is to build a dedicated knowledge base for cyanobacteria. The increasing number of completely sequenced cyanobacterial genomes provides wide opportunities for understanding the metabolic organization of the cyanobacterial species in diverse environments. Here we introduce the Cyanobacterial KnowldegeBase (CKB), a freely accessible, comprehensive database resource covering information pertaining to 74 completely sequenced cyanobacterial species. The database also includes an informative tool called cyano-BLAST, which helps in comparative analysis between cyanobacterial genomes and the genomes of pro-and eu-karyote, such as E. coli and Arabidopsis.

Results and Discussion
Organisms Seventy-four fully sequenced genomes of seven orders are currently included in the CKB database. This comprises 12 species of Chroococcales, 1 of Chroococcidiopsidales, 2 of Gloeobacteriales, 12 of Nostocales, 7 of Oscillatoriales, 2 of Pleurocapsales and 38 of Synechococcales. The web user interface of CKB is shown in (Fig 1) and the complete list of the species exists in the CKB is given in Table 1.

Tools
The database analysis portal provides access to the CKB BLAST tool, as well as tools for pattern and fuzzy searches, and restriction digestion.
The CKB BLAST tool can be used to compare nucleotide or protein sequences, to identify members of gene families, and to infer functional and evolutionary relationships between sequences.
Users are provided with several customized databases for similarity searches within the CKB BLAST analysis tool. This includes a database with information on all cyanobacterial  chromosomes and plasmids. The users have the freedom to restrict their analysis to either chromosomes or plasmids. Furthermore, CKB provides databases that allow users to compare individual organisms, multiple organisms and orders also (Fig 2). As cyanobacteria are prokaryotic photosynthetic organisms, a model prokaryotic genome (E. coli) and a photosynthetic eukaryotic genome (Arabidopsis) are included for advancing comparative analysis.
In addition, pattern and fuzzy search tools are available to help in identifying the patterns present in different cyanobacterial genomes. Furthermore, the restriction digestion tool helps to identify restriction sites within the sequences.

Searching and browsing through the database
The Cyanobacterial KnowledgeBase consists of information related to 74 fully sequenced cyanobacterial species of seven orders, namely Chroococcales, Chroococcidiopsidales, Gloeobacteriales, Nostocales, Oscillatoriales, Pleurocapsales and Synechococcales. The browse option helps with orientation and navigation through the species under each order (Fig 3). The species report can be reached from the "Browse by Order" option, which provides brief information about the species, taxonomy, morphological features, genome status, and its genome details. The search tool can also be used to retrieve information related to specific genes, functions, or keywords, etc. An example search result for a query keyword "Chaperone" returned 907 entries (Fig 4)

Proteome profiling
The complete gene set of each genome can be accessed under the proteome profiling from the "Species reporter" tool. The table provides a complete gene list with PID gene name (locus_tag), synonym, product name, strand, start and end, length, COG (Clusters of Orthologous Groups) id and GI (Genbank) accession number. Furthermore, the search tool within the table provides an option to search and retrieve the results by specific keyword.

Gene report
Information related to each gene is displayed under five sections. The 'details section' provides brief information related to the gene, and allows the user to navigate to the nearest genes present on either side of the gene of interest (Fig 5). The 'sequence feature section' provides domains, repeats, motif, and binding site information in both graphical and tabular form ( Fig  6). The FASTA format of protein and nucleotide sequences are provided at the bottom section with direct links for BLAST analysis. The 'annotation section' displays the functions of the gene with gene ontology and UniProt keywords. The last two sections provide links to other external databases and list of homologous proteins respectively.

Review of other related databases and web-resources
The rapidly increasing genomics and proteomics data due to advancements in high throughput data generation has created a need for enhanced data management to empower basic and applied research in cyanobacteria. Many web-based databases and community resources have been created specifically for cyanobacteria to facilitate systems biology analysis using these large data. Table 2 provides the list of databases summarized by Hernández-Prieto et al. [7] which has analytical tools along with the additional web resources and databases that are currently available.
The most comprehensive and widely used web based database is CyanoBase [8], which contains currently sequenced and annotated genome sequences, along with gene annotations and information related to various mutations involved in 39 species of cyanobacteria. It also includes tools such as BLAST for genes and genome similarity searches and KazusaMart which can be used to convert identifiers from one format to different formats. CYORF is another community annotated database that provides the open reading frame (ORF) list for approximately 33 genomes along with data from KEGG and DBGET at the GenomeNet, Pfam and Prosite motifs, predicted localization sites and protein 3D structures and tools to search for similar sequences [9]. CyanoBIKE is an instance of BioBike which provides web-based programmable knowledge base for genomic, metabolic and experimental data specifically for cyanobacteria. It has the collection of different datasets along with built-in tools for analysis, which require some basic programming skills for its application [10].
Apart from the above three generalized cyanobacterial databases, there are a few more databases which are developed specifically for a particular species or a group of cyanobacteria, which includes Cyanorak [11], SynechoNET [12] and ProPortal [13]. These are dedicated resources with annotations for orthologous sequences of marine picocyanobacteria, proteinprotein interaction data for Synechocystis, and information related Prochlorococcus isolates respectively.
Additionally, many specialized databases that are available focusing on specific protein class or property exclusively for cyanobacterial species. It includes cTFbase, a database containing transcription factors [14], CyanoPhyChe, which contains physico-chemical properties of cyanobacterial proteins [15], CyanoClust, which includes homolog groups in cyanobacteria and plastids produced by the program Gclust [16], CyanoEXpress, with curated genome-wide expression data [17] and CyanoLyase, a database of phycobilin lyase sequences, motifs and functions [18]. Along with these online databases, CyanoNews [19], Cyanosite [20], CyanoData [21] and CyanoDB [22] are the major web resources that provide the basic information about cyanobacteria, current happenings in cyanobacterial research, the methods used in cyanobacteriology, bibliography archive, research groups involved in cyanobacterial research, etc. that are extensively referred by cyanobacteriologists.
CKB, the present available database has incorporated all 74 currently fully sequenced genomes of cyanobacteria, including customized tools for inclusive analysis of these genomes. The tool also helps in interpreting newly sequenced genomes by comparing them with the previously annotated cyanobacterial and/or other model organism genomes. The flexibility of defining datasets by either organism or order, or as whole genome or plasmids, helps the user to segregate their search and its results according to their specific needs. An additional significant characteristic is the inclusion of the model prokaryotic genome (E. coli) and presence of a photosynthetic eukaryotic genome (Arabidopsis), which further assists in comparative sequence analysis thereby making CKB a unique and beneficial resource for cyanobacterial genome analysis.

Future Prospects
It is planned to improve and update the content of the database of CKB in the following aspects. First, gene information will be enriched by adding experimentally proven results related to biological functions, expression, and protein-protein interactions by manually curating the data from peer reviewed literature. In addition, we intend to include or develop further analysis tools to support the analysis of cyanobacterial genomes. The necessary efforts will also be made to ensure the database as user-friendly and efficient as possible, using the reflection and feedback from users of the first version of CKB to guide our efforts.

Conclusions
Here we present CKB as a knowledge database for the cyanobacteriologists. CKB provides access to information related to fully sequenced genomes and can be utilized for analysis and retrieving information. The CKB database website is freely accessible as a web application at: http://nfmc.res.in/ckb.

Data Collection and Organization
The complete genomes of 74 cyanobacteria were downloaded from the NCBI ftp site and their accession numbers are listed in S1 File [23]. Sequence features, annotations, and external links were downloaded from the UniProt database in xml format for each gene [24]. All the downloaded data from NCBI and UniProt databases were converted into csv format and uploaded into a SQL database. The full schema of the database is included as the S1 Fig. . Complete data related to the sequence and annotations are stored in a MySQL database. The database is designed using PHP, with jQuery JavaScript Library (V1.10), and Cascading Style Sheets (CSS) for the web interface. In addition, a simple gene browser in HTML5 is incorporated into the gene report page, which is provided by Chase Miller [25]. The BLAST 2.2.29+ tool is downloaded from NCBI ftp and pattern and fuzzy search tool and the restriction digestion tools are downloaded from Sequence Manipulation Suite [26][27]. The web server and all information parts of the database are hosted at NFMC portal www.nfmc.res.in.
Supporting Information