Prometheus, an omics portal for interkingdom comparative genomic analyses

Functional analyses of genes are crucial for unveiling biological responses, genetic engineering, and developing new medicines. However, functional analyses have largely been restricted to model organisms, representing a major hurdle for functional studies and industrial applications. To resolve this, comparative genome analyses can be used to provide clues to gene functions as well as their evolutionary history. To this end, we present Prometheus, a web-based omics portal that contains more than 17,215 sequences from prokaryotic and eukaryotic genomes. This portal supports interkingdom comparative analyses via a domain architecture-based gene identification system and Gene Search, and users can easily and rapidly identify single or entire gene sets in specific pathways. Bioinformatics tools for further analyses are provided in Prometheus or through Bio-Express, a cloud-based bioinformatics analysis platform. Prometheus is a new paradigm for comparative analyses of large amounts of genomic information.


Introduction
The completion of the Human Genome Project (2003) was not an end but rather a new beginning for further functional genomic analyses. The ENCyclopedia of DNA Elements (ENCODE) was launched to begin investigating the functions of the identified human genes [1]. In addition, large-scale functional studies, such as interactome or network analyses, were performed in model organisms, including Arabidopsis thaliana, Saccharomyces cerevisiae, and Drosophila melanogaster. These efforts accumulated network information on various interactomes and gene functions. These vast amounts of biological information enabled functional studies that contributed to the unveiling of biological responses, cloning of genes of interest, and development of molecular markers for model organisms or medicines in humans [2,3]. Thus, the trend of functional analyses has been transferred from candidate gene research to genome-wide research. However, this flood of information has largely been restricted to model organisms, and it has been challenging for researchers to apply these data to newly sequenced genomes. Since next-generation sequencing (NGS) technology was developed in the mid-2000s, an enormous amount of genomic information has been analyzed and amassed in public databases. As the numbers of sequenced genomes increased, many tools and pipelines were developed to investigate gene functions, identify gene families, and perform comparative genomic analyses. However, the application of comparative analyses is restricted to functional gene annotations and newly sequenced genome analyses. Newly sequenced genomes are initially compared to those that have previously been analyzed, including genomes of closely related species, to provide information on genome structure changes and gene repertoires. Such comparisons can also predict gene paralogs, which are genes related by duplication events, or orthologs, which are those related by speciation events [4][5][6]. As orthologs tend to be more similar in function than paralogs [7], they are widely used for functional gene annotations [8]. Moreover, recent gene-of-interest studies that include multigenome orthologs offer insight into their mechanisms for adapting to the environment [9,10]. However, these comparative genomic analyses were performed at genome-, genus-, or kingdom-wide [11][12][13] levels, thereby restricting comparisons at the species, family, or order level. To understand the evolution of genes of interest more precisely, interkingdom analyses are needed, particularly because many genes in eukaryotic genomes have universal common ancestors in Bacteria and Archaea [14].
Here, we report Prometheus (https://prometheus.kobic.re.kr), an omics portal for interkingdom comparative genomic analyses. We collected 17,215 genome assemblies from 16,730 species and constructed four primary databases to provide basic genome information, with more detailed information on individual genes provided in secondary databases. Researchers can then access detailed information on genes of interest, such as gene structure, domain architecture, subcellular localization, orthologs, and paralogs, as well as their sequences. In particular, Prometheus provides Gene Search to identify genes of interest based on their domain architectures from prokaryotes to eukaryotes and to perform various comparative analyses, such as comparison of chromosome sequences, sequence alignment, and phylogenetic analyses. Furthermore, researchers can perform various bioinformatics analyses with these and their own sequencing data in a cloud-based platform, Bio-Express. Prometheus presents a new paradigm for genome research, from single genes of interest to entire gene pathways.

Web interface
Prometheus provides data searches, configuration of data analyses, data visualization, and storage of user data. The interface is implemented using Hypertext Markup Language (HTML) and cascading style sheets (CSS) and uses a jQuery JavaScript library (jQuery) to modify web page contents. To visualize data, dynamic web interface is constructed by Asynchronous Java-Script and XML (Ajax) using JavaScript Object Notation (JSON) data format. Furthermore, the genome browser was constructed using Scalable Vector Graphics (SVG), and the phylogenetic viewer was constructed using JavaScript. The web interface of Prometheus supports cross-browsing.

Construction of taxonomy combined heatmap of photolyase/cryptochrome family.
Sequences for the photolyase/cryptochrome family of genes from different species in previous study [15] were collected and domain architectures were investigated using InterProScan v5.0. Each of the subtypes reported in previous studies were investigated using Gene Search in Prometheus. The numbers of each of the subfamily genes were calculated for individual species and visualized as a heatmap using R scripts. The taxonomic tree was constructed using phyloT in iTOL [16], an online tool that generates phylogenetic trees based on the NCBI taxonomy. Finally, the taxonomic tree and heatmap were combined using Adobe Illustrator.

Bioinformatics analysis using a cloud-based analysis system, Bio-Express
LAST [17], BLAST [18], Clustal Omega [19], MUSCLE [20] and InterPro [21] programs are run in the hybrid-cluster system, Bio-Express. To support further genomic analyses using personal data such as RNA-seq, ChIP-seq, or genome resequencing data, Prometheus links to Bio-Express, and users can perform further various genomic analyses using personal data in My Gene and various analysis pipelines in Bio-Express. Bio-Express is constructed by Hadoop to support high-speed analysis of a large amount of data. To maintain a large data sets, Prometheus uses HDFS to store the data divided by optimized block sizes into various computer servers. This storage system can maintain three copies of user data and provides stable data storage to reduce risk of data loss. The web server of Prometheus transmits tasks, progress and results of data analysis to the Bio-Express server using Apache thrift library-based Remote Procedure Call (RPC) and receives results in JSON format. The results of genomic analyses are stored in HDFS and downloaded in the web browser using HTTP. In the case of large amounts of data, users can download their data using GBox (High-Speed Data Transmission), a high-speed file transmission software using TCP/IP, and transferred user data are stored in HDFS.

Database construction
The database of primary and secondary data tables in Prometheus was constructed using the MySQL database management system. In the database, primary data tables were created through data in opened in five public databases, and secondary data tables were constructed by parsing results of bioinformatics tools such as InterProScan, OrthoMCL [22], MultiLoc2 [23] and TargetP [24]. Detailed methods for database construction are described in Supplemental Note Section 1.

Concept and construction of Prometheus
Prometheus (http://prometheus.kobic.re.kr) provides an integrated pipeline for interkingdom comparative genomic analyses and comprises four major sections, Genome Archive, Gene Search, Bio-Express, and Genome Analysis. Users can identify genes of interest using Gene Search and investigate their domain architectures using InterPro in Genome Analysis. Furthermore, users can obtain additional species information via accessing the Korean Bioresource Information System (KOBIS) or perform further analyses by accessing the cloud-based Bio-Express (Fig 1).
To establish Prometheus, 17,215 genome assemblies from 16,730 species were collected and stored in four primary databases. The genomic information in Genome Archive (Fig 2A and  Table 1) is arranged by taxonomic rank (obtained from NCBI), which users can access by clicking the species or common name in the taxonomic tree or using a key word search. This general information provides details on genome assembly, annotation, and taxonomy. In eukaryotic genomes, distinct versions of genome assembly and annotation were provided, and so each version is stored separately (Fig 2B and S1 Table in S1 File). Prokaryotic genomic information is separated by strain to support metagenomics analyses. Genomes were classified according to criteria from RefSeq, which provided most of the genomic data (S1 Table in S1 File), to construct the database and to visualize the genomic information. In total, 435 eukaryotic genomes, 15,984 prokaryotic genomes, and 311 archaea genomes were collected and assembled into the four primary databases containing information on assembled genomes, general feature formats (GFFs), coding sequences (CDSs), and protein sequences, for a total 213,478,449 records (S2 Table in S1 File). Five secondary databases containing information of subcellular localization, domains, and homologs in the same or different species were constructed (S3-S5 Tables in S1 File). Taxonomic information in Genome Archive is stored in a taxonomy database, and general information of genome assembly and annotation is stored in a genome report database. In total, 11 databases were constructed with 1,163,053,603 records (S3-S5 Tables in S1 File).

PLOS ONE
We investigated other genome databases, such as CoGe [25], Ensembl [26], PLAZA 4.0 [27], and MicrobesOnline [28], to compare genome contents and data types (S6 Table in S1 File). CoGe and Ensembl provide more genome assemblies than Prometheus. However, CoGe provides only total numbers of genomes. Ensembl provides ortholog/paralog information between two genomes while Prometheus provides ortholog/paralog information among multiple genomes in same family. PLAZA 4.0 and MicrobesOnline are focused on comparative General information on individual genomes is obtained using Genome Browser (Fig 2C), with zoom in/out functions ranging from 1× to 10× and a gene search function by position or gene name. Users can access and download the individual gene's information (CDS, and peptide sequence) by key word search or by clicking within the Genome Browser. Detailed information on individual genes is provided in Gene Viewer (Fig 2D), and users can access the Genome Browser or result pages in Gene Search. Bioinformatics analyses, including InterPro [21], OrthoMCL [22], MultiLoc2 [23], and TargetP [24], were performed using protein sequences of each species to construct six secondary databases, which are presented in separate sections within the Gene Viewer (Fig 2D). Using Gene Viewer, researchers can save time by accessing various gene information more easily instead of visiting individual websites to determine subcellular localization, putative orthologs or paralogs and domain architectures.

Analyses of transcriptional factors and TCA cycle in gene search
The major function of Prometheus is to perform interkingdom comparative analyses. To support this objective, secondary databases containing information on domain architectures and orthologs/paralogs of individual genes were constructed. Prometheus has totally contained over 60 million unique proteins which was extracted from primary database such as Ensembl, Phytozome, Refseq, Solgenomics and the others (S7 Table in S1 File). Domain architectures of individual proteins were analyzed using InterPro and were shown as IPR terms. Thus, users can identify genes of interest using Gene Search by typing their domain architectures using IPR terms with high performance (S8 Table in S1 File). We validated the utility of Prometheus by performing an interkingdom investigation of transcription factors (TFs) and genes involved in the TCA cycle using Gene Search (Fig 3 and S7 and S8 Tables in S1 File). A pipeline (iTAK v1.7) [29] was used to identify plant TFs and classify protein kinases. TFs, transcriptional regulators (TRs), and kinases were identified by consensus rules mainly summarized from PlnTFDB [30], PlantTFDB [31] with families from PlantTFact [32], and AtFDB [29]. Domain architectures of each TF were investigated using InterProScan, and their domain architectures depicted by IPR terms were used for further analyses using Gene Search. To provide additional information about identified genes, the number of domain subtypes are depicted in a summary table in Gene Search and as a header of sequence data in a FASTA file (S1 Fig in S1 File). Users can categorize identified genes into each subtype. We identified and validated 79,960 genes from 15 gene families using the iTAK pipeline v1.7 [29] (Fig 3A and S9 Table in S1 File). The accuracy of our   Genes involved in the TCA cycle were further investigated with Gene Search to demonstrate the potential for applying comparative genomics at the pathway level. As the TCA cycle is a fundamental metabolic pathway for survival in prokaryotes and eukaryotes, we selected this for an interkingdom comparative genomic analysis. A total of 435,044 genes were identified from 20 individual genes in the TCA cycle using Gene Search, and the ratios of species harboring each gene in the TCA cycle were shown as heatmaps (Fig 3B and S10 Table in S1 File). These results showed that some genes, such as those encoding isocitrate dehydrogenase (IDH1 and IDH2) and malate dehydrogenase (MDH1 and MDH2) evolved in a lineage-specific manner. Furthermore, the results show the lineage-specific rates of functionally redundant genes, such as those encoding succinate dehydrogenase and succinyl-CoA synthase. This investigation of the TCA cycle also provided information on the gene repertoires and the evolution of the TCA cycle in each kingdom. Thus, Prometheus provides information for evolutionary studies of single genes or those in specific pathways, including the distributions and rates of genes, as well as repertoires of gene orthologs in pathways. In addition, Prometheus provides the domain architectures of genes as well as their CDSs and/or peptide sequences.

Tools for comparative analyses and personalized management system via My Genes in Prometheus
To support comparative analyses in Prometheus, essential tools such as LAST [17] (a program for comparing sequences at the chromosome level), BLAST [18], and InterPro [21] are provided in Genome Analysis (S2 Fig in S1 File). Users can monitor the progress of analysis in a personalized page, My Genes (S3 Fig in S1 File), and download the result files from each program via a file menu. In the case of data from InterProScan, the result file is shown in a graphic format and results are downloaded in a .tsv file format (S4 Fig in S1 File). Thus, users can investigate domain architectures of genes of interest and perform interkingdom identification using Gene Search.
We performed a comparative analysis of genes in the photolyase/cryptochrome family using a gene set from a previous study [15] as a control (Fig 4 and S11 Table in S1 File). The domain architectures of photolyase/cryptochrome subfamilies are the same and family IPR terms are different (Fig 4A), enabling a more accurate identification of each subfamily. The results also indicated lineage-specific distributions of photolyase/cryptochrome gene families in each kingdom. Furthermore, the gene repertoires of each subgroup of these families are shown in a combined taxonomy heatmap (Fig 4B), demonstrating lineage-specific evolution and the expansion of subgroups at the species level. These data demonstrate that Gene Search and bioinformatics tools in Genome Analysis in Prometheus support interkingdom comparative analyses. In summary, Prometheus provides the bioinformatics tools essential for comparative analyses, and users can combine these tools with interkingdom comparative analyses in Gene Search to unveil gene function or the evolution of genes/gene families.

Further genomic analyses using Bio-Express with personalized data via My Genes
Personal data, such as RNA-seq or ChIP-seq data, and sequence data downloaded from Prometheus (e.g., genome, CDS, and peptide) in Genome Archive or FASTA files from Gene

Comparative genomics portal, Prometheus
Search can be uploaded and stored in My Genes (S3 Fig in S1 File) and further analyzed using the cloud-based Bio-Express platform (https://www.bioexpress.re.kr/). Bio-Express system consists of cluster nodes for bioinformatics analysis, Hadoop Distributed File System (HDFS) storage for data deposition, cache solutions, and a distributed task scheduler (S5 Fig in S1  File). The Bio-Express hardware system consists of 900 core CPUs, 9.1 TB of memory, and 1.5 PB of disk storage in total [33]. Programs for bioinformatics analysis in Bio-Express are modularized and shown as icons (S6 Fig in S1 File). Users can construct their own analysis pipelines by selecting and linking each modularized program using arrows and programs were provided in Bio-Express were summarized in S12 Table in S1 File. We performed a transcriptomic analysis in Bio-Express using the genome of Hibiscus syriacus [6] and RNA-seq data. For this, TopHat2 [34] and Cufflinks [35] programs were used, and genes differentially expressed in tissues from a previous study were identified and visualized as a heatmap (S7 Fig in S1 File). Thus, users can perform bioinformatics analyses with personal data in My Genes by linking to Bio-Express. This combination of Prometheus and Bio-Express can provide convenient and user-friendly analysis conditions for non-bioinformatician scientists.

Discussion
Since NGS technology was developed and applied to biology, vast amounts of genomic data have accumulated. With these data, comparative analyses of species or genes can be performed to unveil gene function or evolution. For instance, the evolution of pungency in peppers was discovered by a comparative analysis with tomato and potato genomes [5]. However, only a small number of biologists can perform these comparative analyses using bioinformatics tools. Indeed, the accessibility of bioinformatics analysis is currently a major hurdle for ongoing biologic research. Thus, we constructed Prometheus, a web-based omics portal for interkingdom comparative genomic analyses. Biologists can identify genes or gene families of interest using the domain architectures in Gene Search. Genes from multigene families containing various domain architectures can be detected, such as for the photolyase/cryptochrome family [15] and the nucleotide-binding leucine-rich repeat gene family [36]. Additional subtype information of identified genes is provided in the headers for their sequences in FASTA files.
The goal of combining kingdom-wide gene identification with subtype information is to provide evolutionary insight by detecting lineage-specific subtypes or subtype distribution patterns, as exemplified by the analysis of gene subtypes involved in the TCA cycle. Moreover, users can perform comparative analyses of single genes as well as sets of genes involved in specific signaling pathways. We found that genes containing specific domains showed high rates of accuracy in domain architecture-based Gene Search in Prometheus. However, the accuracy was reduced for genes without specific IPR terms, which is a limitation of domain architecture-based gene search systems using InterPro or the pfam database. Nevertheless, this limitation will be minimized as Prometheus is updated with new releases of these databases.
To support comparative analyses, Prometheus incorporates various tools, such as LAST, Clustal Omega, and Phylogeny viewer, in Genome Analysis. This is a valuable addition, as there are currently few web sites for comparative analyses with large restrictive or functionally important gene families, such as TFs. For TFs in plants, there are two representative web sites, PlnTFDB [30] and PlantTFDB [31], but their gene repertoires differ due to their rules for indemnification of TFs [37]. Prometheus clears this particular hurdle via its domain architecture-based Gene Search system, thereby providing biologists with a powerful comparative analysis platform with various tools for further studies.
Prometheus provides information to users on individual genomes assigned by taxonomy in Genome Archive via Genome Browser. Here, users can download the genomic and peptide sequences and CDSs as well as upload their own data for further analyses in Prometheus or the cloud-based Bio-Express platform. Furthermore, users can access detailed information on genes of interest in the Gene Viewer page. The connection with Bio-Express enables Prometheus to provide various bioinformatics tools and allows biologists to analyze their own data in same platform. Thus, unlike other comparative genomics portals or platforms, Prometheus provides tools not only for comparative analyses but also for genomic analyses, such as transcriptome or resequencing analyses.

Conclusion
Prometheus is an integrated platform for interkingdom comparative genomic analyses with additional support for other genomic analyses with the user's own data. Users can identify genes of interest based on their domain architecture using Gene Search as well as conventional methods using sequence similarity from domains Archaea, Bacteria, and Eukarya. The domain architecture-based gene search can provide precise gene sets compared to sequence similarity gene sets. Users can investigate detailed information including domain architectures, subcellular localization, and putative orthologs or paralogs of individual genes identified by Gene Search in Gene Viewer and predict their putative functions. Users can also carry out interkingdom analyses of large data sets for evolutionary studies. Analysis tools such as LAST, Clustal Omega, and Phylogeny viewer will support such studies. Thus, Prometheus offers biologists a new paradigm for comparative genome analyses and evolution studies. The platform and InterPro version will be updated annually with newly sequenced genomes to ensure that broad and precise data are available to researchers. Furthermore, newly developed tools for comparative genomic analyses will continue to be added to support various analyses. Finally, visualization of domain subtype architectures identified by Gene Search is now being developed and will be available for updates in the near future.