PAN2HGENE–tool for comparative analysis and identifying new gene products

Advances in next-generation sequencing (NGS) platforms have had a positive impact on biological research, leading to the development of numerous omics approaches, including genomics, transcriptomics, metagenomics, and pangenomics. These analyses provide insights into the gene contents of various organisms. However, to understand the evolutionary processes of these genes, comparative analysis, which is an important tool for annotation, is required. Using comparative analysis, it is possible to infer the functions of gene contents and identify orthologs and paralogous genes via their homology. Although several comparative analysis tools currently exist, most of them are limited to complete genomes. PAN2HGENE, a computational tool that allows identification of gene products missing from the original genome sequence, with automated comparative analysis for both complete and draft genomes, can be used to address this limitation. In this study, PAN2HGENE was used to identify new products, resulting in altering the alpha value behavior in the pangenome without altering the original genomic sequence. Our findings indicate that this tool represents an efficient alternative for comparative analysis, with a simple and intuitive graphical interface. The PAN2HGENE have been uploaded to SourceForge and are available via: https://sourceforge.net/projects/pan2hgene-software


Introduction
Next-generation sequencing (NGS) platforms have sparked a dramatic change in the history of genome sequencing processes. NGS facilitates complete sequencing of genomes at relatively low costs, allowing the development of several other analyses including comparative analysis [1,2].
The main advantages of NGS are the production of large amounts of data, lower costs, and reduced time to sequencing. However, the emergence of these tools led to an additional challenge of handling large volumes of data, affirming the need to develop efficient ways for storage and management of data [1]. Over the years, bioinformatics approaches have contributed to improved handling and storage of data. Public databases, including NCBI, EBI, and DDBJ, are used by researchers from various fields for handling such data. These databases store diverse biological information including sequencing, annotation, and genome assembly data [3]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Although these platforms provide understanding regarding the genomics of an organism, new tools are required to gain insights into the functions of gene contents. Comparative analysis is effective to this goal, with high accuracy in terms of the structural annotation process, as this type of analysis allows identification of orthologs genes by homology [3,4].
Numerous computational tools have been developed to perform this type of analysis. Among these, the ABC software was created for interactive navigation of genomic data and can be used to perform multiple sequence alignments. This tool allows quantitative data on the alignments and annotations of the genes under study to be displayed simultaneously, thus highlighting the similarities in their sequences and evolutionary rates. Its purpose is to facilitate comparative sequence analyses, such as visualization of phylogenetic trees and generation of summary graphs [5].
PanTools is a software package that features genome annotation, sequence addition, gene cluster, genome reconstruction, pan-genome comparison, and query functionality. Its implementation is based on the Neo4j graph database, demanding the application of large sets of eukaryotic genomes (62 Escherichia coli genomes, 93 yeast genomes, and 19 Arabidopsis thaliana genomes). This program facilitates the construction of pan-genomic databases of many genomes with extensions for sequence addition and ontology annotations, among others. According to its creators, PanTools is the starting point for a collection base and is used as a linear reference in the field of comparative genomics [6].
ITEP, an integrated toolkit for genome exploration, consists of a series of command scripts that allow identification, comparison, and curation analysis of protein families. This tool uses a set of Python libraries to access genome information data and executes via scripts, workflows, and analyses related to a complete collection of genomes. ITEP proves to be an advantageous and flexible option for comparative analysis of microbial pan-genomes as it has been designed in modules, and thus allows the addition of functionalities and workflows for analysis [7].
Fast-D is a local annotation tool that allows the assignment of orthologs based on a reference genome. Using fasta files as input, this tool allows users to customize parameters and reference databases, offering command-line options and editing of the original configuration file. Fast-D has two annotation phases-structural and functional. The structural phase predicts biological characteristics (CDSs, RNAz, and CRISPRs), and the functional phase provides the functions of proteins predicted in CDSs. Each step of the annotation process is implemented through modules developed in Python, allowing the addition of extensions and new features. Fast-D provides better results for well-characterized organisms (Actinobacteria, Firmicutes, and Proteobacteria) than for less-studied species. It is possible to present numerous uncharacterized genes in this standard database [8].
Although the tools presented above make important contributions to facilitating comparative analyses, most of these tools have limitations when running on the web interface or have extensive command lines which lead to an increase in the user's difficulty of use. Thus, we present PAN2HGENE, a computational tool that allows the identification of gene products missing from the original genomic sequence and performs automated comparative analysis using both complete and draft genomes, through a simple and intuitive graphical interface.

Tool validation
For tool validation, reads of fifteen Escherichia coli strains were used. These data are available at the NCBI database in SRA format (https://www.ncbi.nlm.nih.gov/sra) and were downloaded using fastq-dump, a script from the SRA toolkit package (https://www.ncbi.nlm.nih. gov/sra/docs/sradownload/), using the-split-files <sra number> parameter for paired data. Table 1 lists the strains used, their SRAs, and library type.
In this analysis was used twelve complete genomes and three draft genomes. To evaluate the effectiveness of the tool, the data were analyzed using other comparative analysis tools, including PGAPWEB [9], PGAP [10], and PANWEB [11]. The criteria for choosing these tools was based on the fact that they all use PGAP as a tool to perform comparative analysis within their pipelines.

Programming language and database
PAN2HGENE was developed using Java, a robust and multiplatform programming language, and NetBeans IDE 12.0 (https://www.oracle.com). The Swing library was used to create a graphical interface. The database manager used was MySQL 8.0.23. The following processes were carried out in addition to development.

Mapping
Bowtie2 software version 2.3.5.1 (http://bowtie-bio.sourceforge.net/bowtie2/) was used to perform mapping of the raw reads against the input file, which can be draft genome or complete genome in FASTA format. As a result, a FASTQ file containing unmapped reads was generated [12].

De novo assembly
The SPades software version 3.14.1 was used to assemble the dataset with unmapped reads, with default parameter values [13]. The files with raw paired reads are previously checked with the bbmap tool (sourceforge.net/projects/bbmap/) to address the existence of orphaned reads, called singletons. This treatment is necessary to avoid possible errors in the assembly process with Spades.

Annotation
The comparative analysis needs the standardization of genomic sequence (complete genome or draft genome) of all organisms into the analysis. The standardization of the annotation performed in PAN2HGENE allows the user to choose between the web RAST platform (RAST) and Prokka-rapid prokaryotic genome annotation [14].

Similarity
Blast software version 2 (https://blast.ncbi.nlm.nih.gov/Blast.cgi) was used to similarity search between the products from input file annotation against products obtained by the annotation process of dataset assembled of the unmapped reads.

Comparative analysis
The PGAP 1.2.1 software [10] was used to perform the comparative analysis. All parameter values can be adjusted by the user on the PAN2HGENE graphical interface. The values of the parameters used in this study were as follows: method, GF; e value, 1 e-10; coverage and identity, 0.7. For analysis and visualization of the results, R software was used (https://www.r-project.org/).

Pipeline
The PAN2HGENE pipeline is executed in two parts (Fig 1). The first consists of the process of identifying gene products missing the original genomic sequence. The second, on the other hand, is possible to carry out a comparative analysis of the target organisms of the study with their updated genomic sequences. The steps that make up the first stage are: (i) input data: draft genome or complete genome sequence in the FASTA format (contigs or complete genome) used as reference and raw data (reads) in the FASTQ format; (ii) mapping: performed using Bowtie2, the raw data was mapped against the reference input file (FASTA) to obtain a FASTQ with unmapped reads; (iii) de novo assembly of unmapped reads using default parameter value; (iv) annotation of the reference input file and the result file generated from the de novo assembly of unmapped reads (both in FASTA).
The user can choose between the web RAST platform or Prokka. If the user chooses the annotation using RAST, at the end of this process, the annotation file is downloaded in EMBL format. However, if the user chooses to use Prokka, the annotation process occurs locally and the GBK annotation file is generated.
Identify new products (v): The CDS were extracted from annotation file and organized into a local database, the identification of new products was performed using Blast 2, the CDS extracted from the annotation file (reference input file) were mapped against CDS extracted from de novo assembly result; (vi) update input file: products that have not found any similarity in the BLAST analysis are added at the end of the input file. After the update process, the second round of annotation is performed.
The second part consists of: (i) file generation to comparative analysis: creation pep, nuc, and function files from annotation file (EMBL or GBK); (ii) comparative analysis using PGAP; and (iii) plotting of graphs results using R software.

Results and discussion
New product identification PAN2HGENE identified missing products in most strains of E. coli analyzed. Table 2 shows the quantity of these new products, with fourteen of the fifteen strains analyzed presenting new products. The results are organized according to the Sequence Read Archive (SRA) number, new products, hypothetical protein quantity, average product size, and the total amount of the product.

Comparative analysis
Attempts to perform the analysis with the PANWEB software resulted in errors in the organisms with the following SRA numbers SRR5168216, SRR5470155, and To perform PGAPweb tests, it was necessary to standardize the input files. According to the PGAPweb tool manual, input files can have three patterns (http://pgaweb.vlcc.cn/pgaweb.vlcc. cndoc). The standard chosen in this analysis was the generation of files with pep, nuc, and function extension, an ad hoc script was used to create these files.
The results of tests with PGAPweb indicated that, despite the generation of the pangenome graph, this tool did not provide the file for graph generation and the alpha values necessary for characterization of the pangenome graph to determine whether the pangenome was open or closed.
Based on this, the Desktop version of the PGAP software was used to perform the comparative analysis, using as input the files previously generated in the PGAPWeb test. The results obtained using PAN2HGENE and PGAP are shown in Fig 2. The first result refers to the pangenome analysis.
Comparison of the pangenomic analyses of the fifteen E. coli strains corroborates the results obtained in Gordienko's study [15], which characterizes the pangenome as open. The alpha  A comparison was also made between the results of pangenomic analysis using the RAST and Prokka annotation software (Fig 5). Each annotation software performs its task following its strategy, it was observed that the comparative analysis can be influenced according to the annotation software used in the annotation standardization process. However, the graphical analysis of the results, as well as the analysis of the mean and median values of alpha demonstrate that the difference is not significant, but it does exist.
However, the analysis carried out using the PAN2HGENE software pipeline has as the main focus to maximize the representation of the genetic content of the organisms used in the analysis, resulting in a more accurate comparative analysis.   Table 3 highlights some functions performed by PAN2HGENE in comparison to the BPGA [16] and Roary [17] software that performs the comparative analysis.   As future work to be implemented in the next versions of this tool, there is the development of an XML parser that provides the user with the use of other engines to carry out the comparative analysis process, such as, BPGA, Roary.