IIS – Integrated Interactome System: A Web-Based Platform for the Annotation, Analysis and Visualization of Protein-Metabolite-Gene-Drug Interactions by Integrating a Variety of Data Sources and Tools

Background High-throughput screening of physical, genetic and chemical-genetic interactions brings important perspectives in the Systems Biology field, as the analysis of these interactions provides new insights into protein/gene function, cellular metabolic variations and the validation of therapeutic targets and drug design. However, such analysis depends on a pipeline connecting different tools that can automatically integrate data from diverse sources and result in a more comprehensive dataset that can be properly interpreted. Results We describe here the Integrated Interactome System (IIS), an integrative platform with a web-based interface for the annotation, analysis and visualization of the interaction profiles of proteins/genes, metabolites and drugs of interest. IIS works in four connected modules: (i) Submission module, which receives raw data derived from Sanger sequencing (e.g. two-hybrid system); (ii) Search module, which enables the user to search for the processed reads to be assembled into contigs/singlets, or for lists of proteins/genes, metabolites and drugs of interest, and add them to the project; (iii) Annotation module, which assigns annotations from several databases for the contigs/singlets or lists of proteins/genes, generating tables with automatic annotation that can be manually curated; and (iv) Interactome module, which maps the contigs/singlets or the uploaded lists to entries in our integrated database, building networks that gather novel identified interactions, protein and metabolite expression/concentration levels, subcellular localization and computed topological metrics, GO biological processes and KEGG pathways enrichment. This module generates a XGMML file that can be imported into Cytoscape or be visualized directly on the web. Conclusions We have developed IIS by the integration of diverse databases following the need of appropriate tools for a systematic analysis of physical, genetic and chemical-genetic interactions. IIS was validated with yeast two-hybrid, proteomics and metabolomics datasets, but it is also extendable to other datasets. IIS is freely available online at: http://www.lge.ibi.unicamp.br/lnbio/IIS/.

their species names and IDs (Table S4). This allows building more complete networks by species since protein-protein interactions studied in a particular subspecies are added into the generated network of that species. Though, if the user wants to see only the interactions for that specific subspecies, the original taxonomic ID of each protein is also available in the generated XGMML file, and so the user can easily filter this information in the annotation table inside Cytoscape (Data Panel).

Integration of experimental interaction data
GPMGDID is a non-redundant database which integrates all protein-metabolitegene-drug interactions described in several public databases and for several organisms.
Interaction pairs are classified by experimental methodology (e.g. two-hybrid, pull down, genetic interference, etc.), organism and source literature (PubMed ID of the paper in which the interaction was described), while the proteins/genes involved in the interaction are characterized by Gene Ontology annotation database (biological process, molecular function and cellular component) and KEGG pathways allowing the compartmentalization and enrichment analysis performed in the INTERACTOME

MODULE.
The experimental information about protein-metabolite-gene-drug interactions can be accessed on several public databases but there is no standardization of protein identifiers and experimental methods description hindering the integration steps.
Although there are some efforts to construct standardized databases, like the PSI-MI TAB format for protein-protein interaction and JSON format for small moleculesproteins databases, many other problems related to different protein identifiers (UniProt ID, gene symbol, Ensembl ID or RefSeq) and source literature (PubMed ID or DOI) need to be solved in order to organize these information as an integrated database. GPMGDID database and scripts were developed to solve these problems as described below ( Figure   S1). In order to do the correlation of several protein identifiers, a database named HMDB_DrugBank_DB which contain information related to non-redundant proteinprotein interactions and metabolite/drug-protein interactions, respectively. In order to eliminate redundant information produced by the same interaction pair described in more than one public database, the scripts insert the interaction pair into the table only if an interaction pair ID given by UniProt_ID1_UniProtID2_PubMedID does not exist. Also, these scripts convert the protein identifiers to UniProtKB_AC using the Ipmapping_selected table, and if there is more than one protein matching to the same UniProtKB_AC the reviewed entry (manually curated from Swiss-Prot database) is prioritized. As described in the "Organism classification" section above, the taxonomic IDs for subspecies were converted to their corresponding species ID (Table S4), and the insert_interactionsDB.pl and insert_HMDB_drugbank_db.pl scripts use this information to save the interaction pair into GPMGDID database to maintain the taxonomic ID in the species level.
Finally, the UniProtGO and Kegg2UniProt tables contain 1:N relationships between UniProtKB_AC and Gene Ontology IDs or UniProtKB_AC and KEGG pathways, respectively. These tables are created by the insert_uniprotgo.pl and insert_kegg.pl scripts and used to perform the analyses of protein/gene compartmentalization and enrichment in the INTERACTOME MODULE.

IIS pipeline
The Integrated Interactome System (IIS) integrates four different modules for processing, annotation, analysis and visualization of the interaction profiles. The system accepts three inputs of data types: chromatograms, protein/gene lists and metabolite/drug lists. Figure S2 shows the pipeline constructed to process these datasets in order to connect them to the GPMGDID database through the INTERACTOME MODULE. As shown in Figure S2, between the three types of input the chromatogram data are the most complicated to process since they require several processing steps: chromatogram submission, chromatogram processing, reads annotation, reads clustering/assembly and contigs annotation. All results obtained from each step can be accessed by the user through the IIS interface. The chromatogram submission receives the uploaded chromatograms file in a ZIP format (each file containing from one to 96 chromatograms named according to the position in the 96 well sequencing plates, e.g. A01 to H12), checks the file integrity of the ZIP file and the individual chromatograms after uncompressing, organizes the uncompressed chromatograms in a directory structure (chromat_dir, edit_dir and phd_dir) to enable the execution of PHRED and BDTrimmer programs, sends an email to the user summarizing the information about the chromatograms processing, and finally starts the reads annotation step blasting the trimmed reads sequences against GenBank/NR (NCBI) protein database. All these steps are performed in the SUBMISSION MODULE. The reads annotation step performs only a partial annotation (BLASTx against GenBank/NR (NCBI) with e-value threshold of 1e-10) allowing the user to verify the protein identity and homologous organism of each read. In the case of chromatograms submission which Homo sapiens is the selected organism (organism = HS in the Nomenclature field of the SUBMISSION MODULE), the reads annotation pipeline prioritizes the best RefSeq alignment of an Homo sapiens hit. The reads partial annotation together with the submission report can be useful for the user to evaluate the quality of the yeast two-hybrid (Y2H) experiments and sequencing.
Finally, in order to eliminate the redundant reads typically generated by Y2H and transcriptome assays, the reads are assembled into clusters (contigs and singlets) using CAP3 program with default parameters (overlap length cutoff ≥ 40 and overlap percent identity cutoff ≥ 80).

Annotation table creation
Nine databases were downloaded to be used in the ANNOTATION MODULE: Ensembl of proteins/genes and contigs/singlets are obtained through searching against our ID mapping using UniProtKB_AC as a unique identifier. For the contigs/singlets a previous step is necessary in which the contigs/singlets sequences are blasted (BLASTx) against the Swiss-Prot database [43] with an e-value threshold of 1e-20, and the association between contigs/singlets and UniProtKB_AC is performed by using the best alignment hit containing a sequence identity ≥ 30%. Here users can both choose to blast their sequences against the complete Swiss-Prot database, by selecting the "All" option in the Organism field in the Annotation Builder window, or blast against the Swiss-Prot database restricted to one specific organism. The annotation process for CDD (ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd) and PDB (ftp://ftp.ncbi.nlm.nih.gov/blast/db/pdbaa) is based on sequence similarity using BLAST against these protein databases with e-value thresholds of 1e-10 and 1e-20, respectively.
In the case of CDD, the RPSBLAST program was used for querying.

Integration with the GPMGDID database
The fundamental point that enables the integration between IIS pipeline and GPMGDID database is the use of UniProtKB_AC as the unique identifier, so both the contigs/singlets and protein/gene lists must be converted to UniProtKB_AC, as explained above. For the contigs/singlets, their sequences are blasted (BLASTx) against the Swiss-Prot database [43] with an e-value threshold of 1e-20, and the association between contigs/singlets and UniProtKB_AC is performed by using the best alignment hit containing a stringent sequence identity ≥ 95%. In the case of protein/gene input data the user can either submit a list of Gene Symbol, UniProtKB_AC or Refseq as the protein/gene identifier, so the conversion from Gene Symbol and Refseq to UniProtKB_AC is necessary and can be obtained through searching against our ID mapping database (Ipmapping_selected table) considering the organism defined by the user and choosing a reviewed entry if there is more than one protein matching to the same ID in the database.
For the metabolite/drug list, there is a complete equivalence between the metabolite/drug IDs accepted as input for IIS pipeline and used in the GPMGDID database.

Experimental methods for the detection of types of interactions
The classification of interactions in protein-protein (pp), protein-gene (pg) and gene-gene (gg) interactions in the network is based on the method by which the interactions have been detected. The list of experimental methods from each database that comprise GPMGDID were analyzed, and the following methods were considered to identify pg and gg types of interactions: "chromatin immunoprecipitation array", "chromatin immunoprecipitation assay", "chromatin immunoprecipitation assays", "DNase I footprinting", and "one hybrid" to characterize pg, and "genetic interference" to characterize gg. The remaining methods were considered to characterize pp. Interactions between proteins and metabolites or proteins and drugs were classified as proteinmetabolite (pm) and protein-drug (pd) interactions, respectively.

Nodes and edges attributes
Nodes and edges are annotated in the generated network according to diverse attributes that can be used to cluster nodes or compare different networks. These attributes are defined as follows: Node attributes: -ID: gene, metabolite or drug name registered in our database.
-UniProt ID: gene/protein entry defined in the UniProt database.
-Gene: gene name registered in our database according to UniProt.
-Metabolite: metabolite name registered in our database according to HMDB, YMDB or ECMDB.
-Metabolite ID: metabolite identifier registered in our database according to HMDB, YMDB and/or ECMDB.
-Drug: metabolite name registered in our database according to DrugBank.
-Drugbank ID: metabolite identifier registered in our database according to DrugBank.
-p-value: p-value calculated for each node to be included in the network (the default is to include nodes with p-value ≤ 0.05).
-node.label: gene, metabolite or drug name registered in our database, which can be modified when using Cytoscape.
-node.shape: shape of each type of node. We use circle ("ellipse" in the xgmml) for genes/proteins, square ("rectangle" in the xgmml) for metabolites, and triangle ("triangle" in the xgmml) for drugs.
-node.fillColor: color of each node defined by values in the RGB scale. The default are: input contigs/singlets or genes/proteins from the project = blue; metabolites/drugs = yellow; bait (if applicable) = red; first neighbors = green; second neighbors = orange; third neighbors = purple. If selecting node color to be relative to fold change (FC) values, the molecules node colors will be defined by the following relations: up-regulated molecules = red; down-regulated molecules = green; non-regulated molecules = yellow; first/second/third neighbors = gray.
-node.size: size of each node. The default gene/protein node sizes are defined by the equation: node.size = degree + 20. The default metabolite and drug node sizes are equal to 20. If selecting node size to be relative to fold change (FC) values, the molecules node sizes will be defined by linear relationships, considering three FC ranges: 1 -10, 10 -100 and 100 -1000.
-Degree (pp): degree connectivity of each node (the number of neighbors of a node), considering only interactions between genes/proteins (pp = proteinprotein).
-Biological Process (GO): all the terms separated by ";" that define the biological processes of each gene/protein according to Gene Ontology (GO).
-Cellular Component (GO): all the terms separated by ";" that define the cellular component of each gene/protein according to Gene Ontology (GO). Several children terms were grouped in the same ancestral term in order to have a more concise list of the main cellular compartments (Table S2).
-Selected CC: a unique term selected from the "Cellular Component (GO)" attribute for each gene/protein to be used to separate the nodes in an easier-tovisualize and interpret layout. This term is selected according to the following order of priority: extracellular > cell wall > plasma membrane > mitochondrion > endoplasmic reticulum > golgi apparatus > endosome > centrosome > microtubule organising centre > lysosome > vacuole > glyosysome > glycosome > peroxisome > amyloplast > apicoplast > chloroplast > plastid > cytosol > cytoplasm > nucleus (Table S2)