13 Jan 2014: Shahbaaz M, Hassan, Ahmad F (2014) Correction: Functional Annotation of Conserved Hypothetical Proteins from Haemophilus influenzae Rd KW20. PLOS ONE 9(1): 10.1371/annotation/23d005b8-fe53-4b14-a31c-915be3e839b5. https://doi.org/10.1371/annotation/23d005b8-fe53-4b14-a31c-915be3e839b5 View correction
Haemophilus influenzae is a Gram negative bacterium that belongs to the family Pasteurellaceae, causes bacteremia, pneumonia and acute bacterial meningitis in infants. The emergence of multi-drug resistance H. influenzae strain in clinical isolates demands the development of better/new drugs against this pathogen. Our study combines a number of bioinformatics tools for function predictions of previously not assigned proteins in the genome of H. influenzae. This genome was extensively analyzed and found 1,657 functional proteins in which function of 429 proteins are unknown, termed as hypothetical proteins (HPs). Amino acid sequences of all 429 HPs were extensively annotated and we successfully assigned the function to 296 HPs with high confidence. We also characterized the function of 124 HPs precisely, but with less confidence. We believed that sequence of a protein can be used as a framework to explain known functional properties. Here we have combined the latest versions of protein family databases, protein motifs, intrinsic features from the amino acid sequence, pathway and genome context methods to assign a precise function to hypothetical proteins for which no experimental information is available. We found these HPs belong to various classes of proteins such as enzymes, transporters, carriers, receptors, signal transducers, binding proteins, virulence and other proteins. The outcome of this work will be helpful for a better understanding of the mechanism of pathogenesis and in finding novel therapeutic targets for H. influenzae.
Citation: Shahbaaz M, Md. ImtaiyazHassan, Ahmad F (2013) Functional Annotation of Conserved Hypothetical Proteins from Haemophilus influenzae Rd KW20. PLoS ONE 8(12): e84263. https://doi.org/10.1371/journal.pone.0084263
Editor: Eugene A. Permyakov, Russian Academy of Sciences, Institute for Biological Instrumentation, Russian Federation
Received: October 3, 2013; Accepted: November 21, 2013; Published: December 31, 2013
Copyright: © 2013 Shahbaaz et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The authors sincerely thank Indian Council of Medical Research for financial assistance (Grant No. BIC/12(04)/2012). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Haemophilus influenzae strain Rd KW20 is a Gram-negative bacterium frequently isolated from the lower respiratory tract of patients with chronic bronchitis ,  which is the “fourth-most-common” cause of death in the United States . Due to comparatively small genome size and its phylogenetic closeness to Escherichia coli, H. influenzae is a very convenient model organism for genomic and proteomic findings , , . The genome of H. influenzae was successfully sequenced , and it consists of 1,830,140 base pairs in a single circular chromosome that contains 1740 protein-coding genes, 2 transfer RNA genes, and 18 other RNA genes . Due to successful sequencing of whole genome, H. influenzae serve as a model organism for whole-genome annotation, computational analysis and cross-genome comparisons . Furthermore, genome-scale model of metabolic fluxes construction , ,  and whole-genome transposon mutagenesis analysis ,  was first implemented in H. influenzae. Moreover, in this study it is also used as a test genome to evaluate the performance of various bioinformatics approaches for proteome analysis, with the ultimate aim of determining the in silico properties of the protein set expressed by the bacterium under certain conditions.
Genomic analysis of 102 bacterial genomes shows that the respective genomic pool contain 45,110 proteins organized in 7853 orthologous groups with unknown function . Proteins with unknown function may be termed as Hypothetical Proteins (HPs) or putative conserved proteins because these proteins are showing limited correlation to known annotated proteins , . The HPs have not been functionally characterized and described at biochemical and physiological level . Nearly half of the proteins in most genomes belong to HPs, and this class of proteins presumably have their own importance to complete genomic and proteomic information , . We have been working on structure based rational drug design where we always need a selective target for drug design , , . A precise annotation of HPs of particular genome leads to the discovery of new structures as well as new functions, and helps in bringing out a list of additional protein pathways and cascades, thus completing our fragmentary knowledge on the mosaic of proteins . Furthermore, novel HPs may also serve as markers and pharmacological targets for drug design, discovery and screen , .
The use of advanced bioinformatics tools for sequence analysis and comparison is an initial step to identify homologue for only a part of the region shared between proteins, which could lead to a robust function prediction. Most commonly used method for functional prediction of gene products is by identification of related well-characterized homologues using sequence-based search procedures such as BLAST . Multiple sequence alignment of homologues of a family is a suitable method to obtain structurally/functionally important positions and structurally conserved domains. We have considered functional domains as the basis to infer the biological role of HPs. Motif analysis is an obligatory step in the identification and characterization of HPs. Detection of common motifs among proteins in particular with absent or low sequence identities (e.g. less than 30%) may provide important clues for function or classification of HPs into appropriate families . A series of signature databases are publically available, and are used for motif finding including GenomeNet  (contains PROSITE , PRINTS , Pfam , ProDom , BLOCKS ) and InterPro  using InterProScan . A potent method for motif searches represents the use of MEME suite , a resource for investigating candidate's functional and structural motifs/sites in HPs (Table 1). Furthermore, study of protein interactions using STRING database  is crucial to understand the functional role of individual proteins in a well-organized biological network.
Here we have used recent bioinformatics tools to assign function to all HPs encoded by H. influenzae genome. The Receiver Operating Characteristic (ROC) analysis  is used for evaluating the performance of used bioinformatics tools. We also measured the confidence level of the function prediction on the basis of used bioinformatics tools . The function prediction has high confidence level if more than three tools indicate the same functions. While if there is less than three tools then it is less confidently predicted function . So, we have successfully assigned functions to all 296 HPs of H. influenzae genome with high confidence. We have performed an extensive sequence analysis of proteins associated with virulence using tools like Virulentpred  and VICMpred , because H. influenzae is the causative agent of infection in respiratory tract.
Materials and Methods
The computational framework used for functional annotation of HPs is given in Figure 1, is divided into three phases namely, Phase I, II and III. The Phase I include the characterization and sequence retrieval of HPs by analyzing the genome of H. influenzae. The Phase II comprises the automated annotation of various functional parameters using various online servers. In Phase III, the systematic performance evaluation of various bioinformatics tools by using H. influenzae protein sequences with known function by performing ROC analysis. The probable functions of the characterized HPs were predicted by the integration of various functional predictions made in PHASE II. In latter phase expert knowledge is used for performing ROC analysis and for confidently annotating the HPs functional properties.
Methodology is divided into three phases: PHASE I. H. influenzae HP characterization and sequence retrieval from online databases. PHASE II. The extensive analysis of sub-cellular localization, physicochemical parameters, virulence, function and domain present in HPs. PHASE III. This phase include assessment of predicted functions using the protein with known function from H. influenzae and reliable prediction of possible functions of HPs.
We have analyzed the genome of H. influenzae and found 1,657 proteins present in it (http://www.ncbi.nlm.nih.gov/genome/). The 429 proteins are characterized as HPs and their fasta sequences were retrieved from UniProt (http://www.uniprot.org/) using the primary accession number of all HPs.
Expasy's ProtParam server  has been used for theoretical measurements of physiochemical properties such as molecular weight, isoelectric point, extinction coefficient , instability index , aliphatic index  and grand average of hydropathicity (GRAVY) . These predicted parameters are listed in Table S1.
A protein can be characterized as drug or vaccine target by utilizing the knowledge of sub-cellular localization. The proteins localized in cytoplasm can act as possible drug targets, while surface membrane proteins are considered as potent vaccine targets . Databases like UniProt provide valuable information about sub-cellular location of proteins . If experimental information about HP localization is absent, then we have used sub-cellular localization prediction tools like PSORTb , PSLpred  and CELLO , . CELLO (version 2.0) two-level support vector machine based system, which comprises 1444 and 7589 protein sequences as standard datasets for the prediction of bacterial and eukaryotic protein localization, respectively , . The PSLpred is used only for predicting sub-cellular localization of Gram negative bacteria. We have used SignalP 4.1  for predicting signal peptide and SecretomeP  for identifying protein involvement in non-classical secretory pathway. TMHMM  and HMMTOP  have been used for predicting the propensity of a protein to be a membrane protein. The sub-cellular localization predictions of 429 HPs are listed in Table S2.
The first step towards predicting the functionality of a protein is generally a sequence similarity search in various available gene and protein databases. We have used BLASTp  and HHpred  for searching similar sequences with known function. BLAST is a popular bioinformatics tool, most frequently used for calculating sequence similarity by performing local alignments. The BLASTp search against the non-redundant protein sequences (nr) database returns 100 homologs of each HP, and proteins with low query coverage (<50%) or low sequence identity (<20%) are excluded. Proteins showing high sequence identities (>40%) and e-value (<0.005) are referred to as close homologs of HPs and those with low identities (<26%) are considered as remote homologues. The search with the highest value of the respective parameters considered as probable function of the given HP. The BLASTp also used for checking the availability of structural homologs in Protein Data Bank (PDB). Whereas, HHpred utilizes pair wise comparison of profile hidden Markov models (HMMs) for remote protein homology detection by searching various protein databases like PDB , , SCOP , CATH , etc. is also used for detection of structural homologs. We have used BLASTp for determining the sequence identity between two proteins sequences and PRALINE  for multiple sequences comparison (Table S3).
We have used various tools for precise functional assignments to all 429 HPs from H. influenzae are described in Table 1. The functional domain of a protein is predicted by using various publically available databases such as Pfam, SUPERFAMILY , CATH, PANTHER , SYSTERS , SVMProt , CDART , SMART , and ProtoNet  (Table S4). The database SYSTERS was used for clustering proteins on the basis of their functions. We used BLASTp for searching SYSTERS database and the output is obtained in the form of clusters of functionally related proteins. The clusters with e-value (<0.005) are considered as a proper classification of HP. SVMProt was used for the SVM based classification of proteins into 54 functional families from its primary sequences. The significance level of classification is measured in the form of R-value and P-value (%), classification with R-value (>2.0) and P-value (>60%) are considered as significant. CDART and SMART were used for similarity search based on domain architecture and profiles rather than by direct sequence similarity. The Simple modular architecture research tool (SMART) search for similar domain in Swiss-Prot , SP-TrEMBL  and stable Ensembl  proteomes in normal mode. The search with e-value (<0.005) was considered as a significant match for the given HP.
Similarly, PANTHER is a comprehensively organized database of protein families, trees and subfamilies, used to develop evolutionary relationships to infer the functions of HPs. The HMM- based search is performed on PANTHER database for functional annotation of HPs and important hits with e-value greater than 1e-3 are reported in the output. ProtoNet (Version 6.0) tree provided an automatic hierarchical clustering of the protein sequences. The “Classify your protein” option in ProtoNet is used for assignment of a biological function to HPs.
Protein sequence motifs are signatures of protein families and can often be used as tools for the prediction of protein function, particularly in enzymes, in which motifs are associated with catalytic functions. We used InterProScan which combines different protein signature recognition methods from the InterPro consortium which is the integration of several large databases, including PANTHER, Pfam, SMART, ProSite and SUPERFAMILY etc. for motif discovery. The output generated by InterProScan is presented in the form of the checksum of the protein sequence which is supposed to be unique, e-value of the match which should be less than 0.005 and status of the match in the form of true (T) or unknown (?), indicative of reliability of the generated result. The MOTIF and MEME suite have been used to perform motif- sequence database searching and assignment of function. The MOTIF tool generates a very large set of output and to identify the probable function of the HP we check whether the SCOP database predicted fold in HP is also present in the MOTIF generated functional annotations. While in motif discovery using MEME suite we first cluster the protein sequences of HPs into clusters using CLUSS ,  online server and then submit the clustered sequences in the MEME suite server. MEME suite server identified three motif sites in the clustered HPs by default. The MAST  module of MEME suite then perform database searching for assigning function to the discovered motifs in the HPs.
Virulence factors analysis
Virulence factors (VFs) are described as potent targets for developing drugs because it is essential for the severity of infection . For identifying these VFs we have used VICMpred and Virulentpred. Both are SVM based method to predict bacterial VFs from protein sequences with an accuracy of 70.75% and 81.8%, respectively. Both methods use five-fold cross-validation technique for the evaluation of various prediction strategies.
Functional protein association networks
The function and activity of a protein are often modulated by other proteins with which it interacts. Therefore, understanding of protein-protein interactions serve as valuable information for predicting the function of a protein. We have used STRING (version–9.05)  to predict protein interactions partners of HPs. The interactions include direct (physical) and indirect (functional) associations, experimental or co-expression. STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms wherever applicable.
The statistical estimation of diagnostic accuracy is considered as an important step towards the validation of the predicted outcome of the adopted pipeline . There are various available conventional methods for comparing the accuracy of various predicted models but ROC analysis is an extensively used method for analyzing and comparing the diagnostic accuracy , provides the most comprehensive explanation of diagnostic accuracy available till date . We used six levels at which diagnostic efficacy can be evaluated. The two binary numerals “0” or “1” used to classify the prediction as true positive (“1”) or true negative (“0”). The integers (2, 3, 4 and 5) are used as confidence rating for each case. The ROC analysis is carried out for sequences of 100 proteins with known function from H. influenzae. We used the above explained in silico pipeline for the function prediction these known proteins using various online bioinformatics tools. We further classified the predicted function of proteins using already known function (Table S5 and S6). The classification results are submitted to “ROC Analysis: Web-based Calculator for ROC Curves”  in format 1 form as required by the software. This online software automatically calculates the ROC using the submitted data and generates the result in the form of accuracy, sensitivity, specificity and the ROC area. These generated parameters are utilized for validating the predicted functions of HPs. The average accuracy of used pipeline is 96.25% (Table S7) and indicates that outcomes of functional annotation of HPs are reliable that can be further utilized for other experimental research.
Results and Discussion
We have extensively analyzed sequences of 429 HPs using BLAST, Pfam, PANTHER, CATH, CDART, and SVMProt. Tools like InterProScan, MOTIF, and MEME suite were used for discovering functional motifs in the HPs. We have successfully assigned a proposed function to each of 429 HPs present in H. influenzae (Table S3 and Table S4) and discovered motif in 420 HPs using MEME suite using 208 predicted clusters of CLUSS ,  online software tool (Table S8), among which 296 HPs are characterized with high confidence and are listed in Table 2, and less confident annotated proteins are listed in Table S9. All sequence analyses were compiled. It was observed that in HPs present in H. influenzae, there are 139 enzymes, 57 transporters, 32 binding proteins, 21 bacteriophage related proteins, 15 lipoproteins and the rest are involved in various cellular process like transcription, translation, replication, etc. (Figure 2). These analyses suggest a possible role of HPs in the development and pathogenesis of the organism, and identified groups are described here separately.
The chart shows that there are 41% are enzymes, 20% proteins involve in transportation, 12% binding proteins, 7% bacteriophage related proteins and rest are proteins involved in cellular processes like transcription, translation, replication etc., among 429 HPs from H. influenzae.
Enzymes produced by bacteria are key player for the survival of organism in their host because they provide nutrient for growth and responsible for pathogenesis of organism, for enzymes modify the local environment for favorable growth inside the host and metabolism of compounds inside the host . We characterized 139 enzymes. Knowledge of these enzymes is important for understanding the host-pathogen interaction as well.
We identified 14 oxidoreductase enzymes, which are critically important for bacterial virulence and pathogenesis. It is well understood that the disulfide bonds are important for the stability and/or structural rigidity of many extracellular proteins, including bacterial virulence factors. Bond formation is catalyzed by thiol-disulfide oxidoreductases (TDORs). Oxidoreductases like SdbA is required for disulfide bond formation in S. gordonii, which is required for autolytic activity . Protein P45154 contain 2Fe-2S ferredoxin-type domain. Many bacteria produce protein antibiotics known as bacteriocins to kill competing strains of the same or closely related bacterial species. We identified protein P44743 as a radical SAM (S-adenosylmethionine) protein, it is understood that radical SAM proteins play a significant role in pathogenesis of an organism and is also validated that the inhibition of these enzymes is effective in preventing the lethal diseases .
Similarly, we identified 39 transferase enzymes which are required for the efficient spore germination and full virulence of bacteria like Bacillus anthracis. Transferase enzymes are essential for biosynthesis of lipoprotein, and bacterial lipoproteins play an important role in virulence of bacteria . Proteins Q57022, P44064 and P45180 are glycosyl transferase, and on mutation it affects extracellular polysaccharide (EPS) and lipopolysaccharide (LPS) biosynthesis, cell motility, and reduces the development of disease symptoms , . We have characterized protein P44256 as DNA polymerase IV and it is observed that virulent strains contain increased level of activity of DNA polymerase than non-virulent strains, indicating its role in virulence .
The protein Q57544 is found to be a β-lactamase. The enzyme responsible for generation of resistance against β-Lactam antibiotics like penicillin, cephalosporins, etc. . We annotated 56 hydrolase enzymes having an established role in virulence of bacteria, e.g. Kdo hydrolase is the main cause of virulence in Francisella tularensis, which is classified as a bioterrorism agent . Similarly, nudix hydrolase encoded by nudA gene in Bacillus anthracis is important for the complete virulence .
There are 8 lyase enzymes. These are important for the virulence of pathogen in host . The P44717 protein is a cystathionine β-lyase, an enzyme which forms the cystathionine intermediate in cysteine biosynthesis, may be considered as the target for pyridiamine anti-microbial agents . Similarly, isocitrate lyase is an enzyme of glyoxylate cycle, which catalyzes the cleavage of isocitrate to succinate and glyoxylate together with malate synthase. This enzyme bypasses two decarboxylation steps of TCA cycle. It is found to up-regulate glyoxylate cycle during pathogenesis, and therefore, this pathway is used by bacteria, fungi, etc., for survival in their hosts .
The isomerase enzyme catalyze changes within one molecule by structural rearrangement  and isomerases like peptidylprolyl cis/trans isomerases (PPIases) involved in protein folding. These isomerases are considered as surface-exposed proteins which are important for virulence and resistance to NaCl . We identified 13 isomerases and 5 ligases in a group of 139 enzymes. Ligase enzymes are also part of virulence in the hosts. It is found that E3 ligase activity associated with the C-terminal region of XopL, a type III effectors, which specifically interacts with plant E2 ubiquitin conjugating enzyme that induce plant cell death and subvert plant immunity . There are also 4 HPs with kinase activity, which play a significant role in growth, differentiation, metabolism and apoptosis in response to external and internal stimuli . Thus, such enzymes are important for the survival of pathogen and may serve as a target for drug design and discovery .
Transport process plays a pivotal role in cellular metabolism, e.g., for the uptake of nutrients or the excretion of metabolic waste products, etc. We successfully predicted 50 transporters, 3 carriers, 3 receptors and 1 signal transduction proteins among HPs. It is recently identified that these proteins may be involved in virulence and essential for intracellular survival of pathogens . The protein P44691 was predicted to be a member of ABC 3 transporter family, presumably involved in virulence because they are associated with the uptake of metal ions, such as iron, zinc, and manganese . This protein also helps in the attachment of pathogenic bacteria to the mucosal surfaces of host cells, which is a critical step in bacterial pathogenesis, thereby present as a putative drug target .
We found protein P44005 and P45280 as SNARE associated Golgi protein. The soluble N-ethylmaleimide-sensitive factor attachment protein receptors (SNARE) proteins play an essential role in the compartment fusion in eukaryotic cells . They share a conserved motif, known as SNARE motif, and have been classified as glutamine containing SNAREs (Q-SNAREs) and arginine containing SNAREs (R-SNAREs) on the basis of favorably conserved residue at the center of this motif . These proteins are central regulators of membrane fusion, so they are potential targets for intracellular organisms, which frequently rely on destabilizing the host intracellular traffic. This finding helps us to conclude that by mimicking SNAREs some inclusion proteins can control intracellular trafficking.
Bacteriocins proteins contain an N-terminal domain with an extensive resemblance to a [2Fe-2S] plant ferredoxin and a C-terminal colicin M-like catalytic domain and to gain entry into vulnerable cells. These proteins parasitize an existing iron uptake pathway by using a ferredoxin-containing receptor binding domain . Protein Q57133 is a transferrin-binding protein. Transferrins are a group of non-haem iron-binding glycoproteins, widely distributed in the physiological fluids and cells of vertebrates. These proteins are involved in iron transport within the circulatory system of the vertebrates. Transferrins is important for bacterial virulence but their role in virulence is still not fully understood . The membrane transferrin receptor-mediated endocytosis is a major route of cellular iron uptake and the efficient cellular uptake of transferrin pathway has shown potential in the delivery of anticancer drugs, proteins, and therapeutic genes into primarily proliferating malignant cells over expressed transferrin receptors , .
32 HPs are annotated as binding proteins in which 15 are DNA binding, 5 RNA binding, 9 metal binding and 3 ATP/coenzyme binding proteins. We have identified a tetratricopeptide repeat (TPR), a structural motif involved in the assembly of various multi-protein complexes in many HPs. TPR-containing proteins often play important roles in cell processes, and involved in virulence-associated functions .
HPs function as DNA-binding proteins also contribute to the virulence. The winged-helix-turn-helix (wHTH) motif in sarZ proteins in Staphylococcus aureus contributes to virulence by binding to cvf gene that encodes for alpha hemolysin . In complex regulatory system of group A Streptococcus (GAS), there is the streptococcal regulator of virulence (Srv) which is the member of the CRP/FNR family of transcriptional regulators, and members of this family possess a characteristic C-terminal helix-turn-helix motif (HTH) that facilitates binding to DNA targets. Point mutation in this motif alters protein-DNA interaction , indicate that DNA binding motifs are regulatory factors of the virulence of bacteria. The RNA binding proteins are also contributing to the survival of the organism and control the virulence factors of the pathogens .
Lipoproteins identified in bacteria are formed by lipid modification of proteins that facilitate the anchoring of hydrophilic proteins to hydrophobic surfaces through hydrophobic interactions of the attached acyl groups to the cell wall phospholipids. This process has a considerable significance in many cellular and virulence phenomena. We found 15 lipoproteins from the group of HPs because they play crucial roles in adhesion to host cells, variation of inflammatory processes and translocation process of virulence factors into host cells. It is also discovered that lipoproteins may function as vaccines. The knowledge of these facts may be utilized for the generation of novel countermeasures to bacterial diseases .
Structural motifs like helix-turn-helix are conserved in various organisms. A detection of these common patterns in a sequence refers that such proteins are mainly involved in the regulation of transcription. The transcription regulators like HilC and HilD also showed DNA binding activities and contributes to the virulence of Salmonella enterica, where these are involved in the invasion to the host cells . We found 18 transcriptional regulatory, 3 translation regulatory, 1 replication regulatory, 3 cell cycle regulatory enzyme/protein. The regulatory protein RfaH is found in E. coli and enhances the expression of different factors that are supposed to play a role in the bacterial virulence. Furthermore, inactivation of rfaH decreases the virulence of uropathogenic E. coli strain . Similarly, the RNA-binding protein Hfq has emerged as an important regulatory factor in varieties of physiological processes, including stress resistance and virulence in various Gram-negative bacteria such as E. coli. Hfq modulates the stability or translation of mRNAs and interacts with numerous small regulatory RNAs . The cell cycle and related protein P44063, is involved in lipopolysaccharide biosynthesis and are important in understanding the virulence of H. influenzae, as proteins involved in this particular biosynthesis are considered as primary virulence factors .
We use the consensus of VICMpred and VirulentPred for predicting the virulence factors among the 429 HPs and found 40 HPs that give positive virulence score in both servers, and can be used as potent drug targets for drug design. These are listed in Table 3. In this group of virulent proteins we observed that protein P43936 is a PemK superfamily toxin of the ChpB-ChpS toxin-antitoxin system protein involved in plasmid maintenance . We have also identified 30 bacteriophage related proteins among HPs. It is known that SuMu protein 1a, a bacteriophage related protein, has shown homology to IgA metalloproteinase and IgA1 protease which are described as virulence factors in non-typeable H. influenzae . So, SuMu proteins are considered as highly virulent proteins.
Using an innovative in silico approach we have analyzed all 429 HPs from H. influenzae. Using the ROC analysis and confidence level measurements of the predicted results, we precisely predict the function of 296 HPs with confidence and successfully characterized them. We did not find enough evidences for functional prediction of 124 proteins, and hence these sequences require further analysis. The sub-cellular localization and physicochemical parameters prediction are useful in distinguishing the HPs with transporter activity from the rest of the protein. The protein-protein interaction also helps to find out the involvement of such proteins in various metabolic pathways. Further, we are able to detect the 40 virulence proteins essential for the survival of pathogen, particularly protein Q57523 showing highest virulence score in VICMpred which is known to be the most virulent HP among the listed virulence proteins. Our results could facilitate in developing drugs/vaccines, specifically targeting the pathogen's system without causing any allergic or side effect to the host. This in silico approach for functional annotation of HPs can be further utilized in drug discovery for characterizing putative drug targets for other clinically important pathogens.
List of predicted physicochemical parameters by Expasy's ProtParam tool of 429 HP from H. influenzae.
List of predicted sub-cellular localization of 429 HPs from H. influenzae.
List of annotated functions of 429 HPs from H. influenzae using BLASTp, STRING, SMART, INTERPROSCAN and MOTIF.
List of functionally annotated domains of 429 HPs from H. influenzae by CATH, SUPERFAMILY, PANTHER, Pfam, SYSTERS, CDART SVMProt and ProtoNet.
List of annotated functions of 100 proteins with known function from H. influenzae using BLASTp, SMART, INTERPROSCAN and MOTIF for ROC analysis.
List of functionally annotated domains of 100 proteins with known function from H. influenzae by CATH, SUPERFAMILY, PANTHER, Pfam, SYSTERS, CDART SVMProt and ProtoNet for ROC analysis.
List of accuracy, sensitivity, specificity and ROC area of various bioinformatics tools used for predicting function of HPs from H. influenzae obtained after ROC analysis.
List of clusters formed by CLUSS online tool and predicted motif sequence site and sequence by MEME Suite in 429 HPs from H. influenzae.
Conceived and designed the experiments: MS MIH. Performed the experiments: MS MIH. Analyzed the data: MS MIH FA. Contributed reagents/materials/analysis tools: MS MIH FA. Wrote the paper: MS MIH FA. Validated the data: FA MIH. Maintained workstations: MS MIH.
- 1. Sethi S, Murphy TF (2001) Bacterial infection in chronic obstructive pulmonary disease in 2000: a state-of-the-art review. Clin Microbiol Rev 14: 336–363.
- 2. Murphy TF, Sethi S (1992) Bacterial infection in chronic obstructive pulmonary disease. Am Rev Respir Dis 146: 1067–1083.
- 3. Ball P (1996) Infective pathogenesis and outcomes in chronic bronchitis. Curr Opin Pulm Med 2: 181–185.
- 4. Cash P, Argo E, Langford PR, Kroll JS (1997) Development of a Haemophilus two-dimensional protein database. Electrophoresis 18: 1472–1482.
- 5. Evers S, Di Padova K, Meyer M, Fountoulakis M, Keck W, et al. (1998) Strategies towards a better understanding of antibiotic action: folate pathway inhibition in Haemophilus influenzae as an example. Electrophoresis 19: 1980–1988.
- 6. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512.
- 7. Wong SM, Akerley BJ (2008) Identification and analysis of essential genes in Haemophilus influenzae. Methods Mol Biol 416: 27–44.
- 8. Edwards JS, Palsson BO (1999) Systems properties of the Haemophilus influenzae Rd metabolic genotype. J Biol Chem 274: 17410–17416.
- 9. Papin JA, Price ND, Edwards JS, Palsson BB (2002) The genome-scale metabolic extreme pathway structure in Haemophilus influenzae shows significant network redundancy. J Theor Biol 215: 67–82.
- 10. Schilling CH, Palsson BO (2000) Assessment of the metabolic capabilities of Haemophilus influenzae Rd through a genome-scale pathway analysis. J Theor Biol 203: 249–283.
- 11. Akerley BJ, Rubin EJ, Novick VL, Amaya K, Judson N, et al. (2002) A genome-scale analysis for identification of genes required for growth or survival of Haemophilus influenzae. Proc Natl Acad Sci U S A 99: 966–971.
- 12. Herbert MA, Hayes S, Deadman ME, Tang CM, Hood DW, et al. (2002) Signature Tagged Mutagenesis of Haemophilus influenzae identifies genes required for in vivo survival. Microb Pathog 33: 211–223.
- 13. Doerks T, von Mering C, Bork P (2004) Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes. Nucleic Acids Res 32: 6321–6326.
- 14. Hawkins T, Kihara D (2007) Function prediction of uncharacterized proteins. J Bioinform Comput Biol 5: 1–30.
- 15. Galperin MY, Koonin EV (2004) ‘Conserved hypothetical’ proteins: prioritization of targets for experimental study. Nucleic Acids Res 32: 5452–5463.
- 16. Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, et al. (2009) Protein function annotation by homology-based inference. Genome Biol 10: 207.
- 17. Nimrod G, Schushan M, Steinberg DM, Ben-Tal N (2008) Detection of functionally important regions in “hypothetical proteins” of known structure. Structure 16: 1755–1763.
- 18. Hassan MI, Kumar V, Somvanshi RK, Dey S, Singh TP, et al. (2007) Structure-guided design of peptidic ligand for human prostate specific antigen. J Pept Sci 13: 849–855.
- 19. Hassan MI, Kumar V, Singh TP, Yadav S (2007) Structural model of human PSA: a target for prostate cancer therapy. Chem Biol Drug Des 70: 261–267.
- 20. Thakur PK, Kumar J, Ray D, Anjum F, Hassan MI (2013) Search of potential inhibitor against New Delhi metallo-beta-lactamase 1 from a series of antibacterial natural compounds. J Nat Sci Biol Med 4: 51–56.
- 21. Minion FC, Lefkowitz EJ, Madsen ML, Cleary BJ, Swartzell SM, et al. (2004) The genome sequence of Mycoplasma hyopneumoniae strain 232, the agent of swine mycoplasmosis. J Bacteriol 186: 7123–7133.
- 22. Lubec G, Afjehi-Sadat L, Yang JW, John JP (2005) Searching for hypothetical proteins: theory and practice based upon original data and literature. Prog Neurobiol 77: 90–127.
- 23. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410.
- 24. Rost B, Valencia A (1996) Pitfalls of protein sequence analysis. Curr Opin Biotechnol 7: 457–461.
- 25. Kanehisa M (1997) Linking databases and organisms: GenomeNet resources in Japan. Trends Biochem Sci 22: 442–444.
- 26. Sigrist CJ, Cerutti L, de Castro E, Langendijk Genevaux PS, Bulliard V, et al. (2010) PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 38: D161–166.
- 27. Attwood TK (2002) The PRINTS database: a resource for identification of protein families. Brief Bioinform 3: 252–263.
- 28. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, et al. (2012) The Pfam protein families database. Nucleic Acids Res 40: D290–301.
- 29. Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, et al. (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 33: D212–215.
- 30. Henikoff JG, Henikoff S (1996) Blocks database and its applications. Methods Enzymol 266: 88–105.
- 31. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, et al. (2011) InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res 40: D306–312.
- 32. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, et al. (2005) InterProScan: protein domains identifier. Nucleic Acids Res 33: W116–120.
- 33. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, et al. (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37: W202–208.
- 34. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, et al. (2011) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 39: D561–568.
- 35. Metz CE (1978) Basic principles of ROC analysis. Semin Nucl Med 8: 283–298.
- 36. Shanmughavel SAaP (2008) Computational Annotation for Hypothetical Proteins of Mycobacterium Tuberculosis. Journal of Computer Science & Systems Biology 1: 50–62.
- 37. Garg A, Gupta D (2008) VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics 9: 62.
- 38. Saha S, Raghava GP (2006) VICMpred: an SVM-based method for the prediction of functional proteins of Gram-negative bacteria using amino acid patterns and composition. Genomics Proteomics Bioinformatics 4: 42–47.
- 39. Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, et al. (2003) ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res 31: 3784–3788.
- 40. Gill SC, von Hippel PH (1989) Calculation of protein extinction coefficients from amino acid sequence data. Anal Biochem 182: 319–326.
- 41. Guruprasad K, Reddy BV, Pandit MW (1990) Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng 4: 155–161.
- 42. Ikai A (1980) Thermostability and aliphatic index of globular proteins. J Biochem 88: 1895–1898.
- 43. Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157: 105–132.
- 44. Vetrivel U, Subramanian G, Dorairaj S A novel in silico approach to identify potential therapeutic targets in human bacterial pathogens. Hugo J 5: 25–34.
- 45. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32: D115–119.
- 46. Yu NY, Wagner JR, Laird MR, Melli G, Rey S, et al. (2010) PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics 26: 1608–1615.
- 47. Bhasin M, Garg A, Raghava GP (2005) PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics 21: 2522–2524.
- 48. Yu CS, Chen YC, Lu CH, Hwang JK (2006) Prediction of protein subcellular localization. Proteins 64: 643–651.
- 49. Yu CS, Lin CJ, Hwang JK (2004) Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci 13: 1402–1406.
- 50. Emanuelsson O, Brunak S, von Heijne G, Nielsen H (2007) Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc 2: 953–971.
- 51. Bendtsen JD, Kiemer L, Fausboll A, Brunak S (2005) Non-classical protein secretion in bacteria. BMC Microbiol 5: 58.
- 52. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305: 567–580.
- 53. Tusnady GE, Simon I (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics 17: 849–850.
- 54. Soding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33: W244–248.
- 55. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF Jr, Brice MD, et al. (1977) The Protein Data Bank. A computer-based archival file for macromolecular structures. Eur J Biochem 80: 319–324.
- 56. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF Jr, Brice MD, et al. (1978) The Protein Data Bank: a computer-based archival file for macromolecular structures. Arch Biochem Biophys 185: 584–591.
- 57. Hubbard TJ, Ailey B, Brenner SE, Murzin AG, Chothia C (1999) SCOP: a Structural Classification of Proteins database. Nucleic Acids Res 27: 254–256.
- 58. Sillitoe I, Cuff AL, Dessailly BH, Dawson NL, Furnham N, et al. (2013) New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Res 41: D490–498.
- 59. Simossis VA, Heringa J (2005) PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information. Nucleic Acids Res 33: W289–294.
- 60. Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 313: 903–919.
- 61. Mi H, Muruganujan A, Casagrande JT, Thomas PD (2013) Large-scale gene function analysis with the PANTHER classification system. Nat Protoc 8: 1551–1566.
- 62. Meinel T, Krause A, Luz H, Vingron M, Staub E (2005) The SYSTERS Protein Family Database in 2005. Nucleic Acids Res 33: D226–229.
- 63. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31: 3692–3697.
- 64. Geer LY, Domrachev M, Lipman DJ, Bryant SH (2002) CDART: protein homology by domain architecture. Genome Res 12: 1619–1623.
- 65. Letunic I, Doerks T, Bork P (2012) SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res 40: D302–305.
- 66. Rappoport N, Karsenty S, Stern A, Linial N, Linial M (2012) ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree. Nucleic Acids Res 40: D313–320.
- 67. Gasteiger E, Jung E, Bairoch A (2001) SWISS-PROT: connecting biomolecular knowledge via a protein database. Curr Issues Mol Biol 3: 47–55.
- 68. Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28: 45–48.
- 69. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, et al. (2002) The Ensembl genome database project. Nucleic Acids Res 30: 38–41.
- 70. Kelil A, Wang S, Brzezinski R (2008) CLUSS2: an alignment-independent algorithm for clustering protein families with multiple biological functions. Int J Comput Biol Drug Des 1: 122–140.
- 71. Kelil A, Wang S, Brzezinski R, Fleury A (2007) CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics 8: 286.
- 72. Baron C, Coombes B (2007) Targeting bacterial secretion systems: benefits of disarmament in the microcosm. Infect Disord Drug Targets 7: 19–27.
- 73. Zou KH, Warfield SK, Fielding JR, Tempany CM, William MW 3rd, et al. (2003) Statistical validation based on parametric receiver operating characteristic analysis of continuous classification data. Acad Radiol 10: 1359–1368.
- 74. Swets JA, Dawes RM, Monahan J (2000) Better decisions through science. Sci Am 283: 82–87.
- 75. Eng J (2013) ROC analysis: web-based calculator for ROC curves. Baltimore, Maryland, USA: Johns Hopkins University.
- 76. Bjornson HS (1984) Enzymes associated with the survival and virulence of gram-negative anaerobes. Rev Infect Dis 6 Suppl 1S21–24.
- 77. Davey L, Ng CK, Halperin SA, Lee SF (2013) Functional analysis of paralogous thiol-disulfide oxidoreductases in Streptococcus gordonii. J Biol Chem 288: 16416–16429.
- 78. Parveen N, Cornell KA (2011) Methylthioadenosine/S-adenosylhomocysteine nucleosidase, a critical enzyme for bacterial metabolism. Mol Microbiol 79: 7–20.
- 79. Okugawa S, Moayeri M, Pomerantsev AP, Sastalla I, Crown D, et al. (2012) Lipoprotein biosynthesis by prolipoprotein diacylglyceryl transferase is required for efficient spore germination and full virulence of Bacillus anthracis. Mol Microbiol 83: 96–109.
- 80. McQuiston JR, Vemulapalli R, Inzana TJ, Schurig GG, Sriranganathan N, et al. (1999) Genetic characterization of a Tn5-disrupted glycosyltransferase gene homolog in Brucella abortus and its effect on lipopolysaccharide composition and virulence. Infect Immun 67: 3830–3835.
- 81. Li Q, Zhang Y, Sheng Y, Huo R, Sun B, et al. (2012) Large T-antigen up-regulates Kv4.3 K(+) channels through Sp1, and Kv4.3 K(+) channels contribute to cell apoptosis and necrosis through activation of calcium/calmodulin-dependent protein kinase II. Biochem J 441: 859–867.
- 82. Makioka A, Ohtomo H (1995) An increased DNA polymerase activity associated with virulence of Toxoplasma gondii. J Parasitol 81: 1021–1022.
- 83. Poole K (2004) Resistance to beta-lactam antibiotics. Cell Mol Life Sci 61: 2200–2223.
- 84. Okan NA, Chalabaev S, Kim TH, Fink A, Ross RA, et al. (2013) Kdo hydrolase is required for Francisella tularensis virulence and evasion of TLR2-mediated innate immunity. MBio 4: e00638–00612.
- 85. Edelstein PH, Hu B, Shinzato T, Edelstein MA, Xu W, et al. (2005) Legionella pneumophila NudA Is a Nudix hydrolase and virulence factor. Infect Immun 73: 6567–6576.
- 86. Ejim LJ, D'Costa VM, Elowe NH, Loredo Osti JC, Malo D, et al. (2004) Cystathionine beta-lyase is important for virulence of Salmonella enterica serovar Typhimurium. Infect Immun 72: 3310–3314.
- 87. Dunn MF, Ramirez Trujillo JA, Hernandez Lucas I (2009) Major roles of isocitrate lyase and malate synthase in bacterial and fungal pathogenesis. Microbiology 155: 3166–3175.
- 88. Reffuveille F, Connil N, Sanguinetti M, Posteraro B, Chevalier S, et al. (2012) Involvement of peptidylprolyl cis/trans isomerases in Enterococcus faecalis virulence. Infect Immun 80: 1728–1735.
- 89. Huang J, Huang Q, Zhou X, Shen MM, Yen A, et al. (2004) The poxvirus p28 virulence factor is an E3 ubiquitin ligase. J Biol Chem 279: 54110–54116.
- 90. Engh RA, Bossemeyer D (2002) Structural aspects of protein kinase control-role of conformational flexibility. Pharmacol Ther 93: 99–111.
- 91. Stephenson K, Hoch JA (2002) Histidine kinase-mediated signal transduction systems of pathogenic microorganisms as targets for therapeutic intervention. Curr Drug Targets Infect Disord 2: 235–246.
- 92. Freeman ZN, Dorus S, Waterfield NR (2013) The KdpD/KdpE two-component system: integrating K(+) homeostasis and virulence. PLoS Pathog 9: e1003201.
- 93. Garmory HS, Titball RW (2004) ATP-binding cassette transporters are targets for the development of antibacterial vaccines and therapies. Infect Immun 72: 6757–6763.
- 94. Jahn R, Scheller RH (2006) SNAREs – engines for membrane fusion. Nat Rev Mol Cell Biol 7: 631–643.
- 95. Fasshauer D, Sutton RB, Brunger AT, Jahn R (1998) Conserved structural features of the synaptic fusion complex: SNARE proteins reclassified as Q- and R-SNAREs. Proc Natl Acad Sci U S A 95: 15781–15786.
- 96. Grinter R, Milner J, Walker D (2012) Ferredoxin containing bacteriocins suggest a novel mechanism of iron uptake in Pectobacterium spp. PLoS One 7: e33033.
- 97. Cheng Y, Zak O, Aisen P, Harrison SC, Walz T (2004) Structure of the human transferrin receptor-transferrin complex. Cell 116: 565–576.
- 98. Kratz F, Beyer U, Roth T, Tarasova N, Collery P, et al. (1998) Transferrin conjugates of doxorubicin: synthesis, characterization, cellular uptake, and in vitro efficacy. J Pharm Sci 87: 338–346.
- 99. Singh M (1999) Transferrin As A targeting ligand for liposomes and anticancer drugs. Curr Pharm Des 5: 443–451.
- 100. Kondo Y, Ohara N, Sato K, Yoshimura M, Yukitake H, et al. (2010) Tetratricopeptide repeat protein-associated proteins contribute to the virulence of Porphyromonas gingivalis. Infect Immun 78: 2846–2856.
- 101. Kaito C, Morishita D, Matsumoto Y, Kurokawa K, Sekimizu K (2006) Novel DNA binding protein SarZ contributes to virulence in Staphylococcus aureus. Mol Microbiol 62: 1601–1617.
- 102. Doern CD, Holder RC, Reid SD (2008) Point mutations within the streptococcal regulator of virulence (Srv) alter protein-DNA interactions and Srv function. Microbiology 154: 1998–2007.
- 103. Ariyachet C, Solis NV, Liu Y, Prasadarao NV, Filler SG, et al. (2013) SR-like RNA-binding protein Slr1 affects Candida albicans filamentation and virulence. Infect Immun 81: 1267–1276.
- 104. Kovacs Simon A, Titball RW, Michell SL (2011) Lipoproteins of bacterial pathogens. Infect Immun 79: 548–561.
- 105. Olekhnovich IN, Kadner RJ (2002) DNA-binding activities of the HilC and HilD virulence regulatory proteins of Salmonella enterica serovar Typhimurium. J Bacteriol 184: 4148–4160.
- 106. Nagy G, Dobrindt U, Schneider G, Khan AS, Hacker J, et al. (2002) Loss of regulatory protein RfaH attenuates virulence of uropathogenic Escherichia coli. Infect Immun 70: 4406–4413.
- 107. Christiansen JK, Larsen MH, Ingmer H, Sogaard-Andersen L, Kallipolitis BH (2004) The RNA-binding protein Hfq of Listeria monocytogenes: role in stress tolerance and virulence. J Bacteriol 186: 3355–3362.
- 108. Wang L, Vinogradov EV, Bogdanove AJ (2013) Requirement of the lipopolysaccharide O-chain biosynthesis gene wxocB for type III secretion and virulence of Xanthomonas oryzae pv. Oryzicola. J Bacteriol 195: 1959–1969.
- 109. Bukowski M, Lyzen R, Helbin WM, Bonar E, Szalewska-Palasz A, et al. (2012) A regulatory role for Staphylococcus aureus toxin-antitoxin system PemIKSa. Nat Commun 4: 2012.
- 110. Zehr ES, Tabatabai LB (2012) Bayles (2012) DO Genomic and proteomic characterization of SuMu, a Mu-like bacteriophage infecting Haemophilus parasuis. BMC Genomics 13: 331.