Infections caused by Salmonella enterica, a Gram-negative facultative anaerobic bacteria belonging to the family of Enterobacteriaceae, are major threats to the health of humans and animals. The recent availability of complete genome data of pathogenic strains of the S. enterica gives new avenues for the identification of drug targets and drug candidates. We have used the genomic and metabolic pathway data to identify pathways and proteins essential to the pathogen and absent from the host.
We took the whole proteome sequence data of 42 strains of S. enterica and Homo sapiens along with KEGG-annotated metabolic pathway data, clustered proteins sequences using CD-HIT, identified essential genes using DEG database and discarded S. enterica homologs of human proteins in unique metabolic pathways (UMPs) and characterized hypothetical proteins with SVM-prot and InterProScan. Through this core proteomic analysis we have identified enzymes essential to the pathogen.
The identification of 73 enzymes common in 42 strains of S. enterica is the real strength of the current study. We proposed all 73 unexplored enzymes as potential drug targets against the infections caused by the S. enterica. The study is comprehensive around S. enterica and simultaneously considered every possible pathogenic strain of S. enterica. This comprehensiveness turned the current study significant since, to the best of our knowledge it is the first subtractive core proteomic analysis of the unique metabolic pathways applied to any pathogen for the identification of drug targets. We applied extensive computational methods to shortlist few potential drug targets considering the druggability criteria e.g. Non-homologous to the human host, essential to the pathogen and playing significant role in essential metabolic pathways of the pathogen (i.e. S. enterica). In the current study, the subtractive proteomics through a novel approach was applied i.e. by considering only proteins of the unique metabolic pathways of the pathogens and mining the proteomic data of all completely sequenced strains of the pathogen, thus improving the quality and application of the results. We believe that the sharing of the knowledge from this study would eventually lead to bring about novel and unique therapeutic regimens against the infections caused by the S. enterica.
Citation: Uddin R, Sufian M (2016) Core Proteomic Analysis of Unique Metabolic Pathways of Salmonella enterica for the Identification of Potential Drug Targets. PLoS ONE 11(1): e0146796. https://doi.org/10.1371/journal.pone.0146796
Editor: Dipshikha Chakravortty, Indian Institute of Science, INDIA
Received: August 6, 2015; Accepted: December 21, 2015; Published: January 22, 2016
Copyright: © 2016 Uddin, Sufian. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: The study was supported by International Foundation for Science (IFS) grant# F/5378-1. The authors would also like to gratefully acknowledge the Higher Education Commission of Pakistan for providing fellowship during the study.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations: KEGG, Kyoto Encyclopedia of Genes and Genomes; CD-HIT, Cluster Database at High Identity with Tolerance; DEG, Database of Essential Genes; UMP, Unique Metabolic Pathways; SVM, Support Vector Machine; KO, KEGG Orthology; FTP, File Transfer Protocol; NCBI-GI, National Center for Biotechnology Information—GenInfo Identifier; COP, Cluster of Proteins; API, Program Interface; BLAST, Basic Local Alignment Search Tool; BLASTp, Protein-Protein BLAST; VFDB, Virulence Factors Database; PDB, Protein Databank; SE, Salmonella enterica
Salmonella enterica is a Gram-negative facultative anaerobic intracellular bacterium. According to the classification scheme of Kauffmann-White , more than 2500 serological variants (or serovars) were categorized in six subspecies [2, 3]. Most of the serovars have a broad range of hosts while some have adapted to specific hosts. The mechanism of adaptation is currently unclear . Typically, S. enterica serovars infect the host through the mouth, leading to the three major symptoms: enterocolitis, bacteremia and enteric fever, or asymptomatic chronic carriage . Human pathogens include serovar Typhi, Paratyphi, Typhimurium, Sendai, Choleraesuis, Dublin and many others .
Pathogenesis of Salmonella enterica initiates with its entry in the host organism. Salmonella is usually acquired from the environment by contact with a carrier host or by oral intake of contaminated food or water. After ingestion, Salmonella survives the low pH of the stomach, eventually leading to entry of the intestine where it uses a type III secretion system to deliver effecter proteins essential for intestinal invasion . Hereafter, bacterial progression within the host is different in Non-Typhoidal Salmonella and Typhoidal Salmonella. Non-typhoidal Salmonella serovars induce a localized inflammation which, in immunocompetent persons, results in enterocolitis with the infiltration of polymorphonuclear leukocytes (PMNs) into the sub-mucosal epithelium . In Typhoidal Salmonella, intestinal inflammation is moderate, largely consisting of macrophage infiltration  and the bacteria is distributed and reaches the blood either directly or via the mesenteric lymph nodes or are transported within leukocytes, causing bacteremia . Both types of Salmonella grow and persist in systemic tissues where they adapt to the intracellular environment. The pathogen can escape from host cells using secretion systems .
A genome is the set of genes in a single functional organism, whereas the pangenome of a prokaryote is the set of non-redundant genes which includes a core genome containing genes present in all strains; dispensable genes that are absent from one or more strains, but not all; and genes that are unique to each strain . Recently, microbial pangenomics has attracted the scientific community which was inspired by the accessibility to sequenced data of whole-genomes of the strains of particular species [12–15]. Simultaneously, research on pan-proteomics was also initiated to study the effects of similarities and differences at the protein level among the strains of specie [16–18]. As of October 13, 2015, there were only 45 target genes reported in DrugBank Database for S. enterica, which covers only 1.6% of its core genome size i.e. 2,800 . Since the pathogen has developed resistance against conventional drugs, so there is a dire need to find new therapeutic drug targets.
In the present study, we took the whole proteome sequence data of 42 strains of 19 serovars of S. enterica and KEGG-annotated metabolic pathway data of Homo sapiens, identified and discarded S. enterica homologs of human proteins in unique metabolic pathways (UMPs) and identified enzymes essential to the pathogen using DEG database. We compared our results to a previous study  where they searched for new antimicrobial targets by focusing on different metabolic enzymes of a single serovar and comparing the results with other serovars at the genome level. In a more recent report, the pangenomic analyses of 22 complete and 23 draft genome sequences was performed . However, to the best of our knowledge the current study is the first subtractive core proteomic analysis of the unique metabolic pathways applied to any pathogen for the identification of drug targets primarily essential enzymes.
1. Identification of UMPs of S. enterica
KEGG Brite Hierarchy files of H. sapiens and 42 strains of S. enterica containing information about the genes of respective metabolic pathways were downloaded from the KEGG database . The metabolic pathways unique to the serovars (i.e. missing in human host) were identified using KEGG Orthology (KO) IDs, and the corresponding genes were sorted out. The UMPs absent in some strains were listed out using in-house AWK scripts.
2. Clustering common proteins of UMPs of 42 strains
The KEGG IDs of all the genes from UMPs were converted to corresponding NCBI GIs using KEGG-API service . Amino acid sequences were retrieved from the respective strains available on NCBI FTP server  using Fastblast . The genes encoding tRNA and rRNA were excluded since the aim was to propose enzymes as the drug targets. Further plasmid-encoded genes were not considered to be essential for the survival of cell, as per information available in the Database of Essential Genes (DEG) . We noticed that some NCBI GIs were discontinued and therefore, updated to the new GIs. We linked the new GIs with the old one and retrieved the sequence. CD-HIT  is a standalone command-based application which groups a set of sequences of a database on the basis of sequence identity. Orthologs within the 42 strains were identified by using CD-HIT (updated on August 27, 2012) to group protein sequences with at least 80% sequence identity in to Clusters of Proteins (COPs) so that each COP will be analyzed at once for further steps of subtractive proteomics. The results were verified by comparison to the online server of ElimDupes .
3. Searching of non-homologous essential enzymes
To process all COPs for subtractive proteomic analyses at once, a novel strategy was applied which comprised of two approaches. In first approach, proteins of all COPs were subjected to BLASTp  against Homo sapiens downloaded from NCBI FTP server  and the output was analyzed for non-homologous proteins. In second approach, 3 strains out of 42 were selected at random and proteins of those strains were subjected to BLASTp against human proteome. Both approaches are illustrated in S1 Fig. The parameter details for BLASTp are mentioned in Table 1 (a). The results of both approaches were observed by BioPerl module SearchIO  and the better approach was adapted to the next steps considering the criteria of time processing. The non-homologous COPs from the previous step were subjected to BLASTp of DEG V. 10  to identify essential genes of the pathogen. The parameter details are mentioned in Table 1 (b). The KEGG Brite hierarchy is one of the important features of KEGG server containing the information of enzymes of metabolic pathways. The enzymes were sorted out from non-homologous essential COPs of S. enterica using the hierarchy files of 42 strains .
4. Searching the virulent genes
VFDB (Virulence Factors Database) containing protein sequences of all virulent genes was downloaded and non-homologous COPs from 3 randomly selected strains were subjected to standalone BLASTp against VFDB sequences to find out virulent genes with sequence identity of 70% or more. Table 1 (c) contained the parameter details.
5. Characterization of the hypothetical proteins
The hypothetical proteins were identified among the enzymes to characterize their structure and/or function. All the hypothetical protein sequences were subjected to standalone BLASTp against protein sequences available in PDB (Protein Data Bank)  obtained from PDB FTP server . The parameter details are mentioned in Table 1 (d). The queries with significant hits against PDB database were verified from CD-HIT output and those with ‘no hits’ were subjected to SVM-Prot  and InterProScan version 4.0  for protein family prediction. The results were manually cross-checked with CD-HIT output.
6. Validation from the literature:
The non-homologous catalytic proteins considered as putative drug targets were validated from DrugBank database  and published results of Becker et. al. . In order to do so, the gene symbols of essential enzymes  were converted to full form using DAVID Bioinformatics tool , and then searched in both sources manually.
Results and Discussion
1. Identification of UMPs of S. enterica
Each of the metabolic pathways of 42 strains of the S. enterica was compared with the complete human metabolic pathway. On average, each strain has 117 metabolic pathways and at least 34 UMPs (Table 2) with all UMPs present in almost all strains. A heatmap containing the percentage presence of proteins in each pathway and totally absent pathways in individual strains is illustrated in Fig 2, while its corresponding quantitative data is provided as S2 Table. In the studied strains of S. enterica, we found that only the strain (Typhi P-stx-12) was predicted to metabolize the Atrazine, thus may be resistant to it. However the dataset lacked the pathway information of β-Lactam resistance and Bisphenol degradation which were also the next most frequent absent pathways among all studied strains. The strains Heidelberg CFSAN002069 and Typhi CT18 needed to update in KEGG since the data was not updated and hence 22 and 11 NCBI GIs were appended, respectively in both strains and mentioned in S3 Table.
The heatmap contains percentage presence and absence of genes of in each metabolic pathway of 42 strains of S. enterica.
2. Clustering common proteins of UMPs of 42 strains and searching of non-homologous essential enzymes
The CD-HIT resulted in 537 COPs and each cluster was comprised of more than 1 protein. Out of total, 241 COPs contained at least 42 proteins belonging to the 42 strains of S. enterica. S4 Table contained the NCBI-GIs of orthologous proteins (genes) clustered in groups.
The complete human proteome was obtained from NCBI FTP server (details in S1 Table). The non-homologous proteins could be potential drug targets with reduced possible side effects or cross reactivity of the drug with the host proteins. It is essential to find the similarity of the shortlisted sequences with the human host. In order to do so, we compared each COP with the individual human proteins. We performed this comparison by two separate approaches (details in methods section). As stated earlier that the COPs were consisted of up to 80% similar proteins; therefore, if we compare either (i) each single entry of the COPs with the host proteins or (ii) comparing few randomly selected entries of the COPs with human host proteins, the outcome would remain same. We used both of the approaches to see if the statement maintains. Both approaches of searching non-homologous sequences in the pathogen revealed exactly same results i.e. 198 out of 241 COPs were identified as non-homologous to humans (Table 3). The second approach was selected for the further steps of subtractive proteomics as the approach was accurate and relatively fast. The COP names mentioned in Table 3 were allocated by the authors following the criteria of maximum or common occurrences of that name in a respective cluster. One important aspect was observed during the tabulation of data (Table 3) that despite having exactly the same or closely similar names within the COPs, the member proteins of the respective COPS showed low similarity among them. These COPs include Cytochrome BD-II Ubiquinol Oxidase (COP # 139 and 221), D-alanyl-D-alanine Carboxypeptidase (COP # 127 and 190), Lipopolysaccharide core biosynthesis protein (COP # 250, 339 and 384), Peptidoglycan Synthetase FtsI (COP # 65 and 67), PTS system Ascorbate-specific transporter IIC (COP # 129 and 164), Transcriptional regulator (COP # 17 and 167), Tricarboxylate transport membrane protein (COP # 109 and 476), Two component response regulator (COP # 378, 410 and 411) and Type III Secretion apparatus protein SpaR (COP # 341 and 344). From the similar named COPs, we randomly selected the few proteins and subjected to online BLASTp which resulted in low similarity in each case. There might be two possibilities for the outcome; either these sets of COPs were isozymes or might be human error during the GenBank submission. For instance BLASTp of NCBI GI 194443076 and 194443845 have only 29% identity though they both have same name and belong to the same strain. The beta subunit of the subtype 1 and 2 of the enzyme Nitrate reductase shared more than 80% sequence similarity and hence clustered in a single COP. The enzyme Succinate Dehydrogenase Cytochrome b556 large membrane was somehow not characterized as an enzyme during KEGG analysis hence its UniProt ID was mentioned in Table 3.
Additionally, we searched for the essential and virulent genes from the 198 COPs by applying the same subtractive proteomics approach. The database of essential genes (DEG) is a well curated open-access database consisting of essential genes from various organisms ranging from single-cell prokaryotes to multicellular eukaryotes. The bacteria harbor various virulent genes which lead to pathogenecity. Therefore, identifying virulent factors in the genome could lead us to elucidate the molecular mechanism of bacterial pathogenecity. The VFDB  is an online server containing information about virulent genes present in various microorganisms. Similar results were obtained from 3 randomly selected strains and it was found out that 138 out of 198 COPs were essential for the bacteria as per the prediction of DEG (Table 3), and 42 out of 198 COPs were identified as virulent genes (Table 3). There were 73 enzymes in the 138 non-humongous essential COPs (Table 3). The NCBI GIs of each respective COP was presented in S5 Table. The S1 Text contained important information regarding the accessibility of NCBI GIs mentioned in S5 Table. The data illustrated through pie chart in Fig 3 and tabulated in Table 4 revealed that most of the targets (34%) belonged to the subclass ‘phosphoryl transferases’ or ‘kinases’ which are the most favorable targets in drug discovery research .
The pie chart reveals that 63% of the enzyme targets belong to Transferase class which is subdivided into phosphoryl (34%), glycosyl (19%) and other (10%) transferases.
3. Characterization of the hypothetical proteins
Hypothetical proteins are those for which the sequences are available but their family and functional classification has not been established. As such they may represent unidentified drug targets [38, 39]. The computational methods (for e.g. Blast2GO, HMMscan, KEGG Automatic Annotation Server (KAAS), ProtParam server, PSORTb, SVMProt, etc) are effective in annotating the functional and family classes of the big number of hypothetical sequences present in bacterial genomes [40–42]. The functional classification may lead us to predict the mechanism of the possible metabolic pathway in which the protein is involved. In order to characterize the hypothetical proteins among the shortlisted COPs, we first looked how many proteins were hypothetical. We found out that there were 3,105 proteins in 73 COPs, out of which 114 proteins were hypothetical (Table 5). The identifier details of these 3,105 enzymes are provided in S6 Table.
Later on, we performed a BLASTp search using 114 hypothetical sequences as ‘query’ and sequences of PDB as ‘database’. It was performed so that if there is any homology in already well characterized PDB database then it may lead us to classify the hypothetical proteins. The BLASTp showed hits against 81 queries with the PDB database while rest (i.e. 33) queries showed no hits (Table 5). The names of obtained hits for 81 queries were manually matched with the corresponding 24 COPs. The leftover 33 queries for which no similarity was found in PDB database were subjected to the bioinformatics tools i.e SVM–Prot and InterProScan. The obtained results for the 33 ‘no hits’ were confirmed by matching their names with the respective COPs. All results verified the output of CD-HIT clustering.
4. Validation from the literature
A similar study was performed by Becker et. al. using experimental techniques, so we have compared our results obtained from in silico approach. We also looked in the DrugBank of the possible entry of any drug target(s) against Salmonella. The DrugBank  reported 19 drug targets of S. enterica. 11 out of 19 belonged to the human, while remaining 8 belonged to the bacteria. The oxygen-insensitive NADPH Nitro reductase was common in 35 strains only. Other five did not belong to UMP. Only one (i.e. Penicillin-binding protein) out of 8 genes was present in the output of current strategy. Results are summarized in Table 6. Becker and his coworkers  have reported 155 essential enzymes for S. enterica serovar Typhimurium strain LT2, and compared those with various strains of S. enterica by performing extensive experimental study. We compared our identified 73 enzymes with the results of Becker and observed that 24 enzymes were shared by the reports of Becker et. al. (Table 3). Furthermore, the enzyme CheA (Chemotaxis Protein, COP # 49) was found as essential in current study while Backer et. al. suggested it as non-essential. This discrepancy may arise due to the recent updates in the DEG.
We have performed extensive computational analysis of S. enterica at the level of core proteome to identify new potential drug targets. Subtractive proteomics through a novel approach was applied, i.e. by considering only proteins of the unique metabolic pathways of the pathogens and mining the proteomic data of all completely sequenced strains of the pathogen, thus improving the quality and application of the results. We identified 73 enzymes that are common to 42 strains of S. enterica, belong to unique metabolic pathways, are essential for pathogen survival and which have no human homologs. These four characteristics suggest that the enzymes are potential drug targets and should be tested experimentally. We compared them to experimental data [Becker et. al] showing that 24 out of the 73 (~33%) enzymes are current drug targets. The remaining 49 enzymes are new potential drug targets. We have annotated the function of 114 hypothetical proteins unique to S. enterica, providing additional new potential drug targets. Finally, our organization of the available core proteomic data (available in S2, S4, S5 and S6 Tables) in different categories e.g. clusters, organism codes, NCBI RefSeq IDs etc, provide a basis for further studies.
S1 Fig. Strategy for subtractive proteomic analysis
S1 Table. Details of downloaded biological datasets
S2 Table. Number of Genes present in Unique Metabolic Pathways of 42 strains of S. enterica
S3 Table. Discontinued and Updated NCBI GIs of Heidelberg CFSAN002069 and Typhi CT18
S4 Table. Cluster of Proteins (COPs) formed using CD-HIT
S5 Table. Non-homologous Essential Enzymes of S. enterica 42 strains as drug targets
S6 Table. Protein Identifiers and Names of 73 COPs
The authors would like to gratefully acknowledge the Higher Education Commission of Pakistan to provide fellowship during the study.
Conceived and designed the experiments: RU MS. Performed the experiments: RU MS. Analyzed the data: RU MS. Contributed reagents/materials/analysis tools: RU MS. Wrote the paper: RU MS.
- 1. Popoff MY, Bockemuhl J, Gheesling LL. Supplement 2001 (no. 45) to the Kauffmann-White scheme. Research in microbiology. 2003;154(3):173–4. Epub 2003/04/23. pmid:12706505.
- 2. Betancor L, Yim L, Martinez A, Fookes M, Sasias S, Schelotto F, et al. Genomic Comparison of the Closely Related Salmonella enterica Serovars Enteritidis and Dublin. The open microbiology journal. 2012;6:5–13. Epub 2012/03/01. pmid:22371816; PubMed Central PMCID: PMCPmc3282883.
- 3. Coburn B, Grassl GA, Finlay BB. Salmonella, the host and disease: a brief review. Immunology and cell biology. 2007;85(2):112–8. Epub 2006/12/06. pmid:17146467.
- 4. Sun JS, Hahn TW. Comparative proteomic analysis of Salmonella enterica serovars Enteritidis, Typhimurium and Gallinarum. The Journal of veterinary medical science / the Japanese Society of Veterinary Science. 2012;74(3):285–91. Epub 2011/10/15. pmid:21997235.
- 5. Fierer J, Guiney DG. Diverse virulence traits underlying different clinical outcomes of Salmonella infection. The Journal of clinical investigation. 2001;107(7):775–80. Epub 2001/04/04. pmid:11285291; PubMed Central PMCID: PMCPmc199580.
- 6. Patel JC, Galan JE. Manipulation of the host actin cytoskeleton by Salmonella—all in the name of entry. Current opinion in microbiology. 2005;8(1):10–5. Epub 2005/02/08. pmid:15694851.
- 7. Haraga A, Ohlson MB, Miller SI. Salmonellae interplay with host cells. Nature reviews Microbiology. 2008;6(1):53–66. Epub 2007/11/21. pmid:18026123.
- 8. Wangdi T, Winter SE, Baumler AJ. Typhoid fever: "you can't hit what you can't see". Gut microbes. 2012;3(2):88–92. Epub 2011/12/14. pmid:22156762; PubMed Central PMCID: PMCPmc3370952.
- 9. Carter PB, Collins FM. The route of enteric infection in normal mice. The Journal of experimental medicine. 1974;139(5):1189–203. Epub 1974/05/01. pmid:4596512; PubMed Central PMCID: PMCPmc2139651.
- 10. Mastroeni P, Grant A. Dynamics of spread of Salmonella enterica in the systemic compartment. Microbes and infection / Institut Pasteur. 2013;15(13):849–57. Epub 2013/11/05. pmid:24183878.
- 11. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proceedings of the National Academy of Sciences of the United States of America. 2005;102(39):13950–5. Epub 2005/09/21. pmid:16172379; PubMed Central PMCID: PMCPmc1216834.
- 12. Deng X, Phillippy AM, Li Z, Salzberg SL, Zhang W. Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification. BMC genomics. 2010;11:500. Epub 2010/09/18. pmid:20846431; PubMed Central PMCID: PMCPmc2996996.
- 13. Donati C, Hiller NL, Tettelin H, Muzzi A, Croucher NJ, Angiuoli SV, et al. Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome biology. 2010;11(10):R107. Epub 2010/11/03. pmid:21034474; PubMed Central PMCID: PMCPmc3218663.
- 14. Hao P, Zheng H, Yu Y, Ding G, Gu W, Chen S, et al. Complete sequencing and pan-genomic analysis of Lactobacillus delbrueckii subsp. bulgaricus reveal its genetic basis for industrial yogurt production. PloS one. 2011;6(1):e15964. Epub 2011/01/26. pmid:21264216; PubMed Central PMCID: PMCPmc3022021.
- 15. Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, Gajer P, et al. The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. Journal of bacteriology. 2008;190(20):6881–93. Epub 2008/08/05. pmid:18676672; PubMed Central PMCID: PMCPmc2566221.
- 16. Lilburn TG, Cai H, Gu J, editors. The Core and Pan-Genome of the Vibrionaceae. Bioinformatics, Systems Biology and Intelligent Computing, 2009 IJCBS'09 International Joint Conference on; 2009: IEEE.
- 17. Yang L, Tan J, O'Brien EJ, Monk JM, Kim D, Li HJ, et al. Systems biology definition of the core proteome of metabolism and expression is consistent with high-throughput data. Proceedings of the National Academy of Sciences of the United States of America. 2015;112(34):10810–5. pmid:26261351.
- 18. Zhang L, Xiao D, Pang B, Zhang Q, Zhou H, Zhang L, et al. The core proteome and pan proteome of Salmonella Paratyphi A epidemic strains. PloS one. 2014;9(2):e89197. Epub 2014/03/04. pmid:24586590; PubMed Central PMCID: PMCPmc3933413.
- 19. Jacobsen A, Hendriksen RS, Aaresturp FM, Ussery DW, Friis C. The Salmonella enterica pan-genome. Microbial ecology. 2011;62(3):487–504. Epub 2011/06/07. pmid:21643699; PubMed Central PMCID: PMCPmc3175032.
- 20. Becker D, Selbach M, Rollenhagen C, Ballmaier M, Meyer TF, Mann M, et al. Robust Salmonella metabolism limits possibilities for new antimicrobials. Nature. 2006;440(7082):303–7. Epub 2006/03/17. pmid:16541065.
- 21. Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic acids research. 2014;42(Database issue):D199–205. Epub 2013/11/12. pmid:24214961.
- 22. NCBI. NCBI FTP server 2013 [cited 2013 December 21]. Available: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/.
- 23. Hallam S. Fast Blast 2013 [cited December, 2014]. Available: http://www.cmde.science.ubc.ca/hallam/fastblast.php.
- 24. Luo H, Lin Y, Gao F, Zhang CT, Zhang R. DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic acids research. 2014;42(Database issue):D574–80. Epub 2013/11/19. pmid:24243843.
- 25. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics (Oxford, England). 2012;28(23):3150–2. Epub 2012/10/13. pmid:23060610; PubMed Central PMCID: PMCPmc3516142.
- 26. HCV-Sequence-Database. ElimDupes [December, 2014]. Available: http://hcv.lanl.gov/content/sequence/ELIMDUPES/elimdupes.html.
- 27. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC bioinformatics. 2009;10:421. Epub 2009/12/17. pmid:20003500; PubMed Central PMCID: PMCPmc2803857.
- 28. NCBI. NCBI FTP server 2014 [updated January 6; cited 2014 January 11]. Available: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/.
- 29. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome research. 2002;12(10):1611–8. Epub 2002/10/09. pmid:12368254; PubMed Central PMCID: PMCPmc187536.
- 30. Chen L, Xiong Z, Sun L, Yang J, Jin Q. VFDB 2012 update: toward the genetic diversity and molecular evolution of bacterial virulence factors. Nucleic acids research. 2012;40(Database issue):D641–5. Epub 2011/11/10. pmid:22067448; PubMed Central PMCID: PMCPmc3245122.
- 31. Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, et al. The RCSB Protein Data Bank: new resources for research and education. Nucleic acids research. 2013;41(Database issue):D475–82. Epub 2012/11/30. pmid:23193259; PubMed Central PMCID: PMCPmc3531086.
- 32. PDB. RCSB PDB FTP server 2014 [updated October 1; cited 2014 January 18]. Available: ftp://ftp.wwpdb.org/pub/pdb/derived_data/.
- 33. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic acids research. 2003;31(13):3692–7. Epub 2003/06/26. pmid:12824396; PubMed Central PMCID: PMCPmc169006.
- 34. McWilliam H, Li W, Uludag M, Squizzato S, Park YM, Buso N, et al. Analysis Tool Web Services from the EMBL-EBI. Nucleic acids research. 2013;41(Web Server issue):W597–600. Epub 2013/05/15. pmid:23671338; PubMed Central PMCID: PMCPmc3692137.
- 35. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, et al. DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic acids research. 2011;39(Database issue):D1035–41. Epub 2010/11/10. pmid:21059682; PubMed Central PMCID: PMCPmc3013709.
- 36. Huang da W, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, et al. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic acids research. 2007;35(Web Server issue):W169–75. Epub 2007/06/20. pmid:17576678; PubMed Central PMCID: PMCPmc1933169.
- 37. Cohen P, Alessi DR. Kinase drug discovery—what's next in the field? ACS chemical biology. 2013;8(1):96–104. Epub 2013/01/02. pmid:23276252; PubMed Central PMCID: PMCPmc4208300.
- 38. Teh BA, Choi SB, Musa N, Ling FL, Cun ST, Salleh AB, et al. Structure to function prediction of hypothetical protein KPN_00953 (Ycbk) from Klebsiella pneumoniae MGH 78578 highlights possible role in cell wall metabolism. BMC structural biology. 2014;14:7. Epub 2014/02/07. pmid:24499172; PubMed Central PMCID: PMCPmc3927764.
- 39. Naqvi AA, Shahbaaz M, Ahmad F, Hassan MI. Identification of Functional Candidates amongst Hypothetical Proteins of Treponema pallidum ssp. pallidum. PloS one. 2015;10(4):e0124177. Epub 2015/04/22. pmid:25894582; PubMed Central PMCID: PMCPmc4403809.
- 40. Ravooru N, Ganji S, Sathyanarayanan N, Nagendra HG. Insilico analysis of hypothetical proteins unveils putative metabolic pathways and essential genes in Leishmania donovani. Frontiers in genetics. 2014;5:291. Epub 2014/09/11. pmid:25206363; PubMed Central PMCID: PMCPmc4144268.
- 41. Shahbaaz M, Hassan MI, Ahmad F. Functional annotation of conserved hypothetical proteins from Haemophilus influenzae Rd KW20. PloS one. 2013;8(12):e84263. Epub 2014/01/07. pmid:24391926; PubMed Central PMCID: PMCPmc3877243.
- 42. Cui T, Zhang L, Wang X, He ZG. Uncovering new signaling proteins and potential drug targets through the interactome analysis of Mycobacterium tuberculosis. BMC genomics. 2009;10:118. Epub 2009/03/21. pmid:19298676; PubMed Central PMCID: PMCPmc2671525.