Identification of Functional Candidates amongst Hypothetical Proteins of Treponema pallidum ssp. pallidum

Syphilis is a globally occurring venereal disease, and its infection is propagated through sexual contact. The causative agent of syphilis, Treponema pallidum ssp. pallidum, a Gram-negative sphirochaete, is an obligate human parasite. Genome of T. pallidum ssp. pallidum SS14 strain (RefSeq NC_010741.1) encodes 1,027 proteins, of which 444 proteins are known as hypothetical proteins (HPs), i.e., proteins of unknown functions. Here, we performed functional annotation of HPs of T. pallidum ssp. pallidum using various database, domain architecture predictors, protein function annotators and clustering tools. We have analyzed the sequences of 444 HPs of T. pallidum ssp. pallidum and subsequently predicted the function of 207 HPs with a high level of confidence. However, functions of 237 HPs are predicted with less accuracy. We found various enzymes, transporters, binding proteins in the annotated group of HPs that may be possible molecular targets, facilitating for the survival of pathogen. Our comprehensive analysis helps to understand the mechanism of pathogenesis to provide many novel potential therapeutic interventions.


Introduction
Treponema pallidum ssp. pallidum is experimentally investigated to be the cause of venereal syphilis, a globally existing sexually transmitted disease (STD) [1][2][3][4]. T. pallidum ssp. pallidum is a Gram-negative bacterium, classified as a member of family Spirochaetaceae [5]. The syphilis infection is frequently transmitted through sexual contacts, which results in the pandemic of this particular disease [6]. The primary effects of infection can be seen as skin lesions on the site of infection [4]. The secondary and tertiary stages of syphilis are assumed to be lethal because of the prevalence of the organism in the body of host [7,8]. The infection of syphilis is severe in nature as 12 million new cases of venereal syphilis were reported by World Health Organization in the year 1999 with most of the cases were from the developing countries [4].
The SS14 strain of T. pallidum ssp. pallidum was first isolated from the skin lesion of a patient with secondary syphilis [2,9]. The genome sequence of T. pallidum ssp. pallidum is available in the NCBI database containing 1,087 genes encode 1,027 proteins. Among these, function of 444 proteins are not experimentally determined so far, and are termed as hypothetical proteins (HPs). A hypothetical protein is one predicted to be encoded by an identified open reading frame, but for which no protein product has been confirmed or characterized. [10]. However, HPs possibly play important roles in the survival of pathogen, and hence disease progression [10,11]. Since, it is very difficult to work on T. pallidum ssp. pallidum because of its complete obligate dependence on a mammalian host system to survive in the environment. Therefore, genomic sequence of T. pallidum ssp. pallidum offers a wealth of basic information which can be further analyzed to extract useful information [3]. A precise function of HPs from several pathogenic organism have been reported already using sequence and structure based methods [11][12][13][14].
The already sequenced genome of the T. pallidum ssp. pallidum was taken in our study to explore the function of these HPs with high precision using well optimized bioinformatics tools described elsewhere [15]. To predict function of HPs with high confidence, their sequences are retrieved from the NCBI and analyzed by using various bioinformatics tools for the prediction of physicochemical properties, sub-cellular localization, sequence similarity search, virulence factor prediction, etc. Moreover, HPs may act as potential virulent factors which may be predicted by bioinformatics tools and targeted further for the structure based rational drug design [16][17][18][19][20]. The predicted functions of HPs are further validated by using a statistical technique like ROC (Receiver operating characteristic) that is helpful to assess the performance of used bioinformatics tools. We believe that such analyses expand our knowledge regarding the functional roles of HPs of T. pallidum ssp. pallidum and provide an opportunity to discover novel potential drug targets [21].

Materials and Methods
Here we used our well optimized series of tools for the functional annotation of HPs [11,15,22]. The sequences of all HPs were obtained from the NCBI (http://www.ncbi.nlm.nih. gov/genome/741). The sequences of all 444 HPs were retrieved using their primary accession numbers in FASTA format from Uniprot database (http://www.uniprot.org/).

Analysis of physicochemical properties
Physicochemical parameters of all HPs were analyzed using Expasy's ProtParam server (http:// web.expasy.org/protparam/). This online server performs the theoretical measurement of various physicochemical parameters such as molecular mass, isoelectric point, extinction coefficient, instability index, aliphatic index and grand average of hydropathicity (GRAVY). The predicted properties of HPs are listed in the S1 Table. Sub-cellular localization The precise estimation of sub-cellular localization (such as cytoplasm, periplasm, inner membrane, outer membrane and extracellular space) of a protein is helpful in predicting its function at the cellular level. Previous studies show that a protein present in the cytoplasm is a drug target. While membrane proteins found on the surface are considered to be a vaccine targets [23]. Array of online subcellular localization software is used to predict the location of HPs in the T. pallidum ssp. pallidum. PSORTb CELLO (v2.5) and PSLpred are effective tools to predict the subcellular localization of a particular protein. The SignalIP4.1 was used to predict signal peptide cleavage sites. SecretomeP2.0 was used to predict non-classical protein secretion, i.e., signal peptide independent secretion. TMHMM and HMMTOP were used to predict transmembrane helices in proteins as it is helpful in identification of the membrane proteins. Detailed information on subcellular localization is listed in S2 Table. Sequence comparisons In order to search for known functional homologues of HPs, we performed sequence similarity searching using BLASTp against non-redundant (nr) database of proteins. We have performed HMM based similarity search using HMMSCAN, a module of HMMER server used to search for a similar domain and families. It works as an interface for searching the Pfam, TIGRFAMs, Gene3D and superfamily databases of protein families and domains. Results of sequence comparison are listed in the S3 Table. Domain and function assignment Proteins are classified into families and superfamily on the basis of their sequence, structure and function by various protein classification tools like CATH, SCOP, etc. Here, we used varieties of tools to predict the function of HPs. We have also used PANTHER, a database distinguishing proteins in families and subfamilies, which provides GO based function assignment of the protein. Furthermore, Pfam database was used to predict the function of proteins based on sequence similarity. We have also performed protein classification using clustering techniques using SYSTERS and ProtoNet. SYSTERS is a database of protein family which uses BLASTp to search the database for similar sequences and provides the cluster of proteins formed on the basis of functional similarity. However, the ProtoNet provides hierarchical classification of proteins. CDART tool was used to search the conserved domains in HPs which searches the query sequence against Conserved Domain Database (CDD). We have also analyzed HPs using Simple Modular Architecture Research Tool (SMART) which predicts the function of a protein based on the domain architecture. The motif search in protein sequences was done by using InterProscan, which searches various available databases for function prediction. Results of function prediction based on these tolls are listed in the S4 Table. Virulence factor analysis Identification of bacterial virulence factors can help to understand the mechanism of pathogenesis and search for potential therapeutic targets [23,24]. We used VICMpred [25] and Viru-lentPred [26] for identification of HPs which may be responsible for virulence in the T. pallidum ssp. pallidum. Virulent HPs from T. pallidum ssp. pallidum are listed in the S5 Table. Prediction of protein interaction network Functional association among proteins is necessary to complete any biological process, therefore, the knowledge of protein-protein interaction is also helpful for prediction of function of a protein. Here we have used STRING (version-9.1) [27] to predict the proteins which show interaction with HPs and hence its involvement in a particular metabolic process.

Performance assessment
The predicted functions of HPs from the genome of T. pallidum ssp. pallidum are validated using the receiver operating characteristic (ROC) analysis. This statistical analysis is performed using 100 sequences of proteins with known function (S6 Table). Functions of these proteins are predicted using the adopted pipeline for the annotation of the HPs. The diagnostics efficacy is evaluated at six levels. The true positive or true negative prediction is classified as ''0" or ''1" binary numerals. In addition, 1, 2, 3, 4 and 5 is the adopted confidence ratings. The average accuracy of the used pipeline is found to be 93.91% (S8 Table). ROC analysis indicates high reliability of bioinformatics tools used here (S7 and S8 Tables).
The level of confidence for each prediction is assumed on the basis of number of tools predicting similar function. For a particular HP, if its similar function was clearly given by four and more tools, then such prediction was considered as output with high level of confidence. Whereas if the function predicted by less than four tools, we have not included these HPs in the Table 1. Although, we separately provided a table for function prediction at low level of confidence in the S9 Table.

Results and Discussion
The genome of the SS14 strain was sequenced to high accuracy by Matejková et al., [2] in 2008 using oligonucleotide array strategy. But errors in key features such as start codons (alternate or otherwise) and stop codons (due to sequencing errors) were observed. Recently, the complete genome sequence of the TPA Mexico, A strain was reported by Pětrošová et al., [28] using the Illumina sequencing technique. However, a recent report on resequencing of T. pallidum ssp. pallidum strains Nichols and SS14 has identified errors in 11.5% of all annotated genes and subsequently corrected [29]. Hence, we assume that the available genome sequence of T. pallidum ssp. pallidum in the database is free from experimental sequencing errors. Extensive sequence analysis of all 444 HPs based on the above mentioned tools helped us to precisely assign function to 207 HPs with high confidence (Table 1). We have also predicted functions for 237 HPs with low level of confidence (S9 Table). We annotated the function of these HPs using protein classification databases such as CATH, Superfamily, Pfam, PANTHER, SYSTERS. Recent studies pertaining to experimental analysis of T. pallidum ssp. pallidum genome (Nichols) have provided us with solid evidences that support most of the predictions of this work [30]. All of these studies are performed using Nichols strain which shows slight variations from SS14 strain of T. pallidum ssp. pallidum [2]. Besides slight variations in some regions, we have found substantive correlation with data provided by these studies with that of predicted function in the present work. We categorized all these 207 HPs in various functional classes that contain 83 enzymes, 58 binding proteins, 28 transporters, 31 proteins involved in various cellular processes like regulation mechanisms, and 17 proteins exhibiting miscellaneous functions (Fig 1). Various functional classes of these classified HPs are described below.

Enzymes
Enzymes play vital role in many leading biochemical processes. About 40% of annotated HPs are enzymes. T. pallidum ssp. pallidum is an obligate parasite therefore it solely depends on the host for most of its nutritional requirements [4]. Enzymes may facilitate its survival in the host by carrying out various cellular processes making it viable for the course of infection in the host.
We found six oxidoreductases among these HPs of T. pallidum ssp. pallidum. These enzymes presumably play an essential role in the pathogenesis. B2S298 (HP TPASS_0151) is NADH-quinone reductase (NQR2/RnfD) which regulates expression of virulence factors in Vibrio cholerae [31]. It is also involved in sodium translocation and electron transport [31]. Most of the oxidoreductases are involved in iron-sulphur cluster transport [31].
There are 27 HPs predicted as transferases. Many members of this class are involved in lipid biosynthesis, RNA processes and other significant cellular processes thus responsible for bacterial pathogenesis and virulence. There are various kinases such as B2S2P4 (HP TPASS_0296), which take part in coenzyme A biosynthesis [32]. B2S1Z8 (HP TPASS_0050) is predicted to be phosphoribosyl transferase. Members of PRTase family are involved in DNA processing and nucleotide metabolism [33]. Titz et al., [30] provided a similar function for the TP0050 gene product in Nichols strain of T. pallidum ssp. pallidum in their study which shows a significant similarity with HP TPASS_0050. B2S2Q5 (HP TPASS_0307) is a PASTA domain containing protein which is found in penicillin binding proteins and serine/threonine kinases [34]. McKevitt et al., [35] in their study of T. pallidum ssp. pallidum (Nichols strain) antigens predicted  TP0307 as conserved hypothetical protein. This domain has special affinity for β-lactam antibiotics [34]. They characterized TP0750, TP0494 as conserved HPs [35]. In the present work, we have successfully assigned functions to their homologues in SS14 strain i.e. HP TPASS_0750 (B2S3Y0) and HP TPASS_0494 (B2S389) as nicotinate-nucleotide adenylyltransferase and zinc ribbon domain containing protein, respectively. B2S389 (HP TPASS_0494) and B2S3H9 (HP TPASS_0592) exhibit DNA directed polymerase activity, hence proving their role in bacterial pathogenesis by facilitating regulatory processes. B2S492 (HP TPASS_0860) is HAMP domain containing protein which is a characteristic domain of signal transduction proteins and helps in signal conversion [36]. The third class of enzymes is hydrolases. There are more than 50% proteins in all characterized enzymes representing this class of enzymes. The majority of representative proteins of hydrolase class are membrane bound proteins involved in various significant processes such transmembrane transport, metal ion binding, cell wall degradation, thus associated with various virulence factors. There is a number proteins having peptidase activity that contains LysM domain, responsible for cell wall degradation in prokaryotes [37] which helps various transmembrane transporters to carry out their functions. There are six phosphohydrolases in this group. They contain conserved HD motif which holds the specific characteristic of signal transduction systems [38] and have metal ion binding property [39]. We found B2S4K0 (HP TPASS_0963) and B2S4K9 (HP TPASS_0972) which exhibit antibiotic resistance capacity and are involved in macrolide antibiotic transportation [40]. Titz et al., [30] predicted TP0936, a counterpart of HP TPASS_0963 in the Nichols strain as ABC transporter and depicted its involvement in membrane biogenesis. We predicted HP TPASS_0444 (B2S340) as peptidoglycan-binding protein.
Homologue of HP TPASS_0444 in the Nichols strain (TP0444) is predicted as conserved HP in the above mentioned study. We have successfully assigned function to the homologue of TP0877 in SS14 strain (HP TPASS_0877) as glycoprotease which is characterized as conserved HP in the gene expression analysis as done by Smajs et al., [41].
Lyases also play a key role in bacterial pathogenesis as they are involved in various biosynthesis processes. B2S3A6 (HP TPASS_0512) shows 2-C-methyl-D-erythritol 2, 4-cyclodiphosphate synthase activity and is involved in isoprenoid synthesis. It may be acting as a potential drug target [42].

Transporters
Transporter proteins are involved in transportation of nutrients, that are helpful in various metabolic processes, and hence survival of the organism. These proteins also facilitate the transfer of virulence factors and are directly involved in infection [43]. We found 28 proteins having functions as transporters possibly involved in transportation of metal ions, virulence factors and biosynthesis assembly proteins. Some of HPs are the members of ABC transporter class proteins. B2S3C6 (HP TPASS_0534) is V-type ATP synthase (subunit C) which may be involved in ATP synthesis hence may be involved in providing energy for various metabolic processes of bacterial pathogen [44]. B2S3F9 (HP TPASS_0567) is MgtE N-terminal domain containing protein and helps in magnesium transport [45]. McKevitt et al and Smajs et al characterized its counterpart (TP0567) as HPs in their experimental studies [35,41]. Similarly, B2S3G4 (HP TPASS_0580) is FMN-binding domain protein which is found to be involved in the electron transfer pathway [46]. Titz et al., [30] predicted the gene product of Nichols strain (TP0580) as ABC transporter whereas Smajs et al., [41] characterized it as conserved hypothetical integral membrane protein. B2S3L4 (HP TPASS_0625) is an outer membrane protein (OmpA) which works as a receptor for T-even like phages. It also acts as a porin protein with low permeability allowing penetration of small solutes [47]. B2S460 (HP TPASS_0826) is predicted as mechanosensitive ion channel which allows efflux of solvent and solutes in cytoplasm hence making its role significant in survival of pathogen [48]. B2S478 (HP TPASS_0846) contains major facilitator superfamily domain and is a representative of a class of membrane transporters which are involved in transportation of sugars, amino acids, drugs, various metabolites and varieties of ions [49]. B2S4D8 (HP TPASS_0906) and B2S4M3 (HP TPASS_0986) are multidrug transporters and exhibit multiple drug resistance capability thus making the pathogen viable against drugs [50]. A detailed understanding of the functional mechanism of all these transporters will be helping to discover effective drugs against them.

Binding proteins
We have characterized 58 proteins as binding proteins out of 207 functionally annotated HPs. We have further divided these into 13 DNA binding, nine RNA binding, 31 protein binding, three ion binding and two adhesion proteins. The DNA and RNA binding proteins are involved in various cellular and regulatory processes such as transcription, translation and recombination and thus playing a vital role in the survival and propagation of pathogen in the host. 31 HPs are the protein binding in nature, and 29 of them are tetratricopeptide repeat (TPR) containing proteins. TPR containing proteins are involved in protein-protein interactions and thus plays an important role in virulence [51]. B2S214 (HP TPASS_0066) and B2S215 (HP TPASS_0067) are tetratricopeptide repeat containing proteins. Titz et al., [30] predicted their homologues in Nichols strain (TP0066 and TP0067) to be involved in DNA metabolism. Tetratricopeptide repeat containing proteins are involved in various metabolic and regulatory processes [51]. Homologues of this protein predicted with tetrapeptide repeats in the present work are characterized as HP by McKevitt and Smajs group [35,41]. Therefore, proteins showing 100% similarity may be considered exhibiting similar functions for Nichols strain and indicating experimental evidence. We found that B2S2J3 (HP TPASS_0246) and B2S3Y9 (HP TPASS_0752) are showing similarity with von Willebrand factor with a type A domain which is found to be responsible for various blood disorders [52][53][54]. Association of type A domain makes it liable to be involved in various significant activities such as cell adhesion and immune defense [55]. Thus, such HPs may be possible therapeutic targets because they are involved in the bacterial pathogenesis by helping in cell adhesion and immune defense mechanism.

Cellular processes/regulatory proteins
There are 21 HPs presumably involved in various cellular and regulatory mechanisms, and are important for the pathogenesis of T. pallidum ssp. pallidum. Most of these proteins are involved in cell division, chromosome segregation and condensation, sporulation, intercellular signaling and various flagellar proteins involved in transport activity. These proteins may also be important for bacterial pathogenesis and can be treated as possible drug targets [56]. B2S2P5 (HP TPASS_0297) is found to be presumably involved in sporulation and cell division. Titz et al., [30] predicted involvement of its counterpart TP0297 (Nichols strain) in the cell wall metabolism. B2S3T0 (HP TPASS_0702) is prokaryotic chromosome segregation/condensation protein ScpA whereas its homologue in Nichols strain (TP0702) was characterized as a HP in the study done by Smajs et al on T. pallidum ssp. pallidum transcriptome [41].

Proteins with miscellaneous functions
We found 17 HPs exhibiting miscellaneous functions such as cell signaling, solvent tolerance proteins, etc. B2S234 (HP TPASS_0086) is a PilZ domain containing protein that serves as the receptor for cyclic di-GMP which act as secondary messenger for bacteria [57,58]. Cyclic di-GMP is involved in regulation of exo-polysaccharide synthesis, motility of bacteria, gene expression and host-pathogen interaction [57,58]. Hence, these HPs may also be considered to be significant in the pathogenesis of T. pallidum ssp. pallidum. B2S3A9 (HP TPASS_0515) and B2S424 (HP TPASS_0796) are organic solvent tolerance proteins responsible for antibiotic resistance [59]. Smajs et al., [41] characterized its homologue in the Nichols strain (TP0796) as conserved HP. B2S3B5 (HP TPASS_0522) is a colicin V production protein that is a bacterial toxin which disrupts the membrane potential of other sensitive cell thus leading to their death [60]. B2S3F5 (HP TPASS_0563) is a DnaJ domain containing protein which is an exclusive feature of hsp40 family of molecular chaperons [61]. These molecular chaperons are involved in various significant processes such as protein folding, polypeptide translocation and protein degradation [61]. Our knowledge of these HPs will be helpful in the field of the drug discovery by completing the mosaic of knowledge regarding the host-pathogen interaction especially in the case of T. pallidum ssp. pallidum.
We compared the group of HPs successfully annotated with high confidence ( Table 1) with those of unannotated genes (Table S9). For the comparison, we considered several characteristics features such as average gene length, the number of predicted protein-protein interactions, gene expression level and predicted antigens. Surprisingly, there is a relative difference between average gene lengths of the HPs of both groups was observed. The average length of polypeptides chain, not annotated, are less than 40 amino acids, which corresponding to the gene length of 120 bps. Whereas, in the group of HPs predicted with a high level of confidence (n = 207) the average gene length is relatively high. We can infer that the relatively smaller gene lengths have affected the confidence level of this group.
We further used STRING [27] to predict the protein-protein interactions. While comparing both groups for the number of predicted protein-protein interactions, we found no such characteristic difference that could affect the confidence level of function prediction. For instance, string predicted 10 functional partners for the protein HP TPASS_0017 (B2S1W5) whereas it predicted 4 functional partners for the protein HP TPASS_0004 (B2S1V4) which is an HP of the group for which functions are assigned with low level of confidence. It predicted only two functional partners for the HP TPASS_0022 (B2S1X0) which is from first group whereas it predicted 10 functional partners for the HP TPASS_0008 (B2S1V7) which is an HP from second group.
We checked the expression level of genes from both groups on the basis of study of Smajs et al. [41]. We did not find any such correlations for the gene expression levels in this study. On the other hand, we checked the number of predicted antigens using the investigation of McKevitt et al. [35] for T. pallidum antigens. We found 17 predicted antigens in the group of HPs for which functions are predicted with a high-level of confidence. Whereas, against the expectations, we found a relatively higher number of predicted antigens i.e., 24 in the second group. The comparison done between both the groups considering characteristics such as gene length, predicted protein-protein interactions, gene expression levels and predicted antigens established no characteristic difference except for the gene length that is relatively low in the second group (n = 237). We should notice, although, that no differences between the group of genes with predicted function and the group of genes with a less accurate predicted function is here observed if we compare these results with previously published experimental studies [35,41]. This may suggest that the degree of prediction accuracy does not necessarily allow to univocally identify functional genes and has to be taken with caution.

Virulent proteins
Gram negative pathogens are frequently evolved to modify the features like increase motility, cell adhesion and to tackle with immune response of the host, thus increasing their virulence inside the host environment [62]. We have used VICMpred and Virulentpred servers to predict virulence factors in this group of 444 HPs. There are 19 HPs (out of 207) found to be virulent on the basis of the consensus sequence analysis (Table 2). It was already hypothesized that targeting virulence factor provides a better therapeutic intervention against bacterial pathogenesis [63]. The predicted HPs having virulent characteristics provide a powerful target-based therapies to clear an existing infection and are further considered as an adjunct therapy to existing antibiotics, or potentiators of the host immune response [64]. The progress reported recently a proof of concept for antivirulence molecules at the preclinical stages should allow the antivirulence concept to become a reality as a new antibacterial approach.

Conclusions
Functional annotation of 444 HPs from T. pallidum ssp. pallidum has been carried out using various in silico approaches and functions have been assigned to 207 HPs with high confidence. Performance assessment of bioinformatics tools was carried out using ROC analysis and reported in terms of accuracy and sensitivity of the predicting tools. We are not considering the HPs annotated with low level of confidence. Our prediction is showing functional importance of the HPs in the survival of the pathogen in the host. Our study facilitates a rapid identification of the hidden function of HPs which is potential therapeutic targets and may play a significant role in better understanding of host-pathogen interactions. Once these HPs are established as a novel drug/vaccine targets, further research for new inhibitors and vaccines can be conducted.
Supporting Information S1