An integrative in-silico approach for therapeutic target identification in the human pathogen Corynebacterium diphtheriae

Corynebacterium diphtheriae (Cd) is a Gram-positive human pathogen responsible for diphtheria infection and once regarded for high mortalities worldwide. The fatality gradually decreased with improved living standards and further alleviated when many immunization programs were introduced. However, numerous drug-resistant strains emerged recently that consequently decreased the efficacy of current therapeutics and vaccines, thereby obliging the scientific community to start investigating new therapeutic targets in pathogenic microorganisms. In this study, our contributions include the prediction of modelome of 13 C. diphtheriae strains, using the MHOLline workflow. A set of 463 conserved proteins were identified by combining the results of pangenomics based core-genome and core-modelome analyses. Further, using subtractive proteomics and modelomics approaches for target identification, a set of 23 proteins was selected as essential for the bacteria. Considering human as a host, eight of these proteins (glpX, nusB, rpsH, hisE, smpB, bioB, DIP1084, and DIP0983) were considered as essential and non-host homologs, and have been subjected to virtual screening using four different compound libraries (extracted from the ZINC database, plant-derived natural compounds and Di-terpenoid Iso-steviol derivatives). The proposed ligand molecules showed favorable interactions, lowered energy values and high complementarity with the predicted targets. Our proposed approach expedites the selection of C. diphtheriae putative proteins for broad-spectrum development of novel drugs and vaccines, owing to the fact that some of these targets have already been identified and validated in other organisms.


Genomes selection
The thirteen C. diphtheriae strains, including three of the four biovars: gravis, mitis and belfanti (Table 1) were included in this study. The gene and protein sequences of these thirteen C. diphtheriae strains were retrieved from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria). The different steps involved in this computational approach for genome-scale modelome prediction and for the prioritization of putative drug and vaccine targets are given in (Figs 1 & 2).
Prediction of core-modelome and identification of core genome To construct the core-modelome of C. diphtheriae, we followed a slightly modified protocol described by Hassan et al., 2014 [18]. High throughput structural modeling, MHOLline (http://www.mholline.lncc.br), was used to predict the modelome (whole-proteome set of protein 3D models) for each strain. MHOLline uses comparative modeling approach for protein 3D structure prediction through MODELLER [19]. Our workflow also includes BLASTp (Basic Local Alignment Search Tool for Protein) [20], HMMTOP (Prediction of transmembrane helices and topology of proteins), [21] BATS (Blast Automatic Targeting for Structures), FILTERS, ECNGet (Get Enzyme Commission Number), MODELLER, and PROCHECK [22].
MHOLline work on the basis of available template. It is probable that MHOLline cannot detect all the common conserved proteins due to the unavailability of the template. To overcome this probability, we used EDGAR (an Efficient Database framework for comparative Genome Analyses using BLAST score Ratios for pan-genomics analysis) to collect common conserved genome as well of all Cd strains [23]. Later, the results from MHOLine and EDGAR were compared and crosschecked to obtain the final dataset of common conserved proteins.

Identification of intra-species conserved proteins
Primarily, for the identification of highly conserved proteins with available 3D models in all Cd strains (! 95% sequence identity), the standalone release of NCBI BLASTp+ (v2.2.26) was adapted from the NCBI ftp. Site (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) and installed on a local machine. Furthermore, a search was performed using NCTC13129 as a random reference genome for all strains. Comparative genomics/proteomics approach was next adopted for selecting the highly conserved proteins using an all-against-all BLASTp analysis with a cut-off value of E = 0.0001, as in many other essentiality studies before [6,13,15,18,24].

Essential and non-host homologous (ENH) protein targets
A subtractive genomics approach was next followed for the selection of conserved targets, which were essential to the bacteria [15]. Concisely, the set of proteins derived from the coremodelome of C. diphtheriae was subjected to the Database of Essential Genes (DEG) for homology analyses. The DEG encompasses experimentally validated data of currently available essential genomic elements like protein-coding genes and non-coding RNAs, from bacteria, archaea and eukaryotes. For a bacterium, essential genes form a minimal genome, i.e., a set of functional modules that has key roles in the emerging field of synthetic biology [25]. The cutoff values used for BLASTp were: E-value = 0.0001, bit score !100 and identity ! 25% [15,18]. The pool of essential genes was then subjected to NCBI-BLASTp (E-value = 0.0001, bit score !100 and identity ! 25%) against the human genome for filtering pathogen-essential hosthomologs [6]. The remaining set of pathogen-essential non-host homologs were additionally crosschecked with NCBI-BLASTp PDB database using the default values to find any remote structural similarity with the existing host homolog protein structures, keeping the cutoff level to 15% for query coverage. The biochemical pathways of these proteins have been checked using KEGG (Kyoto Encyclopedia of Genes and Genomes) [26], functionality using UniProt (Universal Protein Resource) [27], virulence using PAIDB (Pathogenicity island database) [28], and cellular localization using CELLO (subCELlular LOcalization predictor) [29]. The final list of targets was based on criteria described by Barh et al., 2011.

Essential and host homologous (EH) protein targets
We further extended our analyses to the set of protein targets that were essential to C. diphtheriae but homologous to host proteins. The essential protein targets deviating from the cutoff values for essential non-host homologous proteins were treated as host homologous proteins. This set of targets was also checked for pathway involvement, functional annotation virulence, and cellular localization as mentioned above.

Computational identification of druggable pockets
The information obtained from 3D structures and druggability analyses are important features for prioritizing and authenticating putative pathogen targets [30,31]. As mentioned above, for druggability analyses, the final list of essential non-host and host homologous protein targets were subjected to DoGSiteScorer in PDB format [32]. The DoGSiteScorer is an automated pocket detection and analysis tool for calculating the druggability of protein cavities. For each detected cavity the tool returns the pocket residues and a druggability score ranging from 0 to 1. Values closer to 1 indicate highly druggable protein cavity, i.e. the predicted cavities are likely to bind ligands with high affinity [32]. The DoGSiteScorer also calculates volume, depth, surface area, lipophilic surface, and further parameters for each predicted cavity.

Ligand libraries preparation, virtual screening and docking analyses
The ligand libraries were prepared from four different sources, compounds from ZINC database (ZINC drug-like molecules, ZINC Natural Product), natural compounds from literature survey [33] and the Di-terpenoid Iso-steviol derivatives (S1 Table). ZINC (drug-like molecules) contains 11,193 drug-like molecules, with Tanimoto cutoff level of 60% [34] and ZINC (Natural Product) contain 11,203 molecules. Whereas, the small library of natural compounds Intra-species subtractive modelomics workflow for conserved target identification in C. diphtheriae species. The table represents the total number of protein sequences as an input data fed to the MHOLline workflow (upper red arrow). The blue arrow represents the core genes of thirteen Cd strains. The rectangular boxes show how this workflow processes and filters a large quantity of genomic data for putative drug and vaccine target identification of a pathogen. contained 28 molecules and the library of Di-terpenoid Iso-steviol derivatives contained 31 molecules respectively. The structures of these molecules were constructed using MOE-Builder tool. The 3D structures were modeled and partial charges were calculated using MOE (Molecular Operating Environment). The energies of the modeled molecules were minimized using the energy minimization algorithm of MOE tool (gradient: 0.05, Force Field: MMFF94X, Chiral Constraint) [35]. The modeled molecules were saved in the.mol2 file format and subjected to docking analysis.
The 3D structures of proteins were examined for structural errors such as missing atoms, wrong bonds and protonation states in the MVD (Molegro Virtual Docker) [36]. The consensus set of protein cavities and those predicted with DogSiteScorer (druggability ! 0.80) were compared with the MVD detected cavities, for all Cd targets. The maximum numbers of residues from DoGSiteScorer falling in the cavities detected by MVD were merged and final grid was generated based on the consensus between the highest scoring pocket from DoGSiteScorer and cavities detected by MVD for docking. The most druggable cavity was subjected to virtual screening using MVD. The program comprises of three search algorithms for molecular docking analyses namely MolDock Simplex Evolution (SE), MolDock Optimizer [36] and Iterated Simplex (IS). We employed the MolDock Optimizer search algorithm, which is based on a differential evolutionary algorithm, using the default parameters that are a) population size = 50, b) scaling factor = 0.5 and c) crossover rate = 0.9. The orientations of docked molecules from the library of natural compounds and from the derivatives of Di-terpenoid Iso-steviol were analyzed in Chimera [37]. The 200 top ranked compounds (ZINC drug-like molecules, ZINC Natural Product) for each target protein were evaluated for shape complementarity and hydrogen bond interactions. This led to the selection of a final set of compounds with polypharmacology and polypharmacy characteristics for target proteins in C. diphtheriae.

Results and discussion
Modelome prediction and conserved targets identification in C. diphtheriae Among 13 strains of C. diphtheriae species, our employed methodology produced high-confidence 3D structural models from orthologous proteins in C. diphtheriae species through the efficient MHOLline workflow (Fig 3). A comparative structural genomics approach was followed where all the G2 sequences classified as "Very High", "High", "Good" and "Medium to Good quality" by MHOLline, from the 12 Cd strains, were aligned to the Cd NCTC13129 strain as a reference genome. First, we identified a set of common conserved proteins with a pre-defined sequence similarity of 95-100%. This resulted in a set of 463 protein sequences, being conserved in all Cd strains (S3 Table).

Protein targets as putative drug and vaccine candidates
The identification of essential proteins in C. diphtheriae was carried out where the core-modelome was compared to DEG (Database of Essential Genes). This filter drastically reduced the number of selected targets to 23 final targets. Further comparison of the corresponding protein sequences to the human host proteome resulted in a set of 8 targets as essential non-host homologous (ENH, Table 2) and a set of 15 targets as essential host homologous proteins (EH, Table 3).

Prioritization parameters for drug targets and vaccine candidates
There are several factors that can aid in determining potential therapeutic targets [30]. For vaccine candidates, the information about subcellular localization is important: Proteins that contain transmembrane motifs are favored [24,30,38,39]. The 23 essential proteins have a low molecular weight and all are localized in the cytoplasmic compartment of C. diphtheriae (Tables 2 & 3). After the druggability evaluation using DoGSiteScorer [32] for both essential non-host and host homologous conserved targets from C. diphtheriae, we could predict at least one druggable cavity for each Cd target. The host homologous proteins as therapeutic targets could adversely affect the host. Therefore, the first step in numerous in silico drug target identification approaches are filtering proteins homologous to host proteome. Thus, we only consider the eight pathogen-essential non host homologs for the docking studies [13,15,40]. For the eight pathogen-essential non host homologs (S2 Table) glpX, nusB, rpsH, hisE, DIP1084, DIP0983, smpB, and bioB 3, 0, 1, 0, 2, 0, 1 and 3 cavities with score > 0.80 were predicted. The cavity of each protein exhibiting the highest druggability score was subjected to docking analyses. The numbers of predicted cavities with their respective druggability scores are given in Tables 2 & 3.
The identified eight non-host homologous and essential Cd proteins could be novel therapeutic targets for Corynebacterium diphtheriae.
As per our knowledge, glpX, hisE and bioB proteins have been reported as potential drug target in Mtb. Protein nusB is a member of Nus-transcription Factor family that help bacteria in the process of elongation, transcription: translation coupling and termination. Some members of this family (nusG) has already been reported as drug target. Furthermore, rpsH and smpB are also reported as potential drug target by Folador et al., 2016 in their in silico study [41]. Protein DIP1084 is Putative iron transport membrane protein (FecCD-family) and DIP0983 is uncharacterized Hypothetical Protein that need to be characterized experimentally. Hence, these protein could be a good therapeutic target against Cd.

Virtual screening and molecular docking
For each target protein (glpX, nusB, rpsH, hisE, DIP1084, DIP0983, smpB, and bioB) four different libraries were separately screened. A total of 28 molecules from natural compounds library and 31 compounds from the derivatives of Di-terpenoid Iso-steviol library were docked. Furthermore, top 200 drug-like molecules from virtual screening analyses of two large libraries (ZINC drug-like molecules, ZINC Natural Product) were examined one-by-one for the selection of the final set of promising molecules that showed favorable interactions with

Validation of docking protocol
To validate the accuracy of MolDock program (MVD), the co-crystallized ligand of Biotin synthase, bioB (PDB ID; 1R30) was extracted and then re-docked into the binding pocket of receptor protein. The RMSD between docked and co-crystallized ligand was found to be 1.81 A˚, which shows that the adopted docking protocol is valid and can be used to correctly predict the binding pose of the ligands [35,42]. The superposition of co-crystallized ligands and docked is shown in Fig 4. NP_939302.1 (glpX, Fructose 1, 6-bisphosphatase II) is a key enzyme of gluconeogenesis and catalyzes the hydrolysis of fructose 1, 6-bisphosphate to form fructose 6-phosphate and orthophosphate. A reverse reaction catalyzed by phosphofructokinase in glycolysis, and the product, fructose 6-phosphate, is an important precursor in various biosynthetic pathways [43]. In all organisms, gluconeogenesis is an important metabolic pathway that allows the cells to synthesize glucose from non-carbohydrate precursors, such as organic acids, amino acids and glycerol. FBPases are members of the large superfamily of lithium sensitive phosphatases, which includes three families of inositol phosphatases and FBPases (the phosphoesterase clan CL0171, 3167 sequences, Pfam data base). The FBPases are already reported as targets for the development of drugs for the treatment of noninsulin dependent diabetes [44,45]. Based on a comparison with a crystallographic structure of the glpX template (PDB ID: 1NI9, GlpX from Escherichia coli), none of the active site residues were identified. The docking analysis was performed utilizing the highest scoring pocket obtained from DoGSiteScorer. Table 4 shows a set of 10 promising ligands according to their minimum energy values and the maximum number of hydrogen bond interactions from the four aforementioned libraries. Compounds ZINC67912153, ZINC13142972, Jacarandic Acid and 16-hydrazonisosteviol are shown in Fig 5. NP_939692.1 (nusB, Transcription antitermination protein NusB) is a prokaryotic transcription factor involved in antitermination processes, during which it interacts with the mRNA nut site at boxA portion. The crystal structure of M. tuberculosis and E. coli NusB proteins suggest that the basic N-terminal region of the molecule associates with the rRNA BoxA. Hypothetically, this is indicative of the so-called arginine rich RNA binding motif (ARM) in   Therapeutic targets identification for Corynebacterium diphtheriae the bacteriophage N protein, HIV tat and HIV rev. This suggestion is supported by the presence of a phosphate-binding site at the N-terminal end of α-A in each NusB protomer that includes a pair of conserved arginines, Arg10 and Arg14 [46]. The bismuth-dithiol solutions have been shown to selectively inhibit Escherichia coli rho transcription termination factor [47]. A comparison between the crystallographic structures of the NusB template (PDB ID: 1EYV, NusB from M. tuberculosis) and our modeled structure reveals that the conserved arginines were located at position 12 and 16 (Arg12 and Arg16) and are likely to contribute in the interactions. Although none of these residues are predicted to form hydrogen bonds with selected docked ligands, these molecules were predicted to interact with other residues in the pocket. Table 5 shows the 8 selected ligands from all the four libraries according to their  minimum energy values and the number of hydrogen bond interactions. The compounds ZINC15043210, ZINC00053531 Jacarandic Acid and 16-hydrazonisosteviol are shown in (Fig 6). A decent binding mode and good shape complementarity was observed in these complexes. NP_938900.1 (rpsH, 30S ribosomal protein S8) is an important RNA-binding protein that inhabits a central position within the small ribosomal subunit. It widely interacts with 16S rRNA and is vital for the correct folding of the central domain of the rRNA. The protein rpsH S8 also controls the synthesis of numerous ribosomal proteins by binding to mRNA. It binds exactly to very similar sites in the two RNA molecules. It is a ribosomal protein that has medium-size, and its role as a significant primary RNA-binding protein in the 30S subunit is discovered recently. The S8 mutations within the protein have been shown to result in defective ribosome assembly. In Escherichia coli, the S8-binding site within 16S rRNA has been investigated independently by a number of techniques including nuclease protection, RNAprotein crosslinking, RNA modification, hydroxyl-radical footprinting and chemical probing. The rpsH S8 protein is also one of the principal regulatory elements that control ribosomal protein synthesis by the translational feedback inhibition mechanism discovered by Nomura and colleagues [48]. It regulates the expression of the spc operon that encodes, in order, the ten ribosomal proteins L14, L24, L5, S14, S8, L6, L18, S5, L30 and L15 [49]. The active site residues of rpsH, based on a comparison with its template structure were Arg86, Tyr88, Ser107, Ser109, Gly124, Gly125 and Glu126. However, none of the molecules interacts with these residues ( Table 6); nonetheless they are predicted to interact with other residues of the binding NP_938502.1 (bioB, Biotin synthase) catalyzes the final step in the biotin biosynthetic pathway by converting dethiobiotin (DTB) to biotin. This reaction uses organic radical chemistry for inserting sulfur atom between non activated carbons C6 and C9 of DTB. BioB is a member of the "radical SAM" or "AdoMet radical" superfamily, which is categorized by the presence of a conserved CxxxCxxC sequence motif (C, Cys; x, any amino acid) that synchronizes an essential Fe 4 S 4 cluster, as well as by the use of S-adenosyl-Lmethionine (SAM or Ado-Met) for radical generation. AdoMet radical enzymes act on a wide variety of biomolecules. For example, BioB and lipoyl-acyl carrier protein synthase (LipA) are involved in vitamin biosynthesis; lysine 2,3-aminomutase (LAM) facilitates the fermentation of lysine; class III ribonucleotide reductase (RNR) and pyruvate formate lyase (PFL) catalyze the formation of glycyl radicals in their respective target proteins; and spore photoproduct lyase repairs ultraviolet light-induced DNA damage [50]. The protein bioB was reported as putative drug target in C. diphtheriae by Barh et al., 2011 in their in silico study [15]. A comparison between our modeled protein and template structures suggest Cys86, Cys90, Cys93 and Arg291 as the active residues. Although, only Cys86, Cys90 and Cys93 were found to interact with the compounds from our prepared libraries, the molecules were predicted to interact with other residues in the pocket. The binding mode of compounds with active site residues and low scores suggest a set of 10 molecules ( Table 7) as promising leads from our four libraries. The predicted binding NP_939612.1 (hisE, Phosphoribosyl-ATP pyrophosphatase) is the second enzyme in the histidine-biosynthetic pathway, hydrolyzing irreversibly phosphoribosyl-ATP to phosphoribosyl-AMP and pyrophosphate. It is encoded by the hisE gene, which is present as a separate gene in many bacteria and archaea but is fused to hisI in other bacteria, fungi and plants. As it is essential for growth as seen in in vitro experiments, HisE is a potential drug target for tuberculosis [51]. A comparison of template and target protein structures here showed that there was no reported information about ligand-residue/s association in the active site cavity. Hence, the cavity chosen for virtual screening was simply the one that presented the highest DogSiteScorer druggability score (>80). A list of best dock molecules is shown below ( Table 8  Therapeutic targets identification for Corynebacterium diphtheriae ribosome orients SmpB towards the small ribosomal subunit, and directs tmRNA towards the elongation-factor binding region of the ribosome. The tmRNA-SmpB rescue system is ubiquitous in bacteria, and is also found in some chloroplasts and mitochondria [52]. In this case the template structure (PDB ID: 1P6V) did not contain any ligand, and no reported information was found about the ligand-residue interaction in their cavities. Therefore, amongst the cavities identified by MVD, the best cavity for docking analysis was chosen in consensus with highest druggability score from the DogSiteScorer. ZINC31168211 was found to form the network of 12 hydrogen bonds with Asn9, Ser16, Val49, Ser50, Thr52, Asp53, Ser54, Thr109. Table 9 lists top compounds from respective libraries selected for this target while the binding modes of Rhein, 16-hydroxyisosteviol, ZINC01414475 and ZINC31168211 are also shown (Fig 10).
NP_939445.1 (DIP1084, Putative iron transport membrane protein, FecCD-family) The Pfam search for the protein showed that it has two main components, FecCD and ABC_trans. The FecCD is a subfamily of bacterial binding-protein-dependent transport systems family constituting transport system permease proteins involved in the transport of numerous compounds through the membrane. These transporters tend to catalyze the thermodynamically unfavorable translocation of substrates against a transmembrane concentration gradient through the coupling to a second, energetically favorable process. ABC systems can be categorized in three functional groups, as follows. Importers mediate the uptake of nutrients in prokaryotes. The nature of the substrates that are transported is very wide, including mono-and oligosaccharides, organic and inorganic ions, amino acids, peptides, iron-siderophores, metals, polyamine cations, opines, and vitamins [53]. Exporters are involved in the secretion of various molecules, such as peptides, lipids, hydrophobic drugs, polysaccharides, and proteins, including toxins such as hemolysin. The third category of systems is apparently not involved Therapeutic targets identification for Corynebacterium diphtheriae in transport, with some members being involved in translation of mRNA and in DNA repair. Table 10 shows a set of 11 high scoring compounds against the proposed target. Compound ZINC70454922 from ZINC NP library was predicted to form ten hydrogen bonds with relatively low docking score (Fig 11).
NP_939345.1 (DIP0983, Hypothetical protein DIP0983) is a conserved hypothetical protein. It is annotated as a possible lysine decarboxylase (LDC) in the Pfam database (PF03641) [54] due to the presence of the highly conserved PGGxGTxxE motif. Some enzymes i:e "Lonely Guy" LOG are often mis-annotated as lysine decarboxylases enzymes; it is apparently responsible for catalyzing L-lysine decarboxylation to produce the polyamine metabolite cadaverine [55]. Conversely, this annotation is not supported by any biochemical or functional data in any of the PGGxGTxxE motif containing LDC identified so far. This motif is highly conserved among a vast number of proteins with unknown function, predicted from bacterial, yeast, and plant; in Arabidopsis thaliana, all the genome-annotated LOG proteins are identified as LDC like proteins by protein family. Based on sequence BLAST against the PDB, LOG from Claviceps purpurea shares more than 30% identical residues with crystal structures of LDC-like proteins of unknown function, whose structures are already determined. Recently, lysine decarboxylase has been reported as a therapeutic target by Lohinai et al., 2015 for Periodontal Inflammation [56]. Here we listed 12 compounds showing good potency against our target tabulated in Table 11. Four of the compounds with promising docking results are shown in Fig 12. Among the drug-like molecule ZINC13142972 (1-[(2S, 3S, 4S, 5R)-3,4-dihydroxy-5-(hydroxymethyl) oxolan-2-yl]imidazo[1,2-b]pyrazole-7-carbonitrile) was predicted to show good results against two of our targets NP_939302.1 (glpX, Fructose 1,6-bisphosphatase II) and NP_939445.1 (DIP1084, Putative iron transport membrane protein, FecCD-family). It Therapeutic targets identification for Corynebacterium diphtheriae has been reported that at present 50% of drug molecules are either from natural source or their derivatives [57]. Interestingly, the compounds from second library of ZINC (Natural Product) showed better energy scores among all the libraries. Furthermore, from the library of natural compounds (28 molecules), Jacarandic Acid and Rhein were identified as the top ranked molecules and in silico analysis of the library (derivatives of diterpenoid isosteviol) suggest that compounds 16-hydroxyisosteviol, 16-hydrazonisosteviol, 17-hydroxyisosteviol, 16-17 dihydroxyisosteviol and 16-oxime, 17-hydroxyisosteviol were top ranked molecules, however, with much higher energy scores (less negative) than the top compounds from the ZINC libraries (ZINC drug-like molecules, ZINC Natural Product).

Conclusion
We utilized a bioinformatics pipeline for determining the conserved proteome of 13 strains of C. diphtheriae, and subsequently exploit 3D structural information, resulting in a small set of prioritized putative drug/vaccine targets, of which eight proteins are pathogen-essential, nonhost homologous and 15 are pathogen-essential, host-homologs. After a detailed structural comparison between host and pathogen proteins, we suggest that eight of the non -host homologs could be considered for antimicrobial chemotherapy in future studies on anti-diphtheriae drugs and vaccines. Moreover, the strategy described herein is of general nature and can also be employed to other pathogenic microorganisms.
Supporting information S1