Mining of potential drug targets through the identification of essential and analogous enzymes in the genomes of pathogens of Glycine max, Zea mays and Solanum lycopersicum

Pesticides are one of the most widely used pest and disease control measures in plant crops and their indiscriminate use poses a direct risk to the health of populations and environment around the world. As a result, there is a great need for the development of new, less toxic molecules to be employed against plant pathogens. In this work, we employed an in silico approach to study the genes coding for enzymes of the genomes of three commercially important plants, soybean (Glycine max), tomato (Solanum lycopersicum) and corn (Zea mays), as well as 15 plant pathogens (4 bacteria and 11 fungi), focusing on revealing a set of essential and non-homologous isofunctional enzymes (NISEs) that could be prioritized as drug targets. By combining sequence and structural data, we obtained an initial set of 568 cases of analogy, of which 97 were validated and further refined, revealing a subset of 29 essential enzymatic activities with a total of 119 different structural forms, most belonging to central metabolic routes, including the carbohydrate metabolism, the metabolism of amino acids, among others. Further, another subset of 26 enzymatic activities possess a tertiary structure specific for the pathogen, not present in plants, men and Apis mellifera, which may be of importance for the development of specific enzymatic inhibitors against plant diseases that are less harmful to humans and the environment.


Introduction
One of the major challenges for plant breeders is to maintain high levels of quality and production of cultures. Diseases caused by plant pathogens are one of the main factors limiting the productivity of large commodities, such as soybean (Glycine max), corn (Zea mays) and tomato (Solanum lycopersicum) [1,2]. Use of pesticides is one of the most commonly used alternatives to plant pathogens control, being used in a wide variety of crops [3]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Datasets and clustering
The datasets of predicted proteins for each genome studied in this work were obtained from UniprotKB (version 2015_10 http://www.uniprot.org/) and RefSeq (Version 70, http://www. ncbi.nlm.nih.gov/). These datasets contained several proteins annotated as "uncharacterized", "hypothetical" and / or "putative". Three plant genomes were analyzed: G. max, Z. mays and S. lycopersicum. Pathogens were chosen according to the geographic distribution of the disease, most of them with a cosmopolitan occurrence. The pathogens analyzed comprise eleven fungal and four bacterial genomes, all pathogenic to one or more species of the plants studied. Also included were the genomes of Homo sapiens, Apis mellifera (pollinator), Trichoderma harzianum (soil fungus) and Bacillus subtilis (plant growth promoting bacteria) ( Table 1).
The complete, annotated set of enzymes was extracted from KEGG (release 73.0, January 2015) and contained 1,524,871 protein sequences, from 298 Eukaryotes, 3014 Eubacteria and 175 Archaea genomes. Sequences with less than 60 amino acids were removed. To clusterize the sequences into groups based on sequence similarity, we used the AnEnPi pipeline [67]. A similarity score with a cut-off value of 120 was used for all BLASTp pairwise comparisons since this . Results were parsed to obtain, for each enzymatic activity as defined by their Enzyme Commission (EC) number, files containing one or more groups of primary structures. If for a given enzymatic activity, only one group was produced at the end of the clusterization step, then all sequences would be considered homologous, and that enzymatic activity was removed from the analysis. On the other hand, if more than one group was produced, then sequences in the same group were considered homologous, with a score above 120, while sequences allocated in different groups were considered analogous (potential NISEs), with a score smaller than 120. In other words, sequences allocated in the same group have similar tertiary structures, while sequences allocated in different groups have different folding patterns, which reflects their different evolutionary origins [34, 35,68].

Protein function inference
The groups of homologous sequences generated after the clustering step using the KEGG dataset were used for reannotation (with the pipeline AnEnPi) of the predicted proteins from the organisms in this study, which were compared, in a pairwise manner, to each primary protein structure within each protein functional group from KEGG. For the biochemical function inference, a cutoff value of 10 −20 was used, a highly restrictive value that gives greater reliability to the results [67,[69][70][71]. Sequences with scores below this threshold were removed from the analysis.

NISEs: Identification, structural validation and essentiality
The search for cases of analogy (NISEs) between enzymes from plants and pathogens was performed through the analysis of the groups produced after the clustering step and functional inference. For this, one of the modules of AnEnPi was used together with in-house scripts to parse and filter the results. To validate the identified NISEs, that is, to verify if the enzymes found are cases of evolutionary convergence, we classified the sequences in accordance with their folds using the SUPERFAMILY database. The information in this database is based on a collection of Hidden Markov Models [72], which represent the structural domains of proteins classified by SCOP [73]. Heteromultimeric enzymes, enzymes annotated with the term "subunit" and sequences without an associated fold were excluded from the final list. Fused domains were maintained in our analysis, as in the case of the family "Dimeric alpha + beta barrel", which is an evolutionarily conserved group of protein families [73,74]. Enzymes with the same EC number, but displaying different folds and, consequently, belonging to different superfamilies, were considered potential NISEs.
The Database of Essential Genes (DEG, 14.7, October/2016, http://www.essentialgene.org/) was used as a reference for the search for essential activities in the pathogens studied. A BLASTp search was performed between all enzymatic sequences identified as analogous against the DEG database. An e-value of 10 −5 was used as threshold. Later, another BLASTp search was performed between all enzymatic sequences identified as analogues against the predicted proteins of organisms that should not be affected by an eventual inhibitor for the target identified in phytopathogen (H. sapiens, A. mellifera, T. harzianum and B. subtilis). An e-value of 10 −5 was used as threshold.

Data preparation, clustering and functional activity inference
After cleaning and preparation, the initial dataset obtained from KEGG was reduced to 1,225,682 protein sequences distributed over 3,893 enzymatic activities. After clusterization, this dataset was used for the reannotation of the predicted proteins of the plants and phytopathogens, comprising 444198 individual sequences in 2096 enzymatic activities from the three plants and their 15 pathogens. Predicted proteins from H. sapiens, A. mellifera, T. harzianum and B. subtilis were also reannotated, comprising 114914 individual sequences in 2008 enzymatic activities. Annotation quality of the downloaded sets of predicted proteins varied greatly. Before the reannotation procedure, the best annotated organism among the plants was Z. mays, with approximately 90% of their proteins characterized, while S. lycopersicum presented only 9% of its proteins annotated. Among the pathogens, P. ananatis presented 83% of its entire conceptual proteome annotated and M. perniciosa had only 1.3% of its proteins characterized. After the functional inference step, where only enzymes were reannotated, on average 15% of the proteins of each organism were associated with an enzymatic activity (data not shown).

Potential NISEs: Identification and validation
Initially, a total of 568 cases of potential NISEs was identified, and from this set 97 cases were validated ( Table 2, see S1 Table for more details). Sequences labeled with "subunit" or "chain" (324 cases), enzymes displaying the same fold (55 cases), and sequences without an associated fold in the SUPERFAMILY database (92 cases) were excluded. Cases of analogy were validated for all the pathogens studied: only one case was found for P. sojae and S. sclerotiorum, while 14 cases were found for A. flavus AF70. In total, 13 cases of analogy were found in the comparisons between G. max and its pathogens, 23 cases between S. lycopersicum and its pathogens, and 61 cases between Z. mays and its pathogens ( Table 2).
The validated NISEs (97 cases), comprising 39 different enzymatic activities, participate in central metabolic pathways including the carbohydrate metabolism (13 enzymatic activities),  (29) amino acid metabolism (8), energy metabolism (6), biosynthesis of secondary metabolites (4) and lipid metabolism (4). Eight enzymatic activities belong to other pathways such as xenobiotics degradation, metabolism of cofactors and vitamins, nucleotide metabolism and metabolism of other amino acids (Fig 2). It is important to remember that one enzymatic activity may participate in more than one pathway.

Essential NISEs
After the validation step a screening for essential enzymes was performed, revealing 58 cases of analogy ( In the amino acid metabolism, several enzymes were identified as essential and analogous, like carbonic anhydrase for R. solanacearum and. A. flavus AF70; prolyl aminopeptidase, for F. oxysporum 4287; transaminase, for A. flavus AF70, G. moniliformis, A. flavus NRRL3357 and F. oxysporum Fo5176. Chitinases were found as essential and analogous for P. seryngae and R. solanacearum (Table 3).
Analogous and essential enzymes were also found in the metabolism of lipids and biosynthesis of secondary metabolites pathways. Acetyl-CoA carboxylase was identified in. X. axonopodis and phospholipase A2 in C. graminicola. Ornithine carbamoyltransferase, identified in P. ananatis, participates in the amino acid metabolism (S2 Table). Some enzymatic activities found to be essential for some pathogens have not been identified as essential in others: these cases are represented by enzymes encoded by different genes. In this group we can cite enzymes belonging to the antioxidant system (AS), composed of enzymes involved with the

Analogous enzymes in the antioxidant system
One group of enzymes that stood out among the validated NISEs, including non-essential activities, were the enzymes that comprise the antioxidant system (AS). In all comparisons made between plants and their pathogens, except in the case of B. cinerea, for at least one of the functional activities of the antioxidant system, the host enzyme and its counterpart in the pathogen are structurally different (Table 4). In total, 27 cases of analogy were found for the antioxidant system, including catalase (CAT), peroxidase (POX), superoxide dismutase (SOD), ferroxidase (HEPH) and peroxiredoxin (PRDX). In our results, CAT was identified as an essential enzyme for 9 of the 14 pathogens studied, and POX was identified as essential in E. turcicum, C. graminicola, P. seryngae and R. solanacearum. SOD was identified as an essential enzyme for X. axonopodis. Among the pathogens analyzed, there are two species with distinct strains, A. flavus (NRRL3357, AF70) and F. oxysporum (Fo5176, 4287). No differences were observed between different lineages as in the case of A. flavus and F. oxysporum. It is important to emphasize that the AS enzymatic activities are present in all the genomes included in the present work; however, only the cases of validated NISEs have been shown, which explain gaps in the absence/presence pattern observed for HEPH, PRDX and SOD (Table 4).

Specific structural forms
After obtaining the final list of validated, essential NISEs between the plant hosts and their pathogens, a search for these enzymatic activities was performed on the predicted proteins of H. sapiens, A. mellifera, B. subtilis and T. harzianum. The objective of this comparison was to find specific structural enzymatic forms of the pathogen in the genomes of species that should not be affected by an eventual inhibitor targeting that particular structural form, mainly H. sapiens and A. mellifera. Of the 97 NISEs validated, 68 specific structural forms of the pathogen (in relation to the plant host, men and bee) were found (Table 5). They are distributed over 26 enzymatic activities (16 of them being essential). From these 68 structural forms, 39 were present in T. harzianum and 17 in B. subtilis, which is expected since these organisms belong to the same kingdoms of the phytopathogens studied in this work (Fungi and Bacteria).

Discussion
The correct description of the analogous enzymes is important for the practical tasks of metabolic reconstruction and enzymatic nomenclature. In addition to this practical importance, these enzymes represent important evolutionary phenomenon, existence shows that for various biochemical problems, evolutionarily independent solutions may appear [35]. The main works on the practical application of analogous enzymes describes studies of metabolic pathways and inhibitory targets for human pathogens [42,[69][70]. In the case of our study, we sought a practical application, focused on the solution of an agronomic problem. Essential enzymes are one of the primary targets for the development of inhibitors of any kind; however, species that share essential enzymatic functions may inadvertently be affected by products developed with other applications in mind [75]. Pesticides are commonly targeted at these functions, and their damaging effects on several species including man himself and several vital species such as pollinators and beneficial microorganisms are reason for great concern [76][77][78]. In fact, it is estimated that approximately 35% of the crops are dependent on Table 4. Alternative enzymatic forms found among the enzymes of the antioxidant system.

Organisms
Structural forms Essential and analogous enzymes in the genomes of plants and phytopathogens Essential and analogous enzymes in the genomes of plants and phytopathogens pollinators for sexual reproduction, and pesticides are the main factor contributing to the current decrease of the pollinator population [44,79]. Through the joint use of primary structure data, tertiary structure data and essentiality data, beginning with 444198 individual sequences, comprising 2096 enzymatic activities in 3 plants and 15 phytopathogens, we have disclosed a subset of analogous sequences in 29 essential enzymatic activities present both in the plant and the pathogen. These belong to several components of the central metabolism of plant and pathogens, being involved in the carbohydrate metabolism, the metabolism of amino acids, the detoxification of reactive oxygen species and others, thus offering several opportunities as targets.
Interestingly, the subset of non-essential NISEs contains several enzymes important in the context of host-pathogen interactions, such as cellulases, chitinases, glutathione transferase and lysophospholipase. Blocking or inhibiting these enzymes would, in principle, decrease virulence and / or delay the defense mechanisms of the pathogen [80,81]. Inhibition of cellulases and chitinases has also been proposed as a strategy for the development of new antifungal drugs for aspergillosis in humans [22]. Glutathione transferase play an essential role in the protection of necrotrophic fungi against toxic metabolites derived from plants and reactive oxygen Essential and analogous enzymes in the genomes of plants and phytopathogens species [82], while lysophospholipase has been implicated with virulence in Cryptococcus neoformans [83]. Some of the diversity found for the enzymes of the antioxidant system, both in terms of enzymatic activities and in structural forms, may be explained by evolutionary pressures: during the co-evolution between plants and their pathogens, it is likely that different antioxidant enzymes of plants have adapted to overcome the pathogen virulence mechanisms [84,85]. The role of these enzymes in mechanisms of virulence, susceptibility to infections, development of drug targets and evaluation of pesticide effects has been studied for SOD [86][87][88][89][90], CAT [91][92][93][94] and POX [95].
Eighteen of the 29 enzymatic activities identified in this study as analogous and essential were identified in databases of drug targets such as TDR Drug Targets (http://tdrtargets.org/), DrugBank (https://www.drugbank.ca/) and Potential Drug Target Database (http://www. dddc.ac.cn/pdtd/), meaning they are being studied or employed as a drug target for at least one pathogen. Among them we can mention enzymes from the carbohydrate and amino acids metabolism such as lactoylglutathione lyase, acetyl-CoA carboxylase, carbonic anhydrase, and enzymes of the AS like catalase, peroxidase, peroxiredoxin and superoxide dismutase. Since these enzymatic activities present multiple tertiary structures, we are not able to tell, from this data, which one is under study; nonetheless, these findings give indirect support to our analyzes, corroborating the idea that essential enzymes with specific structural forms have great potential as drug targets as described in our study. Improvements in the annotation of genes and their products, and a better experimental characterization of enzymatic activities, would allow the use of less-stringent criteria in our procedures, mainly in data cleaning and filtering, but also in clustering and structural validation, increasing the number of essential and analogous enzymes that could be further studied as potential drug targets.

Conclusions
The approach employed in this study enabled the elaboration of lists of essential and analogous enzymes, most belonging to the central metabolism and/or involved in host-pathogen interactions, with potential to be a drug target. These enzymes provide an opportunity for the discovery of targets with considerable structural differences over their counterpart in beneficial organisms such as pollinators. Inclusion of structural data allows the disclosure of specific structural forms, facilitating the development of environment-friendly enzyme inhibitors, which may be of great importance for agricultural use.
Supporting information S1