The authors have declared that no competing interests exist.
Conceived and designed the experiments: AB AAI MJM IC JAV. Performed the experiments: AB. Analyzed the data: AB AAI. Wrote the paper: AB JAV.
Over the past several years fungal infections have shown an increasing incidence in the susceptible population, and caused high mortality rates. In parallel, multi-resistant fungi are emerging in human infections. Therefore, the identification of new potential antifungal targets is a priority. The first task of this study was to analyse the protein domain and domain architecture content of the 137 fungal proteomes (corresponding to 111 species) available in UniProtKB (UniProt KnowledgeBase) by January 2013. The resulting list of core and exclusive domain and domain architectures is provided in this paper. It delineates the different levels of fungal taxonomic classification: phylum, subphylum, order, genus and species. The analysis highlighted
Some fungi have become pathogenic to plants and in a lesser extent to animals. Under certain conditions their presence in the human body can prove a threat for human health, especially for immunocompromised patients. Yet, some fungi can also infect healthy individuals. The low sensitivity of the antifungal drugs available together with the clinically observed resistance of some fungi raises the demand for new alternative treatments. Proteins are biological molecules which perform essential functions within the living organisms. Many of those functions are attributed to the varying folded structure of each protein. These configurations are composed of functional units -also called domains- each one independently responsible for a fraction of the overall biological function. Understanding how the different block combinations are distributed across members of the same or similar families of organisms is important. For instance, exclusive domain combinations can hold particular acquired functions. Blocks displaying a high mobility can play major roles for the organism's survival. The biological goal of this study was to analyse the functional implications of protein domains and domain combinations in the available fungal proteomes. This information can be used to highlight proteins and pathways that could be potentially used as drug targets.
There has been a significant rise in the incidence of fungal infection over the last few years. This has been partially due to an increase in the susceptible population as the result of blood cancer, intensive care, solid organ transplantation, or chronic granulomatous disease, in addition to a growing number of patients receiving high doses of corticosteroids or other immunosuppressive treatments
The rationale in the search for new antifungal targets in the pre-genomics era was based on the molecular study of genes associated with fungal viability or virulence. The advent of massive parallel sequencing technologies and their progressively reduced cost has enabled the sequencing of large numbers of genomes in a short period of time. As a result, it is now possible to look for broad-spectrum antifungal targets by detecting homologous proteins present in most of the available fungal proteomes
In the late 90s, the concept of protein domain was coined by Branden and co-workers to define an independent, compact and stable protein structural unit that folds independently of other such units
Whereas the domain repertoires of different organisms can be relatively similar, there can be a high variation in the number of the domain combinations. Therefore, the number of different domain architectures is more variable and is related to the organism complexity and lifestyle
Protein domain information can be used for many different purposes. For instance, the domain architecture can determine the overall protein function and be used to transfer genomic annotations in newly sequenced genomes
In this manuscript, we first perform an analysis of the protein domains and domain architectures derived from all the available complete fungal proteomes. We identify the core and exclusive domains and domain architectures at different levels of the fungal taxonomical classification (phylum, subphylum, order, genus, species) and characterize which domains are found to be promiscuous. Then, we use the obtained protein domain information to explore
Proteins from all the available complete fungal proteomes (137 organisms) were retrieved from UniProtKB (UniProt Knowledgebase, release 2013_01). The 137 fungal proteome sets represented 111 species of the following four fungal phyla: Ascomycota, Basidiomycota, Chytridiomycota and Zygomycota. In addition, the phylum Microsporidia was also considered. Microsporidia include a controversial group of eukaryotic organisms of fungal origin, without mitochondria and peroxisomes. Microsporidia have been considered in this study since they are considered the earliest-diverging clade of sequenced fungi
The related domain information for each protein was obtained from Pfam (release 27.0). On average, 67.0% of the proteins had at least one Pfam domain assigned. However, the Pfam coverage was not evenly distributed among the studied species. On the low coverage side, there were plant pathogens like
On average, each fungal protein in the studied set had 1.55 domains, a proportion that did not vary significantly between species. Overall, 5,279 different Pfam domains were found among the studied proteomes (35.6% of the total Pfam domains). Among them, 131 domains (2.5%) were found in every fungus analysed (they are ‘core’ domains), whereas 612 domains (11.6%) were found only in one organism. Interestingly, when Microsporidia organisms were not considered in the analysis, the number of fungal ‘core’ domains rose to 268. Overall, 70.8% of all domain types (3,740) appeared in both single and multidomain proteins. In addition, 964 domains (18.3%) were not seen in combination with any other domain, whereas 575 domains (10.9%) appeared exclusively in multidomain proteins (see
Protein domain architectures were defined throughout this study as the ordered tuples of Pfam domains, listed from the protein N-terminus to C-terminus. Architectures with either different domain counts or order were computed as independent domain combinations, even if they had the same domain types. For instance, the representation of three potential different example domain architectures with just two domains could be: 1) D1∼D1∼D2; 2) D1∼D2; and 3) D1∼D2∼D1. As indicated, all three would be considered different domain architectures because either the number of domains, the order of the domains and/or the number of domains were different.
In total, 21,853 unique domain architectures were identified in our set of 137 fungal proteomes, with 10,206 of them (46.7%) appearing exclusively in a single proteome and only 56 ‘core’ architectures (0.2%) that were present in all the fungi. However, when Microsporidia species were not taken into account, the number of ‘core’ architectures almost doubled: 107 domain architectures were found. Gene Ontology (GO) term annotations were added to give an insight of their molecular function and/or biological activity (see ‘
Most of these 56 ‘core’ domain architectures (
The effect of removing Microsporidia from the analysis of ‘core’ domains and architectures highlights how different Microsporidia organisms can be, when compared with the rest of the organisms analysed. To measure how significant this increase is, we compared it to the effect induced by removing the species from other phylum (Zygomycota). Note that the only strain from Zygomycota (
The same pattern of modest variance in the number of domains shown in the previous section was also observed at the level of the domain architectures (
At the species level, 637 domains and 10,582 domain architectures were identified as exclusive for one of the 111 fungal species analysed. In this case, proteomes from multiple strains belonging to the same species were grouped. Throughout the rest of the manuscript, organisms belonging to the phylum Microsporidia were considered as any other fungus. In any case, the different analyses were also performed without Microsporidia. However, the results are not shown for simplification purposes since we believe they do not add much novel information, unlike the information about ‘core’ domains and architectures, included in the previous sections.
The average number of exclusive architectures per species was 96, ranging from 10 to 467 (see below). All the exclusive architectures for each species are provided as
The organisms with the least number of exclusive architectures were those from the primitive Microsporidia species
As a simple observation, species without other closely related species in the dataset tended to have larger numbers of exclusive domain architectures. The highest numbers were found in
The repertoire of exclusive protein domains looked slightly different. For a complete list of exclusive domains per species see
A similar analysis of the exclusive domain architectures was done at the genus level. In this case, all the proteomes from the species belonging to the same genus were grouped (
In parenthesis, the number of species and the number of strains that belong to a given genus are indicated. The area occupied by each genus corresponds to the number of exclusive domain architectures, whereas the colour correlates with the number of exclusive domains present among those architectures (calculated as the number of domains divided by number of domain architectures).
Genera (Nb. species/Nb. strains) | Sum of the number of strain-specific architectures | Number of genus-specific architectures | Increase (%) |
20 | 39 | 95.0 | |
88 | 140 | 59.0 | |
59 | 88 | 49.0 | |
131 | 152 | 16.8 | |
218 | 246 | 12.8 | |
839 | 943 | 12.4 | |
244 | 273 | 11.8 | |
149 | 160 | 7.4 | |
230 | 245 | 6.5 | |
351 | 372 | 6.0 | |
160 | 154 | 3.9 | |
367 | 378 | 3.0 | |
149 | 145 | 2.6 | |
387 | 392 | 1.3 | |
43 | 43 | 0 |
The table is sorted by the increase in the number of exclusive architectures in the genus when compared to the individual species. In parenthesis, the number of species and the total number of strains are indicated.
The information is split by taxonomical levels and allows the screening of conserved domains and domain architectures in the studied organisms. In particular, an interesting case is that of the Domains of Unknown Function (DUFs) and their related domain architectures. The specificity and exclusiveness or these domains to certain taxonomical groups could help to elucidate the underlying biological functions.
Since we were interested in knowing which domains were essential for the different fungi, next we identified the list of promiscuous domains per organism, as explained in ‘
Pfam domain name | Description | Times in top 25 ranking of promiscuous domains | Average number of bigrams | Gene ontology (GO) terms |
AAA* | AAA family proteins often perform chaperone-like functions that assist in the assembly, operation, or disassembly of protein complexes | 132 | 17 | GO:0005524 ATP binding |
GATase* | Glutamine amidotransferase class-I | 123 | 7 | - |
SH3_1* | SH3 (Src homology 3) domains are often indicative of a protein involved in signal transduction related to cytoskeletal organization | 122 | 11 | GO:0005515 protein binding |
PX | PX domains bind to phosphoinositides. | 117 | 10 | GO:0005515 protein binding; GO:0007154 cell communication; GO:0035091 phosphatidylinositol binding |
PH* | PH stands for pleckstrin homology | 116 | 9 | GO:0005515 protein binding; GO:0005543 phospholipid binding |
SNF2_N | SNF2 family N-terminal domain. This domain is found in proteins involved in a variety of processes including transcription regulation, DNA repair, DNA recombination and chromatin unwinding | 115 | 12 | GO:0003677 DNA binding; GO:0005524 ATP binding |
Helicase_C | Helicase conserved C-terminal domain | 108 | 20 | GO:0003676 nucleic acid binding; GO:0004386 helicase activity; GO:0005524 ATP binding |
MMR_HSR1 | The full-length GTPase protein is required for the complete activity of the protein interacting with the 50 S ribosome and binding of both adenine and guanine nucleotides, with a preference for guanine nucleotide | 98 | 8 | GO:0005525 GTP binding |
DEP* | Domain found in Dishevelled, Egl-10, and Pleckstrin (DEP). The DEP domain is responsible for mediating intracellular protein targeting and regulation of protein stability in the cell | 89 | 5 | GO:0035556 intercellular signal transduction |
UBA* | UBA/TS-N domain. Found in several proteins having connections to ubiquitin and the ubiquitination pathway | 88 | 7 | GO:0005515 protein binding |
TPR_1* | Tetratricopeptide repeat | 86 | 9 | GO:0005515 protein binding |
zf-RING_2 | Ring finger domain | 80 | 11 | GO:0005515 protein binding; GO:0008270 zinc ion binding |
C1_1* | Phorbol esters/diacylglycerol binding domain (C1 domain). This domain is also known as the Protein kinase C conserved region 1 (C1) domain. | 76 | 5 | GO:0035556 intercellular signal transduction |
JmjC* | The JmjC domain belongs to the Cupin superfamily. JmjC-domain proteins are hydroxylases that catalyse a novel histone modification | 74 | 5 | GO:0005515 protein binding; |
UCH* | Ubiquitin carboxyl-terminal hydrolase | 72 | 9 | GO:0004221 ubiquitin thiolesterase activity; GO:0006511 ubiquitin-dependent protein catabolic process |
BRCT* | BRCA1 C-terminus (BRCT) domain. Canonical BRCT phosphopeptide interaction cleft at a groove between the BRCT domains | 71 | 7 | - |
PHD* | PHD folds into an interleaved type of Zn-finger chelating two Zn ions in a similar manner to that of the RING and FYVE domains | 66 | 9 | GO:0005515 protein binding |
UBACT | Repeat in ubiquitin-activating (UBA) protein | 65 | 5 | GO:0005524 ATP binding; GO:0006464 cellular protein modification process; GO:0008641 small protein activating enzyme activity |
TPR_2 | Tetratricopeptide repeat | 65 | 7 | - |
RhoGEF | RhoGEF domain. Guanine nucleotide exchange factor for Rho/Rac/Cdc42-like GTPases Also called Dbl-homologous (DH) domain. It appears that Pfam:PF00169 domains invariably occur C-terminal to RhoGEF/DH domains | 57 | 6 | GO:0005089 Rho guanyl-nucleotide exchange factor activity; GO:0035023 regulation of Rho protein signal transduction |
CBM_1 | Fungal cellulose binding domain | 24 | 13 | GO:0004553 hydrolase activity, hydrolyzing O-glycosyl compounds; GO:0005576 extracellular region; GO:0005975 carbohydrate metabolic process; GO:0030248 cellulose binding |
Domains marked with an asterisk had been previously identified as promiscuous in animals, plants and fungi
The promiscuous domain found in a highest number of organisms was the ‘ATPases Associated with cellular Activities’ (‘AAA’ or ‘AAA+’) domain, found to be promiscuous in 133 out of the 137 organisms studied. It belongs to a large and intensively studied protein superfamily. ‘AAA’ domains usually have a ring shaped oligomeric complex that conveys them several activities through the energy-dependent unfolding of macromolecules
Interestingly, after ‘AAA’ and ‘GATase’, the most promiscuous domains identified were ‘SH3_1’ (found in 122 organisms) and ‘PX’ (117 organisms), two related domains which can interact with each other. ‘SH3 ‘domains are typically 40–60 amino acids long
It has been previously pointed out that protein domain promiscuity is a volatile feature throughout evolution
So far, in this study we have performed a detailed analysis of the protein domain and architecture content of the available fungal proteomes. The resulting information can be used with different purposes in mind. In our case, we decided to further mine the data and explore approaches to detect
Protein domain information of the human reference proteome (UniProtKB release 01_2013) was then analysed and compared with the information available for all the fungal organisms. At least one Pfam domain was retrieved for 72.8% of the human proteins, with an average of 2.06 domains per protein. The number of distinct Pfam domains found (5,519, 37.2% of all the Pfam domains) was slightly higher than the sum of the different domains available for all the fungal proteomes analysed (5,279). However, on the other hand, almost three times more architectures were found in the combined fungal proteomes when compared with human (17,469 vs. 6,741, respectively). The number of domains and domain architectures either exclusive or shared between human and fungi are represented in
The category “fungi” refers to the set of 137 organisms analysed.
In order to have a second view on the human and fungal domain contents, we again analysed the protein domain information, but this time at the level of the Pfam domain clans. Pfam clans consist of a series of evolutionary related Pfam families which are believed to share a common ancestor
Based on the criterium of drug side effects, all proteins containing protein domains belonging to those 48 clans (corresponding to 77 Pfam domains) are in our opinion, potentially good initial candidates to be considered as antifungal targets. Using the already available information about the exclusive domains per taxonomic level, the list of potential targets could then be tailored to e.g. particular species, genera or phyla.
Just as one example, one of those detected 77 Pfam domains was the domain ‘DinB’ (PF05163, part of the clan CL0310), which was exclusively found in proteins from species from the genus
The second approach was to search for biological pathways that could be potential targets for antifungal compounds, focusing on proteins with promiscuous domains, since they might play a role in maintaining network stability
Overall, 3,675 fungal proteins in UniProtKB contained at least one of these 8 domains. We then looked for representation of these proteins in fungal pathways available in the public domain.
Protein(s) originally annotated | Domain architecture | Number of Proteins | Number of species | Metabolic pathway (UniPathway) | Metabolic pathway description |
Q0C8M3 | ketoacyl-synt∼Ketoacyl-synt_C∼Acyl_transf_1∼PS-DH∼Methyltransf_12∼KR∼PP-binding∼Condensation | 4 | 4 | UPA00875: lovastatin biosynthesis | Biosynthesis of lovastatin, an HMG-CoA reductase inhibitor produced by the fungus |
Q4WBW4; A1DBP9 | Esterase_phd∼CBM_1 | 34 | 22 | UPA00114: xylan degradation. | Degradation of xylan, a polymer of xylose residues |
P15807; O14172 | NAD_binding_7∼Sirohm_synth_M∼Sirohm_synth_C | 125 | 96 | UPA00262: siroheme biosynthesis | Biosynthesis of siroheme, the cofactor for sulfite and nitrite reductases. Siroheme is formed by methylation, oxidation and iron insertion into the tetrapyrrole uroporphyrinogen III (Uro-III) |
Overall, three pathways present in UniPathway, containing these five proteins, were identified as potential targets for antifungals:
Lovastatin biosynthesis (for the domain ‘CBM_1’). Lovastatin is commonly used as cholesterol-lowering agent. In a fungal context, it is a compound that blocks the first step of the terpene biosynthesis for the production of ergosterol (the main component of cell membranes in fungi). It was first discovered in
Xylan degradation (for the domain ‘KR’). Diverse microorganisms including filamentous fungi secrete enzymes capable to digest xylan, a polysaccharide constituent of the plant cell walls. These enzymes have been used in industrial production of animal food, textiles and production of biofuels
Biosynthesis of siroheme (for the domain ‘Sirohm_synth_M’). Siroheme is a heme-like prosthetic group used by sulfite and nitrite reductases to convert sulfite into sulfide and nitrite into ammonia, respectively. This process is essential for the assimilation of sulfur and nitrogen by plants and consequently, for life. Sulfite reductases are found in bacteria, plants and fungi but not in animals. Assimilation of all inorganic sulfur and the majority of nitrogen in the biosphere depend on the availability of siroheme. Without it, there would be no reduced sulfur available for the synthesis of cysteine and methionine. Those amino acids are essential for animals, which are unable to reduce sulphate, and thus require to include sulfur-containing amino acids in the diet
Unfortunately, the amount of pathway information from fungal organisms in the public domain is quite limited at present. It was not possible to find comparable ready-to-use information in other resources apart from UniPathway (see ‘
Since annotation of fungal pathways in resources like UniPathway is limited to the most popular species, the protein domain architecture of the five proteins identified was searched in the rest of fungal organisms to know how common these three pathways were throughout the fungal kingdom. Assuming that proteins with the same domain architecture can have a similar function and can be involved in the same or similar pathways, the most ubiquitous pathway among the three was the “biosynthesis of siroheme”, present in 96 out of the 111 fungal species studied (
The 111 fungal species used in this study were classified in five groups taking into account the frequency in which they had been found in clinical samples (see ‘
Group 4 (the one including the species which are most commonly found in clinical samples) was characterized by 140 exclusive architectures. Among those, 11 architectures contained DUFs. Group 3 contained 674 unique domain architectures. The least clinically relevant groups, groups 2 and 1, had 1,801 and 3,048 architectures, respectively. They were the ones with the highest number of exclusive architectures.
When the analysis was performed at the level of the protein domains, 11 exclusive domains were found for at least one species in Group 3, whereas only two domains were found to be exclusive of Group 4.
Pfam domain name | Domain description | Pfam clan information | Group |
ATP1G1_PLM_MAT8 | ATP1G1/PLM/MAT8 family | - | 3 |
CTP_transf_3 | Cytidylyltransferase. This family consists of two main Cytidylyltransferase activities: 1) 3-deoxy-manno-octulosonate cytidylyltransferase; 2) acylneuraminate cytidylyltransferase. NeuAc cytydilyltransferase of |
CL0110: GT-A. This is the GT-A clan that contains diverse glycosyltransferases that possess a Rossmann like fold | 3 |
FRG | FRG domain. This presumed domain contains a conserved N-terminal (F/Y)RG motif. It is functionally uncharacterised | - | 3 |
HI0933_like | HI0933-like protein | CL0063: NADP_Rossmann. A class of redox enzymes is composed by two domain proteins. One domain, termed the catalytic domain, confers substrate specificity and the precise reaction of the enzyme. The other domain, which is common to this class of redox enzymes, is a Rossmann-fold domain | 3 |
PTS-HPr | PTS HPr component phosphorylation site | - | 3 |
SdiA-regulated | SdiA-regulated. This family represents a conserved region approximately within a number of hypothetical bacterial proteins that may be regulated by SdiA, a member of the LuxR family of transcriptional regulators. Some family members contain the Pfam:PF01436 repeat | CL0186: Beta_propeller. This large clan contains proteins that contain beta propellers. These are composed of between 6 and 8 repeats. The individual repeats are composed of a four stranded sheet | 3 |
Sugarporin_N | Maltoporin periplasmic N-terminal extension. This domain would appear to be the periplasmic, N-terminal extension of the outer membrane maltoporins | - | 3 |
TIR_2 | TIR domain. This is a family of bacterial Toll-like receptors | CL0173: STIR. Both members of this clan are thought to be involved in TOLL/IL1R-like pathways, by mediating protein-protein interactions between pathway components. The N-termini of SEFIR and TIR domains are similar, but the domains are more divergent towards the C-terminus | 3 |
Uma2 | Putative restriction endonuclease. This family consists of hypothetical proteins that are greatly expanded in cyanobacteria. The proteins are found sporadically in other bacteria. A small number of member proteins also contain Pfam:PF02861 domains that are involved in protein interactions. Solutions of several structures for members of this family show that it is likely to be acting as an endonuclease | CL0236: PDDEXK. This clan includes a large number of nuclease families related to holliday junction resolvases | 3 |
FixP_N | N-terminal domain of cytochrome oxidase-cbb3, FixP. This is the N-terminal domain of FixP, the cytochrome oxidase type-cbb3. The exact function is not known | - | 3 |
MFMR | G-box binding protein MFMR. It is between 150 and 200 amino acids in length. The N-terminal half is rather rich in proline residues and has been termed the PRD (proline rich domain), whereas the C-terminal half is more polar and has been called the MFMR (multifunctional mosaic region). It has been suggested that this family is composed of three sub-families called A, B and C, classified according to motif composition | - | 3 |
HEPN | HEPN domain | CL0291: KNTase_C. This alpha helical domain is found associated with a variety of nucleotidyltransferase domains | 4 |
Keratin_B2_2 | Keratin, high sulfur B2 protein | CL0520: Keratin_assoc. Families in this clan are cysteine-rich and are from proteins associated with Keratin | 4 |
Extended information about all domains exclusively found in different at least one species from the described groups according to their occurrence in clinical samples can be found in
As a third approach to identify potential targets for drugs, the list of protein domains produced by Kruger and colleagues in a previous study
When we compared directly this list with the generated list in the previous section (according to the occurrence of the fungi in clinical samples), we found that none of the exclusive domains from proteins of the Groups 3 or 4 were identified to have small-molecule binding potential. Only seven domains exclusive for organisms belonging to the less relevant group 2 appeared to have small-molecule binding potential: ‘BH3’ (PF15285), ‘Cons_hypoth698’ (PF03601), DUF2146 (PF10220), ‘Gb3_synth’ (PF04572), ‘NCD2’ (PF04905), ‘bact-PGI_C’ (PF10432) and ‘TTKRSYEDQ’ (PF10212). Interestingly, these exclusive domains were present mainly in proteins of the multi-resistant species
Based on the structural similarities, we decided to include in the comparison all the Pfam domains belonging to the same clans than the list of 215 domains initially studied (extending the initial number to 1,193).
As a result of this second analysis, three domains were exclusively found in proteins exclusive of species of the Group 3: ‘CTP_transf_3’ (PF02348), ‘HI0933_like’ (PF03486, also highlighted in the previous section) and ‘Uma2’ (PF05685). ‘CTP_transf_3’ is a domain found in cytidylyl-transferase membrane proteins, which are important regulatory enzymes in the synthesis of phospholipids in eukaryotic cell membranes
In this study, we have characterized the protein domain and domain architecture content of the available fungal proteomes (including the phylum Microsporidia) and we have shown how that information can be used
In analogous previous studies, protein domain order and domain repetitions were considered in a different way, assigning the same domain arrangement regardless of the number of consecutive copies of a single given domain
Another reason for the consideration of the architectures as they were, is that part of the approach developed in this project is planned to be used to improve automatic protein sequence annotation in UniProtKB
For example, knowing that the fungal promiscuous domain ‘PX’ is usually involved in targeting proteins to cell membranes, domain architectures including ‘PX’ domains might be of interest. Looking into this particular case, we identified the domain architecture ‘PXB∼PX∼DUF3818’, present in 106 proteins of all fungal phyla. The only protein manually annotated in UniProtKB with this domain architecture is Q06839, a peripheral membrane protein believed to be involved with cell communication process (GO:0007154) in
As a result of this comprehensive study, we provide access to the full list of ‘core’ and exclusive domains and domain architectures at different levels of the taxonomic classification and also identify the promiscuous domains. This information can be a very valuable resource for researchers interested in comparative studies between different fungal organisms. Here, we have only highlighted some examples of how this information could be used, but it is clear for us that more focused studies could be performed on particular groups of organisms, using all the generated information here. This information could also be combined with genome features such as gene clusters (very frequently found in fungi, e.g. for genes involved in secondary metabolism) or synteny.
However, in this study we decided to focus on the possible application of this information in the detection of antifungal targets. We then followed three different approaches. First of all, we identified those protein domains and domain architectures that were present in fungi but not found in the human proteome. Secondly, using the promiscuous domains, we identified three pathways whose components could be targeted. Last, we created five groups of organisms depending on their occurrence in clinical samples and then inferred small-molecule protein domain binding information obtained in a recent study involving small molecules stored in the ChEMBL database. The results coming from these three approaches constitute just a first step and should be taken with caution, since they have different inherent limitations. It is also expected that these approaches will provide new information when new data (e.g. pathway related information) is made available in the public domain.
Analogous studies where the interaction between protein domains and small molecules is assessed, are becoming more popular in the last few years. For instance, recently, using data from Protein Data Bank (PDB) structures, more than thirteen thousand physical interactions between small molecules and protein domains were identified
Throughout this study we have used domain information coming from Pfam-A domains, so all the conclusions are limited to that context. One fact to consider is that protein sequence coverage in Pfam for fungi is lower than for other organisms, especially for species like
Overall, this manuscript provides a comprehensive analysis of protein domain and domain architectures in the available fungal proteomes and shows three approaches that can be used as a first step in the detection of new antifungal targets. These approaches could also be used for organisms with clinical interest other than fungi e.g. bacteria. Therefore, analogous analyses could be performed for different groups of pathogenic bacteria using as a starting point the scripts provided (available at
The proteomes used in this study were obtained from UniProtKB (
Domain information was obtained from Pfam
Additional functional annotation for Pfam domains was retrieved using Gene Ontology (GO) terms (
A local MySQL database was developed to store the protein sequence and domain information. The analysis of domain and architectures was performed using R (
The nomenclature of the fungal organisms followed in this study was the same one used in UniProtKB (release 01_2013). The taxonomical classification used was the one provided by the National Centre for Biotechnology Information (NCBI)
A method to measure the weighted bigram frequency (WBF), introduced by Basu and colleagues
Information in
Metabolic pathway information was retrieved from the public resource UniPathway (
Based on the frequency of appearance in clinical samples (according to two recent epidemiological prospective studies carried out in Spain
Group 4: Species with more than 100 isolates in any of the previously cited epidemiological studies.
Group 3: Species with more than 50 isolates in the study carried out by Puig-Asensio
Group 2: Species isolated at least once in any of the previously cited epidemiological studies.
Group 1: Species not isolated in the cited epidemiological studies, but present in clinical samples of the collection of fungal strains from the National Centre for Microbiology (Spain).
Group 0: Neither isolated in the two epidemiological studies cited nor present in the Spanish collection of fungal strains from the National Centre for Microbiology.
A list of
(PNG)
(XLS)
(XLS)
(XLS)
(CSV)
(CSV)
(CSV)
(CSV)
(XLS)
(CSV)
(CSV)