The protein architecture in Bacteria and Archaea identifies a set of promiscuous and ancient domains

In this work, we describe a systematic comparative genomic analysis of promiscuous domains in genomes of Bacteria and Archaea. A quantitative measure of domain promiscuity, the weighted domain architecture score (WDAS), was used and applied to 1317 domains in 1320 genomes of Bacteria and Archaea. A functional analysis associated with the WDAS per genome showed that 18 of 50 functional categories were identified as significantly enriched in the promiscuous domains; in particular, small-molecule binding domains, transferases domains, DNA binding domains (transcription factors), and signal transduction domains were identified as promiscuous. In contrast, non-promiscuous domains were identified as associated with 6 of 50 functional categories, and the category Function unknown was enriched. In addition, the WDASs of 52 domains correlated with genome size, i.e., WDAS values decreased as the genome size increased, suggesting that the number of combinations at larger domains increases, including domains in the superfamilies Winged helix-turn-helix and P-loop-containing nucleoside triphosphate hydrolases. Finally, based on classification of the domains according to their ancestry, we determined that the set of 52 promiscuous domains are also ancient and abundant among all the genomes, in contrast to the non-promiscuous domains. In summary, we consider that the association between these two classes of protein domains (promiscuous and non-promiscuous) provides bacterial and archaeal cells with the ability to respond to diverse environmental challenges.


Introduction
Since Wetlaufer [1] described that consecutive residues in polypeptide chains tend to fold into more or less compact modules called domains, it has been generally accepted that domains are the protein evolutionary modules, and modular reuse has been demonstrated in all domains of life [2]. This modularity might be advantageous to the organisms, allowing signaling proteins PLOS  to expand their regulatory linkages and may elicit a broader range of control mechanisms either via modular combinations or through modulation of inter-modular linkages [3]. Therefore, proteins are modular, and it has been suggested that large proteins contain multiple domains, either with the same or with different structural folds [4]. In this regard, comparative analyses of proteins have revealed a substantial fraction of multidomain proteins in eukaryotic organisms [5,6]. Similar conclusions have been reached through the analysis of domain combinations, namely, that more complex organisms display greater proportions of domain combinations and duplications [7,8].
Recently, some of these domain combinations have been identified as stable during evolution, whereas others have been highly mobile [9]. Accordingly, domains are defined as promiscuous if they are combined with many other domains, suggesting that such combination of domains with different structures and functions is a major source of diversity and modulation of molecular functionality. A second group of proteins whose domain architecture is conserved has been identified, and this group contains domains with few associations, or monolithic domains. A recent analysis showed that the increase in domain combinations of 19 families of DNA-binding transcription factors (TFs) correlated with the complexity of the organisms studied; however, it is uncertain if this correlation can be extended to additional domains [10] or is a particular observation for proteins devoted to gene regulation.
Therefore, in this work we evaluated the tendency of protein domains to combine with other domains, i.e., domain promiscuity, in proteins of bacterial and archaeal genomes. We assigned domains based on a Superfamily hidden Markov models (HMM), and a quantitative measure of domain promiscuity, the weighted domain architecture score (WDAS), was used to analyze 1319 non-redundant bacterial and archaeal genomes. We further attempted to correlate this measure with function, genome size and protein fold ancestrality.

Prokaryotic genomes
A total of 5321 genomes corresponding to Bacteria and Archaea domains were downloaded from the NCBI server and filtered to exclude redundancy, using the criteria described in [11]. We performed comparative analyses using the open reading frames (ORFs) that encode predicted proteins in all organisms. Hence, a total of 1214 bacteria; and 105 archaeal non-redundant genomes were considered for this analysis.

Superfamily domain assignments
The predicted protein sequences (the proteome) in all organisms were scrutinized to assess their domain organization by using the Superfamily database assignments [12]. A library of 1659 Superfamily HMMs was compared to the protein sequences by using the script hmmscan.pl (HMMer 3.1b2 [13]) provided by Superfamily (v. 2015) that considers the following parameters: -E 10 -Z 15438 -hmmscan -threads 4. To generate consistent assignments from the output files, we ran the script ass3.pl -t n -f 4 -e 0.0001 (this e-value is considered as conservative default).

Evaluation of domain architecture diversity
The WDAS, a measure of domain architecture diversity, was computed as described elsewhere [14]. The WDAS considers the proteins containing a domain and the total number of proteins per genome via the inverse abundance frequency (IAF) statistic, as follows: where P t is the total proteins per genome and P d is the number of proteins with domain d.
To determine the diversity of architectures associated with a specific domain, the inverse variability (IV) is obtained from the inverse of the number of distinct partner domain families at the N-and C-terminal sides adjacent to a domain. The definition of the IV of a domain, d, is: where f d is the number of different domain families adjacent to domain d. Finally, the WDAS of a domain is the product of the IAF and the IV: The associations between the WDAS and genome size (measured in ORFs) were binned in 13 intervals without overlaps, with a width of 836 ORFs, as previously described [10]. In brief, the number of windows was calculated by using the Sturges's formula: k = 1+ log2N where k is the number of equal classes and N the number of data, rounding to the nearest integer, the k value. Thus, the width of classes was determined with the equation: c = R/k where R = high value-low value (Genome size in ORFs). Values resulting from the application of the above formulas were k = 13 and c = 836, thus 13 windows without overlaps were used with a width of 836 ORFs [10].

Promiscuity identification
In order to define the threshold to separate a domain as promiscuous and non-promiscuous, we computed an accumulative plot for the number of enriched functions associated to promiscuous and non-promiscuous domains at different scores of WDASs. In a posterior step, both datasets were adjusted with polynomial regression (S1 Fig). From this plot we found an increasing between the first, second and third intervals, that corresponds to values between 0 and 3.0 of WDAS. The numbers of enriched functions decays when a value of 0 to 4.0 was consider. Therefore, a WDAS of 3.0 could be considered as a threshold to identify promiscuous and non-promiscuous domains. This found is consistent, when the number of total domains per interval was consider, i.e. the number of domains exhibited a slight increase when values between 0 and 3.0 of WDAS were considered and that represents the 13.69% of the total of domains (S1 Fig). In summary, domains with a WDAS between 0 and 3.0 were defined as promiscuous.

Functional enrichment
For each genome, two lists of protein domains were considered, using a threshold of �2.7 for promiscuous and �2.8 for non-promiscuous domains. An enrichment analysis using a hypergeometric test was conducted on these lists, considering 50 different functional categories obtained from the Superfamily (SUPFAM) and Structural Classification of Protein (SCOP) databases. Such a distribution describes the probability of finding x domains associated with a particular category in a list of interest k, from a set of N domains, with m domains associated with the same category. The P-values were corrected for multiple testing with Benjamini-Hochberger method to decreases the false discovery rate, and a statistical significance at Pvalue of <0.05 was set.

Ancestrality of protein domains
To evaluate the protein domains ancestrality, superfamily assignments were associated to the phylogenomic tree of Protein Fold Families values previously described [15]. In brief, Caetano-Anolles et al. [15] considered a PSI-BLAST comparisons between PDB and genome sequences [16]; and the usage and sharing of protein folds was characterized with the: fold occurrence (G ij ), average genome occurrence (Ḡ i ), and fraction of genomes harboring a fold (f i ). G ij defines how often a protein fold (i) occurs in a given proteome (j). Ḡ i represents averages of G ij values. Ḡ i and f i measure the extent of fold sharing within each domain or combination of domains. Values were converted into linearly ordered multistate characters and normalized using an arbitrary scale (generally 0-20). Cumulative frequency plots were used to illustrate the accumulation of folds belonging to a protein class along a phylogenetic tree, and it is a function of distance in nodes from the root. These plots can be considered time plots of lineages [17] with a time axis defined in relative units.

Distribution of domains according to superfamily assignments
It has been previously described that the increase in complexity of the domain organization of proteins would substantially contribute to the evolution of organismal complexity [2,18]. In order to gain insights into protein architecture among prokaryotes, 1214 bacterial and 105 archaeal non-redundant genomes were analyzed in terms of their repertoire of protein domains. A coverage of assignment of 75.8% per genome, with some genomes with coverage of more than 90%, such as the gammaproteobacterium "Candidatus Evansia muelleri" (GCF_000953435.1), with coverage of more than 95% and representing the organism with the highest proportion of proteins with domain assignments. This result is consistent with the assignment percentage of proteins that were assigned with at least one Superfamily domain for Eukaryotes, Archaea, Bacteria and Viruses [12]. The genomes with the minor coverage of domain assignments corresponded to four mycoplasma genomes: Mycoplasma haemofelis Ohio2 (GCF_000186985.1), a hemoplasma species with a coverage of less than 30% and M. suis str. Illinois (GCF_000179035.2), M. ovis str. Michigan (GCF_000508245.1), and Spiroplasma kunkelii CR2-3x (GCF_001274875.1), with coverages of 33.4%, 33.6%, and 38.2%, respectively, suggesting that a large diversity of novel protein domains remains to be identified in those genomes and must be further explored (S1 Table).
We evaluated all bacterial and archaeal proteins in terms of their domain organization, identifying that 46.7% of proteins contain one domain, 31.3% contain two domains, 11.2% have three domains, and 10.6% contain more than four domains. It is evident from this distribution that more than 53% are multidomain proteins. The proportion of multidomain proteins follows a linear distribution, where large genomes contain a large proportion of two-, three-, or four-domains proteins, suggesting a positive correlation (R 2 = 0.955) between the abundance of multidomain proteins and the number of proteins per genome (S2 Fig and S1 Table), when a non-linear least squared to fit the dataset was used.
In this regard, domains are present in various combinations in multidomain proteins. Thus, in this work, promiscuous domains were defined as domains that reside in many different combinations, i.e. combined with many other domains, a property that suggests a high degree of flexibility to combining with other different domains. Hence, to determine the diversity of structural domains associated with the bacterial and archaeal genomes, we evaluated the repertoire of all non-redundant domains per genome. As shown in Fig 1, organisms with reduced genomes (less than 2000 ORFs) contained a low proportion of non-redundant domains (less than 500 different protein domains) with an increase in organisms with more than 2000 ORFs (around 700 protein domains); however, the maximum number of different domains reached a plateau of around 800 different protein domains in organisms with more than 5000 ORFs. Therefore, small organisms exhibit a low diversity of different structural domains, suggesting that superfamilies at large genomes have increased their repertoire mainly via diverse duplication events followed by evolutionary divergence, as it has been previously suggested in eukarya [19]. In this manner, organisms could recycle protein domains to increase their repertoire and respond to diverse environmental challenges.

Protein architecture provides functional diversity to bacterial and archaeal genomes
Based on the previous result and considering the hypothesis of domain recycling, we evaluated the protein architectures of the 1317 superfamilies. To this end, the complete set of proteins in 1320 bacterial and archaeal genomes was analyzed in terms of structural domain composition, according to the formula described in methods, with the WDAS, a measure of domain promiscuity, determined and plotted as a function of the genome size (see S1 File). Based on the WDAS, we defined that values closer to 0 represent promiscuity (diversity of their protein domain architectures), and higher values suggest no diversity in the protein architecture and, consequently these protein domains must be considered non-promiscuous, or monolithic.
To evaluate the set contribution of promiscuous and non-promiscuous domains to the total repertoire of domains associated with all the genomes, those domains with a WDAS between 0 and 3.0 (See methods) were considered promiscuous. Based on this definition, we found that an average less than 100 different domains per genome that corresponds to around 11.34% of the total protein domains per genome can be considered promiscuous (Fig 1), and species containing the highest proportions of this class of domains are also associated with large bacterial genomes, such as Archangium gephyr (GCF_001027285.1_ASM102728v1) and Sorangium cellulosum So0157-2 (GCF_000418325.1_ASM41832v1), among others. This finding is interesting because it shows that a low proportion of domains are associated with the highest number of architectural combinations, compared to non-promiscuous domains.

Functional association of promiscuous domains
Based on the definition of promiscuity, it was found that the proportion of promiscuous domains represents less than 10% of all non-redundant domains per genome, suggesting that a small fraction of the total repertoire of domains is determinant for the diversity of combinations of architectural identified so far. Therefore, in order to determine the preferential functions of promiscuous and non-promiscuos domains per genome, both datasets were evaluated, considering the functional annotations deposited in the SUPFAM and SCOP databases. In these databases, 50 different functional groups have been determined, and they are classified in eight broad categories (General; Information; Metabolism; Processes_EC, Processes_IC, and Regulation; Other; and N_A). From this analysis, 18 functional categories were identified as significantly enriched in the promiscuous domains in at least one genome, and in particular the Small-molecule binding domains were identified as enriched in 1219 genomes, whereas Signal transduction was enriched in 377 genomes and DNA-binding (transcription) and transferases were enriched in 153 and 118 genomes, respectively. Finally, the other 14 functional categories identified as enriched in minor proportions, such as Polysaccharide metabolism and Inorganic ion transport and metabolism and, were associated with 34 and 25 genomes, respectively (Fig 2).
In contrast, non-promiscuous domains were associated with 6 functional categories, with Unknown function, Translation, ribosomes, ribosome biogenesis; tRNA metabolism and DNA replication, recombination, repair, the most enriched. These functions, associated to the non-promiscuous domains were identified in 207, 112 and 8 genomes, respectively. Therefore, functional associations were different between promiscuous and non-promiscuous domains, suggesting that bias to DNA binding, Transferases, Small-molecule binding and Signal transduction functions are specifically associated with promiscuous domains . Fig 2.

Protein architecture as a function of genome size
In our second approach, we considered the domains as a function of their correlation to genome size, where the correlation between a WDAS for a given domain and the genome size (number of ORFs) could be positive or negative and a set of uncorrelated domains could be identified. To this end, R-values were calculated and two large groups were defined: those negatively correlated and those with no correlation. Therefore, we were interested in the set of negatively correlated domains (R-values between -1 and -0.4), because they showed WDAS values decreasing as a function of genome size, i.e., protein domains are less promiscuous in small genomes and they are more promiscuous in large genomes. In addition, only protein domains identified in more than 60% of the total genomes or 770 different genomes were considered in this analysis, because we were interested in those domains with a wide genomic distribution for our analysis (See S1 File).
From the complete set of 1317 protein domains analyzed, 52 protein domains had negative correlation values (R-values between -1 to -0.4), i.e., the number of architectures associated with them increased as a function of genome size (Table 1). In this category were, the protein domains primarily associated with small-molecule binding, such as ( Table 1. In summary, the four functional categories more abundant associated to this set of promiscuous domains are related to metabolism (other enzymes, and transferases), DNA-binding

The total number of non-redundant domains follows a logarithmic distribution (A), whereas promiscuous domains follow a polynomial distribution (B).
On the X-axis is the genome size (in ORFs), and on the Y-axis is the total number nonredundant (NR) domains identified. A non-linear least squared to fit the dataset was used. https://doi.org/10.1371/journal.pone.0226604.g001 and small-molecule binding. These functional categories suggest that these structural domains could be associated with ancient and global functions that define the basic functions of cellular maintenance, signaling and regulation of gene expression, and they are influenced by the dynamics of the genome architecture in bacteria [20]; i.e. an increase in genome complexity also increases the probable combinations of structural domains. In contrast, three main functions not associated with promiscuous domains were identified: Translation (SSF:100704), metabolism (redox reactions) (SSF:119536), and other enzymes (SSF:170266). These functions were identified in a small proportion of domains compared to promiscuous domains, suggesting that they are not as highly abundant as promiscuous domains.
Although there are diverse and interesting protein superfamily domains, there are two cases to illustrate the association of promiscuity with genome size: the P-loop-containing  The P-loop domain is the most prevalent domain of the nucleotide-binding protein folds, has been classified in the small-molecule binding functional category (HA) and represents 13.8% of the total domains identified as promiscuous. Proteins with this fold catalyze the hydrolysis of the beta-gamma phosphate bond of nucleoside triphosphate (NTP), and the energy from NTP hydrolysis is used to induce conformational changes in other molecules [21]. P-loop NTPases are characterized by two conserved sequence signatures, the Walker A motif (the P-loop proper) and Walker B motifs which bind the beta and gamma phosphate moieties of the bound NTP and a Mg 2+ cation, respectively [22]. Thus, when we analyzed the number of different domains associated with the P-loop, we found a large number of related domains in larger genomes versus smaller genomes, suggesting that these superfamilies not only increase their association with other domains but also with different architectures (Fig 3).
A second representative example is associated with the Winged helix-turn-helix DNAbinding domain. These structures are characterized by the presence of a third alpha-helix and an adjacent beta-sheet and are central components for DNA binding; the "wings" are small beta-sheets [23]. The recognition helix binds as described for regular helix-turn-helix motifs (i.e., with contacts in the major groove of DNA), and the extra secondary structural elements make different DNA contacts, often with the minor groove or the backbone of DNA [24,25]. The wHTH has been identified in almost all microorganisms, from Bacteria to Archaea, and includes diverse families, such as the Catabolite gene activator (CAP) family, in which global regulators (CRP and FNR) have been described in the bacterium Escherichia coli [26], the heat  Protein architecture in prokaryotic genomes shock and E2F/DP TFs, and the Ets domain family, among others [25]. The WDAS associated with the wHTH decreases as a function of the genome size, where larger genomes contain a higher proportion of these proteins, with diverse architectures and a high diversity of structural domains (Fig 4).
In summary, we suggest that the set of 52 superfamilies follows trends similar to those of the P-loop and wHTH superfamilies, reinforcing the notion that these superfamilies not only increase their associations with other domains but also their architecture diversity.

Promiscuous domains are ancient
To evaluate how promiscuity developed during evolution, we mapped the WDASs associated with all domains against the ancient approach described by [6]. This analysis is based on the hypothesis that the most promiscuous domains could be also ancient domains. Therefore, the 52 protein domains identified as promiscuous associated to the genome size and non-promiscuous were traced along with their ancestry according to the approach described by Caetano-Anolles et al. [15]. S3 Table. In brief, Caetano-Anolles et al. consider the timeline of protein domain evolution spanning~3.8 billion years of evolution, where a score of 0 represents the origin of domains and a score of 1 represents the present day. In this way, ancestrality is defined by ancestries of protein domain constituents derived from a structural phylogenomic census [6,15].
In Fig 5, we show the frequencies of the 52 promiscuous domains that correlated with the genome size and identified in this analysis as a function of the ancestry score. From this analysis, two main results can be described: the first is that promiscuous domains are mainly associated with ancient evolutionary events (i.e., antiquity scores close to 0), with few recent emergence events (scores closer to 1) (Fig 5A). In contrast, no promiscuous domains are distributed along the whole antiquity timeline, suggesting that their emergence has been at diverse times in evolution (Fig 5B). In addition, we determined that promiscuous domains are more abundant than non-promiscuous domains. In addition, a Mann-Whitney U test was used to compare the two datasets, finding an U-value of 1064 and because the distribution is approximately normal the Z-Score is -10.62215 with a p-value of < 0.00001, being significant at p < .01. In other words, the difference between the promiscuous and non-promiscuous datasets is statistically significant. In summary, the promiscuous domains are abundant and ancient, such as the P-loop-containing nucleoside triphosphate hydrolases (SF:52540) and the DNA-binding domain (wHTH), because they are recurrent and most ancestral among the universal domains, in comparison to non-promiscuous domains. Therefore, the interplay of promiscuous and no-promiscuous domains determines the architecture of proteins in all the genomes.

Conclusions
In this work, we evaluated 1317 superfamilies for their architectures, their abundance in bacterial and archaeal genomes, their functional roles and their antiquities. Our data reinforce that increased gene complexity also requires the development of mechanisms for gene regulation at the transcription level and small-molecule binding, where the P-loop and the wHTH are the most significant examples, suggesting that the interplay of these structural domains could  On the X-axes are antiquity assignments, i.e., how ancient each structural domain present in the universal enzymatic reactions is, as suggested by Caetano-Anolles et al [15]; a score of 0 represents an ancient event, whereas 1.0 increase the ability of an organism to recognize and respond to diverse environmental stimuli. These findings are relevant in the context of functional categories, because promiscuous domains are mainly associated with basic metabolic, signal transduction, and gene regulation functions. We also found 52 domains whose promiscuity value or WDAS depends on genome size, and this was reinforced by the number of domains associated with each of them. In summary, we consider the association between these two classes of protein domains (promiscuous and non-promiscuous) has provided, since ancestral times, bacterial and archaeal cells with the ability to respond to diverse environmental challenges.   Table. Antiquity and abundance of structural domains. Columns correspond to: Domain; Function; SCOP; Description; ancient_order; Ancient_score; R-value; Total_Domains; Archaea_Domains; and Bacteria_Domains. (TXT) S1 File. Boxplot of the architecture of the 52-superfamily domains as a function of genome size. On the X-axis of each graph, genome size ranges are displayed in 13 windows, with a represents recent domain emergence. On the Y-axis is the abundance (frequencies) of the superfamilies. The size of a circle represents the proportion of each domain in relation to the total protein domains.