β-Propeller Blades as Ancestral Peptides in Protein Evolution

Proteins of the β-propeller fold are ubiquitous in nature and widely used as structural scaffolds for ligand binding and enzymatic activity. This fold comprises between four and twelve four-stranded β-meanders, the so called blades that are arranged circularly around a central funnel-shaped pore. Despite the large size range of β-propellers, their blades frequently show sequence similarity indicative of a common ancestry and it has been proposed that the majority of β-propellers arose divergently by amplification and diversification of an ancestral blade. Given the structural versatility of β-propellers and the hypothesis that the first folded proteins evolved from a simpler set of peptides, we investigated whether this blade may have given rise to other folds as well. Using sequence comparisons, we identified proteins of four other folds as potential homologs of β-propellers: the luminal domain of inositol-requiring enzyme 1 (IRE1-LD), type II β-prisms, β-pinwheels, and WW domains. Because, with increasing evolutionary distance and decreasing sequence length, the statistical significance of sequence comparisons becomes progressively harder to distinguish from the background of convergent similarities, we complemented our analyses with a new method that evaluates possible homology based on the correlation between sequence and structure similarity. Our results indicate a homologous relationship of IRE1-LD and type II β-prisms with β-propellers, and an analogous one for β-pinwheels and WW domains. Whereas IRE1-LD most likely originated by fold-changing mutations from a fully formed PQQ motif β-propeller, type II β-prisms originated by amplification and differentiation of a single blade, possibly also of the PQQ type. We conclude that both β-propellers and type II β-prisms arose by independent amplification of a blade-sized fragment, which represents a remnant of an ancient peptide world.


Introduction
The number of possible amino acid sequences available to proteins is tremendous. At the median protein length of 300 residues, there are 20 300 (,10 390 ) different amino acid combinations; a number so big that life could not possibly have explored all of these sequences to arrive at the current complement of proteins. Instead, it has become evident that the number of proteins observed today is much smaller and that most proteins resemble other proteins. The reason lies in the descent of modern proteins from autonomously folding units called domains. These domains gave rise to new proteins by amplification, recombination, and divergence, and they are thought to have mostly been established at the time of the last common ancestor.
On the structural level, proteins resemble each other even more than on the sequence level. Owing to biophysical constraints, unrelated sequences converged to the same of only ,1000 folded conformations found in nature. Therefore, structural similarity alone cannot be used to assess whether two proteins have a common origin. The aforementioned vastness of sequence space however makes it unlikely that two sequences converge to a significant similarity and sequence similarity is therefore considered the hallmark of homology.
However, as homologies become more remote and the number of residues that can be compared decreases, it becomes progressively harder to establish statistically significant similarity between sequences over the background. In such cases it would be beneficial to include structural information into the comparisons, because, even though prone to convergence, structures diverge more slowly than sequences. A method to do this was recently introduced in order to establish cases of distant homology [14]. Its rationale is that homologs were almost identical in sequence and structure when they started to diverge from their common ancestor. Over time, these proteins accumulated differences, resulting in progressively lower similarities both in sequence and, more slowly, also in structure. Due to the continuity of this process, we expect to see a positive correlation between sequence and structure similarity for homologous proteins. In contrast, analogs should have varying degrees of structural similarity, mostly independent of sequence similarity, and sequence similarities should generally be low. Sequence and structural similarity scores of analogs are thus expected to be uncorrelated.
It is conceivable that specific local structures might restrict the possible amino acids at one or more positions of the protein, leading to a similar correlation between structure and sequence similarity. A test of this possibility in an evolutionary study of the origins of outer-membrane b-barrels did not uncover such correlation [14], as expected from the observation that domains are multiply convergent at the supersecondary structure level without an accompanying increase in sequence similarity.
In previous studies, we established homology between proteins of different folds based on the analysis of common fragments [4,5,8,14]. These fragments were presumably already found in the last common ancestor of these proteins and were preserved until today even though the proteins themselves underwent foldchanging events. In one of our studies, we found that b-propellers, which adopt folds comprising 4 to 10 repeats of a 4-stranded bmeander called a 'blade', can be seen for the most part to have arisen by the independent amplification and diversification of one ancestral blade [15].
The b-strands in each of these blades are named A to D from the N-terminal innermost strand to the C-terminal outermost one [15]. In most b-propellers, b-strands from both the N-and Cterminal regions of the domain constitute the first blade and form a stabilizing velcro closure. Irrespective of the number of blades, bstrands A, B, and C of different blades are usually superimposable with a root-mean-square deviation (RMSD) of below 1 Å even though the insertions between these b-strands vary [15].
Given their versatility in forming b-propellers with different blade numbers, it seemed possible that blades may represent ancient peptides that also gave rise to other folds. In this study we therefore extended our previous efforts by including structural information in the detection of b-propeller homologs. We used the aforementioned method of analyzing sequence similarity as a function of structural similarity to distinguish homology from cases of structure-induced sequence similarity. Here, we show the results of these analyses and report on four potential homologous of bpropeller blades.

Cluster Maps
Dataset SCOPb+. We created the SCOPb+ dataset by extending the all-b class of SCOP70 1.75 (current release, dated June 2009, clustered at 70% sequence identity), which we chose as a suitable background for distinguishing b-propeller homologs from analogs with similar secondary structure composition (see Results). The extension step was necessary to include potential bpropeller homologs that are not part of the all-b class of SCOP and to include structures not classified in SCOP.
To establish a scaffold for our extension, we first searched the PDB70 database as available on April 5, 2012 (PDB clustered at 70% sequence identity) using HHpred [16,17], a sensitive remote homology detection method based on the pairwise comparison of Hidden-Markov-Models (HMMs). As query, we used various bpropellers from SCOP and recurrently found matches to 4-to 8bladed b-propellers (folds b.66-b.70), type II b-prisms (fold b.78), and WW domains (superfamily b.72.1).
The actual extension step started by including all proteins of the all-b class of SCOP70 and extending it by systematically searching PDB70 with all proteins of the aforementioned scaffolding groups. These searches were conducted using the global-alignment mode of HHsearch, the search procedure of HHpred, and matches below 40% probability were discarded. The similarity of some queries led to overlapping matches to the same template protein.
We therefore considered all matches to one template in order of decreasing length and kept only those with more than 50% of their residues not already covered by previously accepted matches. In total, the SCOPB+ dataset comprises 3223 entries.
For each entry of SCOPb+ a multiple sequence alignment was computed with the buildali.pl script (a modified PSI-BLAST [18] procedure) and hhmake was used to convert the alignments to HMMs -both programs are part of the HHpred package. The HMMs of all entries were kept as query HMMs and additionally a single database HMM was created by merging all of them.

Clustering Procedure
We searched the SCOPb+ database with each query HMM using HHsearch in global-alignment mode to obtain an all-vs-all matrix of similarity p-values. These p-values were extracted from the result files and converted to a CLANS [19] input file using the bio.io.hhpred and bio.io.clans modules of CSB, respectively [20]. The cluster map was computed from the input file using the forcedirected layouting method implemented in CLANS (attract and repulse value 10) at a p-value threshold of 1e-5 until equilibrium was reached.

Spurious Connections
We found false-positive connections in the cluster map and removed them after manual verification (dashed boxes in Figure 1). A representative example stems from the extension search with the N-terminal 7-bladed b-propeller in nitrous oxide reductase (SCOP d1fwxa2) as query. This search resulted in matches to a template protein (3HRP) comprising two domains: a 6-bladed b-propeller and an immunoglobulin-like E set domain. Due to a misaligned match, both template domains were covered and instead of the expected b-propeller domain almost the complete protein was included in SCOPb+. In the cluster map, this protein was located amidst b-propeller proteins due to its b-propeller domain, but is also -and spuriously so -connected to the immunoglobulin domains.

Correlation of Structural and Sequence Similarity
Dataset. First, we created a template dataset consisting of all single-chain SCOP70 entries as well as the b-pinwheels and the luminal domains of inositol-requiring enzyme 1 (IRE1-LD) proteins. We created HMMs for all 13654 proteins in the dataset as described in the section on cluster map creation.
Next, we chose proteins for a 'background' dataset, which contains the SCOP all-b class structures that were neither bpropellers nor considered potential homologs of them, i.e. we excluded b-pinwheels, type II b-prisms, and WW domains. We used this dataset to evaluate which correlation levels are to be expected for structurally similar yet analogous proteins.
Finally, we assembled a query dataset of 583 blade-like structures from all b-propellers (SCOP folds b.66-b.70), type II b-prisms (b.78), and WW domains (superfamily b.72.1) of SCOP70, b-pinwheel fragments, and IRE1-LD fragments. This dataset contains blades and similar b-meanders that we extracted by manual inspection of the structures.
WW domains were restricted to four residues before the first and three residues after the second conserved tryptophan, similar to their Pfam definition (PF00397) [21].
The sequences of b-pinwheels are not continuous when considering b-strands A-D of one blade in structural order. This makes it impossible for TM-align to reasonably align b-pinwheel and b-propeller blades. Thus, we 'rewired' the main chain of all bpinwheel blades by inserting the residues of b-strands B and C (the putative b-hairpin invasion) in between b-strands A and D of their blade. We mapped the positions of the reordered residues to the standard b-pinwheels and computed sequence scores using their HMMs.
Both IRE1-LD structures (PDB 2BE1 and 2HZ6) contain five potential blade homologs, however two of them are not in a bmeander conformation but in a long, extended b-hairpin. We excluded the two elongated instances as the structural alignment score for them would not be meaningful and added the remaining three blade-like fragments to the dataset.
As the full-length proteins of all fragments in the query dataset are in the template dataset, we mapped the query fragments onto them, which allowed us to use the template HMMs for sequence score computations.
Correlation calculations. We aligned each query-template pair using TM-align [22] and obtained the query length-normalized TM-score, which we used as our structural similarity score. The TM-score is a value in the interval ]0, 1] where perfectly identical structures have a value of 1 whereas random pairs of structures have a value of ,0.17 [23,24]. Sequence similarity scores were calculated by aligning the query and template HMMs according to the TM-align structural alignment using HHalign, the HHsearch scoring procedure. The score HHalign returned was normalized by the number of aligned residues. For the sake of simplicity, we call these sequence similarity scores 'HHalign-scores' for the remainder of the manuscript. In the final step, we used SciPy [25] to calculate the propellers, primarily of viral origin, remain unconnected in sequence space, as discussed previously [15]. Clusters in dashed boxes were omitted in the detailed analysis after manual inspection (see ''Spurious connections'' in the cluster map section of the Methods). The purple groups are different superfamilies of the b-prism type I fold (b.77), unrelated to the b-prism type II fold discussed in this manuscript (see Fig. 2). The four clusters discussed in this manuscript are in red circles. doi:10.1371/journal.pone.0077074.g001 b-Propeller Blades as Ancestral Peptides correlation between TM-and HHalign-scores for subsets of the query and template datasets.
Correlation significance. To assess the statistical significance of each correlation, we assumed a linear dependency between TM-and HHalign-scores. For each set of comparisons (e.g. the set of scores of all comparisons of IRE1-LD queries against PQQ motif b-propeller templates), we did a linear regression (using SciPy) and computed a t-test with the null hypothesis that the slope is zero. In other words, we assessed whether the TM-and HHalign-scores are significantly related. We chose a significance level of 1e-3 and, unless otherwise noted, the correlation values in this manuscript imply significant relationships.

Results
To detect homologs of b-propellers, we clustered the SCOPb+ dataset based on pairwise sequence similarities. Almost all bpropellers clustered together, as already observed in our previous analysis of the evolution of b-propellers ( Fig. 1) [15]. For a detailed inspection, we concentrated on proteins with direct or transitive connections to b-propellers at a p-value cutoff of 1e-5 and omitted all others ( Fig. 2A). To annotate groups within this map, we reclustered it at a more stringent cutoff (1e-15), which clearly resolved many groups and allowed us to annotate them by manual inspection. The annotations were transferred to the initial cluster map where the groups remained well defined and resolved, also at the less stringent cutoff used for this map (Fig. 2B).
The cluster map depicts the high degree of interconnectedness between different groups of b-propellers (Fig. 2). The biggest cluster acts as a hub for the connections to the outer clusters and is formed by 5-to 8-bladed propellers of known groups: WD40, KELCH, YWTD, YVTN, NHL [15], PQQ [26], Clathrin [27], and PD40 (Pfam PF07676). The proximity of these different groups in the cluster map indicates close homology, yet the different groups form distinguishable subclusters.
Adjacent to the hub, three b-propeller clusters are formed by the 4-bladed Hemopexin-like domain family (SCOP identifier b.66.1.1), the RCC1/BLIP-II superfamily (b.69.5), and the loosely connected 7-bladed Sema domain superfamily (b.69.12). Also directly connected to the hub is a large cluster formed by the Asp-Box b-propellers, which are mostly 6-and 7-bladed but also contain the only known 10-bladed b-propeller Sortilin [28]. The Asp-Box b-propellers are further tightly connected to proteins of the 5-bladed glycoside hydrolase family 43 and more loosely to two 6-bladed Enterobacteria phage K1F b-propellers and to the Integrin cluster.
Interestingly, we found four groups of proteins in the cluster map that are not b-propellers, yet are connected to them: luminal domains of inositol-requiring enzyme 1 (IRE1-LD), type II bprisms (BP2), b-pinwheels, and WW domains. These groups vary in the strength of their connections to b-propellers, from the loosely connected WW domains outgroup to the highly connected IRE1-LD.
In the following sections, we report on our investigations of each of the four folds with respect to an origin from an ancestral blade.
Luminal Domain of Inositol-requiring Kinase 1 IRE1-LD (2BE1 and 2HZ6) is located within the main bpropeller cluster. This domain detects unfolded proteins in the endoplasmic reticulum as part of the unfolded protein response [29] and was predicted to adopt a b-propeller fold due to the detection of four blade-like repeats resembling those of an 8bladed b-propeller [30]. However, both IRE1-LD structures were found to share a unique fold that consists of a flat anti-parallel bsheet, formed by b-strands from two monomers as part of their homodimer interface, and a-helices on one side of the b-sheet that form a groove [31,32]. Further, the fold has two lobes that are described as a distorted b-barrel and a partial b-propeller for the yeast structure, and as two b-barrels for the human one [31,32].
Due to the striking proximity of IRE1-LD to b-propellers in the cluster map, we investigated this relationship in detail. We ran confirmatory HHpred searches with both IRE1-LD proteins as query against the full PDB70 database. The resulting matches were almost exclusively to b-propellers (yeast IRE1-LD: 252 of 258 matches to b-propellers; human IRE1-LD: 332 of 335) and all other matches had low probabilities. Except for a single lowscoring match, the RBB1NT domain of human retinoblastoma- binding protein 1 (2YRV) at 24% probability, all non-b-propeller matches were to WW domains and type II b-prisms, both proteins described later in this article. Reverse searches with the top-ranked b-propeller matches confirmed the connection to IRE1-LD.
Next, we were interested in whether state-of-the-art repeat detection methods would be able to automatically detect the four blade-like repeats previously found with a semi-automated procedure [30]. We ran the sensitive repeat detection tool HHrepID [33] with the two IRE1-LD sequences as query and both runs detected five repeats. The previously described repeats were the first, third, fourth, and fifth repeat in HHrepID, whereas the second repeat was newly detected. While the first, second, third, and fifth repeat had high probabilities (80%-92%), the probability of the fourth repeat varied between yeast and human IRE1-LD (37% and 89%). However, the sequence segment of this repeat was the same as previously reported and it aligned well to the other repeats.
Mapping the repeats onto the structure revealed that repeats 1, 2, and 5 are three-stranded b-sheets (Fig. 3). In contrast, repeat 3 contains a long central b-strand and two shorter b-strands that form N-and C-terminal b-b-hairpins with the central one. Repeat 4 comprises two long b-strands that form an elongated b-hairpin. Repeats 1 and 5 constitute the aforementioned partial b-propeller lobe, whereas repeat 2 is part of the putative distorted b-barrel lobe [31]. The elongated repeats 3 and 4 are part of the large bsheet at the homodimeric interface.
To investigate the structural similarity between IRE1-LD repeats and b-propeller blades, we chose the yeast protein as a representative, as a b-strand of repeat 3 in the human structure is not solved. We superimposed repeats 5 and 1 onto two consecutive blades of the 8-bladed BamB b-propeller (3Q7M), which was the top match in the aforementioned HHpred run. Interestingly, this also superimposed the C-terminal b-hairpin of repeat 3 to the third consecutive blade of the b-propeller, i.e. repeats 5, 1, and 3 are alignable to three consecutive blades. The superimposition aligns the three repeats to the outer blade b-strands, which is peculiar given that strand D is known to be the structurally least conserved one in b-propellers [15]. The newly detected repeat 2 is slightly more distorted than repeats 1 and 5 and therefore did not align as well to b-propeller blades. In a superimposition of repeat 2 and one BamB blade, repeat 3 again comes close to the subsequent b-propeller blade, albeit not as well as when repeats 5 and 1 are used to set the superimposition.
The aforementioned BamB, along with many other top matches of the IRE1-LD HHpred searches, belongs to the PQQ family of b-propellers. These proteins contain an 11 residue motif on bstrands C and D of each blade, which ends with a tryptophan at position 11 (Fig. 3) [26]. The motif comprises two key structural components: (1) residues 6 and 7 of one blade are arranged parallel to the indole ring of Trp11 from the previous blade and (2) the main chain carbonyl of residue 4 is hydrogen-bonded to the Trp11 indole NH group within the same blade [26]. We analyzed IRE1-LD with respect to these two features and found that they are mostly conserved in the structural interactions between repeats 5, 1, and 3. In yeast IRE1-LD, repeat 1 interacts with both structural neighbors, whereas in human IRE1-LD only the interaction between repeats 5 and 1 is seen due to missing density. The more distorted repeat 2, as well as the elongated b-hairpinlike repeat 4 do not show these characteristics. As the conserved residues of PQQ b-propellers are located in b-strands C and D, and play a structural role, it is less surprising that the IRE1-LD repeats align to the outer b-strands and not to the usually wellconserved b-strand A.
To further verify these findings, we applied a method that analyzes the correlation between structure and sequence similarity (in the following: sequence-structure correlation; see Methods) ( Fig. 4A). We omitted IRE1-LD repeats 3 and 4, as their elongated b-hairpin-like structures make them unsuitable to compute sensible structural alignments to b-propeller blades. The correlation between structure and sequence similarity scores when comparing the IRE1-LD repeats to the background set (see Methods) was 0.11 (median TM-and HHalign-score: 0.38 and 20.18). As the background set is a subset of the SCOP all-b class, a low non-zero correlation was to be expected due to shared bstrand propensity. In comparisons of IRE1-LD to b-propellers with different blade numbers, 8-bladed b-propellers had the highest correlation value (correlation 0.72, TM 0.60, HHalign 0.59). We found that the overall highest correlation was achieved in the comparison to the aforementioned PQQ subset of 8-bladed b-propellers and these comparisons also had remarkably high sequence similarity scores (correlation 0.89, TM 0.63, HHalign 1.17). Even though IRE1-LD adopts a fold that is globally different from a b-propeller, our analysis indicates that IRE1-LD is closely related to PQQ b-propellers. The antecedent blades are still detectable as repeats even though they only have three b-strands remaining or have changed their conformation. The complexity of the IRE1-LD fold and the five PQQ-like repeats make it unlikely that this fold has arisen by amplification from a single blade. Instead, it is conceivable that a PQQ b-propeller underwent a massive fold change, which was retained due to its emergent usefulness in ER stress sensing.

Type II b-prism (BP2)
The second group of potential b-propeller homologs in our cluster map are type II b-prisms (BP2, SCOP fold b.78). Proteins with this fold form a superfamily of phylogenetically widespread lectins, referred to as Galanthus nivalis agglutinin-related lectins (GNA-related lectins) after the first structure of this fold [34]. The BP2 fold comprises three four-stranded b-meanders that are arranged around and orthogonal to a central pseudo-symmetry axis and are curved towards the center (Fig. 5A). Similar to bpropellers, which circularly permute between one and three bstrands of a terminal blade in order to hydrogen-bond their N-and C-termini and achieve increased stability (velcro closure), BP2 proteins also use velcro closure for their domain organization and dimerization [34,35]. The sugar-binding motif is located on the outer, concave side of up to three of the b-sheets [36,37]. Even though sugar binding is their most discussed function, GNArelated lectins also possess 1) anti-tumor, anti-fungal, and anti-viral activity [38,39], 2) bind the HIV surface glycoprotein GP120 [40], 3) and can be taste modifying [41].
It is important to discriminate BP2 from the type I b-prism (BP1, b.77), which resembles BP2 structurally but has b-strands running parallel to the pseudo-symmetry axis. BP1 proteins also bind carbohydrates with up to three binding sites, and a common origin of BP1 and BP2 has been discussed without clear conclusion [42]. The large distance between BP1 and BP2 proteins in our cluster map (Fig. 1) indicates that even the most sensitive homology detection methods cannot connect them, thus they should be considered analogs.
The BP2 cluster in our cluster map is an outgroup to the 8bladed PQQ b-propellers, which are found in the central cluster. A multiple-structure alignment of the three b-sheets of a BP2 (1XD5) and the eight blades of a PQQ b-propeller (BamB, 3Q7M) shows that all three BP2 b-sheets align well with PQQ blades (Fig. 5B). Further, a conserved tryptophan in b-strand 4 of the BP2 b-sheets superimposes, with slightly different orientation, onto the conserved tryptophan in position 11 of the PQQ specific motif (see IRE1-LD section). The major difference is a two-residue deletion in the BP2 b-sheets, corresponding to positions 5 and 6 in the PQQ motif (Fig. 5C). BP2 may compensate for the missing stabilizing interaction, which residue 6 provides in PQQ motif blades by coordinating the tryptophan sidechain, through the interaction of its three tryptophan residues in the core of the structure [43]. These differences to the conserved PQQ motif might explain the location of BP2 as an outgroup of PQQ bpropellers. To verify the presumed homology of BP2 and PQQ bpropellers, we analyzed their sequence-structure correlation (Fig. 4B). The similarity scores for structure and sequence comparisons between BP2 and the background set were low and uncorrelated (correlation 0. 16 Sequence searches with single BP2 b-meanders against PDB70 showed that these are more similar to each other than to any bpropeller blade, suggesting that the BP2 repeats were amplified from a single blade of a PQQ b-propeller.

b-pinwheels
Proteins that adopt the b-pinwheel fold are the third group with connections to b-propellers in our cluster map. They are DNAbinding modules of bacterial type IIA topoisomerases. The first structures with this fold were the C-terminal domains (CTD) of DNA gyrase A (GyrA, 1SUU) and of the topoisomerase IV ParC subunit (1WP5) [44]. DNA gyrase is capable of introducing negative supercoils into DNA, however this function is lost upon removal of either its complete CTD or of a conserved motif therein, the GyrA box [45,46]. In contrast, topoisomerase IV, which antagonizes DNA gyrase by relaxing supercoiling, remains functional without the CTD but loses specificity for positive supercoiling [44].
Structurally, b-pinwheels resemble b-propellers, with fourstranded b-sheets circularly arranged around a central pore. Yet the folds differ due to a b-hairpin invasion between neighboring bpinwheel blades (Fig. 6) [47]. Even though they are, strictly speaking, not b-propellers, SCOP classifies them into the 6-bladed b-propeller fold (b.68), where they constitute their own superfamily called ''GyrA/ParC C-terminal domain-like'' (b.68.10). Interestingly, b-pinwheel structures exist in different variants: completely closed circular forms and C-shaped open forms that can be planar or spiral-shaped. It has been suggested that GyrA always has six blades whereas the number in ParC varies from three to eight and it was hypothesized that ParC evolved from GyrA [48].
In DALI searches [49] for structures similar to b-pinwheels, using the CTD of GyrA and ParC as query, the b-hairpin invasion leads to a clear separation of matches to b-pinwheels (Z-scores .16) and b-propellers (Z-scores ,5), which are the top matches besides b-pinwheels. In these searches, the 6-bladed closed bpinwheels were most similar to 6-bladed b-propellers, whereas the C-shaped forms with five or six blades had 7-bladed b-propellers as top matches.  In these searches, we found six additional b-pinwheel domains (1ZI0, 1ZVU, 1ZVT, 3L6V, 3NO0, 3UC1) and conducted HHpred searches for all eight b-pinwheels against PDB70. We pooled the results into a non-redundant list and, after the self matches, 33 and 3 of the following 40 matches were to 7-and 8bladed b-propellers, respectively, and only 4 low-scoring matches were to proteins of other folds. The majority of the b-propeller matches were to 7-bladed b-propellers with the WD40 motif, which is in agreement with the cluster map, where b-pinwheels almost exclusively connect to WD40 b-propellers. For confirmatory reverse searches, we used the 10 best b-propeller matches. In all cases, the best b-pinwheel match had a probability .50% and in eight of ten searches .80%. All reverse searches matched multiple b-pinwheels and the matches were interspersed with matches to various b-propeller groups. An earlier study had proposed RCC1 as the group of b-propellers with the highest similarity to b-pinwheels [50], but our analysis indicates only a transitive connection between these groups via the proteins of the main b-propeller cluster, a finding consistent with the previously noted lack of key RCC1 residues in gyrase A [51].
Due to the rather low sequence similarity of b-pinwheels and WD40 b-propellers, which is also evident from their distance in the cluster map, it is not surprising that the WD40 motif-defining tryptophan and aspartate residues are not conserved in bpinwheels.
To investigate whether the sequence similarity between bpinwheels and b-propellers could be structure-induced, we again computed sequence-structure correlations (Fig. 4C). Due to the bhairpin invasion, TM-align is unable to align b-pinwheel and bpropeller blades in a reasonable way; therefore we created artificially reordered b-pinwheel blades (see Methods). The correlation of structure and sequence similarity between the reordered b-pinwheels and the background set was 0.12 (TM 0.39, HHalign 20.13), which is in line with the results for IRE1-LD and BP2. The correlation of scores between the reordered b-pinwheels and the WD40 b-propellers, which were their best sequence matches, was indistinguishable from the background (correlation 0.12, TM 0.56, HHalign 0.37). Both are higher than for the background set, but there is no significant correlation between them, indicating that the sequence similarity may be structureinduced and thus pointing to a convergent origin of WD40 and bpinwheels (see Methods), as previously proposed [52].
The apparent similarity of b-pinwheels to b-propellers in sequence searches may be due to the two folds being formed by repeats of the same length and secondary structure. This is because the statistical significance of comparisons between repetitive proteins increases with the number of repeats that can be matched, even when the repeats individually have little or no detectable similarity. In this case, searches with single reordered bpinwheel repeats did not show even low-scoring matches to bpropellers. We therefore conclude that this similarity is not indicative of homology.

WW Domain
The fourth group we found connected to b-propellers in our cluster map is the WW domain superfamily (b.72.1). Members of this superfamily adopt a ,38 residue long fold comprising a curved three-stranded b-meander with two highly conserved tryptophan residues [53]. The N-terminal of these is located in the first b-strand and projects to the convex side of the b-sheet, whereas the C-terminal is in the third b-strand and has its side chain on the concave side. Together with a conserved tyrosine in the central b-strand, the latter forms a binding site for proline-rich motifs (Fig. 7) [54]. WW domains are known to occur in tandems of up to four copies and one reason for this amplification might be to increase binding affinity [55,56]. Structurally, a WW domain corresponds to three b-strands of one b-propeller blade.
In our cluster map, WW domains are loosely connected to the main b-propeller hub and HHpred searches with single domains often had b-propellers as low-scoring matches, with similar results for the reverse searches. Since, as mentioned for b-pinwheels, the statistical significance of comparisons between repetitive proteins increases with the number of repeats that can be matched, we decided to compare searches with single domains to searches using several domains in tandem.
Searches of single WW domains (1E0L, 1E0N, 1PIN, 1WR4) with HHpred against PDB70 yielded matches to IRE1-LD and several b-propellers, scattered sparsely among other matches and mostly with probabilities below 40% (but occasionally as high as 70%). Although the second conserved tryptophan was in some cases aligned to the conserved tryptophan of PQQ b-propellers and IRE1-LD, many high-scoring matches did not have conserved residues at this position.
Searches of double WW domains (1O6W and 2JXW) showed an increase in number and probabilities of matches to IRE1-LD and b-propellers, particularly to the 8-bladed PQQ b-propellers (up to 93%). Here, two consecutive blades frequently aligned without or with only few gaps to the query WW domains and the conserved C-terminal tryptophan residues in each repeat were aligned.
Searches of quadruple WW domains (gi|73919464:363-554, gi|2072503:300-477, gi|73921204:193-581) confirmed our previous results. Here again, BamB was among the top b-propeller matches (88% probability) and it covered the four WW domains with four consecutive blades, the conserved PQQ motif tryptophan of all four blades being matched to the second WW domain tryptophan.
To assess the structural similarity of WW domains and PQQ motif blades, we compared a double WW domain (1O6W) to its top-matching b-propeller, the 8-bladed BamB, in structure and sequence (3Q7M; Fig. 7B and C). The superimposition had a rootmean square deviation (RMSD) of 1.9 Å over the three b-strands of the WW domain and the alignment was gapless.
As discussed for b-pinwheels, the tandem domains might have elevated scores due to the alignment of multiple consecutive repeats, which in this case might be further enhanced by the repetition of tryptophan at particular sequence intervals. Hence, this finding is not per se indicative of a homologous relationship.
In order to gain more clarity in the issue of homology vs. analogy, we analyzed sequence-structure correlations (Fig. 4D). As in the aforementioned cases, the score correlation between WW domains and the background set was low 0.05 (TM 0.38, HHalign 20.41). To our surprise, neither of the b-propeller groups found in the HHpred analysis had significant correlations with WW domains (correlation against PQQ b-propellers 20.18, TM 0.46, HHalign 20.04). In conjunction with the sequence searches described above, we conclude that the similarity between WW domains and b-propellers is fortuitous and does not reflect common ancestry.

Discussion
In our search for b-propeller homologs with different folds, we detected four candidate groups: IRE1-LD, BP2, b-pinwheels, and WW domains. These were connected to b-propellers at various levels of statistical significance in sequence comparisons. The question of their evolutionary relationship with b-propellers touches on the problem of distinguishing remote homologs from analogs, a problem that has been discussed for many decades [57,58]. In this study we have approached this question by complementing detailed, HMM-based sequence comparisons with a recently introduced method that evaluates possible homology based on the correlation between sequence and structure similarity [14]. Our results substantiate a homologous relationship between IRE1-LD, BP2, and b-propellers, but indicate that b-pinwheels and WW domains are most likely of analogous origin.
We have shown previously that b-propellers have arisen for the most part by the independent amplification and diversification of one ancestral blade [15]. A fundamental question in evaluating the evolutionary relationship of IRE1-LD and BP2 to b-propellers is thus whether they also trace their origin to a single blade. In the case of IRE1-LD, the individual repeats are not more similar to each other than to blades of PQQ motif b-propellers and part of the repeats occur in the same geometry. Overall, the IRE1-LD repeats are so similar to PQQ motif blades that they are found in the same sequence cluster, distinct from clusters formed by other b-propellers (Fig. 2). This suggests that IRE1-LD evolved from a PQQ motif b-propeller by a number of mutations that led to a substantial fold change, rather than by amplification of a single PQQ motif blade. We find that the path taken, however, cannot be reconstructed at this time by concatenation of known foldchanging mechanisms [1,2,59], since no intermediate forms appear to have survived. We note that the part of the IRE1-LD repeats that can still be related to PQQ motif blades by sequence similarity corresponds to blade b-strands B-D, strand A having been replaced in the process of fold change with heterologous segments of the polypeptide chain.
In the case of BP2, conversely, the high self-similarity of its repeating units and their distinctness from the blades of bpropellers indicate a monophyletic origin from an ancestral blade. While it remains unclear whether the BP2 and b-propeller folds arose concomitantly from the same ancestral blade, or whether BP2 emerged subsequently from the amplification of a b-propeller blade that made itself independent of its parent structure, we note that the particular similarity of BP2 to PQQ motif blades suggests the second scenario, with BP2 arising from the blade of a PQQ b-propeller. In this case, again, the part of BP2 repeats that can be related to PQQ motif blades by sequence similarity corresponds to blade b-strands B-D, strand A being formed by an N-terminal extension that completes each repeat consecutively, constraining the structure to an overall triangular shape (Fig. 5A). It thus seems possible that the BP2 fold arose by amplification of only the three C-terminal b-strands of a PQQ motif blade and that the Nterminal extension providing the fourth strand to each repeat is of heterologous origin. Experimentally, it may be possible to test the viability of this scenario by attempting to complement triple repeats of three-stranded b-meanders derived from the C-terminal part of PQQ motif blades with heterologous sequences in a phage display assay. Nevertheless, whether such a process actually led to the emergence of BP2 remains conjectural at this time, as a higher sequence similarity of BP2 repeats to blade b-strands B-D over other segments of three consecutive b-strands in PQQ b-propellers is not observable.
The homologous relationships highlighted here are exemplary for a problem of current protein classification systems. Due to their tree-like structure and their treatment of structural, i.e. analogous, aspects as the prime mean of differentiation, these systems can only represent homologous connections between proteins that share the same fold. Thereby, fold-spanning homology, as in the cases presented here, cannot be captured. To alleviate this issue, we recently proposed the ''metafold'' as a new classification level, where homologous proteins can be grouped across different folds [4]. The concept of metafolds can further be applied to bring together proteins that originated from the same ancestral peptide, yet show no global sequence similarity [60]. Once such a systematic grouping of proteins exists, all analogous criteria could be removed from the classification, which would result in a classification by natural descent.