Structural Characteristics of Novel Protein Folds

Folds are the basic building blocks of protein structures. Understanding the emergence of novel protein folds is an important step towards understanding the rules governing the evolution of protein structure and function and for developing tools for protein structure modeling and design. We explored the frequency of occurrences of an exhaustively classified library of supersecondary structural elements (Smotifs), in protein structures, in order to identify features that would define a fold as novel compared to previously known structures. We found that a surprisingly small set of Smotifs is sufficient to describe all known folds. Furthermore, novel folds do not require novel Smotifs, but rather are a new combination of existing ones. Novel folds can be typified by the inclusion of a relatively higher number of rarely occurring Smotifs in their structures and, to a lesser extent, by a novel topological combination of commonly occurring Smotifs. When investigating the structural features of Smotifs, we found that the top 10% of most frequent ones have a higher fraction of internal contacts, while some of the most rare motifs are larger, and contain a longer loop region.


Introduction
Under physiological conditions most proteins self-assemble into unique structures that dictate their interactions with other molecules and determine their function. Protein structures can be decomposed into individually folding units, so called folds [1]. A fold is determined from the number, arrangement, and connectivity (topology) of secondary structure elements [2]. Manually curated [3], semi-automated [4] and automated approaches [5,6] classify protein folds by organizing them into hierarchical systems. Due to the lack of a clear understanding of how to define and classify folds, these various subjective approaches carry substantial inconsistencies [2,7]. Meanwhile, recent studies paint a more nuanced picture of the fold universe of proteins, one that is more continuous in nature, where some higher density hubs formed by related structures correspond to and connect known folds [8,9,10,11]. Part of the motivation to rethink the nature of the protein fold universe is provided by the apparent success of molecular modeling efforts that use short amino acid segments from known protein structures to build up novel folds [12]. Additional motivation comes from anecdotal examples that identify structures representing transitions between previously described folds, which either results in a unification of different fold families or suggests removing fold definitions altogether [13,14]. One such example is described for the RIFT domain, where it is suggested that starting from an ancestral RIFT domain a strand invasion and a strand-swap event (with subsequent duplication and fusion events) resulted in the emergence of the swapped hairpin and double-psi beta barrel folds, respectively [15]. These folds cannot be interconverted with simple topological modifications, such as circular permutation, although their common evolutionary origin has been established.
Since the definition of complete folds is ambiguous, one has to consider structural definitions of smaller (local) entities, such as supersecondary structure elements, that could describe protein folds and the structure universe in a more quantitative and systematic nature. Supersecondary structure elements are defined as a number of regular secondary structure elements that are linked by loops (e.g. Rossmann, helix-turn-helix, four strand Greek key, b-meander motifs etc.). Folds are formed by the overlapping combination of various supersecondary elements, which are shared among different proteins and sometimes highly repeated within the same one. This observation prompted the theory of a relic peptide world [16], which proposes that modern, stable proteins are the results of duplication, mutation, shuffling and fusion of a limited set of relic peptides. Various efforts have tried to explore possible tool sets of supersecondary elements, such as antiparallel bb-sheets [17], abb and bba motifs [18], aa-turn motifs [19], four helix bundles [20] and so on. Building on these earlier efforts, we introduced a new, general, supersecondary structure classification that fully describes all known protein structures [21]. In this schema a basic supersecondary motif, which we will refer to as Smotif, is composed of two regular secondary structure elements linked by a loop. Smotifs are characterized in protein structures by the types of sequential secondary structures and the geometry of the orientation of the secondary structures with respect to each other, as described by four internal coordinates [21,22]

Smotif geometrical classification and saturation in the PDB
We explored the frequency of occurrences of all Smotifs in all protein folds. We established an exhaustive library of 324 types of

Author Summary
Structural genomics efforts aim at exploring the repertoire of three-dimensional structures of protein molecules. While genome scale sequencing projects have already provided us with all the genes of many organisms, it is the three dimensional shape of gene encoded proteins that defines all the interactions among these components. Understanding the versatility and, ultimately, the role of all possible molecular shapes in the cell is a necessary step toward understanding how organisms function. In this work we explored the rules that identify certain shapes as novel compared to all already known structures. The findings of this work provide possible insights into the rules that can be used in future works to identify or design new molecular shapes or to relate folds with each other in a quantitative manner.  Previously we have shown that Smotifs are useful for loop prediction because loop conformations (as defined by the orientation of the embracing secondary structures) up to 10-12 residues are exhaustively sampled in PDB [21,23]. We further refined this observation by exploring the increase of coverage of Smotifs in PDB over time (Fig. 1). Approximately 10 years ago all categories of Smotifs were already represented by at least one example.

Structural factors affecting Smotif occurrence
The occurrence of Smotif geometries in different types of protein folds is uneven (Fig. 2). There are some Smotifs whose geometries are ubiquitous, and occur in many different folds, while others are specific to a few. Fig. 2 displays a bb class Smotif (a) that is highly represented across different folds, corresponding to a geometry that tightly aligns two bb-strands and, thus, allows many non-bonded contacts to be formed. Meanwhile another Smotif within the bb class (b), which is structurally similar but where one of the b-strands is tilted, has a very low occurrence within known folds. Similar trends can be observed for aa, ab, and ba Smotifs: Smotifs forming extensive non-bonded interactions occur more frequently in known folds. We explored the normalized number of intra-motif non-bonded contacts as a function of Smotif frequency and found an exponential correlation between the number of contacts and frequency of motif usage (correlation of r = 0.83 as fitted on a logarithmic scale), indicating that the most frequent motifs (top 10%) are forming more contacts. However, there is not a statistically significant correlation for the rest of the Smotif frequencies (Fig. 3). Another suspected factor for Smotif preferences is their size, as large Smotifs simply cannot fit into smaller folds. Here we found no clear tendency except once again the top 10% most frequent Smotifs, which indeed tend to be smaller (on average 12 (s = 6) residues total within the bracing secondary structures, without counting the variable number of loop residues, while motifs at all other frequencies are generally formed by 16 residues (s = 8)). The longer the loop connecting the bracing secondary structures, the more likely that contacts will be formed between nonproximal secondary structures: e.g. a bb-type Smotif that connects together strands of two b-sheets. We also explored whether solvent accessibility is correlated with the frequency of Smotifs, as one could suspect that buried, conserved cores would be formed by frequently occurring Smotifs and structural regions outside the common core would have a trend to comprise a higher proportion of rare Smotifs, due to a less restrictive structural environment. However, we could not find any statistically significant correlation between the frequency of Smotifs and their exposure (Fig. S1).

Smotif distribution in novel and known folds
Since the repertoire of Smotifs seems to have come close to saturation ( Fig. 1) [23], this prompts the question of what is really unique about a fold structure when it is identified as ''novel''.  Detecting novel folds is a non-trivial question. Automated structural comparisons are often followed by manual inspection to characterize new protein structures. We have explored proteins that were classified as novel at the time of their discovery in two expert validated sources, in the archives of SCOP [3] and in the series of CASP experiments [24]. We found that proteins that were considered novel folds at CASP 3-6 meetings (years [1998][1999][2000][2001][2002][2003][2004] [26], those that are adopted by many different sequences often with different functions, are built by Smotifs that occur with medium or high frequencies in existing folds. This implies that novel folds are composed of a new permutation of existing Smotifs and, specifically, a structure will have a greater likelihood of being ''novel'' if the structure is enriched with rarely occurring Smotifs. This phenomenon becomes especially apparent when the relative frequency of occurrences of Smotifs drops below 0.09 (Fig. 5, Fig. S2, Fig. S3).
Two examples of the above observations are illustrated in Fig. 6. The first example is the new fold target T0181, discussed above (PDB code: 1nyn; Fig. 6A). The second example is a member of the immunoglobulin fold (PDB code: 1gyv; Fig. 6B), which is one of the most populated folds. Target 181, a new fold structure, can be decomposed into 7 Smotifs, where five are considered low frequency (i.e. frequency smaller than 0.01, or less than 1%). On the other hand, for a representative structure of the immunoglobulin fold (SCOP fold descriptor 48725, Immunoglobulin-like beta sandwich), the opposite situation occurs. Five out of the 7 Smotifs that comprise the structure are very well represented (high frequency) in the pool of Smotifs (Fig. 6B).
One could speculate that some novel folds were recently discovered simply because of difficulty in experimental determination, i.e. these structures are harder to solve. We used the XtalPred program [27] to predict the crystallizability of 347 new folds and 2802 known folds, all solved approximately in the same time period (since SCOP 1.73 released in 2007). We found that new folds from the most recent SCOP release 1.75 indeed have a small tendency to be less feasible for experiments. However, XtalPred and other prediction methods for protein crystallizability heavily rely on known homologs of a query sequence. The rationale is that if a protein with a similar sequence has been solved before it usually indicates that this particular protein family is more experimentally tractable. This artifact is illustrated in our analysis by the fact that while new folds from SCOP 1.75 do show less favorable XtalPred scores as compared to known folds, this difference disappears in case of new folds of SCOP 1.73 (Fig. 7).

Novel folds as an unusual combination of common Smotifs
Another plausible way to generate new folds is to combine, otherwise common Smotifs in an unusual sequence, to result in a new topology. To explore this, we calculated a Novelty Z-score for each protein, which was obtained from the product of individual Smotif frequencies. The hypothesis is that if the Novelty Z-score of some novel folds is similar to that of known folds, then the novelty for these cases must be a consequence of a never before seen combination of otherwise common Smotifs rather than a result of being constructed from rare Smotifs. And while new folds from the CASP dataset do show a distribution of Novelty Z-scores biased towards low values (Fig. S4), in the case of SCOP 1.75 (Fig. S5) and SCOP 1.73 (Fig. S6), most novel folds are indistinguishable from already known structures in terms of their overall Novelty Zscores, which indicates that these structures may indeed be a new topological arrangement of common Smotifs. However, one may note the more frequent extreme negative outliers in the distributions for the novel folds in these datasets (averages and standard deviations are 21.0361.1, 0.2561.35 and 0.061.0 for CASP dataset, SCOP 1.75, and SCOP 1.73, respectively). This means that although novel folds are often built using a higher proportion of rare Smotifs, in many cases these folds are novel because their Smotifs are assembled in an unusual sequence. This is illustrated with Target T0201 (CASP 6) and the S50S ribosomal protein L6P (PDB code 1s72 chain E) that share 3 out of 6 of their Smotifs (Fig. 8). However the sequential arrangement of these shared Smotifs is different, yielding different topologies.

Discussion
Since the early nineteen-nineties, it has been clear that the universe of protein folds is much more limited and redundant than the sequences adopting them [28]. Structural biology and the recently launched Structural Genomics efforts have discovered a large subset of possible fold shapes. Many predictions suggest that most of the folds are already known [28,29,30]. Meanwhile, by solving many of the possible folds, the characteristic differences earlier described among fold definitions has become more blurred [8,10,31]. In practice, discovering all possible folds may be an impossible task, partly because it is clear now that the definition of folds is highly subjective [2], and partly because the distribution of folds is extremely uneven: while only a dozen superfolds seem to populate half of a typical genome, and only about 200 folds populate 2/3 of it, it is possible that many thousands of more rarely occurring shapes need to be discovered to reach 80-90% coverage of all possible shapes that were established during evolution [32] [33].
In this work we explored the entirety of protein shapes from the perspective of their Smotif building blocks, which can be defined more objectively than the folds themselves, and which are observed to be nearly completely sampled in the currently known structures. Using this repertoire of Smotifs, we observed that novel folds can be distinguished from already discovered ones by the presence of rare Smotifs and, less often as an unusual combination of otherwise common Smotifs. The most frequently used motifs have a higher average number of internal contacts, while some of the rarest motifs are larger, and contain longer linker regions. These observations may be useful starting points for future works to identifying or designing sequences that are likely to constitute ''novel'' folds.
While in this work we defined Smotifs according to practical considerations and did not investigate if these Smotifs or subset of them could also serve as possible units for structural evolution, it is noteworthy to mention other studies that identified similar structural elements as possible building blocks of structural hierarchy using different approaches. The so called Closed Loops were identified by their close Ca-Ca contacts from solution structures and found to have a nearly standard size (27 residues +/25). This typical size distribution of Closed Loops was supported by polymer statistics, as it is the theoretical optimal size for loop closure and subsequently suggested to be a universal building block of protein folds [34,35]. In another approach, dynamic Monte Carlo simulation of alpha carbon chain of the nearest 24 neighbor in a lattice model identified clusters of ''most interacting residues'', which serve as anchors for protein folding [36]. These anchors were found to be conserved hydrophobic clusters of residues that keep together the so called Tightened End Fragments, which essentially correspond to the Closed Loop definition. Finally a most recent paper updates on the idea of ancient relic peptides of length 20-40 residues that co-occur in different structural contexts, and suggested to be an ancestral pool of peptide modules [37].

Definition of an optimal classification of Smotif geometry
A Smotif is defined by two consecutive regular secondary elements (i.e. a-helix or b-strand), connected by a loop. The N and C-terminal regular secondary structures of a Smotif are referred as SS1 and SS2, respectively. Motif geometry refers to the local spatial arrangement of SS1 with respect to SS2 as introduced in [22] using four internal coordinates. Briefly, SS1 and SS2 were represented by their principal moments of inertia (M1 and M2). If P1 and P2 are the end point of SS1 and start point of SS2, and L is the vector between P1 and P2, then plane P is defined by M1 and L and plane C is defned by M1 and the normal to plane P. Geometry of a Smotif is expressed by four measures: the distance (D) between the C-terminal of SS1 and the N-terminal of SS2 (distance between P1 and P2) and three angles: a hoist (d): angle between L and M1, a packing (h): angle between M1 and M2, and a meridian (r): angle between M2 and plane C (Fig. 2 in [21]).
A library has been established that classifies each Smotif in all PDB structures. This library is organized in a two-level hierarchy: in the first level of classification, (i) Smotifs are identified according to the type of bracing secondary structures: aa, ab, ba and bb according to the definition of secondary structure by the DSSP program [39]. At the second level, (ii) Smotifs are grouped according to their geometry, as described above [21,22]. A protein structure can, therefore, be expressed as a string of overlapping Smotifs where the SS2 from one Smotif constitutes the SS1 in the following Smotif.
The geometrical values used in the second level of classification are distributed in a continuous space. Distance is distributed between 0 and 40 Å . (values larger than 40 Å are assigned to 40), d and h angles span from 0 to 180 degrees, and the r angle spans from 0 to 360 degrees. In order to compare Smotif geometries, the parameter spaces of geometrical values were binned, where each bin is defined by the 4 parameters described above. A range of binning sizes and parameter intervals were explored for the four variables in order to get the sharpest partitioning power of the geometrical space with the smallest number of possible bins (Fig.  S7). The quality of the binning was assessed by calculating the RMSD (Root Mean Square Deviation) and the LGA scores [40] upon structural superposition for all Smotifs that were classified in the same or different geometrical bin. The optimal bin partitioning for each parameter was obtained by studying the distribution of distance and angle values of Smotifs in SCOP 1.71 proteins and resulted in only 324 types of Smotif definitions using the following binning values: 4 Å bins for distance, 60 degree bins for d and h starting at 0 degree, and 60 degree bins for r, starting at 30 degree. At this level of bin resolution the RMSD upon structural superposition of more than 75% of Smotifs that belong to the same geometrical bin falls below 1 Å (Fig. S7).
A program that defines Smotifs is available upon request from the authors. The first comparison was based on the type of secondary structures and the geometry (D, hoist, packing, and meridian) of Smotifs. In a second, stricter comparison, the lengths of the flanking secondary elements (SS1 and SS2) were also compared. If these lengths differed by more than 2 or 4 residues in the case of strands or helices, respectively, the Smotifs were considered different.

Calculating frequencies of occurrences of Smotif classifications
To avoid redundancy when calculating the frequencies of Smotif occurrences for each four-dimensional geometric bin, only a single protein was selected from each protein fold (as defined by SCOP database). Since fold families contain more than one protein structure and structures that belong to the same fold may have a variable number of Smotifs this selection process was repeated 100 times, randomly selecting a different protein in each analysis. Therefore, the frequency of occurrence of a given geometrical bin is the average of counts computed from 100 rounds of analysis for each family. where N is the number of Smotifs and fr is the frequency of the Smotif i as calculated previously. Individual scores were converted into statistical Z-scores using the mean (m) and standard deviation (s) of the population of scores, as (2) Zscore(i)~s core(i){m s ð2Þ

Calculating non-bonded contacts in Smotifs
Internal contact ratio was calculated as the number of nonbonded atomic contacts (i.e. H-bonds, polar contacts, hydrophobic contacts) between SS1 and SS2 divided by Smotif size. Contacts were defined by the Contact of Structural Units (CSU) program [41]. CSU is based on the detailed analysis of interatomic contacts and interface complementarity. For every structural unit CSU calculates the solvent accessible surface of every atom and determines the contacting residues and type of interactions they undergo including all putative hydrogen bond contacts.

Calculating crystallizibility
Protein crystallizability was predicted with the XtalPred server [27]. XtalPred predicts protein crystallizibility by combining nine features: length, length of predicted disorder, Gravy index, insertion score, instability index, percent of coil structure, isoelectric point. Based on these features the protein is assigned to one of five crystallization classes: optimal, suboptimal, average, difficult and very difficult. Each class represents different crystallization success rate observed in TargetDB [42]. Three SCOP domain datasets were compiled for submission to XtalPred; domains from ''new folds'' as defined in (1) SCOP 1.75 and (2) in SCOP 1.73, respectively, and (3) domains in SCOP 1.75 that were added since the release of SCOP 1.73 and that were not new folds. This ensures that we are focusing on proteins that were solved approximately in the same time but were classified differently in terms of novelty. The amino acid sequences of the domains were obtained from the ASTRAL website (astral-scopdom-seqres-gdall-1.75.fa, astral-scopdom-seqres-gd-all-1.73.fa). Sequence redundancy was removed among the domains using CDHIT clustering [43] at 95% sequence identity threshold. The SCOP 1.75 and 1.73 ''new fold'' domains dataset contained 170 and 177 representative sequences (517 and 558 redundant sequences), respectively, and the SCOP 1.75 ''known fold'' dataset contained 2802 representative sequences (out of 13,043 redundant ones). Each amino acid sequence was submitted to XtalPred to calculate the crystallizability class.

Solvent accessibility of Motifs
The corresponding PDB structure, chain identification and residue range was located for each Smotif (369,859 Smotifs in total). We calculated ACC values (water exposed surface area or number of water molecules in contact with the residue) using the DSSP program [44]. The average solvent accessibility of Smotifs was calculated by averaging the ACC values over all residues of the Smotif. We also calculated average ACC values by excluding loop residues, which are usually exposed, for each Smotif, but the conclusions were not affected. Proteins were grouped according to the number of structures per fold. Seven categories were described: new fold (blue rhomboid); folds with: 1 protein (green triangle), 2 to 10 (purple box), 10 to 50 (cyan box), 50 to 100 (orange circle), and more than hundred proteins (red box), respectively. The values were plotted as histogram of frequencies with a log scale in the X-axis. The same dataset and approach is used to avoid redundancy as in