Conceived and designed the experiments: NFF AF. Performed the experiments: NFF JMD. Analyzed the data: NFF JMD AF. Wrote the paper: NFF AF.
The authors have declared that no competing interests exist.
Folds are the basic building blocks of protein structures. Understanding the emergence of novel protein folds is an important step towards understanding the rules governing the evolution of protein structure and function and for developing tools for protein structure modeling and design. We explored the frequency of occurrences of an exhaustively classified library of supersecondary structural elements (
Structural genomics efforts aim at exploring the repertoire of three-dimensional structures of protein molecules. While genome scale sequencing projects have already provided us with all the genes of many organisms, it is the three dimensional shape of gene encoded proteins that defines all the interactions among these components. Understanding the versatility and, ultimately, the role of all possible molecular shapes in the cell is a necessary step toward understanding how organisms function. In this work we explored the rules that identify certain shapes as novel compared to all already known structures. The findings of this work provide possible insights into the rules that can be used in future works to identify or design new molecular shapes or to relate folds with each other in a quantitative manner.
Under physiological conditions most proteins self-assemble into unique structures that dictate their interactions with other molecules and determine their function. Protein structures can be decomposed into individually folding units, so called folds
Since the definition of complete folds is ambiguous, one has to consider structural definitions of smaller (local) entities, such as supersecondary structure elements, that could describe protein folds and the structure universe in a more quantitative and systematic nature. Supersecondary structure elements are defined as a number of regular secondary structure elements that are linked by loops (e.g. Rossmann, helix-turn-helix, four strand Greek key, β-meander motifs etc.). Folds are formed by the overlapping combination of various supersecondary elements, which are shared among different proteins and sometimes highly repeated within the same one. This observation prompted the theory of a relic peptide world
These previous observations motivated us to analyze the occurrence of Smotifs among protein folds and explore the question of what is really unique about a structure that is identified as “novel”. Does the emergence of a novel fold coincide with the emergence of novel Smotifs that are integrated into a structure with known ones? Is it possible to generate novel folds solely from existing Smotifs? What are the rules that guide combinations of Smotifs to an apparently novel fold? Is the novelty of a certain Smotif or the novelty of combining well-known Smotifs the driving force behind the appearance of novel folds? These questions might be relevant to shed light on the rules governing protein structure evolution. There are practical considerations to understanding the actual limits of the definition and novelty of a fold. Exploring these issues can aid in developing more accurate structure modeling tools and support the design and realization of new and experimentally accessible molecular shapes.
We explored the frequency of occurrences of all Smotifs in all protein folds. We established an exhaustive library of 324 types of Smotifs, as classified by their geometry, for each of the four combinations of possible bracing secondary structure elements. We have shown that this geometrical classification of Smotifs correctly captures local structural similarity (see Definition of optimal classification of Smotif geometry in
Each curve corresponds to one of the four Smotifs categories (purple (strand-strand), green (strand-helix), blue (helix-helix) and red (helix-strand)). The cumulative distribution on the plot is obtained by summing the first appearances of Smotifs in 324 geometrical definitions as a function of time.
The occurrence of Smotif geometries in different types of protein folds is uneven (
Smotif frequencies are shown separately for types of α-α (A), α–β (B), β–α (C), and β–β (D). A non-redundant library of folds (one randomly picked structure from each SCOP fold class) as decomposed in Smotifs and the distributions are shown. Standard deviations are shown as extension bars and were obtained by repeating the random selection process 100 times.
IC ratio refers to the average number of internal contacts per residue; loop length is the length of the connecting segment between the two regular secondary structures within Smotifs, and Smotif size is the total number of residues in the Smotifs.
Another suspected factor for Smotif preferences is their size, as large Smotifs simply cannot fit into smaller folds. Here we found no clear tendency except once again the top 10% most frequent Smotifs, which indeed tend to be smaller (on average 12 (σ = 6) residues total within the bracing secondary structures, without counting the variable number of loop residues, while motifs at all other frequencies are generally formed by 16 residues (σ = 8)). The longer the loop connecting the bracing secondary structures, the more likely that contacts will be formed between non-proximal secondary structures: e.g. a ββ-type Smotif that connects together strands of two β-sheets. A correlation was found between the length of the loop within Smotifs and the frequency of Smotif usage in folds among the 50% least frequent Smotifs. However, Smotifs extracted from new folds do not show correlation between Smotifs size or loops length and the frequency of Smotifs: i.e. new folds are not necessarily formed by large Smotifs and do not necessarily have particularly long loops (data not shown).
We also explored whether solvent accessibility is correlated with the frequency of Smotifs, as one could suspect that buried, conserved cores would be formed by frequently occurring Smotifs and structural regions outside the common core would have a trend to comprise a higher proportion of rare Smotifs, due to a less restrictive structural environment. However, we could not find any statistically significant correlation between the frequency of Smotifs and their exposure (
Since the repertoire of Smotifs seems to have come close to saturation (
Each of the Smotifs was used as a probe to search a backdated database of protein structures. The PDB code, chain, start, and end residue position that match the specific Smotif is shown below each structure.
CASP meeting | # Smotifs | # Smotifs with new geometry |
# Smotifs unique FSS |
3 | 62 | 0 | 2 |
4 | 72 | 0 | 3 |
5 | 42 | 0 | 4 |
6 | 59 | 0 | 4 |
SCOP dataset | |||
1.73 | 4567 | 0 | 45 |
1.75 | 3489 | 0 | 42 |
Number of Smotifs with new geometrical classification after comparing with Smotifs extracted from protein structures already known.
Number of Smotifs that are formed by flanked secondary structures (FSS) of SS1, SS2 with unique lengths as compared to all previously known. For example, protein 1fw9 chain A was considered a new fold during CASP4 meeting (target id. T0086). It has a ββ motif between residues 73 and 95. The specific Smotif geometry was present in the backdated protein databank, but none of the Smotifs with the same geometry had two beta strands with comparable length (beta strand lengths are 10 and 11 residues for SS1 and SS2 respectively).
When we explored the frequency of occurrence of Smotifs in the non-redundant set of known folds, we observed that novel folds have a larger fraction of Smotifs that have a low frequency of occurrence in the PDB (
Proteins were grouped according to the number of structures per fold. Seven categories were described: new folds (blue romboid); folds with: one protein (green triangle), 2 to 10 (purple box), 10 to 50 (cyan box), 50 to 100 (orange circle), and more than one hundred proteins (red box), respectively. The values were plotted as histogram of frequencies with a log scale in the X-axis. The same dataset and approach is used to avoid redundancy as for
Two examples of the above observations are illustrated in
One could speculate that some novel folds were recently discovered simply because of difficulty in experimental determination, i.e. these structures are harder to solve. We used the XtalPred program
The normalized frequencies of crystallizability class scores (1 = optimal to 5 = very difficult) are plotted for domains from new folds in SCOP 1.75 (red diamonds) and in SCOP 1.73 (yellow squares), respectively, and for known folds (blue triangles).
Another plausible way to generate new folds is to combine, otherwise common Smotifs in an unusual sequence, to result in a new topology. To explore this, we calculated a Novelty Z-score for each protein, which was obtained from the product of individual Smotif frequencies. The hypothesis is that if the Novelty Z-score of some novel folds is similar to that of known folds, then the novelty for these cases must be a consequence of a never before seen combination of otherwise common Smotifs rather than a result of being constructed from rare Smotifs. And while new folds from the CASP dataset do show a distribution of Novelty Z-scores biased towards low values (
Three out of the six Smotifs that compose target T0201 are also present in the 50S ribosomal protein L6P (PDB code 1s72 chain E) but in a different topological arrangement. Structurally equivalent Smotifs between T201 and 1s72 are depicted in the same color-coding. The sequence of Smotifs is also shown underneath for each protein. In each Smotifs description the first two letters refer to the two secondary structures connected (E and H stand for strand and helix, respectively). The 4 letters after the underscore sign code the 4 geometrical variables describing the relative geometry of the Smotif (in order: the distance between the bracing secondary structures, and three angles: a hoist (δ), a packing (θ) and a meridian (ρ)
Since the early nineteen-nineties, it has been clear that the universe of protein folds is much more limited and redundant than the sequences adopting them
In this work we explored the entirety of protein shapes from the perspective of their Smotif building blocks, which can be defined more objectively than the folds themselves, and which are observed to be nearly completely sampled in the currently known structures. Using this repertoire of Smotifs, we observed that novel folds can be distinguished from already discovered ones by the presence of rare Smotifs and, less often as an unusual combination of otherwise common Smotifs. The most frequently used motifs have a higher average number of internal contacts, while some of the rarest motifs are larger, and contain longer linker regions. These observations may be useful starting points for future works to identifying or designing sequences that are likely to constitute “novel” folds.
While in this work we defined Smotifs according to practical considerations and did not investigate if these Smotifs or subset of them could also serve as possible units for structural evolution, it is noteworthy to mention other studies that identified similar structural elements as possible building blocks of structural hierarchy using different approaches. The so called
All structures from CASP 3,4,5,6 meetings
Similarly, we have downloaded all “new folds” from the SCOP 1.73 and 1.75 releases, 123 and 110 folds, respectively, that are part of a total of 1140 proteins. The list of new folds for earlier releases can be found at SCOP via History link (
A Smotif is defined by two consecutive regular secondary elements (i.e. α-helix or β-strand), connected by a loop. The N and C-terminal regular secondary structures of a Smotif are referred as SS1 and SS2, respectively. Motif geometry refers to the local spatial arrangement of SS1 with respect to SS2 as introduced in
A library has been established that classifies each Smotif in all PDB structures. This library is organized in a two-level hierarchy: in the first level of classification, (i) Smotifs are identified according to the type of bracing secondary structures: αα, αβ, βα and ββ according to the definition of secondary structure by the DSSP program
The geometrical values used in the second level of classification are distributed in a continuous space. Distance is distributed between 0 and 40 Å. (values larger than 40 Å are assigned to 40), δ and θ angles span from 0 to 180 degrees, and the ρ angle spans from 0 to 360 degrees. In order to compare Smotif geometries, the parameter spaces of geometrical values were binned, where each bin is defined by the 4 parameters described above. A range of binning sizes and parameter intervals were explored for the four variables in order to get the sharpest partitioning power of the geometrical space with the smallest number of possible bins (
A program that defines Smotifs is available upon request from the authors.
All protein structures that were identified as “new folds” from SCOP releases 1.73 and 1.75 and CASP 3–6 meetings were decomposed into Smotifs. In case of SCOP, each release identifies the new folds in comparison to the rest of the folds while in case of the CASP sets a Smotif library extracted from a backdated PDB was prepared for each CASP meeting. Within the pairs of datasets, Smotifs in SCOP new and existing folds and Smotifs from CASP new folds and the corresponding Smotif library from previously solved structures, were compared to evaluate the existence of identical Smotifs in the novel folds and the previously defined folds. The first comparison was based on the type of secondary structures and the geometry (D, hoist, packing, and meridian) of Smotifs. In a second, stricter comparison, the lengths of the flanking secondary elements (SS1 and SS2) were also compared. If these lengths differed by more than 2 or 4 residues in the case of strands or helices, respectively, the Smotifs were considered different.
To avoid redundancy when calculating the frequencies of Smotif occurrences for each four-dimensional geometric bin, only a single protein was selected from each protein fold (as defined by SCOP database). Since fold families contain more than one protein structure and structures that belong to the same fold may have a variable number of Smotifs this selection process was repeated 100 times, randomly selecting a different protein in each analysis. Therefore, the frequency of occurrence of a given geometrical bin is the average of counts computed from 100 rounds of analysis for each family.
Each of the proteins in the database was converted into a string of Smotifs. Thus, a protein having 5 regular secondary structures would be expressed as a string of 4 overlapping Smotifs. For each protein, a normalized probability score of observing such a string of Smotifs was calculated:
Internal contact ratio was calculated as the number of non-bonded atomic contacts (i.e. H-bonds, polar contacts, hydrophobic contacts) between SS1 and SS2 divided by Smotif size. Contacts were defined by the Contact of Structural Units (CSU) program
Protein crystallizability was predicted with the XtalPred server
The corresponding PDB structure, chain identification and residue range was located for each Smotif (369,859 Smotifs in total). We calculated ACC values (water exposed surface area or number of water molecules in contact with the residue) using the DSSP program
Solvent accessibility scores of Smotifs as calculated by DSSP. Average solvent accessibility values are plotted as a function of Smotif frequency in α-α (A), β-α (B), α-β (C), and β-β (D) Smotifs.
(5.67 MB TIF)
Distribution of the frequency of Smotif geometries in SCOP 1.75. Proteins were grouped according to the number of structures per fold. Seven categories were described: new fold (blue rhomboid); folds with: 1 protein (green triangle), 2 to 10 (purple box), 10 to 50 (cyan box), 50 to 100 (orange circle), and more than hundred proteins (red box), respectively. The values were plotted as histogram of frequencies with a log scale in the X-axis. The same dataset and approach is used to avoid redundancy as in
(6.42 MB TIF)
Distribution of the frequency of Smotif geometries in SCOP 1.73. Proteins were grouped according to the number of structures per fold. Seven categories were described: new fold (blue rhomboid); folds with: 1 protein (green triangle), 2 to 10 (purple box), 10 to 50 (cyan box), 50 to 100 (orange circle), and more than one hundred proteins (red box), respectively. The values were plotted as histogram of frequencies with a log scale in the X-axis. The same dataset and approach is used to avoid redundancy as in
(6.42 MB TIF)
Histogram of Novelty Z-scores of known folds in CASP dataset. Z-scores were binned by increments of 0.1. Overlaid are the Novelty Z-score for each individual new fold target submitted to CASP meetings
(2.92 MB TIF)
Histogram of Novelty Z-scores of known (red) and new (blue) folds in SCOP 1.75 dataset. Z-scores were binned by increments of 0.1.
(2.92 MB TIF)
Histogram of Novelty Z-scores of known (red) and new (blue) folds in SCOP 1.73 dataset. Z-scores were binned by increments of 0.1.
(2.92 MB TIF)
Structural similarity vs. geometry binning. Panels A and B show the distribution of LGA score
(5.67 MB TIF)
We thank Dr. Vilas Menon for the critical reading of the manuscript and Dr. Lukasz Slabinski for batch-processing large number of sequences for Xtalpred analysis.