Mapping the Geometric Evolution of Protein Folding Motor

Polypeptide chain has an invariant main-chain and a variant side-chain sequence. How the side-chain sequence determines fold in terms of its chemical constitution has been scrutinized extensively and verified periodically. However, a focussed investigation on the directive effect of side-chain geometry may provide important insights supplementing existing algorithms in mapping the geometrical evolution of protein chains and its structural preferences. Geometrically, folding of protein structure may be envisaged as the evolution of its geometric variables: ϕ, and ψ dihedral angles of polypeptide main-chain directed by χ1, and χ2 of side chain. In this work, protein molecule is metaphorically modelled as a machine with 4 rotors ϕ, ψ, χ1 and χ2, with its evolution to the functional fold is directed by combinations of its rotor directions. We observe that differential rotor motions lead to different secondary structure formations and the combinatorial pattern is unique and consistent for particular secondary structure type. Further, we found that combination of rotor geometries of each amino acid is unique which partly explains how different amino acid sequence combinations have unique structural evolution and functional adaptation. Quantification of these amino acid rotor preferences, resulted in the generation of 3 substitution matrices, which later on plugged in the BLAST tool, for evaluating their efficiency in aligning sequences. We have employed BLOSUM62 and PAM30 as standard for primary evaluation. Generation of substitution matrices is a logical extension of the conceptual framework we attempted to build during the development of this work. Optimization of matrices following the conventional routines and possible application with biologically relevant data sets are beyond the scope of this manuscript, though it is a part of the larger project design.


Introduction
The three dimensional native structure of a protein defies all concepts of aesthetics that nature generally tend to practice in its creation. However, these folded structures and the guiding principles of its construction is an important scientific problem, yet to be completely solved. The physical folding code, folding mechanism and a perfect algorithm that predicts the fold from sequence has been a matter of intense scrutiny for more than five decades now [1]. The scientific community has invested significant amount of time and resources in deciphering this

Rotors and Stators of the Folding motor
Structure of a protein chain can be geometrically mapped by its backbone dihedral angles ϕ and ψ. Side-chain dihedral angles χ 1 and χ 2 and their contributions were more a concern for protein designers and crystallographers, while they optimize side-chain packing of a given fold. Allowed regions of ϕ and ψ combinations in a Ramachandran plot is very limited, and therefore structural diversity of protein secondary structure is also limited to α-helices, β-sheets and connecting segments. In 2009, Cole and Bystroff proposed the phone cord model for helix formation and favouring of right handed helices over left handed ones in the crossovers of βαβ super-secondary structures [28]. Phone cord model highlights the creation of a torque on two ends of the helix during its formation, which may either be relieved by the rotation of one end of the chain or by its pivoting. Protein polymeric chain does not behave like a random flight chain with un-restricted freedom of movement, instead is a stiff chain mainly due to the geometric restrictions of its dihedral angle rotations. Chain stiffness is a general term used to describe the degree of angular correlation between neighbouring bonds, and is often measured in terms of characteristic ratio. Characteristic ratio of a random flight chain is 1. Characteristic ratio calculated by P.J. Flory for polypeptide chains ranges between 9.0 and 9.5, mainly due to the poly L (or hypothetically poly R) stereochemistry of polypeptide chain [29]. This indicates that the folding of this stiff protein chain is a directed motorized process, with the motive force generated through non-bonded interactions and dispensed through dihedral angle motions.
A self-folding polypeptide chain may hence be assumed as a motor with four rotors ϕ, ψ, χ 1 , χ 2 , (Fig 1A) and all other elements being stators. The motive force in either direction of folding or unfolding, may come from the mutual dispensation of enthalpic and entropic forces within the chain, subject to reaction conditions. The necessary torque required to fold a polypeptide chain, may come from within, and thermodynamics of folding has already been studied extensively. We were curious to find out how this torque is manifested sequentially such that the chain can move into helix and sheets and eventually assume a unique fold for a given sequence. It is safe to assume that this torque is manifested through dihedral angle movements. Therefore, we attempted to map the sequential geometric events of folding by following the movements of four dihedral angles during secondary structure formation and breaking. A postsecondary structure motion of this motorized process culminating in a fold, is unique for a given sequence and hence cannot be brought to a common platform.

Materials and Methods
Our statistical analysis is based on a 22,997 strong non-redundant protein dataset available on the PISCES server [30], consisting of 14,556,807 amino acid residues, altogether forming 406,799 helices and 576,240 β-sheets. The set of non-redundant protein structures as provided in the PISCES server were downloaded from PDB and dihedral angles were calculated in the given structures. The output was segregated into three groups: β-sheets, α-helices and others as per their main chain dihedral angle distributions. To these groups, we also added the flanking amino acids of each secondary structure, stretching to three amino acids on either side. Further, each of these lists were reduced into a 37 X 37 residue frequency matrix with each value in the matrix representing the number of amino acids present in a 10 0 by 10 0 angle grid, as per the value of their dihedral angles. Such matrices were constructed for every flanking region and secondary structure positions for various combinations of dihedral angles. After the construction of such matrices, we plotted them as contour maps in order to analyse the quantitative distribution of amino acid main and side chain dihedrals in protein structures.

Results Dataset
Our logical deductions are based on the statistical analysis of a non-redundant protein data set of about 23,000 protein structures from Dunbrack's library [30]. We extended the broad conclusions of this study by generating three substitution matrices for protein sequence alignment. These matrices were compared with the standard ones and their relative performances are elaborated in the latter sections of this manuscript.
In an attempt to map the early events of folding and understanding the driving force responsible for the sequential events, we have classified the dihedral angles of amino acids preceding and succeeding the basic secondary structure elements-helices and sheets. Therefore, we have separately classified amino acids at H-3, H-2, H-1, H, H+1, H+2 and H+3 positions each getting into and leaving the helical structures. Similarly, we have S-3, S-2, S-1, S, S+1, S+2 and S+3 for sheets. We have taken the whole sheet segment and not just a strand for better clarity and precision of data.

Relative motions of secondary structure forming dihedral angle rotors
We follow the motions of dihedral angles by mapping the dihedral angle basins of three residues going into the secondary structure, and three residues leaving it as described in the previous paragraph. We observe that, the behaviour of polypeptide chain dihedral angle rotors are analogous to an enigma machine with the information regarding targeted secondary structure is enciphered in it. Rotor orientations are distinctly different for different secondary structures. We observed that helix formation is a result of concerted motion of main chain and side-chain rotors, in varying degree of participation and differential mode of rotations ( Fig 1A). We first evaluated the orientation of χ 1 and χ 2 dihedral angles, because they are successive rotors of the same branch. We had an interesting observation that χ 1 and χ 2 dihedral angles are counter-rotating while formation and breaking of secondary structures under all conditions ( Fig 1B). A more interesting observation is that χ 1 is also counter-rotating with ϕ while formation and breaking of the helix. Therefore, folding-unfolding machine has these three rotors counter-rotating, while forming any secondary structure. The next important observation is that ψ dihedral angle rotor is always left handed in its orientation except while breaking of sheet. This further indicates that ψ dihedral angle do not have a decisive role in the evolution and choice of secondary structure formation (Fig 1B-1D).
Helix formation and breaking. Helix is formed when χ 1 rotor rotates to the left. ϕ dihedral angle rotor complements in counter direction by rotating to the right by 80°. While helix breaking, χ 1 and ϕ rotors assumes reverse orientation. χ 2 dihedral angle maintains its mandate of counter-rotation with χ 1 rotor, during formation and breaking. The ψ dihedral rotor maintains a right handed rotation during both helix formation and breaking (Fig 2).
Sheet formation and breaking. Sheet or a beta strand is formed when χ 1 rotor rotates (clock-wise) to the right unlike helix. ϕ dihedral angle rotor complements in counter rotating direction by rotating to the left full circle 360 0 (Fig 3). χ 2 dihedral angle maintains its mandate of counter-rotation with χ 1 rotor, during formation and breaking of sheet as well. Both ϕ and ψ rotate in counter direction while breaking. Rotations of ϕ in counter directions, while formation and breaking is hence, a common feature for both helices and sheets.
Therefore, trace of dihedral angle geometry tells us an interesting story, untold so far to the best of our knowledge, which can be summarized as follows: i) Geometric evolution of helix patterns of rotors during helix formation (green) and helix breaking (red) are shown d) Orientation of rotors resulting in the formation and breaking of sheeted structures.
doi:10.1371/journal.pone.0163993.g001 and sheet are distinctly different and in most cases opposite; ii) Rotor motions, through which these differential evolution patterns are orchestrated are unique for a given secondary structure; and (iii) the resultant secondary structure formation is a result of this differential rotor motions (Figs 1, 2 and 3 and Figures A1-A4 in S1 File), though the constitution of differential driving forces for these secondary structures are still unclear.

Dihedral angle rotor patterns and amino-acid choices
From the above analysis, we may conclude that there are two distinctly different rotor motions operative for helix and sheet formation. A perturbation to this rotor motion will result in a sequence segment of coil that connects between successive secondary structure elements. In this general scenario, it would be interesting to see whether amino-acids will respond differently to specific secondary structures, while rotor motions are executed by the chain. This is of a very significant importance, in light of the fundamental question of sequence-structure relationship in the protein folding problem. Main chain of any protein sequence is a constant, but the side-chain is believed to influence its evolution to a complete functional fold. So far, studies along these lines, were biased to focus on the chemistry, polarity, β-branching, hydrophobicity, aromaticity, size etc. Our significant lead from the first part of this investigation is that the rotation patterns of side-chain rotors directly influence main chain rotor motions, which will eventually decide the structure. Based on the standard parameters, Gln, Ile, Pro, Trp, Tyr and Val were selected as representative residues initially, to evaluate our objective to investigate sidechain geometric influences on main chain structure evolution. The dihedral angle distributions were calculated and plotted as described in the previous section for each of the amino acids and classified as helices and sheets. The residue-wise data thus generated, was similar not only to the cumulative dataset presented in the previous sections, but also brought out new insights into our perspective (Fig 4, Figure A5 in S1 File). The Linear side chains, both light (Gln) and bulky (Trp & Tyr) side-chains showed similar dihedral angle rotations. The χ 1 for Pro, like ϕ wasn't allowed much conformations as compared to others, but was present in two continuous basins viz. (-150)-(-180) and (130)-(180) ( Figure A6-A25 in S1 File). The left handed helix conformation was also seen at flanking positions in most of the residues with more prominence in Gln and Lys, while completely missing in Ile, Val and Pro (A26-A39 in S1 File). This observation is in concordance with the observations by Kleywegt et al. [31].
In our earlier analysis, we found that χ 1 and χ 2 are counter-rotating while secondary structure formation, and direction of χ 1 rotor is the most critical factor influencing ϕ rotor. Therefore, we mapped χ 1 vs χ 2 and ϕ vs χ 1 , basins during secondary structure formation and breaking (Fig 4,  Figures A6-A17 in S1 File). Interestingly, all seven amino acids we choose are distinctly different in their basin distribution profiles ( Figures A6-A44 in S1 File). This partially answers how structure of a sequence is unique, owing to the observation that the side-chain and main chain-rotor that directs their structural evolution and pathway, are distinctly different and unique.

Translation of rotor preferences to substitution matrices: MIDMAT Series
The data generated had information about dihedral angle distribution for amino acids in protein structure. We decided to extend the study to all amino acids and test it on a sequence alignment platform. Hence, a methodology was developed to convert this distribution data into three possible substitution matrices. The majority of the scoring matrices are based on the evolution of the protein sequence, whereas our scoring matrices are derived from the experimental structural data. To construct the matrices, the geometrical arrangements of dihedral angles were distributed into four basins from +150 to -150 (via ±180), -150 to -50, -50 to +50 and +50 to +150 degrees in the rotational plane. The frequency of appearance of each amino acid was calculated for each basin and the basin energy for each reference basin was calculated  as per the following equation: Where, ΔG is the energy of the basin, R is the Universal Gas Constant, T is temperature (kept constant at 25°C), N x is the number of basin entries, and B is the variable which was computed in three different ways for three different matrices proposed.
For each of the matrices, B represents the following; i) the sum of hits in remaining basins, ii) average of remaining basins and iii) average over all the four basins. Thus, to generate these three values of B for the first basin (Basin1), we used the following set of equations for each matrix where, B i is the value of B for i th basin and b i is the sum of total number of entries in i th basin The basin energies for each amino acid were thus calculated for the four rotors (ϕ, ψ, χ 1 and χ 2 ). Comparative estimates of corresponding basins for any given amino acid pair was measured based on the Euclidean distance calculation as per the following equation.
where, D xy is the distance (indicates the dissimilarity between the rotor preferences of two amino acids) from amino acid x to amino acid y, and ϕ i , ψ i , χ 1i , χ 2i represent the basin energies for the i th basin for ϕ, ψ, χ 1 and χ 2 dihedral distributions for the amino acids x and y.
The distances value calculated for all amino acid pairs has the very nature and physical meaning of a typical substitution matrix. The values of this distance matrix were then normalized by dividing all the values by a common denominator, 1000.
The values thus generated in the distance matrix were further processed to calculate substitution values between each pair of two amino acids x and y (SV xy ) by subtracting each of them with the median value of the distance matrix and multiplying with a normalization factor of 0.5 as illustrated below.
To calculate the substitution value of an amino acid with itself (e.g. ala with ala), we took the maximum substitution value (SV) from the substitution matrix and added it to the existing maximum substitution value of the same amino acid.
The values were then rounded off to the nearest decimal point to give the final substitution score required for the construction of three substitution matrices, MIDMAT1 MIDMAT2 and MIDMAT3 (Tables 1, 2 and 3).
doi:10.1371/journal.pone.0163993.t001 Table 2. MIDMAT 2 substitution matrix. MIDMAT 2 substitution matrix constructed using a different approach than MIDMAT1 as discussed in results section, from the same set of 22,997 non-redundant structures from PISCES server. The amino acids are represented as single letter codes.

Evaluation of MIDMAT1, MIDMAT2 and MIDMAT3
The constructed matrices, MIDMAT1, MIDMAT2 and MIDMAT3 were tested for their efficacy as substitution scoring matrices by employing them in the BLAST program. CASP11 data set was used as a test set, because majority of proteins in CASP11 have minimal sequence similarity with the structures available in PDB at the time of submission. All three matrices were successful in aligning the sequences from CASP11 dataset with their respective PDB IDs, now available in PDB. The results were promising for the MIDMATs and MIDMAT3 consistently produced a better score and E-value for the best matching alignment than MIDMAT1 and MIDMAT2. This, however, is a simple qualitative demonstration highlighting the utility of three matrices generated solely on the basis of dihedral angle geometries of representative protein dataset. We have employed BLOSUM62 (25) and PAM30 (24) as standards. For a set of 67 proteins used for CASP11, the number of common and unique hits is represented by the Venn diagrams for MIDMATs, BLOSUM62 and PAM30 (Fig 4). As is evident from Fig 5, MID-MAT1 showed most similar results to PAM30, a matrix used for aligning highly similar sequences of a significantly small length, usually less than 50. MIDMAT3 is the most promising among all MIDMAT matrices in terms of scores and E-values. This, however, is a simple suggestive demonstration highlighting the utility of three matrices generated solely on the basis of dihedral angle geometries of representative protein dataset.

Discussion
Folding of a protein molecule may be envisaged as an evolution of a given polypeptide chain from random to fixed geometry; with geometrical fixation of the final fold is decided by its amino acid sequence. Polypeptide chain has a characteristic ratio approximately 9 times more than a random flight chain indicating the constraint in the chain due to peptide bond, poly L stereochemistry and steric interaction. Practically, hugely restricted combinations of ϕ and ψ dihedral angles of main chain and χ 1 , χ 2 etc. dihedral angles of side chain are the only variables in this complex process. Since main chain is a constant for all the chains, we made an assumption that the side chain geometrical variables χ 1 and χ 2 are guiding main chain geometrical variables ϕ and ψ. The results obtained were rather surprising with secondary structure elements showing distinctly different geometrical rotation patterns while formation and breaking. We modelled the protein molecule as an 'enigma machine' with the directive of its geometrical evolution encoded in its sequence. The four geometrical variables are modelled as four rotors of this molecular machine. Amino acid sequence is essentially its side chain sequence and therefore, we could logically assume that side chain rotors χ 1 and χ 2 are directing main chain rotors. We could show that their differential motions differentiate secondary structure formations. We further examined seven amino acids of diverse side chain type to evaluate their rotor motion. The observations that all seven were distinctly different in their side chain and main chain rotor combinations further strengthened our belief that side chain geometry controls main chain geometries of a poly-peptide chain. We further extended the study to all amino acids by calculating free energy of each dihedral angle combination basins from their population statistics. All amino acids are compared against each other and quantified in the form of 3 substitution matrices. Their performances are compared by plugging in a BLAST engine. The results of these comparative analysis with standard substitution matrices like BLOSUM62 and PAM30 further underlies the efficacy of these substitution matrices in sequence alignment tools. However, a detailed study on their relative efficacy when compared to other scoring matrices (BLO-SUM62 and PAM) is beyond the scope of this paper, hence not attempted. We believe that our study is throwing light on the following fundamental aspects of protein folding. i) How an amino acid sequence choose a secondary structure. ii) How side chain geometries and its combination with main chain geometries directs the evolution of a random chain to its prescribed fold and iii) How different sequences generally end up in different fold. The emphasis on 'How' rather than 'why' in our investigation may perhaps helped us in proposing some important observations and a model, though they are of suggestive nature than that of a conclusive one.