The Ramachandran Number: An Order Parameter for Protein Geometry

Three-dimensional protein structures usually contain regions of local order, called secondary structure, such as α-helices and β-sheets. Secondary structure is characterized by the local rotational state of the protein backbone, quantified by two dihedral angles called ϕ and ψ. Particular types of secondary structure can generally be described by a single (diffuse) location on a two-dimensional plot drawn in the space of the angles ϕ and ψ, called a Ramachandran plot. By contrast, a recently-discovered nanomaterial made from peptoids, structural isomers of peptides, displays a secondary-structure motif corresponding to two regions on the Ramachandran plot [Mannige et al., Nature 526, 415 (2015)]. In order to describe such ‘higher-order’ secondary structure in a compact way we introduce here a means of describing regions on the Ramachandran plot in terms of a single Ramachandran number, R, which is a structurally meaningful combination of ϕ and ψ. We show that the potential applications of R are numerous: it can be used to describe the geometric content of protein structures, and can be used to draw diagrams that reveal, at a glance, the frequency of occurrence of regular secondary structures and disordered regions in large protein datasets. We propose that R might be used as an order parameter for protein geometry for a wide range of applications.


Introduction
Many three-dimensional protein structures consist of regions of local order called secondary structure [1]. Consequently, the study of secondary structure has occupied a crucial role in structural biology [1][2][3][4][5][6][7][8][9]. A key insight from this study is the recognition that the conformation of a protein backbone near a given amino acid residue can be specified largely by two dihedral angles, called ϕ and ψ, as shown in Fig 1(a) (a third angle, ω, usually takes one of two values, defining trans and cis conformations [5,6]). Ramachandran and co-workers deduced that peptide backbones inhabit only certain regions of dihedral angle (ϕ, ψ) configuration space [10]. Plots drawn in terms of this configuration space are called Ramachandran plots [1,9,11,12], and they are among the most important innovations in structural biology, enabling immediate assessment of the geometric nature of protein structures [13][14][15].
In general, residues that comprise particular protein secondary structures, such as α-helices and β-sheets, correspond to distinct, localized regions on the Ramachandran plot; see Fig 1(b). However, the possibility of secondary structure built from more than one rotational state, i.e. more than one region on the Ramachandran plot, was introduced in 1951 by Pauling and Corey. They proposed a 'pleated sheet' motif in which protein residues alternate between rightand left-handed forms of the α-helix.
While not yet seen in nature, simulations indicate that αpleated sheets can form as kinetic intermediates in unfolding processes [17][18][19][20]. More generally, a broad range of protein structures could in principle be built from polypeptide motifs possessing two rotational states [21,22]. In the non-natural world, protein-mimetic polymers do form large-scale stable structures that simulation indicates harbor a secondary-structure motif built from more than one rotational state. The peptoid nanosheet [23] is a molecular bilayer that possesses macroscopic extent in two dimensions. It is made from peptoids, structural isomers of peptides. The nanosheet is flat because its constituent peptoid polymers are linear and untwisted, properties that result from the fact that backbone residues along each polymer alternate between two twist-opposed rotational states [16]. This secondary-structure motif, called a S-strand, corresponds to two regions on the Ramachandran plot, as shown in Fig 1(c).
To describe this structure and its possible generalizations it is convenient to be able to describe regions on the Ramachandran plot with a single number, so that the state of each residue along a polymer backbone can be compactly described. The desire for such a description is the motivation for this paper. We introduce in Section 2.1 a structurally meaningful combination of ϕ and ψ that we call the Ramachandran number, R. Given a way of describing regions of the Ramachandran plot in terms of one number instead of two, one can then draw diagrams that give insight into protein geometry that is difficult to obtain by other means. In Section 2.2 we show that R can be used to assess in a compact manner the geometric content of protein structures, and can be used to draw diagrams that reveal at a glance the frequency of occurrence of regular secondary structures and disordered regions in large protein datasets. We also suggest that R may be useful in the analysis of intrinsically-disordered proteins, whose three-dimensional structures are less well understood than those of globular proteins [9,[24][25][26]. We conclude in Section 3. The state of a residue within a peptide (top) and a peptoid (bottom) can be largely specified by the two dihedral angles ϕ and ψ. (b) Regular protein secondary structures, such as αhelices and β-sheets, correspond to single diffuse regions on a plot drawn in terms of ϕ and ψ, called a Ramachandran plot (see Methods). (c) Peptoid Σ-sheets [16] harbor a secondary-structure motif in which backbone residues alternate between two regions on the Ramachandran plot. In order to describe each region in terms of a single number, so that the state of each residue in a backbone can be compactly indicated, we describe in this paper the development and properties of a structurally meaningful combination of ϕ and ψ that we call the Ramachandran number, R.
[Panel (a) was adapted from an image found on Wikimedia Commons (link) by Dcrjsr (CC BY 3.0 (link)). The contours in (b) and (c) represent regions within which 70% of a secondary structure resides; see Section 4.1.].

One possible Ramachandran number
One physical factor that suggests a compact way of describing the Ramachandran plot is the sense of residue twist implicit in the plot, which changes sign as one crosses the negative-sloping diagonal; see Fig 2(a). Structures whose backbones occupy the bottom left-hand triangle of the Ramachandran plot have a right-handed (dextrorotatory or D) sense of twist, while structures in the top right-hand triangle have a left-handed (levorotatory or L) sense of twist [1,16] (note that the terms 'L' and 'D' are also independently used to distinguish between distinct amino acid enantiomers [27][28][29][30]). This observation suggests an indexing system that proceeds from the bottom left of the plot to the top right of the plot. To gain insight into how this should be done, we built protein backbones with dihedral angles chosen from designated regions of the Ramachandran plot. We examined the behavior of the end-to-end distance R e of polymers built in this way (the polymer radius of gyration behaves similarly). This behavior is shown in To construct such an indexing system we take the Ramachandran plot axes to have the range [−λ/2, λ/2) where λ = 360° [1,11,13,15]. We divide the plot into a square grid of (360°σ ) 2 sites, where σ is a scaling factor that is measured in reciprocal degrees. We shall show that it is straightforward to make σ large enough that the error incurred upon converting angles from structures in the protein databank to Ramachandran numbers and back again is less than the characteristic error associated with the coordinates of structures in that database. Given a choice of grid resolution σ, we define the integer-valued Ramachandran number in which the coordinates and correspond to a clockwise rotation by 45°, a shift, and a rescaling of the original coordinates ϕ and ψ (see Section 4.2), and the parameter In these relations the symbol bxe means the integer closest to the real number x, i.e. b2.49e = 2 and b2.51e = 3. Combined as a single expression, Eqs (1)-(4) read The integer-valued Ramachandran number runs between The range of these values are dependent on σ, which makes R Z a difficult value to intuitively grasp. Therefore, for ease of plotting, we define the real-valued, normalized, Ramachandran number which is a value that is practically invariant of σ. Given that ϕ, ψ 2 [−λ/2, λ/2) or [−180, 180), the ranges of R Z and R are respectively ½R Z;min ; R Z;max Þ and [0, 1). The closest approximations to the original coordinates ϕ and ψ that may be retrieved from where bxe is the largest integer smaller than the real number x, and α%β is the remainder obtained upon dividing the integer α by the integer β. Eqs (6) to (8) define our mapping of the dihedral angles to the Ramachandran number, i.e. ð; cÞ ! R Z ! R, and the subsequent approximate recovery of those angles, R Z ! ð;cÞ. We show in Section 4.3 that this recovery can be done to within the characteristic precision of the protein databank. By 'slicing' across the Ramachandran plot we group together structures that might be relatively distant in dihedral angle space, the more so as we approach the negative-sloping diagonal (near R = 0.5). One consequence of this grouping is that the set of structures described by a small interval of R displays a distribution of properties, such as end-to-end distance, as shown in Fig 3(a). The mean of this distribution gives rise to a smoothly-varying trend, but the variance of this distribution is nonzero, and is largest near R = 0.5. Some unavoidable structural coarse-graining therefore occurs upon going from the Ramachrandran plot to the Ramachandran number. Despite this drawback, we shall show that R can function as an order parameter for protein geometry, in large part because the Ramachandran plot is in general relatively sparsely occupied: many hypothetical structures that possess distinctly different structural properties but that would be assigned similar Ramachandran numbers simply do not arise in the protein world. Consequently, R can resolve the major classes of protein secondary structure, such as the α and β motifs; see Figs 3(b) and 4.

Properties and uses of R
The indexing system defined by Eqs (6) and (1) collapses the Ramachandran plot into a single line, the Ramachandran number R. As shown in Fig 4, this number can act as an order parameter for types of polymer secondary structure. Given such an order parameter, we can then draw diagrams that reveal the abundance and spatial connectivity of different forms of secondary structure within polymers.
In Fig 5 we show four different molecular structures described in terms of (a) spatial configurations, (b) the Ramachandran plot, (c) a histogram (bar code or 'R-code') of R-values, and (d) a plot of R versus residue number (the structure of the coiled coil was deduced by a number of authors, Ramachandran among them [32][33][34][35]). The R-code of panel (c) can be regarded as a way of assaying the residues of a protein by geometry, much as gel electrophoresis, which results in similar-looking pictures, is used to tell apart macromolecules by their size and charge. The plot of R versus residue number in panel (d) reveals the spatial connectivity of distinct ordered domains. It shows the distinct segments of secondary structure (α-helix and β-sheet) and loop regions in the proteins, and shows that the peptoid S-strand's residues alternate between twist-opposed rotational states. This representation makes clear that the two However, many structures distant in dihedral angle space but close in R do not arise in proteins; the Ramachandran diagram is in general relatively sparsely occupied. Consequently, R can resolve the major types of protein secondary structure, which can be inferred from the fact that lines parallel to the negative-sloping diagonal (marked), along which R varies only slowly, can touch each region of known secondary structure (colored) individually. This sensitivity allows R to function as an order parameter for protein geometry. rotational states of the S-strand motif are incorporated within a single type of secondary structure; the Ramachandran plot alone does not distinguish between that outcome and the alternative, that the two rotational states exist within two distinct types of secondary structure.
R can be used to compactly describe the abundance of secondary structure types with large protein datasets, as shown in Fig 6. There we show histograms of R ('R-codes') for proteins belonging to distinct SCOP classes [36]. These diagrams identify a number of trends within this dataset. As expected, proteins belonging to classes 'a' and 'b' are rich in α-helical (R %0.36) and β-sheet (R %0.52) regions, respectively. More surprisingly, α-helical regions are abundant in all protein classes, even in the 'all-β' class 'b'. Loop regions (R %0.62) are also prominent; loops connect regions of ordered secondary structure. The R-code also highlights the symmetry of the peptoid backbone about the twist-free region R %0.5 (panel (b)).
R can also be used in a time-and space-resolved way, as shown in Fig 7. Here we show the results of molecular dynamics simulations of the peptoid nanosheet [16,23], which reveal the existence of the S-strand secondary structure motif in which residues possess two distinct rotational states. A time series of the R-code of the bilayer (panel (b)) shows the emergence of the S-strand motif within molecular dynamics simulations via a breaking of the initially-imposed molecular symmetry. In panel (c), we show the geometric state of each residue in one peptoid as a function of time, revealing the emergence of the S-strand structure and the subsequent fluctuations of individual residues on a nanosecond timescale.
Such time-and space-resolved analysis of polymer backbones may be useful for the analysis of intrinsically disordered proteins (IDPs), whose backbone conformations are heterogeneous in space and time [9,[24][25][26]. Insight into the behavior of IDPs can likely be obtained by tracking the state of every residue of a protein as a function of time, which can be done with R in

Conclusions
The Ramachandran plot is a central element of structural biology. We have introduced here a way of describing regions of the Ramachandran plot in terms of a single Ramachandran number, R, which is a structurally meaningful combination of ϕ and ψ. The are many possible ways of constructing such a number, and the one we have chosen is sensitive to the local twist state and degree of compactness of a polymer backbone. Given the ability to describe a two-dimensional space with a single number, one can draw diagrams that furnish insight into polymer structure that is difficult to obtain through other means. For instance, we have shown that R can be used to describe the geometric content of protein and protein-inspired structures, in a space-and time-resolved way, and can be used to draw diagrams that reveal at a glance the frequency of occurrence of regular secondary structures and disordered regions in large protein datasets. We speculate that R may also be useful in analyzing the behavior and evolution of    [16,23] show the existence of the Σ-strand secondary structure motif, within which residues possess two distinct rotational states (colored red and blue in the bottom-right-hand cutaway). (b) A time series of the R-code of the bilayer shows the emergence (to the right of the vertical dotted line) of the Σ-strand motif within molecular dynamics simulations. Polymers in these simulations were initially fully extended, and adopted the Σ-strand motif upon relaxation of their backbone constraints. (c) Geometric state of each residue in one peptoid as a function of time, revealing the emergence of the Σ-strand structure and the subsequent fluctuations of individual residues on a nanosecond timescale. intrinsically-disordered proteins (IDPs), important to e.g. the study of diseases [37]. Such proteins are less well-characterized than globular proteins [1]. IDPs spend substantial amounts of time in unfolded or disordered conformations [9,[24][25][26], but may harbor local or transient regions of structure such as α-helices [38]. The Ramachandran number described here may be a useful complement to existing bioinformatics metrics for IDP sequences [39] for understanding the behavior of these proteins in simulations [40][41][42][43][44] and experiments [45][46][47][48]. More generally, R may be useful as an order parameter for polymer geometry for a wide range of applications.

Obtaining polymer (protein/peptide/peptoid) statistics
The contours in Figs 1b and 3b describe the distribution of secondary structures in a Ramachandran plot, while Fig 4 represent histogram distributions of secondary structures on the Ramachandran line. To obtain statistics on all secondary structures (excepting the polyproline II helix; see below), a protein structure database was obtained from the Structural Classification of Proteins or SCOPe (Release 2.03) [36] that contains proteins with no more than 40% sequence identity (downloaded from http://scop.berkeley.edu/downloads/pdbstyle/pdbstylesel-gs-bib-40-2.03.tgz). Secondary structural elements such as α-helices, 3 10 -helices and βsheets were identified using the DSSP algorithm [49][50][51].
The statistics for polyproline II helices (used to generate the green distributions in Figs 3b and 4) were obtained from segments within 16,535 proteins annotated by PolyprOnline [52] to contain three or more residues of the secondary structure.
Fig 6a represents R-codes for entire classes of proteins. We utilized the SCOP classification system and the individual proteins from each class were extracted from the SCOP dataset described above. Altogether, there were 8560 proteins that were amenable to analysis from this database.
The distribution for the S-sheet on a Ramachandran Plot (Fig 1c) and in an R-code (Figs 5 and 6b) were obtained from a 50 nanosecond interval of a molecular dynamics trajectory [16] .  Fig 7(b) and 7(c) describes a trajectory of the same system before and after the symmetryenforcing backbone restraints were lifted. This process was part of a molecular dynamics equilibration step used in Ref. [16].

Coordinate transformation used to obtain R
Eqs (6)-(8) were obtained by rotating the Ramachandran plot so that contours of constant polymer extension R e (see Fig 2) lie roughly horizontal in the new coordinate representation. To this end we define so that ϕ 0 and ψ 0 are obtained by rotating the original coordinate system (ϕ, ψ) clockwise by 45°, shifting the resulting coordinates linearly (so that the new coordinates are non-negative), and rescaling the result by the grid resolution σ . Fig 8(a) shows graphically this transformation. Indexing the new coordinate system according to Eq (1) corresponds to the counting system shown in Fig 8(b). To undo the transformation Eq (1) approximately we computẽ We then insert Eqs (10) and (11) into the equations 0 ð Àc þ lÞs= ffiffi ffi 2 p and  Dihedral angles converted to Ramachandran numbers can be recovered only approximately, but the error incurred during this back-mapping can be made much smaller than the standard error (typically 1Å) associated with structures in the protein databank. Here we show the root-mean-squareddeviation (RMSD) in dihedral angles (a) and in protein α-carbon spatial coordinates (b) generated upon taking 8560 protein structures obtained from SCOP [36], converting their dihedral angles to Ramachandran numbers, and recovering approximately those dihedral angles using Eqs (7) and (8). The parameter σ indicates the grid resolution used to calculate R; see Eq (1).

Recovering dihedral angle values from the Ramachandran number
It is convenient to be able to retrieve from the Ramachandran number a good approximation to the dihedral angles used to calculate it. This can be done for a range of choices of grid resolution σ. We took 8560 protein structures obtained from SCOP [36]; see Section 4.1. For a given protein we took the dihedral angles associated with each residue, and used these to compute the 3D protein structure (given values of the ω dihedral angle). We carried out the conversion from dihedral angles to Ramachandran number, defined by Eq (1), and from this used Eqs (7) and (8) to obtain an approximation to the original dihedral angles. We used these approximate angles (with the original values of the ω dihedral angle) to compute the 3D structure of the protein. We then calculated the root-mean-squared deviation (RMSD) between the original and recovered sets of angles and heavy-atom positions, shown in Fig 9. For a range of values of grid resolution σ we find these RMSD values to lie well within the 1Å characteristic of the protein databank. For the calculations done in this paper we took σ = 10 5 reciprocal degrees.