Coexistence of Flexibility and Stability of Proteins: An Equation of State

We consider a recently suggested “equation of state” for natively folded proteins, and verify its validity for a set of about 5800 proteins. The equation is based on a fractal viewpoint of proteins, on a generalization of the Landau-Peierls instability, and on a marginal stability criterion. The latter allows for coexistence of stability and flexibility of proteins, which is required for their proper function. The equation of state relates the protein fractal dimension , its spectral dimension , and the number of amino acids N. Using structural data from the protein data bank (PDB) and the Gaussian network model (GNM), we compute and for the entire set and demonstrate that the equation of state is well obeyed. Addressing the fractal properties and making use of the equation of state may help to engineer biologically inspired catalysts.


Introduction
Proteins are one of the major components of living cells. They constitute more than half of the cell's dry weight, and are responsible for the execution of most cellular functions required for life, including among others, catalysis and molecular recognition within and between cells and their surroundings. Understanding the relationships between structure, internal dynamics, and enzymatic activity at the single-molecule level could pave new ways to manipulate individual molecules.
Two seemingly conflicting properties of native proteins, such as enzymes and antibodies, are known to coexist. While proteins need to keep their specific native fold structure thermally stable, the native fold displays the ability to perform large amplitude conformational changes that allow proper function [1]. This conflict cannot be bridged by compact objects which are characterized by small amplitude vibrations [2]. Recently, however, it became evident that proteins can be described as fractals; namely, geometrical objects that possess self similarity [3,4]. Adopting the fractal point of view to proteins makes it possible to describe within the same framework essential information regarding topology and dynamics.
Based on the fractal viewpoint, we have recently derived a universal equation of state for protein topology. The same fractal viewpoint allows describing the near equilibrium dynamics of native proteins. We have recently shown that it leads to anomalous dynamics [5]. For example, the autocorrelation function of the distance between two a-carbons on a protein is predicted to decay anomalously, first, at short times, as 1{t d and later, at long times, as t {b , where d and b are exponents that depend on various fractal dimensions. This type of relaxation has been recently observed in single molecule experiments [6,7]. Closely related is the anomalous diffusion of an a-carbon that is predicted by the fractal model, where the mean square displacement is found to increase as *t d . Such a behavior has also been recently observed in molecular dynamics simulations [8].
Natively folded proteins can be characterized by broken dimensions: the fractal and spectral dimensions [2,4,5,[9][10][11][12]. The mass fractal dimension d f describes the spatial distribution of the mass within the protein via the scaling relation M r ð Þ*r d f , where M r ð Þ is the mass enclosed in a sphere of radius r [3]. The spectral dimension d s governs the density of low frequency vibrational normal modes via the scaling relation g v ð Þ*v ds{1 , where g v ð Þdv is the number of modes in the frequency range v,vzdv ½ [13]. While for regular three dimensional (3D) lattices both d s and d f coincide with the usual dimension of 3, for proteins it is usually found that d s v2 and 2vd f v3, leading to an excess of low frequency modes and a more sparse fill of space [2,4,12]. Importantly, the regime d s v2 is associated with the so-called Landau-Peierls instability, where the amplitude of vibrations increases with the number of residues N [14,15]. As this amplitude overcomes a threshold value, it may cause the protein to unfold [2,12].
The Landau-Peirels instability is most readily derived using the density of states. The static mean square displacement (MSD) of an a-carbon, which is essentially the so-called B i -factor, averaged over all a-carbons of the protein, may be expressed as where m is the average mass of an amino acid. Since g v ð Þ*v ds{1 , it follows that if d s v2 the integral diverges with the lower bound v min . The latter depends on the protein radius of gyration R g and the number of residues N as v min *R g {d f =ds *N {1=ds . This leads to S Du i ð Þ 2 T*N 2=ds{1 , which increases with N for d s v2. In particular, the static MSD of surface residues has been argued to grow as We have proposed a marginal stability criterion [16], in which most proteins ''exploit'' the Landau-Peierls instability to attain large amplitude vibrations, which is required for their proper function, yet maintaining their native fold. Thus proteins are assumed to exist in a thermodynamic state close to the edge of unfolding. Based on this and the Landau-Peirels instability of the surface residues, Eq. (2), a general equation of state has been proposed that relates between the spectral dimension d s , the fractal dimension d f , and the number of amino acids along the protein backbone N: where b is a molecular fit parameter depending on the temperature T, the GNM spring constant c, and the GNM cutoff R c : b<ln(cR c 2 /k B T)) [12]. It has been shown that this equation is obeyed by about 500 proteins regardless of their source or function [12]. In the present study we check the validity of Eq. (3) for a much larger set of over 5,000 proteins, using a range of statistical methods, and show that also for this very large set Eq. (3) is beautifully fulfilled. This supports the marginal stability criterion that led to this equation.

Methods
We have used all data files present in the Protein Data Bank (PDB) [17] and filtered out proteins exceeding 95% sequence identity and proteins that have ligands, RNA, or DNA. We have also removed incomplete data files, files that contained data of the a-carbons alone, and also files of proteins smaller than 100 amino acids that are too small to be characterized as fractals. With this screening the set has been reduced to 5793 proteins.
The fractal and spectral dimensions were calculated for all 5793 proteins in similar ways to the procedure described by [12]. Finding the protein center of mass and placing the origin of coordinates at the ten nearest a-carbons, the mass was calculated as a function of the distance r on a log-log scale. The fractal dimension d f has been obtained as the slope of this plot for distances below the protein gyration radius R g , averaged over the ten origin of coordinates, see examples in Fig. 1. It should be noted that when a few alternative locations of an atom are given, only the ''A'' location (usually the most abundant one) has been used.
To find the spectral dimension d s , we calculate the cumulative density of normal vibrational modes representing the number of modes up to a frequency v. To obtain the vibrational modes, we used a frequently applied elastic model for protein vibrations, the Gaussian network model (GNM) [12,[18][19][20][21][22][23]. Two values were taken for the interaction distance cutoff R c , that describes the range of the interaction between an acarbons pair, R c = 6 Å and R c = 7 Å . Plotting on a log-log scale G v ð Þ against the frequency v, the slope in the low frequency range (containing about 24% of the modes, independent of the protein type or size N) To deal with the large number of proteins in this set, both procedures were automated using suitable computer codes. The  automatically calculated spectral dimension values were compared (for the case R c = 7 Å ) to the manually obtained values for the set of 543 studied in [12]. We found almost vanishing mean of the difference between the two results (0.0034), showing that the error is statistical, and a low standard deviation (0.083), suggesting good agreement between the two methods of calculation.

Results
The results for the whole set appear in the supporting information S1 and are shown in Figs. 3 and 4 (for R c = 6 Å and 7 Å , respectively), where we plot the combination 2=d s z1 d f against 1=lnN. In order to present the whole set of data, we designed a (smoothed) colored histogram based on a 100|100 grid, where a pixel color represent the number of proteins associated with the pixel. The data is first fitted to Eq. We also fitted the data to an equation resembling Eq. (3) but in which the value ''1'' is replaced by a free parameter a: Remarkably, the colored histogram shows a ridge roughly centered at the best fitting theoretical lines.
To improve the accuracy of the analyses, a subset was constructed containing only those proteins whose both d f and d s values have been determined with a very high precision, such that the squared correlation coefficients for the power-law fits of both M r ð Þ and G v ð Þ were in the range R 2 .0.99. Accordingly, this subset for R c = 6 Å (containing 4249 proteins) is not identical to the subset for R c = 7 Å (containing 3890 proteins), see the supporting information S1 for details. The results are presented in Figs. 5-6. Fitting to Eq. (3) (dashed lines) leads to b = 4.476 for R c = 6 Å (Fig. 5, cc = 0.576) and b = 3.078 for R c = 7 Å (Fig. 6, cc = 0.593). Fitting the data to Eq. (4) (full lines), yields a = 0.952 and b = 4.747 for R c = 6 Å (Fig. 5, cc = 0.577), and a = 0.833 and b = 4.031 for R c = 7 Å (Fig. 6, cc = 0.611).
Although the data analysis presented in Fig. 3-6 appears complete, it fails to give equal weight to proteins of different sizes. All four different data sets used above are very rich in proteins of small (100-200 residues) and intermediate size, a consequence of        In Figs. 9-10 we present results from the high precision subset of 4249 proteins. Fitting to Eq. (3) (dashed lines) leads to b = 4.535 for R c = 6 Å (Fig. 9, cc = 0.941) and b = 3.124 for R c = 7 Å (Fig. 10, cc = 0.937). Fitting the data to Eq. (4) (full lines), yields a = 1.065 and b = 4.155 for R c = 6 Å (Fig. 9, cc = 0.945), and a = 0.917 and b = 3.609 for R c = 7 Å (Fig. 10, cc = 0.946). Here, as well, all lines pass through almost error bars. This refined analysis gives an even stronger support to Eq. (3).

Discussion
All correlation coefficients mentioned above (Figs. 3-10) are considered excellent. In addition, the values of a are close to the theoretically predicted value a = 1, similar to the set of 543 proteins studied by [12]. In particular, the fits of the data to Eq. (4) for all data sets belonging to R c = 6 Å (shown in Figs. 3,5,7 and 9) yields a values that are remarkably close to 1. The distribution of the data in all four sets appears as a ridge that is roughly centered at the best fitting theoretical lines (Figs. 3-6), and when the binning procedure is being used, all lines pass well through the error bars (Figs. 7-10). We believe that these results strongly confirm the universal behavior described by Eq. (3), thereby supporting the theoretical arguments leading to this equation.
Importantly, a is found to be particularly close to 1 when the binning procedure is introduced, in which we analyze the mean value of 2=d s z1 d f , for a given N, for its dependence on N. In these cases we also obtain remarkably good correlation coefficients, significantly better than those obtained without binning. This suggests that, as a group, proteins follow the equation of state, although the error bars indicate that there are other factors present that cause deviations from the equation. These factors could be related to the protein specific structure and/or function.
The distribution of the data in all four sets appears as a ridge that is roughly centered at the best fitting theoretical lines (Figs. [3][4][5][6], and when the binning procedure is being used, all lines pass well through the error bars (Figs. 7-10). We believe that these results strongly confirm the universal behavior described by Eq.
(3), thereby supporting the theoretical arguments leading to this equation.
To conclude, our analysis confirms the fractal nature of proteins and supports the predicted universal equation of state (3). This suggests that the majority of proteins in the PDB exist in a marginally stable thermodynamic state, namely a state that is close to the edge of unfolding. This could be related to the fact that enzymes require flexibility and large internal motion to function properly [1]. We suggest that Eq. (3) can be used as a tool in the design of artificial enzymes [24]. Interestingly, fractal-like properties have also been suggested to appear in the configuration space of peptides [25].

Supporting Information
Supporting Information S1 A file containing the mass fractal dimension d f and spectral dimension d s of all proteins analyzed, divided into the four data sets described in the text: (i) GNM cutoff