Inferring the Rate-Length Law of Protein Folding

We investigate the rate-length scaling law of protein folding, a key undetermined scaling law in the analytical theory of protein folding. Available data yield statistically significant evidence for the existence of a rate-length law capable of predicting folding times to within about two orders of magnitude (over 9 decades of variation). Unambiguous determination of the functional form of such a law could provide key mechanistic insight into folding. Four proposed laws from literature (power law, exponential, and two stretched exponentials) are tested against one another, and it is found that the power law best explains the data by a modest margin. We conclude that more data is necessary to unequivocally infer the rate-length law. Such data could be obtained through a small number of protein folding experiments on large protein domains.


Introduction
A deep understanding of protein folding must involve a description of the general mechanisms involved. It is reasonable to suspect this will consist of a simple model, based on microscopic physics, expressed in a simple mathematical language. Such a model would show how biological sequences are able to employ physics to spontaneously self-assemble into intricate molecular machines.
Simple models of this sort often start by postulating a mechanism of folding, and then derive the consequences of that mechanism [1][2][3][4][5][6][7][8][9][10][11][12]. This suggests it might be possible to infer the general mechanisms of protein folding by verifying the specific predictions of these models. For a simple model of protein folding, however, there is a limited set of general experimental trends that can be readily predicted. One such experimental trend is the law governing how folding times scale with chain length. This is perhaps the simplest comparison of theory and experiment possible, but has not yet been unambiguously inferred despite nearly two decades of active research [2,7].
Polymer theory provides strong precedent for using chainlength scaling laws as a point of connection between experiment and theory. In polymer theory, for instance, comparing the theoretically predicted scaling of the longest relaxation time and polymer diffusion constant as a function of polymer chain length demonstrates deficiencies in the classical Rouse model [13]. Inclusion of hydrodynamic interactions -as is done e.g. in the Zimm model [14] -is necessary to get the correct scaling laws for these kinetic parameters [15]. Thus, comparing chain-length scaling laws in theory and experiment yields new scientific insight, namely that hydrodynamic effects contribute significantly to polymer dynamics in solution. Our hope is that by determining a rate-length law for protein folding, similar comparisons between experiment and theory will yield new insight into how proteins fold.
The rate-length law is also an interesting result in and of itself. Such a law can be viewed as a statement of the computational complexity of protein folding -given a problem of size N (residues), how does one expect the time-to-solution (folding) to scale? Levinthal pointed out that an exhaustive search would result in exponential scaling, and suggested that this would result in unreasonably large folding times [16]. Thus, in many ways, a resolution to Levinthal's paradox is likely to be phrased directly as a rate scaling law, either non-exponential (polynomial) or exponential with an explicitly small exponential factor.
A number of issues complicate inferring such a law from experiment, most importantly the fact that available kinetic data on protein folding spans a very limited range of chain lengthsabout 30 to 300 residues [17]. The statistical power of the data is inherently limited by the fact that protein domain sizes barely span a single order of magnitude, and that most studies of folding have focused on small, well-behaved model systems.
Further complicating the inference of the rate-length scaling law is the fact that chain length is certainly not the only factor affecting folding rates. In fact, it has been argued that it is a fairly weak predictor of the folding time [18,19]. For instance, there seems to be some correlation of folding times with topological complexity of the native state, such that if two proteins have the same number of residues, but different folds, they may take different amounts of time to fold [18,20]. Moreover, even protein mutants with the same native structure can have at least 3 orders of magnitude variation in their folding rates [21]. Thus, we expect that experimental data on the scaling of folding time with chain length should be very noisy, and difficult to statistically estimate.
Nonetheless, as we will demonstrate, there does seem to be a significant correlation between the number of residues (N) and folding times (t). Many different mathematical forms of this scaling law have been postulated, either from theory or empirically, but all fall into one of three basic classes. Shaknovich [2], Cieplak [3,4], and co-workers have proposed a power-law, t*N n . We recently constructed a model that suggested exponential scaling, t*e aN [1], consistent with predictions made by Zwanzig, Szabo and Bagchi [5,6]. Finally, Thirumalai [7,8], Muñoz [9], Takada [10], Finkelstein [11], and co-workers have suggested a stretched exponentials, t*e aN b , with b as 1=2 or 2=3. Wolynes has proposed the law may conditionally change between all four suggested models [12].
Each model for the rate-length law derives from a different model of the fundamental physics of folding. Exponential rate laws have been estimated based on models of protein folding as a biased conformational search [1,5,6]. Stretched exponentials have been derived by modeling folding as a critical nucleation process [11] or random energy model [7][8][9]. Power laws [2][3][4] and stretched exponentials [10] have been posited empirically based on simulation. Finally, lack of a strong rate-length relationship was suggested based on experimental data [18,19]. Clearly, different postulates for folding mechanisms lead to different predictions for the rate-length law (or lack thereof). Thus, inferring the rate-length law of protein folding may directly inform our understanding of the mechanisms by which all proteins fold.
In what follows, we develop and apply two methods for choosing between these models and evaluate how each proposed model performs.

Modeling Methods
Below, we outline two complementary methods for inferring which proposed scaling law is the most reasonable. First, we present a method capable of fitting each model's parameters to the known data, and examining how well each model explains the data. Next, we investigate a second discriminatory method, which proposes that folding times must be below a certain threshold value to be biologically viable. It has been demonstrated that in the crowded milieu of the cell, proteins must fold rapidly to avoid aggregation or degradation. We suggest that this implies that we can check the reasonableness of any model by seeing if its prediction for this threshold time is reasonable with what has been observed empirically in biology.
In this study we focus on single-domain globular proteins. Kinetic data for the folding times of proteins were taken from the KineticDB [17], which reports protein folding times at zero denaturant, near room temperature, and under neutral pH. Other data sets exist [22][23][24], but were not consistent with one another -despite this, they yielded very similar results (Fig. S1, Table S1 in File S1).

Direct Method: Likelihood Maximization
We want to estimate the parameters for each proposed form of the scaling law. In what follows, we adopt a model that accounts not only for this scaling law, but all other factors (topology, experimental conditions, etc.) via a random Gaussian component. Thus, by fitting each model, we not only learn parameters for each proposed model, but also get an estimate for the relative importance of these other factors in determining folding times.
We assert the following model for the folding time, are the proposed folding rate laws, X represents a random variable distributed (independently and identically) as a zero-mean Gaussian, X *N (0,s 2 ), and t 0 is a fit constant accounting for units of time. These models include the four proposed ratelength laws suggested in the literature and a null model that corresponds to the lack of a rate-length relationship. By adding X to the logarithm of the folding time (1), we model random variation in relative terms, and it enters as a multiplicative factor. The random variable X models all contributions to the folding rate not accounted for by the chain length. In a firstprinciples model, this would possibly include effects due to specific sequences, folded state topology, experimental conditions, or perhaps other factors. Here, we lump these terms into a single random variable in order to focus on the chain length exclusively.
Equation (1) implies t is distributed as a log-normal, with location parameter f (N) and scale parameter s. The likelihood of the entire data set (assuming n independent measurements) is We have three parameters for each model, s, t 0 , and a or n for the exponential and power-law families, respectively. We have fit these parameters by maximizing the likelihood L. Model comparison can then be performed by investigating the ratio of the likelihoods of two alternative models (Table 1). We have adopted the simple likelihood approach (versus a full-fledged Bayesian analysis) because the number of fit parameters are small and equal for each model, the models are simple and lowdimensional, and we have little prior information about the parameters. See Fig. S3 and Table S2 in File S1 for a Bayesian analysis and comparison. Note that parameters (a, n) obtained from likelihood maximization of this model will be equivalent to those obtained by performing least-squares regression for the stated models on the logarithms of the folding times. Our maximum likelihood approach provides an estimate of the width of the fits involved (in the form of s) and also a mechanism for rigorous model comparison via likelihood ratios.

Indirect Method: Biological Limits
We postulate that there exists a critical time, t c , that places a biological upper bound on folding times. Specifically, if a protein folds slower than this time (i.e. twt c ) then that protein will be much more likely to aggregate during the course of folding, and therefore is evolutionarily selected against.
The majority of biologically observed proteins should have folding times less than t c , but we postulate that some proteins will have greater times. These proteins are those that receive help folding from chaperones or other cellular machinery. It has been estimated that about C&10% of proteins fall into this category [25].
Together, these assumptions allow us to build a model for the predicted distribution of protein chain lengths. The size distribution of domains (Fig. 1) can be roughly approximated by a Gaussian with parameters m N and s N . In that case, where f {1 (t c =t 0 )~N c is the chain length corresponding to t c for a specific model (power law, exponential, etc.), and C is the percentage of proteins with folding times slower than t c . This framework is, of course, an approximation. There are undoubtedly many other factors affecting the optimal sizes of proteins beyond merely their folding times. Metabolic efficiency, structural packing constraints [26,27], and the behavior of specific proteins in their local cellular environments certainly play a role. Nonetheless, the concept of an upper limit to the folding times is reasonable, and our aim here is to simply extract some general comments about the reasonableness of predicted folding times, rather than make quantitatively accurate predictions.

Results and Discussion
Direct fitting of all proposed models to the available data yields reasonable results for each (Fig. 2). Each model reports a scale parameter (s) of approximately 3, which indicates that 68% of proteins will have folding times within a factor of e s &20 from the time predicted by the rate law, and 95% will be within a factor of e 2s &400. Thus, with knowledge of only the chain length alone, one can estimate the folding time of a protein to within about 2 orders of magnitude and be correct 95% of the time. Since the available data spans folding times of more than 9 orders of magnitude (between 1:9 : 10 {7 and 9:9 : 10 2 seconds), this demonstrates that chain length captures much of the variation in folding times.
To further assess the presence or absence of an overall ratelength law the likelihood of a null model, which modeled the  logarithm of folding rates as a Gaussian distributed independent of chain length (f (N)~const:), was computed. This null model was found to be significantly less likely (at least a factor of 10 9 less probable) given the data than any of the four rate-length models proposed (Table 1).
Do the data support any one model? The power-law model is slightly favored by comparing the likelihoods that each model generated the observed data (Table 1). In such comparisons, typically a ratio of 10 2 or greater is considered significant, and often models differ by hundreds of orders of magnitude [28] thus, the power law model is better supported by the data, but only by a modest margin. Further, an attempt to fit the stretched exponential form with b as a variable parameter resulted in an unreasonably small value of b along with a very large value of a, resulting in a fit that is very close to the power law (Fig. S2 in File S1). Finally, the power law model has the smallest fit s, indicating that it explains the most variation in the data, and attributes less to orthogonal factors.
It is clear, however, that there is little difference between the models in the range of available data. These models diverge significantly only for very large proteins (Fig. 3). We conclude that, with given experimental data, a direct statistical analysis cannot support any one of the theoretically proposed rate-length laws.
Despite this, each law yields significantly different predictions for the distribution of folding times, generated by transforming the known distribution of domain sizes into each of the different models (Fig. 2). The most significant differences are in the tails of these distributions, where the exponential forms predict much longer folding times for the largest proteins ( Table 2). The power law model predicts no proteins fold in times longer than an hour, while the exponential forms show a significant number of proteins with folding times longer than a day (Fig. 3).
An evaluation of the reasonableness of these folding time distributions is provided by the critical time t c for each model (Table 3). For reasonable values of C (&10%, predicted from experiment [25]), the power law t c is on the order of minutes. The exponential forms, on the other hand, predict t c is on the order of hours. Compare this to some cellular benchmarks, as done by Rollins and Dill (personal communication): the 16 second average time to synthesize a protein in E. Coli (325 residues60.05 s {1 synthesis rate) [29], the 30 seconds it takes for GroEL to refold a protein [30,31], or the 20 min E. Coli doubling time.
Indeed, picking a reasonable value of t c and calculating the probability of observing the empirically observed domain size distribution (Fig. 4) shows that for values of t c *10 seconds, the power law model is clearly the best. However, for any values of t c greater than 100 seconds, the exponential laws are much better models.    Figure 4. The probability that each of the rate-length laws reproduces the experimentally observed domain size distribution given a value of t c . Given a rate-length law mapping chain length to folding times, if the distribution of sizes of protein domains is modeled as a Gaussian, choosing a timescale t c above which proteins need assistance to fold, and an approximate fraction C of proteins that need assistance to fold fully specifies the distribution of domain sizes (eq. 3; here, C~0:10). That distribution was used as a model for the generation of the empirically observed domain size distribution reported in Fig. 1,. Here, we plot the probability that the empirically observed distribution resulted from the model indicated. Note the yaxis is log 10 and divided by 10 3 for clarity, so small differences on the plot are actually quite large. doi:10.1371/journal.pone.0078606.g004

Conclusions
We conclude that while chain length is a statistically significant factor for predicting folding rates, current experimental data do no support any one of the proposed models for the rate-length relationship over any other. While the power law model appears to best explain the available raw data, it results in very fast predicted folding times. The exponential forms, while doing a marginally poorer job of explaining the raw data, yield a distribution of folding times more in line with what we expect from biology. Given the current available data, no clear victor emerges.
Previous theories have claimed that simply predicting one of the four laws investigated here is strong evidence in support of that theory. This is manifestly not the case -not only must the proposed law be reasonable, but it must also predict reasonable parameter estimates, and even then the supporting evidence the rate scaling law can provide given current data is limited. Conversely, the analytical theories mentioned here are not ruled out by the current available data. This is most striking in the case of the exponential form, since exponential scaling of the folding times has often been associated with Levinthal's paradox. This study shows that exponential scaling is reasonable given current experimental data, so long as the exponential scaling constant (a) is sufficiently small.
Clear evidence for any one rate law remains missing. However with a few clear examples of very large globular proteins (500 residues or larger) capable of folding unassisted in vitro, it might be possible to discriminate between the models proposed here. Figure 3 clearly shows the divergences between predicted folding times for large proteins, and shows how just a few data points in this extreme regime might be able to begin differentiating between the proposed models investigated here.

Supporting Information
File S1 Supplemental Information. Figure S1. The rate-length data from each source used, plotted together. Even though there is some overlap in reported sequences, many identical proteins have different chain lengths or folding times reported. We combined these data for our analysis, eliminating only identical measurements. Figure S2. The parameter fit for a stretched exponential, log t!aN b , with b a free parameter. Notice how the fit is perfectly straight on a log-log plot, a characteristic trait of power laws. Figure S3. The parameter posteriors for each model are sharply peaked around their modes. Plotted here is the maximum (not the marginal value) of the posterior at various values of the key parameters a or n. One can see that the likelihood has a sharply peaked value along this dimension. Table S1. The KineticDB dataset with mutants included. Since there are only a few proteins with mutants, and there are many mutants for these few proteins, this database gives artificially more weight to those individual proteins.