^{1}

^{2}

^{1}

^{1}

^{*}

^{3}

^{¤}

Current address: DE Shaw Research, New York, New York, United States of America

Improved the method for calculating the clusters free energies: FM FP AL. Conceived the kinetic model: FM. Designed the kinetic model: AL SP. Improved the kinetic model: FM FP. Designed the method to perform the simulated T-jump experiment: SP. Produced most of the results and their interpretation: FM. Analyzed the data: FP AL SP. Discussed the results: FM FP AL SP. Wrote the manuscript: FM FP AL SP.

The authors have declared that no competing interests exist.

Trp-cage is a designed 20-residue polypeptide that, in spite of its size, shares several features with larger globular proteins. Although the system has been intensively investigated experimentally and theoretically, its folding mechanism is not yet fully understood. Indeed, some experiments suggest a two-state behavior, while others point to the presence of intermediates. In this work we show that the results of a bias-exchange metadynamics simulation can be used for constructing a detailed thermodynamic and kinetic model of the system. The model, although constructed from a biased simulation, has a quality similar to those extracted from the analysis of long unbiased molecular dynamics trajectories. This is demonstrated by a careful benchmark of the approach on a smaller system, the solvated Ace-Ala_{3}-Nme peptide. For the Trp-cage folding, the model predicts that the relaxation time of 3100 ns observed experimentally is due to the presence of a compact molten globule-like conformation. This state has an occupancy of only 3% at 300 K, but acts as a kinetic trap. Instead, non-compact structures relax to the folded state on the sub-microsecond timescale. The model also predicts the presence of a state at

Understanding the mechanism by which proteins find their folded state is a holy grail of computational biology. Accurate all-atom simulations have the potential to describe such a process in great detail, but, unfortunately, folding of most proteins takes place on a time scale that is still not accessible to routine computer simulations. We introduce here an approach that allows for constructing an accurate kinetic and thermodynamic model of folding (or other complex biological processes) using trajectories in which the process under investigation is forced to happen in a short simulation time by an appropriate external bias. An important strength of this approach is the possibility of identifying and characterizing misfolded conformations that, in some proteins, are related to important diseases. We use this method to study the folding of Trp-cage, predicting the structure of the folded state and the presence of several intermediates. We find that, surprisingly, fully unstructured “unfolded” states relax towards the folded conformation rather quickly. The slowest relaxation time of the system is instead related to the equilibration between the folded state and another compact structure that acts as a kinetic trap. Thus, the experimental folding time would be determined primarily by this process.

Understanding protein folding thermodynamics and kinetics is a central issue in molecular biology

A system that is almost ideal for theoretical investigation is the Trp-cage (TC5b) _{10}-helix (residues 11–14) and a polyproline II helix at the C-terminus. The folding mechanism of this system has been studied with several experimental techniques. Calorimetry, circular dichroism spectroscopy (CD)

By atomistic modeling the Trp-cage folding has been studied using several different approaches

In atomistic simulations of biological systems, after an exhaustive exploration is achieved, it is necessary to extract from the trajectory the relevant metastable conformations, to assign their occupation probability, and to compute the rates for transitions among them. Several methods have been developed for this scope

In this paper we present a method that allows exploiting the statistics accumulated in a bias exchange metadynamics run

A cluster analysis is performed on the BE trajectories in a possibly extended CV space, assigning each configuration explored during the biased dynamics to a reference structure (bin) that is close by in CV space.

Next, the equilibrium population of each bin is calculated from the BE simulations using a weighted histogram analysis method(WHAM)

Finally, a kinetic model is constructed by assigning rates to transitions among bins. The transition rates are assumed to be of the form introduced in Ref.

The model constructed in this manner is designed to optimally reproduce the long time scale dynamics of the system. It can be used, for example, for characterizing the metastable misfolded intermediates of the folding process. The advantage of using biased trajectories, besides the acceleration of slow transitions, is a greatly enhanced accuracy of the estimated free energy at transition state regions.

This approach is first illustrated on the Ace-Ala_{3}-Nme peptide (hereafter Ala_{3}). This system is simple enough to allow benchmarking the results against a long standard MD simulation. For this system the model is capable of reproducing with excellent accuracy the kinetics and thermodynamics observed in the unbiased run. The same approach is then applied to the Trp-cage miniprotein. A model is built that allows describing the folding process, computing the folding rates and the NMR spectra, simulating a T-jump experiment, etc. The scenario that emerges is in good agreement with the available experimental data. By kinetic Monte Carlo(KMC)

In the BE approach

Low dimensional free energy projections are often not very insightful, as in complicated processes like protein conformational transitions each minimum in a low dimensional profile may correspond to several different structures. In order to estimate the relative probability of the different structures one should find a manner to estimate the free energy in a higher dimensional space (e.g NR).

In this section a novel method to address this issue is described. The idea is to exploit the low-dimensional free energies obtained from BE to estimate, by a weighted-histogram procedure, the free energy of a finite number of structures that are representative of all the configurations explored by the system. These structures are determined by performing a cluster analysis, namely grouping all the frames of the BE trajectories in sets (

The bins must cover densely all the configuration space explored in BE, including the barrier regions.

The distance in CV space between nearest neighbor bin centers must not be too large. This, as it will be shown in the following, is necessary for constructing the rate model.

The population of each bin in the BE trajectory has to be significant, otherwise its free energy estimate will be unreliable.

A set of bins that satisfy these properties is here defined dividing the CV space in small hypercubes forming a regular grid. The size of the hypercube is defined by its side in each direction:

The canonical weight of each bin is estimated by a weighted histogram procedure based on the metadynamics bias potentials. The derivation that we report follows ref. _{3} and 22 ns for Trp-cage), metadynamics has explored all the available CV space. At the end of the simulation, an estimate of the free energy is the average of _{3} and

In order to simplify the notation we have neglected the position-dependence of _{3} and Trp-cage we used an upper bound for

The free energy estimate given by Eq. 6 is affected by an error

Within this framework, the average value of an observable

The enthalpy

Using Eq. 9 together with Eq. 10 allows extrapolating the average value of the observables for a few tens of K around the temperature at which the simulation is performed. The uncertainty on

In this section we describe a manner for constructing an approximate kinetic model describing transitions between the bins introduced in the previous Section. Constructing the model requires estimating the rates

The rates given by Eq. 12 are used in a KMC algorithm

The diffusion matrix entering in Eq. 13 is estimated using the approach of Ref. ^{5}×10^{5}). The notation

Using these probabilities one evaluates the logarithm of the likelihood to observe the sequence of bins obtained by MD. This is given by

Applying this procedure the prefactor of the rate Eq. 12, which has the form of a jump process among a discrete set of states, is directly optimized. This is a clear advantage with respect to other methods for computing

The approach described in the previous two sections has been carefully benchmarked on solvated Ala_{3}. For this system, it was possible to compare the predictions of the kinetic model, with the results of a very long (∼2 µ

All the BE and MD simulations were performed using the GROMACS suite of programs _{3} was placed in a periodic cubic box containing 1052 TIP3P water _{3}. The SETTLE algorithm

The conformations of Ala_{3} are specified by its six backbone dihedral angles (_{3}) were considered in order to assign the main conformations of the system, denoted by

The system was also simulated using bias exchange metadynamics (BE)

The computational setup used in Ref.

The protons chemical shift deviations (CSD) and ring current shifts (RCS) of a specific configuration were estimated using the SHIFTS program

The Trp solvent accessible surface area (SASA) was calculated for each bin averaging over all the configurations belonging to a bin using the program g_sas in the GROMACS distribution

Ala_{3} is a simple polypeptide that has been extensively used as a benchmark system. Although small, this system shows several protein-like features, such as intramolecular hydrogen bonds and a fragment of _{3} system will be exposed.

The system was simulated using BE _{3}-Nme system section). As expected BE improves the sampling of saddle regions (see _{3} are six one-dimensional free energy profiles (see

Even in this simple system the different structures (see

Correlation between the bins free energies calculated using Eq. 6 applied on BE simulations data and using the standard thermodynamics relation

The equilibrium population of each of the

MD | 34.3% | 12.6% | 22.0% | 0.050% |

BE | 32.1% | 12.0% | 22.3% | 0.085% |

A kinetic model of Ala_{3} was built according to the procedure introduced in the

Panel A: correlation between the MFPT among the four regions in

Dependence of the slope

As a general comment, even in the worst cases investigated (short

The results presented here were obtained analyzing, with the method introduced in the

The set of bins used for constructing the rate model was defined partitioning the five-dimensional CV space in small hypercubes according to the procedure outlined in the _{3} system in the case of the Trp-cage an extended ergodic MD simulation is not available, as equilibrating the system would require performing a run of several tens of

Like for the Ala_{3} case, the free energies of the bins were used for estimating the rate for the transitions between all the neighbouring bins according to Eq. 12. The diffusion matrix entering in eq. 13 was evaluated using the maximum likelihood approach described in the

The maximum likelihood analysis has been repeated sampling the MD trajectory at several different time lags

The rate model described in the

Panel A: metastable sets (clusters) detected by MCL method using

1 | 2 | 3 | 4 | 5 | |

% occupancy | 58.3±0.8 | 24.6±0.7 | 7.0±0.3 | 1.2±0.1 | 2.8±0.2 |

0.0±1.9 | 5.0±2.6 | 11.7±3.8 | 13.8±5.3 | 38.2±5.3 | |

0.0±1.9 | 2.9±2.6 | 6.5±3.8 | 4.1±5.3 | 30.7±5.3 | |

1.82±0.05 | 4.44±0.03 | 6.76±0.04 | 5.54±0.06 | 6.08±0.05 | |

Trp SASA (Å^{2}) |
47.1±0.6 | 70.5±1.0 | 126.4±0.7 | 116.7±1.0 | 140.4±0.8 |

Helical residues | 5.31±0.02 | 2.91±0.03 | 3.86±0.04 | 0.66±0.03 | 1.70±0.03 |

The properties of the clusters depicted in ^{2}, which compares with the value of 47.1±0.6 Å^{2} observed in the folded cluster. This indicates that Trp is shielded from the solvent also in cluster 2. Arg16 forms a

Hydrophobic contacts within 3.9 Å and hydrogen bonds(Å) are displayed. The distances(Å) between Leu7, Pro12, Arg16 and Trp6 selected protons are shown for the 3 most populated clusters. The corresponding values can be compared with the unfolded state NOE contact distances reported in Ref.

In order to characterize in more detail the nature of the clusters described in the previous section, it is useful to consider their NMR properties. As only cluster 1 and 2 are compact and show a significant content of secondary structure, the investigation is here restricted to these two clusters.

In

Panel A: correlation between experimental and calculated

Even if correlation is good, it has to be noted that the proportionality factor between theoretical and experimental CSDs is 0.46 in the full ensemble of bins and 0.6 in cluster 1. To investigate the origin of the variations in the proportionality factor two 20 ns equilibrium MD simulations have been performed, at 282 K (experimental temperature) and at 300 K, starting from the NMR structure and with the same computational setup used in the BE simulation. At both temperatures the proportionality factor with experimental CSDs is 0.8 instead of 1, therefore 0.8 has to be considered the reference value for our computational setup. The optimal proportionality factor of 0.8 is obtained if the CSDs are computed on the lowest free energy bin of cluster 1. The slope difference between 0.6 (cluster 1) and 0.8 may be ascribed to small inconsistencies between the ensemble of structures generated with BE and by an unbiased MD starting from the NMR structure. The further slope variation when the calculation is extended to the full ensemble of bins is most likely a consequence of calculating NMR properties at 298 K instead of at the experimental temperature of 282 K where the population of cluster 1 is larger.

Using a similar procedure (see

The fluorescence relaxation after a temperature jump (T-jump) was estimated according to the procedure outlined in the

Here the dynamics of the system is investigated in more details, still using the rate model introduced in the

Times (inverse of rates) for the transitions between the relevant clusters are shown on the arrows. The uncertainty on each transition time due to both the error on the free energies and the position-dependence of

The approach presented here exploits the trajectories of multiple metadynamics simulations for building a thermodynamic and kinetic model of complex processes (e.g. protein folding) whose description requires a large number of collective variables. The aim of the model is to reproduce the long time scale dynamics of the system and to extract the metastable sets (clusters) of the kinetic process. These states may correspond, for example, to misfolded conformations. The model is constructed as follows: in a first step the equilibrium probabilities of a finite set of conformational states, or bins, are determined by a weighted-histogram procedure exploiting the low-dimensional free energies estimated by metadynamics. In a second step an approximated description of the kinetics is obtained estimating the transition rates among the bins. The diffusion matrix entering in the model is estimated by a maximum-likelihood procedure _{3}-Nme peptide in explicit solvent using the six backbone dihedral angles as CVs. For this system equilibrium MD trajectories on the microsecond timescale are sufficient to sample the relevant conformational space and were used as a reference to evaluate the accuracy of the kinetic model obtained from the BE results. The bins free energies obtained with the method presented here are in excellent agreement with free energies computed from equilibrium MD. The transition rates among neighboring bins are used to run a long KMC. The mean first passage times among selected states obtained in this way are in agreement with those extracted from the reference MD simulations.

Trp-cage is a designed miniprotein that, due to its small size and fast folding rate, has been the object of several theoretical investigations. Here this system is analyzed with a new method, introduced in this paper, that allows deriving a kinetic model of the system by analyzing a set of biased MD trajectories. The model shows the presence of several metastable states (clusters). The most populated one can be classified as the folded state. The second most populated cluster has a

In spite of the presence of several intermediates both the simulated T-jump experiment (see

In conclusion, we have presented an approach aimed at constructing a rate model for complex biomolecular processes starting from a set of biased MD trajectories. One could argue that other approaches aimed at the same purpose are based on less severe assumptions. Distributed simulation techniques allow computing the folding rates directly, and have been applied successfully for studying folding in explicit solvent of even larger systems

Cartesian coordinates of folded state (cluster 1) reference structure in Protein Databank format.

(0.02 MB TXT)

Cartesian coordinates of cluster 2 reference structure in Protein Databank format.

(0.02 MB TXT)

Cartesian coordinates of cluster 3 reference structure in Protein Databank format.

(0.02 MB TXT)

Cartesian coordinates of cluster 4 reference structure in Protein Databank format.

(0.02 MB TXT)

Cartesian coordinates of cluster 5 (compact molten globule) reference structure in Protein Databank format.

(0.02 MB TXT)

Structures of the attractors for the relevant free energy basins of Ala3 found in the MD and BE simulations. Inset: Schematic picture of Ala_{3} test system. The dihedral angles φ and ψ displayed in the figure are chosen as CVs for the BE simulation. They are labeled with suffix according to their position along the chain.

(4.96 MB TIF)

Free energy profiles as a function of φ1 (see _{G}(s,t) during a BE simulation between 1 and 8 ns; after ∼5 ns the bias potential converges and grows parallel to itself. Panel B: Free energy profile from the 1.8 µs MD simulation compared with the profiles obtained from three independent BE simulations. The 3 BE profiles are obtained by applying eq. 2.

(0.49 MB TIF)

Correlation between free energies of neutral walker and WHAM for Trp-cage. Correlation between the bins free energy evaluated using the approach described in the

(0.39 MB TIF)

Simulated Trp-SASA T-jump of Trp-cage. Simulated TRP SASA evolution as a function of time at 298 K starting from an initial distribution at 291 K (black line). The red line is a double exponential fit to the data. The two time constants of fit are τ_{1} = 248 ns, τ_{2} = 2313 ns. The diffusion matrix entering in the kinetic model was calculated using several MD simulations for a cumulative time of ∼500 ns. A time lag of 12 ns was used in the maximum likelihood approach for calculating D.

(1.16 MB TIF)

Free energy profiles of Ala3 along the six backbone dihedral angles. The profiles are calculated using eq. 2 on the last 10 ns of a 30 ns BE simulation.

(0.15 MB TIF)

Free energy profiles as a function of time for Ala3 obtained with a 30 ns BE simulation. −V_{G} is reported for each backbone dihedral angle at several times after the filling time. Each time is represented with a different color: black (10 ns), red (11 ns), green (12 ns) and blue (13 ns). The parallel growth in time of the metadynamics bias potential is evident from the picture.

(0.36 MB TIF)

Diffusion matrix of Trp-cage as a function of the time lag. Few elements of the diffusion matrix are reported. A MD trajectory of ∼500 ns and the maximum likelihood approach explained in the manuscript is used for calculating D at each time lag. After approximately 8–10 ns the diffusion matrix elements show a converging behaviour.

(1.00 MB TIF)

Bins network topology at T = 298 K projected on three dimensions: Cα contacts, dihedral correlations and α-helix fraction. Each bin is represented as a sphere whose dimension and color is associated with the free energy (kcal/mol). The location of the folded state and the molten globule (cluster 5) lowest free energy bins are indicated in the figure.

(3.07 MB TIF)

Diffusion matrix tables and correspoding rates.

(0.07 MB PDF)

We are very grateful to David Chandler for several precious suggestions. We also thank Vanessa Leone, Paolo Carloni, Rolando Hong and Xevi Biarnes for useful discussions and for reading the manuscript before submission.