^{¤}

^{*}

Conceived and designed the experiments: AK RO. Performed the experiments: AK. Analyzed the data: AK TG MM. Contributed reagents/materials/analysis tools: AK. Wrote the paper: AK RO.

Current address: Department of Medicine, Children's Hospital Boston, Harvard Medical School, Boston, Massachusetts, United States of America

The authors have declared that no competing interests exist.

Calculation of the free energy of protein folding and delineation of its pre-organization are of foremost importance for understanding, predicting and designing biological macromolecules. Here, we introduce an energy smoothing variant of parallel tempering replica exchange Monte Carlo (REMS) that allows for efficient configurational sampling of flexible solutes under the conditions of molecular hydration. Its usage to calculate the thermal stability of a model globular protein, Trp cage TC5b, achieves excellent agreement with experimental measurements. We find that the stability of TC5b is attained through the coupled formation of local and non-local interactions. Remarkably, many of these structures persist at high temperature, concomitant with the origin of native-like configurations and mesostates in an otherwise macroscopically disordered unfolded state. Graph manifold learning reveals that the conversion of these mesostates to the native state is structurally heterogeneous, and that the cooperativity of their formation is encoded largely by the unfolded state ensemble. In all, these studies establish the extent of thermodynamic and structural pre-organization of folding of this model globular protein, and achieve the calculation of macromolecular stability

The importance of accurately defining the molecular ensembles of proteins was recognized early by Levinthal, who concluded that folding of a random coil by way of a diffusive search of its combinatorially vast conformational space is incompatible with the biological energies and timescales of protein folding

Structured unfolded states have been observed in a variety proteins

Study of this question has been made difficult by the spectroscopic limits of resolving microscopic ensemble sub-states that exist under the conditions of physiologic temperature, pressure, and hydration

Usage of Monte Carlo (MC) algorithms that utilize simultaneous changes of many conformational variables, such as loop torsion MC and replica exchange MC (REM), has shown promise in efficiently calculating convergent ensembles of proteins in aqueous solution ^{−½}, where

Here, we introduce another such variant, termed replica exchange MC with energy smoothing (REMS), that does so by manipulating the energy expression itself. We show that in spite of deforming the free energy surface to some extent, REMS yields apparently canonical free energy distributions in the energetic regime of biological systems. Consequently, we apply REMS to simulate the thermal folding of a small globular protein, the 20-residue Trp cage TC5b, under the near physiologic conditions of molecular hydration. We show that such an approach can be used for efficient and accurate calculation of protein stability

TC5b is a small globular protein, consisting of several natural and redesigned structural motifs (^{3}) to accommodate a fully extended TC5b in explicit water, a 100 ps MD trajectory phase prior to replica exchange to achieve equilibration (

The structure is composed of an N-terminal α-helix with its α-helical/secondary Q^{5}∶K^{8} salt bridge (red), type I β-turn S^{13}-S^{14}-G^{15} with its β-turn/tertiary D^{9}∶R^{16} salt bridge (blue), and a hydrophobic core that includes both α-helical Y^{3}∶W^{6} and tertiary W^{6}∶P^{19} interactions (gold mesh).

A. Instantaneous potential energy (

We used a smoothing time of 200 fs for the calculation of the Metropolis criterion during REMS (

A. Histograms of potential energies (_{V}^{*}, solid stars) leads to a significant underestimation of the heat capacity of water at low temperature. Sizes of symbols represent ±1σ.

A. Mean probabilities 〈_{n}_{m}^{6}∶P^{19} (closed squares), hydrophobic core Y^{3}∶W^{6} (open circles), salt bridge D^{9}∶R^{16} (closed stars), α-helical Y3∶L7 (solid circles) and the β-turn D9∶S14 (open squares) hydrogen bonds between initial and final structures as a function of replica exchange for the 363 K replica. These measure were chosen because their non-local nature should be most sensitive to initial configuration memory effects. The total length of REMS simulation exceeds the apparent computational time constant of self-diffusion by nearly three orders of magnitude.

In order to examine the origin of thermal stability of TC5b, we calculated the apparent stabilities of various conformational motifs of TC5b as a function of temperature (

A. Fraction of formed α-helix L^{2}YIQWLK^{8} (dashed black), β-turn S^{14} (solid green), and polyprolyl helix P^{18} (dotted blue), as defined using self-consistent clustering and enumeration of their backbone dihedral angles. Note that P^{18} remains unchanged in its backbone conformation due to its definition in CHARMM. Individual α-helical residues have varying thermal stability, with the more N-terminal ones being less stable, consistent with the existence of α-helical fraying. B. Fraction of formed α-helical salt/secondary bridge Q^{5}∶K^{8} (solid red), α-helical hydrogen bond Y^{3}∶L^{7} (dotted red), β-turn/tertiary salt bridge D^{9}∶R^{16} (solid blue), β-turn hydrogen bond D^{9}∶S^{14} (dashed green), tertiary hydrophobic core W^{6}∶P^{19} and Y^{3}∶P^{19} (solid and dashed black), and secondary hydrophobic core Y^{3}∶W^{6} (dashed red), as defined by using self-consistent clustering and enumeration of their distances. Note that the α-helical salt/secondary bridge is only partially formed at low temperature, even though the rest of the structure is nearly fully folded by other measures. Similarly, the secondary hydrophobic core Y^{3}∶W^{6} persists even at high temperature, where the rest of the protein is largely unfolded by other measures. Importantly, substantial amount of residual native structure persists at high temperature. C. Fraction of formed mean α-helical structure (dashed black), mean β-turn structure (solid green), mean tertiary structure (solid black) in the REMS calculated ensembles, and native fraction measured experimentally using chemical shift dispersion (squares), as adapted from the first study of TC5b

Stabilities of both local and non-local structural motifs exhibit an apparently sigmoid melting transition (

Experimental studies of TC5b indicate a substantial amount of residual structure in the unfolded state ensemble at high temperature

In order to discover the origin of residual structure at high temperature, we applied a graph-based approach designed to learn the natural coordinates of highly dimensioned data. By embedding the molecular ensemble in a graph based on geometric similarity, and projecting the individual structures onto a manifold that preserves nearest-neighbor geometric relations of this graph, we are able to distinguish globally organized configurations, termed mesostates, from groups of structures comprised of unrelated conformations (

Mapping of the unfolded state ensemble, as calculated using the 363 K replica, onto the two top coordinates of its locally linear embedding space (open black circles), and the two top coordinates of its principal component projection (solid green circles). Principle component analysis fails to discern mesostate structure of the unfolded state ensemble, with the entire ensemble located near the origin of the PCA projection. On the other hand, displacement along the manifold from the origin of the LLE map coincides with the formation of native-like mesostates, containing: 1) α-helical/secondary salt bridge (red), 2) β-turn/tertiary salt bridge (blue), 3) α-helix and α-helical hydrophobic core, and 4) nearly native configurations with both the α-helix and the tertiary hydrophobic core.

In order to estimate the extent of pre-organization of the thermal folding of TC5b by the residual structure of the high temperature ensemble, we calculated the apparent cooperativities of forming pairs of conformations into configurations, as expressed by the probabilities of forming these configurations conditional on the formation of their constituent conformations (

_{pair} | _{i} | _{j} | _{pair}_{i} P_{j} | |

0.30 | 0.14 | 0.19 | 11 | |

0.54 | 0.19 | 0.59 | 4.8 | |

0.27 | 0.59 | 0.16 | 2.9 | |

0.29 | 0.16 | 0.27 | 6.7 | |

0.51 | 0.27 | 0.29 | 6.5 |

Conditional probabilities of forming pairs of native interactions, as listed, with _{pair}_{i}_{j}_{i}_{j}_{pair}_{i}_{j}

Apparent coupling between local (conformational) and non-local (configurational) contacts has been noted earlier during the folding of Gō lattice polymers, where its origins were related to the details of the potential energy function defining the native state

The apparent cooperativity between forming concomitant α-helical and tertiary hydrophobic cores of TC5b exceeds the expected non-cooperative value by nearly a factor of 3 (

In order to assess how the thermal folding reaction can proceed by way of configurational mesostates, we examined the folding ensemble at the midpoint of its folding transition as comprised by the 310 K replica, by using graph manifold learning. At the folding midpoint, the unfolded and native state ensembles are equi-populated, and their inter-conversion defines all of the possible folding pathways _{1}_{2}_{3}_{1}

Mapping of TC5b's folding ensemble at the midpoint of its thermal transition, as calculated using the 310 K replica, onto the top three coordinates of its LLE manifold. Displacement along the _{1}_{2}^{8} that is part of the N-terminal α-helix in the NMR structure. Displacement along the _{3}

Insofar as the free energy of flexible polymers can be described by a configurational partition function, our study shows that molecularly adapted variants of replica exchange, including REMS introduced here, can be used for the calculation of the free energy and cooperativity of protein folding

Furthermore, the successes and failures of current

To understand the origin of protein stability and cooperativity, we chose to examine a protein the folding of which is well characterized structurally, thermodynamically, and kinetically. The smallest such protein is the 20-residue Trp cage ^{1}), α-helical/secondary salt bridge (Q^{5}∶K^{8}), β-turn/tertiary salt bridge (D^{9}∶R^{16}), and optimized hydrophobic stack (Y^{3}∶W^{6}). In addition, TC5b contains a naturally occurring type I β-turn S^{13}-S^{14}-G^{15}, type II polyproline helix P^{17}-P^{18}-P^{19}, and a hydrophobic core containing both local secondary L^{2}-Y^{3}-I^{4} and non-local tertiary W^{6}∶P^{18} and Y^{3}∶P^{19} interactions.

NMR structure of TC5b (PDB code 1L2Y; model 1) was used as the starting configuration for our studies. The structure was solvated under periodic boundary conditions using a 60×60×60 Å^{3} cubic box of equilibrated TIP3 water, and energy minimized using the CHARMM27 potential energy function in the presence of one randomly placed chloride ion to yield electroneutrality ^{3} in volume, containing a total of 21,640 atoms and 7,112 water molecules. Such size and equilibration was necessary to thermalize and unfold this protein (see below). This system was used as the initial state for molecular dynamics equilibrations in the canonical (

For REM, we utilized the MMTSB Tool Set, a recently developed collection of Perl scripts that interface with CHARMM _{n}_{m}_{n}_{m}_{n}_{n}_{ts}_{m}_{m}_{ts}_{B}_{n}_{m}

Upon each exchange of replicas neighboring in temperature, another exchange using the new pairs of neighboring replicas was attempted in order to maximize the tempering effect and the movement of replicas across the sampled temperature range. Upon a completed exchange, velocities of the exchanged configurations were rescaled to the new temperatures, another exchange was attempted 2 ps later, and the entire REMS simulation was produced for a total of 4,710 exchanges, while discarding 100 initial exchanges, corresponding to more than 0.3 µs of aggregate MD time, and sampling more than 150 million configurations.

Energy smoothing of REMS is equivalent to introducing an error into the calculation of the Metropolis criterion, and consequently produces non-stationary distributions of Markov chains of configurations. Though different in origin, this feature of REMS is analogous to the lack of stationary distributions produced by other tempering methods such as variants of Jump-walking (J-walking), where the conventional MC walker is allowed large transitions sampled from a different temperature ensemble, yielding generally non-stationary distributions of states

In order to evaluate the suitability of REMS to actually recover canonical energy distributions, we calculated the constant volume heat capacity of pure water: C_{v} = (〈U^{2}〉–〈U〉^{2})/k_{B}T^{2}. Because heat capacity reports squares of energy fluctuations, it is an extremely sensitive measure of the equipartition of energy that characterizes canonical ensembles. For this purpose, we used a 20×20×20 Å^{3} box of equilibrated TIP3 water under periodic boundary conditions, simulated using MD in the canonical ensemble for 1 ns, using MD protocol as described above, at four different temperatures: 273, 285, 299, and 313 K. We carried out a REMS simulation of the same system, using replicas at 273, 285, 299, and 313 K, simulated for 1,000 exchanges attempted every 1 ps with

To evaluate the computational efficiency of REMS, we calculated mean transition probabilities of exchanging pairs of replicas adjoining in temperature during the course of the simulation of the thermal folding of TC5b. As can be seen from _{0−N}_{0−N}

In the analysis of structures of calculated ensembles, we use the term conformation to refer to geometries of individual interactions, and configuration to refer to molecular geometries of groups of interactions. Although canonical structures, such as α-helices and β-turns, have defined regular geometries, conformations in solution at ambient temperature exhibit considerable plasticity. Thus, we utilized a self-consistent method for defining conformational basins using a stepwise optimal clustering algorithm based on a self-organizing neural net, as implemented in ART-2 by Brooks and coworkers

In this manner, we examined the formation of the N-terminal α-helix by clustering (^{2}YIQWLK^{9} polypeptide backbone and intrahelical hydrogen bond distances between backbone amide hydrogens and carbonyl oxygens, formation of the α-helical/secondary Q^{5}∶K^{8} salt bridge by clustering the distance between side chain Q carboxamide oxygen and K amine nitrogen, formation of the β-turn by clustering (^{14} and the hydrogen bond distance between backbone D^{9} carbonyl oxygen and side chain S^{14} hydroxyl hydrogen, formation of the β-turn/tertiary salt bridge D^{9}∶R^{16} by clustering the distance between side chain D carboxylate carbon and R guanidino nitrogen, formation of the polyproline helix by clustering (^{18}, and lastly, formation of the hydrophobic core by clustering contact distances among side chain Y^{3} phenol carbon ζ, W^{6} indole carbon δ, and P^{19} imido carbon δ. For all conformational variables, probabilities of forming native conformations were calculated by using clusters with near native centroids, as referenced to the NMR structure of TC5b.

Because probabilities of forming structural configurations, such as folding intermediates, cannot be derived from conformational probabilities _{pair}

In order to discover configurations that involve more than four-body interactions described above, we applied non-linear graph manifold learning techniques. Conventionally, study of high dimensional data such as atomic protein folding trajectories has been done using linear methods such as principal component analysis (PCA). PCA works by computing linear projections of greatest variance from the top eigenvectors of the data covariance matrix, thereby preserving the covariance structure of the data. However, because the global structure of high dimensional data is not necessarily linear, low dimensional linear principal components fail to capture this structure adequately (

Our LLE input data set was dimensioned using the Cartesian coordinates of heavy atoms of TC5b (154 atoms × 3 (_{i}_{W}_{i}_{i}_{j}_{ij}x_{j}^{2}. The low dimensional manifold that preserved these locally linear neighbor relations was constructed by minimizing _{ψ}_{i}_{i}_{j}_{ij}ψ_{j}^{2}, where _{i}_{i}

Our approach is related to other graph-based studies of molecular ensembles _{i}_{i}

We thank Leslie Greengard for helpful discussions, Alex Proekt for comments on the manuscript, and Michael Feig for help with the MMTSB Tool Set. This study utilized the high-performance computational capabilities of the Biowulf PC/Linux cluster at the National Institutes of Health, Bethesda, MD (