Analysis of the Free-Energy Surface of Proteins from Reversible Folding Simulations

Computer generated trajectories can, in principle, reveal the folding pathways of a protein at atomic resolution and possibly suggest general and simple rules for predicting the folded structure of a given sequence. While such reversible folding trajectories can only be determined ab initio using all-atom transferable force-fields for a few small proteins, they can be determined for a large number of proteins using coarse-grained and structure-based force-fields, in which a known folded structure is by construction the absolute energy and free-energy minimum. Here we use a model of the fast folding helical λ-repressor protein to generate trajectories in which native and non-native states are in equilibrium and transitions are accurately sampled. Yet, representation of the free-energy surface, which underlies the thermodynamic and dynamic properties of the protein model, from such a trajectory remains a challenge. Projections over one or a small number of arbitrarily chosen progress variables often hide the most important features of such surfaces. The results unequivocally show that an unprojected representation of the free-energy surface provides important and unbiased information and allows a simple and meaningful description of many-dimensional, heterogeneous trajectories, providing new insight into the possible mechanisms of fast-folding proteins.

The equilibrium kinetic network (EKN) is an undirected capacitated network which describes the equilibrium kinetics of the system (1,2). The EKN is obtained by clustering the trajectory into nodes which contain sets of similar structures.
To reduce the dimensionality of the system before the clustering procedure is applied we use bond principal component analysis (BPCA). BPCA is performed with selected atom pairs only to reduce computational expense further as well as to increase the signal to noise ratio (see Reference (3) for a similar approach, where all atom pairs are used). To select atom pairs for BPCA, the potential of mean force associated with their distance is first determined for each pair in the system. Examples of two such profiles are shown in Figure 1. Atom pairs which give a profile with more than one free energy basin (the right-hand panel in Figure 1, for example) are included in the PCA, and all others are discarded (the left-hand panel in Figure 1). BPCA is performed by calculation of an N × N covariance matrix, where N is the number of selected atom pairs: where i and j are the labels of the pairs, r p i is the distance between the two atoms in pair p i at time t andr represents the corresponding mean value. The eigenvectors corresponding to the three largest eigenvalues of C ij (i.e., the first three principle components) are taken to be the basis vectors for the three-dimensional space in which clustering is performed. This "PCA space" is divided into cubes of equal size, each corresponding to a cluster. The size of the cubes are chosen such that there are 20000 clusters in total (this number is chosen to be as large as possible, but with the requirement that clusters are sufficiently populated to be statistically representative and not too large for numerical problems due to the diagonalisation of rate matrix). Each frame of the trajectory is assigned to a cluster according to its position in "PCA space", and the number of transitions (n ij ) between each pair of nodes (i and j) are counted. To impose detailed balance (which holds in the limit of infinitely large statistics) the capacities in the EKN are taken as c ij = (n ij + n ji )/2.

Building the FEP
The free energy profiles for the EKN are constructed using a procedure (4, 5) which employs folding probability (6) (pfold) as the reaction coordinate. Folding probability was originally proposed as a reaction coordinate to separate folded and unfolded states in protein folding, but can be used as a reaction coordinate to separate any two regions of the configuration space. To compute the folding probability for each node of the EKN one has to specify two nodes A and B; these are usually taken to be the most populated nodes of two basins of interest. When no information about the basins is available it is convenient to take the most populated node (which is likely to represent the native state) as node A. To avoid choosing a node in the same basin as node A, a node B is created outside the network (4). Node B is connected with every node (i) in the network with capacity c Bi = λN i , where N i = j n ji is the population of node i, and λ is taken to be 10 = 0, where p ji are the components of the transition probability matrix obtained from the EKN (i.e., p ji = c ji / k c ki ). The nodes are then sorted according to their values of p f old , which all lie in the range 0 to 1. Next, the FEP is built by considering a set of points, p c between 0 and 1, and using them to "cut" the network into a set with p f old < p c (set A) and a set with p f old > p c (set B). For each value of p c a point of the profile (Z A , −kT ln(Z AB )) is computed, where Z A is the population of set A and Z AB is the number of transitions between sets A and B. It is convenient to transform the Z A coordinate into a coordinate (labelled "natural coordinate") where the diffusion coefficient is constant; this is done by numerically integrating the equation dx = (2/ √ π)dZ A /Z AB (7).

Convergence of the free-energy representations
To confirm that sampling is sufficient and that the free-energy surface represented either as one-dimensional projection over the "natural coordinate" or as an equilibrium kinetic network is converged we separated the trajectory of the wild-type protein model into two halves, and performed the analysis on each half separately. The resulting FEPs are shown in Figure 2.
As mentioned in the Analysis section, if parallel pathways are present then only the height of the largest barrier is meaningful. Figure 2 clearly shows that there is excellent agreement between the two halves of the trajectory in the region of the highest barrier between the denatured and intermediate states. The differences between the two profiles at low values of natural coordinate (in the basins representing the native and intermediate states) result from the overlap between these basins on the natural coordinate. A better test of agreement is to compare the two SEKNs ( Figure 3). The agreement between the two is good: the same five states are identified, with the same major folding pathways. Along these pathways the same rates are measured, within statistical error. Larger differences in rates are observed for transitions characterised by small rates (e. g. i1 → n2) due to limited sampling of these routes. However, these rates have a negligible effect on the overall kinetics of the system.