Detection of Functional Modes in Protein Dynamics

Proteins frequently accomplish their biological function by collective atomic motions. Yet the identification of collective motions related to a specific protein function from, e.g., a molecular dynamics trajectory is often non-trivial. Here, we propose a novel technique termed “functional mode analysis” that aims to detect the collective motion that is directly related to a particular protein function. Based on an ensemble of structures, together with an arbitrary “functional quantity” that quantifies the functional state of the protein, the technique detects the collective motion that is maximally correlated to the functional quantity. The functional quantity could, e.g., correspond to a geometric, electrostatic, or chemical observable, or any other variable that is relevant to the function of the protein. In addition, the motion that displays the largest likelihood to induce a substantial change in the functional quantity is estimated from the given protein ensemble. Two different correlation measures are applied: first, the Pearson correlation coefficient that measures linear correlation only; and second, the mutual information that can assess any kind of interdependence. Detecting the maximally correlated motion allows one to derive a model for the functional state in terms of a single collective coordinate. The new approach is illustrated using a number of biomolecules, including a polyalanine-helix, T4 lysozyme, Trp-cage, and leucine-binding protein.


Introduction
Collective motions are essential for biological functions in proteins [1]. They are involved in numerous biological processes including enzyme catalysis, channel gating, allosteric interactions, signal transduction, and recognition dynamics. The observed motions are as diverse as hinge, shear, or rotational motions of entire subunits, opening motions of molecular lids, loop motions, partial unfolding, or subtle rearrangements of amino acid side chains [2]. Understanding the functional mechanisms of such proteins requires both to identify the protein's collective motions and to relate the observed motions to the protein's biological function.
Diverse experimental methods have been applied to elucidate collective motions including nuclear magnetic resonance (NMR) [3,4], X-ray crystallography [5], as well as single-molecule fluorescence [6] or electron-transfer measurements [7]. Complementary to experiments, molecular dynamics (MD) simulations are a widely used techniques to investigate collective motions in proteins [8]. A state-of-the-art approach to elucidate collective motions from the protein dynamics is principal component analysis (PCA) [9][10][11][12]. PCA is commonly used to extract the collective motions with the largest contribution to the variance of the atomic fluctuations. Alternatively, normal mode analysis (NMA) has been extensively used to identify low-frequency collective modes [13][14][15][16]. Such modes are expected to correspond to large atomic displacements and are therefore assumed to be important to protein function. In addition, elastic network models are an established approach to assess motions intrinsic to the protein structure [17].
Established methods such as PCA and NMA elucidate largescale and low-frequency modes, respectively, but do not necessarily yield collective motions directly related to protein function. Here, we propose a novel analysis technique termed 'functional mode analysis' (FMA) that aims to elucidate collective motions directly related to a specific protein function. As input, the technique requires a set of protein structures, together with a 'functional quantity' that can be expressed as a single number for each input structure. The structures typically derive from an MD simulation, but a (large) set of X-ray or NMR structures is equally well suited. The chosen 'functional quantity' can be quite general and could correspond to some geometric, electrostatic, or chemical observable, or any other variable that might be relevant to the function of the protein. Typical examples for the functional quantity could include the openness of a channel, active site geometry, or cleft solvent accessibility. Given that input, the technique seeks the collective protein motion that is maximally related to the functional quantity. In other words, the technique aims to explain variations in the functional quantity in terms of collective motions.
When relating a functional quantity (from now on termed f ) to collective motions, two quite different motions might be of interest. First, the motion that displays the largest correlation to f . This motion is unaffected by the energy landscape of the protein, and it will be referred to as 'maximally correlated motion' (MCM). It is particularly interesting for quantities f of which the dependence on the protein structure is complex and therefore unclear. An example for such a complex quantity would be the R-value in Xray refinement. Second, the physical motion that actually accomplishes substantial deviations in f , in accordance with the protein's energy landscape, is frequently of interest. Because many different motions might affect f , we use the input structure ensemble to estimate the most probable collective motion that accomplishes a substantial change in f . That motion will be referred to as ensemble-weighted MCM (ewMCM). Depending on the question addressed, the MCM or the ewMCM (or both) can provide insight into the relation between function and motion. Therefore, both motions are considered by the proposed framework.
This paper is organized as follows. First, we describe the analysis technique. Subsequently, four examples for FMA are presented, applying the approach to a polyalanine helix, T4 lysozyme, Trpcage, and leucine-binding protein. The examples illustrate the use of FMA in detecting functionally relevant collective motions.

Theory and Concepts
Let us consider the simulation trajectory x(t)[R 3N of the protein atoms or of a subset of the protein atoms such as the backbone or the heavy atoms. x(t) denotes the 3N cartesian coordinates of N atoms. The coordinates are known at N t times, i.e., t[ft 1 , . . . ,t Nt g. For each time t, an arbitrary scalar functional quantity f (t) is given which can be computed from the protein coordinates and/or velocities. Note that for the following presentation of FMA, the time t is only an index to label the input structures x(t) and should not imply that the structures must correspond to a time series. Instead, the structures x(t) may equally well derive from, e.g., Monte Carlo sampling or from a large ensemble of experimental structures.
Maximally correlated motion (MCM). We seek a normalized collective vector a[R 3N of protein atoms such that the motion along a is maximally correlated to the change in the functional quantity f (t). Therefore, the motion along a is referred to as 'maximally correlated motion' (MCM). The MCM as a function of time t is given by the projection where S Á Á ÁT denotes the average over all times t.
In the present study, two measures are applied to quantify the correlation between f and p a . First, Pearson's correlation coefficient defined by where cov(f ,p a ) denotes the covariance between f (t) and p a (t), and s f and s a denote the standard deviations of f (t) and p a (t), respectively. The Pearson coefficient measures only linear correlation. Second, the mutual information (MI) between f and p a given by [18] I(f ,p a )~ð Here, P(f ',p' a ) denotes the joint probability distribution of f and p a , and P 1 (f ') and P 2 (p' a ) denote the marginal probability distributions of f and p a , respectively. The MI measures any kind of interdependence between f and p a , including non-linear and higher order correlation. Note that if (and only if) f and p a are independent, P(f ',p' a )~P 1 (f ')P 2 (p' a ) holds, the logarithm in eq.
(3) vanishes, and the MI equals to zero. Hence, the MI can be interpreted as the probability weighted deviation from the case of f and p a being independent. Reduction of dimensionality. Before optimizing a (via maximization of R or of the MI), a reduction of the dimensionality of the optimization problem is frequently required. Even when restricting the analysis to a subset of the protein atoms (such as the backbone), the long autocorrelations in protein dynamics may otherwise lead to an overfitted collective vector a. A common procedure to reduce the dimensionality of protein dynamics is principal component analysis (PCA) [10]. PCA allows one to determine a small set of collective vectors with the largest contribution to the mean square fluctuations (MSF) of the atomic coordinates.
For convenience and to clarify the nomenclature we briefly sketch the PCA in the following. Given the 3N cartesian atomic coordinates x i (t) (i~1, . . . ,3N), the elements of the covariance matrix C of the atomic positions are given by Before computing C, translation and rotation of the entire biomolecule is removed by superimposing the simulation trajectory onto a reference structure. Diagonalization of C yields a set of 3N orthonormal eigenvectors e i with corresponding eigenvalues s 2 i . The eigenvectors are typically ordered according to descending eigenvalues and referred to as PCA vectors. The projection p i (t)~½x(t){SxT : e i is called i th principal component (PC) and quantifies the position of the protein along the i th PCA vector.
The MSF of the atoms can be decomposed into contributions from different principal components, In protein simulations the first 10-20 PCs (with the largest eigenvalues) often account for a large fraction (80-90%) of the atomic MSF, and higher PCs describe smaller motions such as angle vibrations [10]. Hence, if large protein motions are expected to dominate changes in the f (t), the first few PCA vectors are a reasonable basis set to construct a.
We stress that PCA vectors are only one possible basis for a. Other possibilities include normal modes or modes derived from

Author Summary
Proteins are flexible nanomachines that frequently accomplish their biological function by collective atomic motions. Such motions may be characterized by hinge, shear, or rotational motions of entire protein domains, loop movements, or subtle rearrangements of amino acid side chains. In many cases it is far from obvious how collective motions are related to a particular biological task. Therefore, we propose a novel technique termed ''functional mode analysis'' that, based on an ensemble of structures, aims to detect a collective motion that is directly related to a particular protein function. From the given set of protein structures, together with a ''functional quantity'', the technique seeks the collective motion that is maximally correlated to the functional quantity. The chosen functional quantity can be quite general; typical examples could include the openness of a channel, active site geometry, or cleft solvent accessibility. Because the proposed framework is highly general, we expect the approach to be useful to a wide range of applications. To illustrate the new technique, we apply functional mode analysis to molecular dynamics trajectories of a polyalanine-helix, bacteriophage T4 lysozyme, Trp-cage, and Leucine-binding protein.
full correlation analysis [16,19]. For some quantities f (t) angles in dihedral space or the cartesian coordinates may also provide a useful coordinate system.
Maximization of the Pearson coefficient R. Assuming that f is approximately a linear function of the PCs, the collective vector a can be derived by maximizing the Pearson coefficient R (eq. (2)). We construct a as a linear combination of the first d PCA vectors, a~X d i~1 a i e i . Here, coefficients a i denote the coordinates of a with respect to the basis set fe i g.
As shown in the supporting Text S1, a maximum in the absolute value of R can be found by numerically solving the coupled linear set of equations For the present study, a~(a 1 , . . . ,a d ) was normalized after computation via eqs. (5). Note that for the maximization of R, the normalization of a is not strictly necessary because R is invariant to the norm of a.
It is instructive to note that maximizing R provides a quantitative model for f as a function of the PCs p i (t). A model for f allows one to predict f from a given protein structure, and, in turn, propose new structures that generate a particular value of f (i.e., a specific functional state). Let denote the model for f (t). The d parameters b i are fitted to the data f (t) by minimizing the mean square deviation (MSD) between f (t) and its model m f (t), i.e., S½f {m f 2 T?min. In multivariate regression analysis this approach is referred to as 'ordinary least square estimation' [20]. The minimum of S½f {m f 2 T with respect to the parameters b i is found by solving the coupled set of linear equations as shown in Text S1. By comparison to eqs. (5), the b i are identical to a i with the exception that the a i may be scaled by an arbitrary factor without changing R. Hence, m f (t) can be rewritten in terms of p a via m f (t)~Sf Tzn (p a (t){Sp a T), where n is a constant. Maximization of the mutual information (MI). If f depends non-linearly on atomic positions, the Pearson coefficient might be an insufficient measure to detect correlation between f and p a . In such cases, we apply the MI as correlation measure because it can detect any kind of interdependence. Optimizing the MI is computationally more demanding as compared to the optimization of R. The methodological details for the MI optimization are described in the section 'Iterative optimization of the mutual information'.
Maximizing the MI yields the collective vector p a that, by construction, provides as much information on f as possible. Because the functional relation between p a and f can be arbitrary, optimizing the MI does not directly provide a quantitative model for f (as in the case of optimizing R, compare previous paragraph). A natural approach to nevertheless yield a model for f is to fit a general curve g(p a ) (such as a spline [21] or a polynomial) to the p a {f data points. Given a protein structure x, this model allows one to predict the quantity f from the structure via f &g((x{SxT) : a).
Contributions of principal components to f (t). PCA modes have frequently been shown to be important to protein function [8,10,22]. However, a functionally important motion may be spread over a number of PCA modes. To understand the protein's function, it is therefore instructive to quantify the influence of different PCA modes on the functional quantity f (t), in particular if the PCA modes are related to intuitive motions such as hinge-bending or torsional modes.
Let us first consider the case of a linear model for f (eq. (6)). Using the linear model for m f (t) as an approximation for f (t) (eq. (6)), the variance of f (t) can be approximated by The expression in square brackets is the contribution of the i th PC to the variance of f (t). When using the same set of simulation frames for the PCA and for constructing the model m f (t), all cov(p i ,p j ) vanish for i=j and eq. (8) simplifies to In case of a non-linear dependence between f and p a , var(f ) cannot be decomposed into contributions from different PCs. Instead, the variance of p a can be written as the right hand side of eq. (8), except for the b i being substituted by the a i . This way, fluctuations of the motion correlated to f (but not the variance in f itself) can be decomposed into contributions from different PCs.
Ensemble-weighted maximally correlated motion contributing to f . The MCM along the collective vector a displays the largest correlation to f (t) as measured from R or from the MI. However, due to the protein's energy landscape, the motion parallel to a may be severely restricted. This fact is schematically illustrated in Fig. 1. Let us assume that the functional quantity f increases to the right in Fig. 1. Then, irrespective of the energy landscape (thin lines in Fig. 1A-C), the MCM is parallel to the direction of increasing f . However, if the energy landscape restricts the motion parallel to the MCM (Fig. 1B), a displacement along the MCM will actually occur through a motion which is non-parallel to the MCM, but which is in accordance to the energy landscape. Such motions would have a substantial projection on the MCM, but are not identical to the MCM.
In addition to the MCM, we therefore seek the most probable collective motion that accomplishes a specific displacement along the MCM. We apply the input structure ensemble to estimate the most probable motion, and refer to that motion as 'ensembleweighted maximally correlated motion' (ewMCM). The ewMCM is shown as black arrows in Fig. 1 for three different hypothetical energy landscapes. Note that the ewMCM is pointing in the direction that accomplishes a displacement along the MCM with a motion of smallest energy increase (i.e., with the largest probability). If the energy landscape does not favor any direction (Fig. 1A), the ewMCM is parallel to the MCM. In contrast, if the energy landscape highly favors motions in one direction over motions in another direction (Fig. 1B), the ewMCM may strongly deviate from the MCM. An intermediate situation is shown in Fig. 1C. It should be emphasized that the ewMCM is the collective motion in the given input ensemble which accomplishes the displacement along the MCM. In case of limited sampling, a second input ensemble may accomplish a displacement along the MCM through a different ewMCM, rendering the ewMCM highly dependent on the input ensemble. This characteristic of the ewMCM motivates the term 'ensemble-weighted'.
Let us assume that the MCM a~X d i~1 a i e i has been optimized such that p a (t)~X d i~1 a i p i (t) is maximally correlated to f (t), as measured from R or from the MI (see above). Thus, the set of a i are fixed in the following, and p a quantifies the functional state of the protein. In the input ensemble, the collective variable p a varies between its minimum p min a~m in(p a (t),t[ft 1 , . . . ,t Nt g) and its maximum p max a~m ax(p a (t),t[ft 1 , . . . ,t Nt g). To define the ewMCM, we choose an arbitrary but fixed value for p a , denoted p Ã a [ ½p min a ,p max a , between its extremes p min a and p max a . As ewMCM we consider the most probable collective displacement v Ã (from the average structure SxT) that generates the functional state p Ã a , Here, P(v Ã ) denotes the probability of the collective atomic displacement v Ã . In the following we restrict the ewMCM v Ã to the subspace spanned by the first d PCA vectors fe i g. Then, the ewMCM can be expressed as v Ã~X d i~1 p Ã i e i and the condition (9) can be rewritten as where P(p Ã 1 , . . . ,p Ã d ) denotes the probability for a particular set of PCs p Ã 1 , . . . ,p Ã d . The p Ã i were estimated as follows. First, to simplify the nomenclature, let assume the mutual covariances between the PCs to equal zero. Then, P(p 1 , . . . ,p d ) can be approximated via where P i denotes the marginal probability distribution of the i th PC, and N p is a normalization constant. Here, the p i were assumed to be mutually independent and normally distributed. (If the PCs were constructed from a different set of frames than the frames used for the FMA, the covariances between different PCs may not vanish, rendering the assumption P(p 1 , . . . ,p d )& P d i~1 P i (p i ) a poor approximation. In that case we switch to new coordinates (q 1 , . . . ,q d )~K(p 1 , . . . ,p d ) with zero mutual covariances. Here, the transformation matrix K is computed from a PCA on the p 1 , . . . ,p d .) Using the approximation in eq. (11), the maximization a is straightforward using Lagrange multipliers. The calculation yields Note that the components a i of the MCM are weighted by the variance s 2 i of p i of the given ensemble, further justifying the term 'ensemble-weighted'. To visualize the ewMCM for the given ensemble, successively increasing values for p Ã a can be chosen between, e.g., p min a and p max a . For each value p Ã a , eq. (12) provides a set of PCs p Ã i and, hence, a structure x~SxTz The structures can be depicted in common molecular visualization software.

Cross-Validation
MD simulations of proteins can be subject to long autocorrelations. The maximization of R or MI can lead to overfitting if too many free parameters a i are used in the optimization. It is therefore essential to cross-validate the derived model for f (t) with an independent set of simulation frames. A convenient approach to cross-validate the optimization is to divide the simulation into frames for model building and for cross-validation. Accordingly, R or MI is optimized applying the model building set only, yielding a correlation R m between data and model. Subsequently, the derived model is validated by predicting f (t) from the derived model using the cross-validation set only, yielding a correlation R c . Note that in the present context the term 'predict' should not imply any prediction into the future. Instead, the exact (or true) f (t) as computed from all atomic coordinates is compared with f (t) as computed from the model, making only use of the functional collective coordinate p a (t). Hence, we apply the term 'predict' as common in, e.g., the pattern recognition literature [23]. Using this approach, overfitting is indicated by a substantially smaller R c as compared to R m .
How many basis vectors d should be used to construct a? The optimal number d will highly depend on the simulation system and the observable f (t). The reasonable choice for d can be identified by plotting R m and R c as a function of d. As long as no overfitting occurs, R c increases with d, indicating an improvement of the model. As soon as R c decreases with d or becomes substantially smaller than R m , the model is overfitted.

Iterative Optimization of the Mutual Information
The collective vector a~X d i~1 a i e i that yields the largest MI between f and p a must be optimized iteratively. The optimization procedure for the MI was implemented as follows. The initial guess a 0 for a is generated randomly, or corresponding to the optimized Pearson coefficient (eqs. (5)). Subsequently, I(f ,p a ) is optimized via a sequence of rotations a 'z1~R' a ' which effect only three coefficients a i' , a j' , and a k' . Hence, where For each optimization step, the i ' , j ' , and k ' are randomly chosen from the d dimensions.
' , approximately n p~1 50 points are uniformly distributed on a unit sphere, each point corresponding to a possible 3D rotation with rotation angles w i and h i (i~1, . . . ,n p ). The MI I i (w i ,h i ) is computed for each of the n p rotations. Subsequently, a set of spherical harmonics (up to order 5) is fitted to the n p discrete I i (w i ,h i ), yielding a continuous and smoothed estimate of the MI as a function of the rotation angles w and h. The fit was implemented as a least-square fit using singular value decomposition. Eventually, the function I(w,h) is optimized by Powell's method [24], yielding the best 3D rotation matrix R 3D ' of the ' th optimization step. The 3D rotations are repeated in different i ' {j ' {k ' subspaces until convergence.
The MI I(f ,p a ) is estimated from the discrete data sets by a binning procedure. Accordingly, the probability distributions P 1 (f ') and P 2 (p' a ) are approximated by counting the occupancies n f ,i and n p,i of f (t) and p a (t), respectively, in bins i~1, . . . ,N b . For the present study, the number of bins N b was found to have a minor effect on the results. A reasonable choice was N b~5 0. Likewise, P(f ',p' a ) are approximated by a two-dimensional (2D) binning, yielding the 2D occupancy n fp,ij . The MI is estimated via where Df and Dp a denote the bin widths of n f ,i and n p,i , respectively. Note that the technique proposed here does not require an estimate of the absolute MI, but only of the relative change in MI due to a rotation R 3D ' . Sophisticated and computationally demanding methods such as kernel density estimates are therefore unnecessary.

Simulation Setup
The Fs 21 helix (originally introduced by Lockhart et al. [25], Sequence Ace-A 5 [AAARA] 3 A-NH 2 ) was modeled with PyMol [26]. The structures of T4 lysozyme (T4L) and leucine-binding protein (LBP) were taken from the protein data bank (PDB codes 256L [27] and 1USG [28], respectively). Likewise, the first structure in the NMR ensemble derived by Neidigh et al. (PDB code 1L2Y [29]) was used as initial structure for the simulations of Trp-cage. The Fs 21 , T4L, Trp-cage, and LBP structures were placed into dodecahedral simulation boxes and solvated with 8828, 8479, 3042, and 17581 explicit water molecules, respectively. All simulation systems were neutralized by adding chloride ions. In addition, 150 mM sodium chloride was added to the Trpcage and LBP systems.
The Fs 21 helix and LBP were simulated with the AMBER03 [30] force field and the TIP3P water model was applied [31]. Ion parameters were taken from Smith et al. [32]. Trp-Cage and T4 lysozyme were simulated with the OPLS all-atom force field [33] and the TIP4P water model [34]. All simulations were carried out using the GROMACS simulation software [35,36]. Electrostatic interactions were calculated at every step with the particle-mesh Ewald method [37,38]. Short-range repulsive and attractive dispersion interactions were described by a Lennard-Jones potential, which was cut off at 1.0 nm (0.8 nm for the AMBER03 simulation). The SETTLE [39] algorithm was used to constrain bond lengths and angles of water molecules, and LINCS [40] was used to constrain the peptide bond lengths, allowing a time step of 2 fs.
The temperature in the Fs 21 and T4L simulations was kept constant by weakly (t~0:1 ps) coupling the system to a temperature bath [41] of 300 K. Likewise, the pressure was kept constant by weakly coupling the system to a pressure bath of 1bar with a coupling constant t of 1 ps. The LBP and Trp-cage systems were coupled to a Nosé-Hoover thermostat [42,43] (t~2 ps) at 300 K and 400 K (to trigger unfolding), respectively, and the pressure was kept at 1 bar using the Parrinello-Rahman pressure coupling scheme [44] (t~5 ps).
The Fs 21 helix was simulated for 250 ns and the structure was written to the hard disk every picosecond. During the simulation the helix partially unfolded and refolded for a number of times. Because we chose to consider collective motions of an intact helix only, the secondary structure was determined for every frame with DSSP [45] and only frames with a complete helix were used for further analysis. Approximately 53.000 such frames were found which were combined into one 'trajectory' of 53 ns with time step 1 ps. The lysozyme simulation system was simulated for 460 ns, and the LBP system for 100 ns. The Trp-cage protein was simulated 8 times for 40 ns with different initial velocities. The volume of the catalytic cleft of T4L was estimated as explained and illustrated in supporting Figure S1.

Results
In the following we apply FMA on four biological examples, and demonstrate how quite different functional quantities f can be related to collective protein motions. In the first three examples of increasing complexity (Fs 21 helix, T4 lysozyme, and Trp-cage) the Pearson coefficient turned out to be sufficient to detect correlations between the respective functional quantity and collective motions. With the final example (leucine-binding protein) we demonstrate how the MI can elucidate correlation in cases where the Pearson coefficient fails.
As a first and trivial example we analyze collective motions related to the end-to-end distance of the Fs 21 helix. The example (including figures) is presented in supporting Text S2 and illustrates the application of FMA in some detail. Because the PCA vectors correspond to the harmonic modes of a simple helical spring, the decomposition of the end-to-end distance into contributions from different PCA modes is particularly instructive in this example.
Here we demonstrate how FMA can be applied to determine the collective motions which are putatively involved in the enzymatic activity of T4L. Two functional quantities f (t) related to the enzymatic activity are considered for the analysis. (i) The volume of the catalytic cleft V cleft , highlighted as red surface in Fig. 2A, and (ii) the distance d ED between the carboxyl groups of the catalytic residues Asp20 and Glu11 (Fig. 2B). The volume V cleft is biologically significant because opening and closure of the cleft is expected to be involved in substrate binding and product release. The distance d ED is a direct measure of the geometry of the catalytic site. According to the textbook mechanism proposed by Phillips [55], Glu11 protonates the glycosidic oxygen while Asp20 stabilizes the produced oxocarbenium ion intermediate. Hence, the carboxyl groups of Glu11 and Asp20 must simultaneously arrange closely to the glycosidic bond. d ED is therefore an easily assessable observable that probes enzymatically active configurations. The distance between the carboxyl groups was measured as the distance between the C d atom of Glu11 and the C c atom of Asp20.
In the following, the results of the FMA of V cleft and d ED are presented in a relatively compact fashion. For a more detailed presentation of FMA we refer to the illustrative a-helix example in supporting Text S2. In a first step, the basis set fe i g was derived by a PCA on the backbone atoms, using the 460-ns T4L simulation. The first 20 PCA vectors were used as basis set for the FMA. The motions along the first three PCA vectors are shown in Fig. 2C. The first PCA vector corresponds to the well-studied hingebending mode of T4L [46,51], and the second to a twisting mode, mainly characterized by a rotation of the smaller (N-terminal) lysozyme domain. The third PCA vector corresponds to the torsion of the N-terminal domain with respect to the C-terminal domain. In Figures. 2D and 2F, V cleft and d ED are plotted as a function of simulation time, respectively. The first 180 ns (250 ns for d ED ) were applied as model building sets (red background), the remaining frames as cross-validation sets (green background). For both V cleft and d ED , the respective collective vector a was optimized by maximizing R, yielding linear models for V cleft and d ED . The models are validated in figs. 2E and 2G, showing scatter plots between simulation data and models using the respective cross-validation sets only. Strong cross-validated correlation (R c~0 :95 and 0.90, respectively) between model and data was found, confirming the validity of the models. (The corresponding scatter plots using the model building sets are presented in Figure  S3.) Note that side chain fluctuations of Glu11 and Asp20 cannot be modeled from backbone PCA modes. Yet the derived model for d ED favorably correlates with the data (R c~0 :90), indicating that side chain fluctuations have only a minor effect on d ED . Figure 2H shows the R c values for V cleft and d ED as black and blue curves, respectively, as a function of the number of PCA vectors d used in the FMA. Apparently, the first PC already provides a good model for V cleft (R c~0 :89). In contrast, at least 15 PCs are required to construct a good model (R c w0:85) for d ED .
The convergence of FMA of V cleft with the number of frames in the model building set is analyzed in Fig. 3. The figure plots R c and R m between the V cleft data and V cleft model as a function of the simulation time in the model building set. All remaining frames of the 460-ns trajectory were applied for cross validation. Remarkably, using only 10 ns for model building yields a reasonable model (R c w0:85) for the remaining frames. In contrast, using less than 0.5 ns for model building yields a highly overfitted model, as visible from the large R m as compared to R c . The related analysis for the FMA of d ED is presented in Fig. S2B. Figure 4 presents the collective vectors related to V cleft and d ED , as well as the contributions of different PCs to V cleft and d ED . The results for V cleft are presented as black bars and curves, the results for d ED in blue. The coefficients a i of a (or b i of the linear model, eq. (6)), are shown in Fig. 4A. Note that b i quantifies the effect of the i th PC on V cleft (or d ED ) per nanometer displacement in PCA space. Because the variance of the PCs rapidly decay with increasing PC index i (Fig. 4B), only the first PCs substantially contribute to the variances of V cleft and d ED (Fig. 4C). Remarkably, the first PC almost completely accounts for var(V cleft ), whereas the second PC accounts for var(d ED ). Figure 4D presents the cumulative contributions of the PCs to var(V cleft ) and var(d ED ) as derived from the respective models as solid curves, and the variances of var(V cleft ) and var(d ED ) as dashed curves. The plot confirms that the models indeed account for a large fraction the variances of V cleft and d ED , respectively.
Which are the ewMCMs contributing to V cleft and d ED ? Applying eq. (12) shows that the ewMCMs contributing to V cleft and d ED are highly related to PCA vectors 1 and 2, respectively, a finding in agreement to Fig. 4C. For the illustration of these motions we therefore refer to the PCA vectors depicted in Fig. 2C. In addition, the MCM and the ewMCM for both V cleft and d ED are shown in supporting videos S1 and S2.
Taken together, the FMA provides a comprehensive picture of the collective motions involved in the catalytic activity of T4L. The hinge-bending mode (Fig. 2C) dominates closing and opening of the catalytic cleft, presumably facilitating substrate binding and release. Surprisingly, this mode leaves the active site geometry virtually unaffected. In contrast, the twisting mode dominates the distance d ED between the carboxyl groups of Glu11 and Asp20. Hence, a major collective rotation of the N-terminal domain with respect to the C-terminal domain may be required to position Glu11 and Asp20 into an enzymatically active configuration.

Initial Unfolding of Trp-Cage
Trp-cage is a 20-residue miniprotein designed by Neidigh et al. [29]. With a folding time of 4 ms [56] Trp-cage is the fastest folding  protein currently known. Trp-cage is characterized by a central tryptophan side chain (Trp6) which is surrounded by an a-helix (residues 2-8), a 3 10 -helix (residues [11][12][13][14], and a C-terminal polyproline helix (Fig. 5A). Here we use Trp-cage as a model system to demonstrate how FMA can be applied to study the structural determinants underlying the initial unfolding process of a protein. To this end, the hydrophobic solvent-accessible surface area (HSAS) is used to quantify the state of unfolding.
Compared to the distances and volumes considered so far as functional quantities f (t), explaining the HSAS by single collective mode is challenging. The HSAS is subject to strong noise and is a non-linear function of the atom coordinates. Only the linear parts of the dependence of the HSAS on the PCs is expected to be successfully captured by the linear model of eq. (6). The non-linear dependence on the PCs (that would have to be described as cross terms of the PCs) will appear as a noisy deviation from the model.
We use FMA to determine the collective motions related to the change in the HSAS, and hence, to the initial unfolding of Trpcage. Three questions are addressed: (i) To which extent can a model based on a single collective motion explain a highly nonlinear quantity such as the HSAS? (ii) Which collective motions increase the HSAS and, hence, represent the initial unfolding of Trp-cage? And (iii), can a model derived only from fluctuations in the folded state predict the HSAS during an unfolding event? To observe multiple unfolding events, eight 40-ns simulations were performed at a temperature of 400 K. The HSAS during the eight simulations is shown in Fig. 5B. In six of the eight simulations, the Trp-cage unfolded after simulation times between 10 and 27.5 ns (blue background in Fig. 5B). In the other two simulations the Trpcage remained folded for the complete simulation time. From the eight simulations, all frames of the folded Trp-cage (yellow background in Fig. 5B) were combined into one 'folded trajectory' of 183.3 ns (916503 frames).
The HSAS of the 'folded trajectory' is plotted in Fig. 5C. The first 100 ns were applied as model building set in the FMA (red background), the remaining 83.3 ns as cross-validation set (green background). The basis set for the FMA was taken from a PCA of all heavy atoms (after a least-square fit of the backbone atoms onto the 1L2Y structure). The first 40 PCA vectors were used as basis set. The PCA vectors are not visualized because they do not correspond to easily interpretable motions. From the model building set, a linear model for the HSAS was derived using eqs. (7), and the model was validated using the cross-validation set (Fig. 5D). The correlation between data and model is substantially weaker (R c~0 :62, R m~0 :70, see Fig. S3D) than in the previous examples. As expected, the HSAS in the folded state is only partially captured by the linear model. To reach a similar model quality as in the previous examples, additional non-linear cross terms of the PCs would have to be included into the model. Without such additional terms, the deviation from the linear model appears as noise. Analysis of the difference between data and model shows that the noise is normally distributed around zero with a standard deviation of 0.36 nm 2 (not shown).
To avoid overfitting, R m and R c are plotted as a function of the number of PCA vectors d used as basis set (Fig. 5E). Both R m and R c increase up to d~40, corresponding to an improvement of the model. For dw40, only R m increases, but R c decreases with d. Hence, using more than 40 PCA vectors as basis set would yield an overfitted model.
The ewMCM related to the increase in HSAS is visualized in Fig. 6A. The motion is mainly characterized by a lift-off motion of the polyproline helix with respect to the a-helix. The ewMCM and the MCM along a are also shown in video S3. The components a i of a (or b i of the model) are shown in Fig. 6B, and the variances of the PCs in Fig. 6C. As visible from Fig. 6D, many PCs contribute to the variance of the HSAS var(HSAS). Remarkably, PCs with larger index i substantially contribute the the HSAS, although they hardly contribute to the MSF of the atom positions (compare Fig. 6C). The cumulative contribution of the PCs to var(HSAS) (Fig. 6E) indicates that approximately 50% of the variance (corresponding to 70% of the standard deviation) of the HSAS in the folded state are explained by the linear model.
Can the model for the HSAS derived from fluctuations in the folded state predict the HSAS during unfolding? To assess this particularly rigorous test for the validity of the model, the HSAS during six independent unfolding events was monitored (blue background in Fig. 5B). Figure 7A displays two examples for the HSAS during unfolding events as black curves, and the HSAS predicted by the model as gray curves. The corresponding plots for all six unfolding events are shown in Fig. S4. Good agreement is found with correlation coefficients R between data and model in the range of 0.72 to 0.88 (insets in Fig. 7 and S4). The respective scatter plot of HSAS data versus model as combined from all six unfolding events is shown in Fig. 7B. Reasonable agreement (R~0:81) between data and model is found. Note that such unfolding events are not present the model building set. Hence, the model derived only from the folded state displays predictive power during a process (initial unfolding) which did not occur in the model building set. Noteworthy, favorable correlation between data and model during unfolding can only be achieved if the collective unfolding modes are at least partially sampled in the folded state fluctuation. If unfolding modes do not sufficiently fluctuate in the folded state, no correlation between such modes and the HSAS can be detected. Figures 7 and S4 show, however, that folded state fluctuations in this case are sufficient to construct the HSAS model (and hence, the functional mode) that holds during initial unfolding.

A Non-Linear Example: RMSD of Leucine-Binding Protein
As an example for a non-linear correlation between a functional quantity f (t) and collective motions we consider the root mean square deviation (RMSD) of the backbone atoms of l-leucinebinding protein (LBP). LBP is a two-domain transport protein (Fig. 8A) that is subject to a large hinge motion (0.7 nm RMSD) upon ligand binding [28,57]. The RMSD was computed with respect to the (apo) crystal structure. The RMSD increases as the protein deviates from the reference, irrespective of the direction of the collective motion. Hence, it can not be explained in terms of a linear function of a collective coordinate. Because the RMSD is frequently assessed in MD studies, we use it as a model quantity to demonstrate the use of mutual information (MI) in FMA.
The RMSD is plotted in Fig. 8B. The collective vector a was optimized such that p a (t) displays the maximal MI to the RMSD. To this end, the first 80 ns of the simulation were applied as model building set, omitting the first nanosecond for equilibration (red background in Fig. 8B), and the remaining frames were applied as cross-validation set (green background). The first 10 PCA vectors from a PCA of the backbone atoms were used as basis set for a (not shown). Figure 8C presents the RMSD versus the optimized collective coordinate p a for the model building set (red dots) and the cross-validation set (green dots). As expected, the RMSD and p a are substantially correlated, and the correlation is highly nonlinear as visible from the non-linear RMSD-p a point cloud. The non-linear model for the RMSD was constructed by fitting a cubic B-spline (black curve in Fig. 8C) to the RMSD-p a points of the model building set. Using this model, the RMSD of the crossvalidation set was predicted and compared to the RMSD from the simulation (red dots in Fig. 8D). Excellent agreement (R c~0 :97) between data and model is found.
For comparison, we tried to derive a linear model for RMSD using the Pearson coefficient R as correlation measure. However, this linear model has little predictive power as visible from modelversus-data scatter plot of the cross-validation set (black dots in Fig. 8D, R c~0 :42). The failure of R to detect the correlation between RMSD and a single collective motion is also apparent from Fig. 8E which plots R m and R c as a function of the number of PCA vectors d used as basis set. Irrespective of d, R c (black curve) is substantially smaller than R m (blue curve), indicating overfitting. In this example, using more than 10 PCA vectors as basis set would increase R c , but R c remains much smaller than R m . Note that the model derived using the MI as correlation measure displays excellent correlation between data and model for both model building and cross-validation sets (green and red curves in Fig. 8E, respectively). Figure 9F presents the analysis of the convergence of FMA with the number of frames in the model building set by plotting R c and R m as a function of simulation time in the model building set. All remaining frames of the 100-ns trajectory were applied for cross validation, and the first 10 PCA vectors were used as basis set. When optimizing the MI (red and green curves), a model building set of 11 ns is sufficient to derive a good model (R c §0:9) for the remaining simulation. In contrast, using less than 5 ns for model building may yield a highly overfitted model, as visible from a  small (or even negative) R c (red curve). When optimizing the Pearson coefficient (black and blue curves), the model quality as measured from R c is substantially poorer as compared to the model optimized via MI. In addition, R c is highly depdendent on the length of the model building set, further emphasizing that the the Pearson coefficient is an unsuitable measure to assess correlation between the RMSD and a single collective mode.
The ewMCM that effects the optimized collective coordinate p a (and hence, the RMSD) is visualized in Fig. 9A. The motion is characterized by a large hinge of the two domains with respect to each other. The collective motion is decomposed into the PCs in figs. 9B-D. The coordinates a i of a are sown in Fig. 9B and the variances of the PCs in Fig. 9C. Figure 9D displays the contributions of the PCs to the variance of p a , indicating that the first PC almost completely accounts for the variance of p a . This finding is expected because the first PC is constructed such that it accounts for the largest possible fraction of the RMSD. Hence, Fig. 9D may be considered as a further validation of the technique.

Discussion
We have presented a novel analysis technique termed functional mode analysis (FMA) to systematically identify functionally relevant collective motions in proteins dynamics. Given an arbitrary quantity f (t) considered relevant to the function of the protein of interest, the approach extracts the linear collective motion which is maximally correlated to f (t). We have used two different measures to quantify the correlation between f (t) and the collective motion. (i), the Pearson coefficient which measures linear correlation only, and (ii), the mutual information (MI) which can assess any kind of correlation including non-linear and higher order correlation. The 'maximally correlated motion' (MCM) must be distinguished from the 'ensemble-weighted MCM' (ewMCM) that -based on the sampling in the input ensemblehas the largest likelihood to contribute to large fluctuations in f (t). Numerous proteins accomplish their biological function by structural transitions such as hinge motions, domain rotations, or side chain reorientations [2]. For many proteins it is far from obvious how functionally relevant quantities are related to such collective atomic motions. In such cases, the proposed technique is expected to elucidate relations between collective motions (i.e. protein dynamics) and protein function. Moreover, the MCM or the ewMCM can be enhanced or steered during a follow-up simulation, allowing one to trigger functional transitions or to enhance the sampling of rare functional events [58,59]. Alternatively, the MCM can be biased to compute free energy differences between different functional states (using, e.g., umbrella sampling).
The success of the technique rests on two prerequisites: (i) The collective motion a must be representable by a linear combination of the chosen basis set fe i g. At the same time, the basis set should not be too large to avoid overfitting. (ii) Sufficient (linear or nonlinear) correlation between f (t) and the single collective motion must be detectable. Future efforts will focus on ameliorating the first condition.
Optimization of the collective vector a via the Pearson coefficient is closely related to the construction of a model for f (t) as a linear function of a set of given collective coordinates. When using MI as correlation measure, a non-linear model for f (t) can be constructed from a general set of functions (such as splines). The given collective coordinates used as basis set may correspond to motions along PCA vectors, normal modes, or to motions in any other useful coordinate system such as rotations of dihedral angles. If these given coordinates have an intuitive meaning (such as the hinge-bending mode in T4L), the derived model can quantify the contributions of intuitive collective motions to the variance of f (t), and hence, provide a functional interpretation of the collective dynamics.
The source code of an FMA implementation is available on the authors' web site http://www.mpibpc.mpg.de/groups/de_groot/ software.html.

Supporting Information
Text S1 Text S1 Found at: doi:10.1371/journal.pcbi.1000480.s001 (0.06 MB PDF) Text S2 FMA of the end-to-end distance of an a-helix  binding site was set up by placing the test atoms on a grid of spacing 1 Å (red block in fig. S1A). The block was placed into the catalytic cleft of a reference structure of T4 lysozyme (T4L). Each T4L structure from the simulation trajectory was fitted onto the reference structure using a least square fit on the backbone atoms. Figure S1A shows one example of a fitted structure, together with the block of test atoms. Subsequently, the test atoms which overlapped with the fitted structure were removed with the genbox tool ( fig. S1B). Every remaining atom contributed 1 Å 3 to the cleft volume.  Fig. 5B), an increasing fraction (e.g. 20%) was used as model building set, whereas the remaining fraction (e.g. 80%) of the folded states were applied as cross validation set. As visible from the black R c curve, applying more than 30% of the simulation hardly improves the prediction for the remaining frames. The respective plots for the lysozyme cleft volume and the RMSD of leucine-binding protein are shown in Figs Figure S4 Predictive power of the model for the hydrophobic solvent-accessible surface (HSAS) during six initial unfolding events. HSAS (black curves) during unfolding events and the prediction for the HSAS by the model (red curves). Note that the model was derived only from fluctuations in the folded state. The correlation R between model and data (printed as insets) lies in the range of 0.72 to 0.88. To facilitate the comparison between data and model in these plots, all HSAS curves were slightly smoothed by running averages. The R-values were computed from the nonsmoothed data (not shown).