On identifying collective displacements in apo-proteins that reveal eventual binding pathways

Binding of small molecules to proteins often involves large conformational changes in the latter, which open up pathways to the binding site. Observing and pinpointing these rare events in large scale, all-atom, computations of specific protein-ligand complexes, is expensive and to a great extent serendipitous. Further, relevant collective variables which characterise specific binding or un-binding scenarios are still difficult to identify despite the large body of work on the subject. Here, we show that possible primary and secondary binding pathways can be discovered from short simulations of the apo-protein without waiting for an actual binding event to occur. We use a projection formalism, introduced earlier to study deformation in solids, to analyse local atomic displacements into two mutually orthogonal subspaces—those which are “affine” i.e. expressible as a homogeneous deformation of the native structure, and those which are not. The susceptibility to non-affine displacements among the various residues in the apo- protein is then shown to correlate with typical binding pathways and sites crucial for allosteric modifications. We validate our observation with all-atom computations of three proteins, T4-Lysozyme, Src kinase and Cytochrome P450.


NAP: the projection formalism and some definitions
The concept of non-affine displacements have been thoroughly developed for periodic crystals in the work of Ganguly et al. (see Refs. 25,26 and 27 of the main text). This is demonstrated in Fig.A for a two dimensional crystal. While for an affine displacement there is a single, homogeneous linear transformation, which takes us to the final displaced coordinates from the initial reference, coordinates, non-affine displacements can never be realised by such a linear transformation of initial coordinates. In the Fig.A, we show an affine shear transformation, while the non-affine one is a resultant of a shear plus an arbitrary displacement of particle 1. Such a resultant transformation can never be realised by a single linear transformation of the initial coordinates. We define non-affinity as the least squared error incurred by 1 representing such an arbitrary transformation by a "best fit" linear affine transformation over the local neighbourhood of any particular atom. An identical definition of course exists in three dimensions. Fig A. The affine transformation is actually a shear transformation, while the non-affine one is a resultant of shear plus displacement of particle 1. Such a resultant transformation can never be realized by a single linear transformation of the initial coordinates.
In order to adapt this formalism for proteins, two main modifications are necessary.
Firstly, in a crystal, the reference configuration is the ideal lattice position of the atoms. In a protein, this is not so straightforward. We have taken as reference the native or average structure of the protein to calculate displacements. The reference coordinates of each atom i is thus R i while the instantaneous position is r i . The displacement, u i = r i − R i . Secondly, in an ideal crystal with a single atom basis, the neighborhood of every atom is identical. For crystals with a basis, the number of unique neighborhoods is equal to the number of atoms in the basis. In a protein, on the other hand, the structural environment of every atom is unique. We have therefore fixed a "coarse graining radius" R Ω which is used to define the neighborhood Ω i around every atom. Note that all neighboring atoms lying within a sphere of radius R Ω around the central atom i belongs to Ω i . The number of atoms n Ω within Ω i neighborhoods of each atom in a protein therefore varies from atom to atom.
The quantity R Ω is a parameter of our method, which, in principle needs to be chosen with care. We discuss how R Ω influences our results later.
Once R Ω and the reference coordinates for every atom is fixed, the calculation proceeds as follows. The non-affine parameter, or local NAP value is given by, where the 3 × 3 dimensional deformation matrix D = I + ε, I the identity matrix and ε the sum of local strain and rotation. Writing the above equation in component form gives us, The minimization over the components of ε for any particle i can be carried out in a straight forward manner leading to the following result.
The NAP may now be obtained by substituting ε * in Eq.2. This completes the calculation of NAP for any atom i. The NAP susceptibility is obtained by calculating the statistical fluctuation of this value. To obtain greater insight into the meaning of NAP, and the nonaffine eigenmodes etc. one needs to reformulate the least squared minimization problem outlined above within a projection operator formalism. We discuss this procedure below.
We begin by defining the relative displacements ∆ j = u j − u i , where we have suppressed the index i for simplicity. Using a term by term expansion one can show that NAP = min Carrying out the minimization with respect to the column vectorε we obtainε * = Q∆, where Q = (R T R) −1 R T . Substituting in the expression for NAP, we finally obtain the microscopic value of NAP at the atom i as, NAP = (P∆) T (P∆) = ∆ T P∆ where P = I − RQ.
Note that P T = P i.e. P is symmetric. Similarly P 2 = P i.e. P is also idempotent. The matrix P is therefore a projection operator which projects out the non-affine component of the relative displacements ∆. The ensemble average of NAP is therefore given by NAP = (P∆) T (P∆) = Tr P ∆ T ∆ P = Tr PCP, where we have used C to denote the average local displacement correlator and Tr denotes the trace operation. The displacement correlator for the entire protein in Fourier space is of course the dynamical matrix. NAP therefore measures the local displacement correlations projected onto the non-affine sub-space.
The eigenmodes of the 3n Ω × 3n Ω dimensional matrix C exhaust all possible fluctuations of the n Ω atoms within Ω i . Out of these only 9 modes are affine. All the rest of the modes are non-affine. The matrix PCP therefore has 9 zero eigenvalues which span the null space of this matrix. The eigenmodes of PCP corresponding to non-trivial eigenvalues constitute the important non-affine fluctuations of the atoms within Ω i . The softest eigenmode corresponding to the largest eigenvalue of PCP contributes the most to NAP and constitutes the most important non-affine mode. The NAP value is basically the absolute value of all the non-affine modes of displacement. These modes of displacements basically result in reconfiguration of the native or the reference structure or conformational transformations.
The "gap", Γ i is the difference between the first (largest) and second (second-largest) eigenvalues. Fig 1b basically demonstrates that the distribution of eigenvalues is non-uniform and the largest eigenvalue is separated from the rest by a large gap Γ i and also this gap is relatively more prominent for the active regions than for the inert regions of the protein. In The NAP susceptibility is a response function for NAP under the effect of local field to which the protein system is exposed, just like C v or isochoric heat capacity is a response function for internal energy of a system exposed to temperature gradient and compressibility is a response function for volume of a closed under the effect of pressure gradient. Just like fluid with high compressibility (like gases) can be compressed easily compared to those with low compressibility (liquids), similarly those regions of the proteins which have high value for NAP susceptibility undergo huge non-affine transformations in their conformations compared those with low NAP susceptibility.
Similarly, NAP spatial correlation function measures the correlation between non-affine displacement fluctuations at two different locations.

Notes on sensitivity of simulation-length on NAP Analysis
We had previously performed an analysis of the effect of simulation length over the NAP analysis. The NAP susceptibility values for all the particles in 200ns long trajectory was found have trends similar to that in case of 400ns trajectory. This is also seen in the Fig.B.
The time-step of the MD frames chosen for both the case was 10ps. So increasing the simulation length does not much change the results. But increasing this time gap with a proprotionate increase of simulation length would help us in getting a better averaging. This is because the MD frames that we would be taking into account would be comparatively less correlated to each other when we increase the time gap between them. In other words, the practice of increasing the time-gap between the snapshots should be adopted to ensure that the snapshots are less correlated to each other and better averaging is obtained.  : 133 ,76 ,145 ,11 ,21 ,106 ,20 ,108 ,29 ,122 ,9 ,104. The relative values of the spatial NAP-correlation for the 5 locations shown in Fig.4 The relative values of the spatial NAP-correlation for the 5 points shown in Fig.4

Description of Videos
We have included four videos which show prominent non-affine displacements. In each of these videos we have applied the eigenvectors corresponding to the largest eigenvalue of the local PCP matrix to cause displacements of the atoms corresponding to the native structure of the protein. Note that these are not MD simulations. NAME Phenomenon S1 opening-up of helices 4 and 6

S2
opening-up of helices 7 and 9 S3 close to open configuration S4 allosteric contact: activation loop and Helix-αc 1. The video S1 demonstrates the opening of the helices 4 and 6 in order to enable the ligand to enter the binding site using the pathway depicted in Fig. 2 a of the manuscript. Note the twisting of the helices in this video and in V2 below.
2. The video S2 demonstrates the opening of the helices 7 and 9 in order to enable the ligand to enter the binding site using the pathway depicted in Fig. 2 b of the manuscript.
3. The video S3 demonstrates the non-affine modes for the residues having high spatial NAP correlation shown in Fig.4(a), thereby demonstrating the allosteric contact between regions 3 and 5 of T4Lys. The color scheme used is same as Fig.4(a). The loop in vicinity of region 3 of which ARG137 is a part is shown in black . and the loop of which TYR18 is a part, being a neighbourhood of region 5, is shown in yellow. The Alfa-carbon atoms of the two residues (TYR18 , ARG137 of regions 5 and 3 respectively) having high spatial NAP correlation are shown in pink solid bubbles. Here we see the residue 22(GLU) yellow, initially forming a hydrogen bond with residue ARG(137) black moving apart from residue 137 black. Thus, the non-affine modes at residue 18 and 137 act as nucleating precursors which tend to take T4Lys molecule from the closed conformation to open conformation.
4. The video S4 demonstrates the non-affine modes for the residues having high correlation shown in Fig.4(b), thereby demonstrating the allosteric contact between regions 1 and 5 of SRC-Kinase. The color scheme used is same as Fig.4(b). The activation loop alongwith residue Tyr416 being a part of region 5 is shown in yellow. and the Alfa-c helix alongwith residue 328 being a neighbourhood of region 1 is shown in blue. The Alfa-carbon atoms of the two residues (Tyr416 , Val328 of regions 5 and 1 respectively) having high spatial NAP correlation are shown in pink solid bubbles. Here we see the residue 416 protruding outside from the activation loop region thereby becoming more exposed and facilitating the phosphorylation. At the same time, the Alfa-c helix is seen to be undergoing an inward rotation. Non-affine modes analysis show that the incipients for the outwards protruding of Tyr416 and incipients for the inwards rotation of Alfa-c helix have an allosteric contact. Apart from this, there is a slight distortion of the salt-bridge between 295LYS and 310GLU. The two residues 295LYS and 310GLU are shown in silver small spheres and stick form.

Notes on the sensitivity of choice of neighborhoodvolume
In d-dimensional space, for a Coarse-graining volume (Ω i ) containing n Ω number of particles , there are d 2 possible affine modes and the remaining 3N −d 2 modes are non-affine. Therefore in 3 dimensions, out of the 3n Ω modes, 9 are naturally affine and the remaining 3n Ω − 9 are non-affine. So, if there are only 3 particles inside Ω i , the number of non-affine modes would be zero and hence the NAP value, which is the resultant magnitude of all the 3n Ω − 9 non-affine modes after minimizing it, for the particle i would always be zero and that would never allow us to capture any non-affinity in the neighborhood of the particle i. So, the bare minimum number of particles for NAP calculation has to be 4. However dense regions of the protein involve many neighboring particles which can potentially influence the motion of the region. In the above figure the eigen spectrum for three different radii have been plotted : 0.3nm , 0.4nm , 0.5nm for the residue GLN 105. This residue has the highest NAP susceptibility in the helix 5. The helix 5 acts as hinge region for the major non-affine transformations required during the binding through the helix-4-and-6 gateway.
averaging process. We wanted to take into account the statistics of all particles in calculating the NAP, rather than selectively choosing backbone or side-chain. Also, such an averaging practice actually helps us in bypassing the non-uniform behavior of side chain and main-chain atoms in displaying the non-affine signals or displacements. The same averaging applies for the study of allostery and susceptibility calculations.

Relation with PCA
In our work non-affine modes of fluctuations are projections from the total displacements ∆ by allowing the operator or matrix P to act on ∆ i.e. we calculate P∆. We therefore perform a normal mode analysis over this projected subspace containing solely the non-affine modes The ghost cyan image is the final orientation of the protein after the application of the most prominent (highest eigenvalue) eigenmode obtained from the Principal Component Analysis. For comparison with our NAP susceptibility calculations, the regions of large susceptibility are highlighted with the same color scheme as in Fig.5 of the manuscript. Note that now no helix twisting motion can be seen in region 1 and 2. The PCA displacements are seen to be mostly affine.