Normal mode-guided transition pathway generation in proteins

The biological function of proteins is closely related to its structural motion. For instance, structurally misfolded proteins do not function properly. Although we are able to experimentally obtain structural information on proteins, it is still challenging to capture their dynamics, such as transition processes. Therefore, we need a simulation method to predict the transition pathways of a protein in order to understand and study large functional deformations. Here, we present a new simulation method called normal mode-guided elastic network interpolation (NGENI) that performs normal modes analysis iteratively to predict transition pathways of proteins. To be more specific, NGENI obtains displacement vectors that determine intermediate structures by interpolating the distance between two end-point conformations, similar to a morphing method called elastic network interpolation. However, the displacement vector is regarded as a linear combination of the normal mode vectors of each intermediate structure, in order to enhance the physical sense of the proposed pathways. As a result, we can generate more reasonable transition pathways geometrically and thermodynamically. By using not only all normal modes, but also in part using only the lowest normal modes, NGENI can still generate reasonable pathways for large deformations in proteins. This study shows that global protein transitions are dominated by collective motion, which means that a few lowest normal modes play an important role in this process. NGENI has considerable merit in terms of computational cost because it is possible to generate transition pathways by partial degrees of freedom, while conventional methods are not capable of this.


Introduction
Proteins are essential components of living cells. Each protein has its own biological function, which is accompanied by conformational change of the protein. Therefore, studying this conformational change is necessary to understand the underlying mechanism of its biological functions. Various experimental methods have been developed as part of this effort. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Specifically, NMR [1,2], Raman spectroscopy [3,4], electron cryo-microscopy [5,6], atomic force microscopy [7,8], and terahertz time-domain spectroscopy [9,10], as well as timeresolved methods such as x-ray scattering [11,12], the transient grating method [13][14][15] and light scattering [16] have expanded our understanding of the relationship between structure and function in biomolecules. Although the local vibrational movement of a protein in metastable states can be relatively easily captured by those experimental methods, it is still challenging to observe its global dynamics experimentally. This is due to the existence of a high-energy barrier between two conformational states, and also because capturing a very dynamic transition state is difficult with current experimental techniques. Thus, various computational attempts have sought to find intermediate structures from the known stable structures.
Molecular dynamics (MD) simulation [17,18] is a representative computational method that can be used to analyze the dynamics of proteins in atomic detail. However, the conventional MD simulation is inappropriate for predicting large-scale transitions due to its computational cost, despite recent efforts to overcome time limitations [19][20][21]. Instead, the prediction of transition pathways based on a simplified potential function, called the elastic network model (ENM), has flourished in recent years. Unlike MD simulation, which integrates an empirical potential function, ENM exploits a Hookean potential function and can significantly reduce the computational cost. The ENI first attempted to find intermediate structures by interpolating the distance between two end-point conformations of a target protein [22][23][24]. This transition pathway generation technique was already implemented on an online morph server called KOSMOS [25]. The mixed elastic network model (MENM) was also developed to study large-scale conformational transitions [26]. In the MENM method, the Boltzmann-weighted Go potentials for the end-point structures are combined into a smooth potential function, which interpolates conformation. The interpolated ENM (iENM) was proposed by constructing a double-well potential function from the ENMs of two end-point structures [27]. To improve conventional coarse-grained ENM-based methods without the loss of physical reality, the hybrid ENI considers the rigidity information of conformation when generating transition pathways [28]. In addition, the ANMPathway used two end-point ENMs and constructed two-state potential resulting in a cusp hypersurface [29]. The minimum energy conformation on this cusp hypersurface is regarded as the transition conformation.
Meanwhile, normal mode analysis (NMA) based on coarse-grained modeling has addressed predicting collective motions at low frequency and succeeded in explaining biological functions in terms of the collective motion [30][31][32][33][34]. Since the transition process mostly shows the large collective motion, there have been efforts to employ low-frequency mode shapes predicted by NMA to the transition pathway. Firstly, collective MD iteratively obtains the transition pathways by combining NMA based on ENM, which guides the collective dynamics with MD evaluating local dynamics and atomic interactions [35]. The optimized torsion-angle normal modes in internal coordinates generated more accurate transition pathways than the Cartesian modes [36]. Next, in the anisotropic network model-Monte Carlo (ANM-MC) method, normal modes are also used for describing intermediate structures on transition. Then, MC algorithm is applied to minimize their conformational energy in order to predict the transition pathway successfully [37]. Such NMA-based transition pathway prediction tools were also addressed through online web servers. NMSim reproduces the feasible transition pathways using rigid-cluster NMA in Cartesian coordinates [38], while iMODS performs the pathway generation based on NMA in internal coordinates [39].
In this work, we present a new simulation method based on ENM, which is called the normal mode guided elastic network interpolation (NGENI). This method can generate pathways on a large-scale transition process by iteratively performing NMA. The conventional ENI method generates conformational pathways between the initial and the final conformations by obtaining displacement vectors iteratively toward a direction which linearly reduces rootmean-square deviation (RMSD) between the two given structures [22][23][24]. On the other hand, the displacement vectors in NGENI are composed of a linear combination of normal mode vectors. Because of this methodological difference, NGENI generates theoretically more reliable pathway than ENI of which the pathway doesn't take into account any physical aspects such as the intrinsic thermal vibration, but only complies with the given topological constraints in Cartesian space. As the existing morphing methods such as MENM, iENM and ANMPathway also utilize either topological constraints or potential energy landscape for prediction of large-scale transition pathways, our method is expected to be a new solution considering both aspects in a balanced way. In addition, NGENI is able to adjust degrees of freedom in the simulation by determining the number of normal modes to be used. This enables us to reduce the computational cost. The validity of NGENI has been verified with extensive testing by comparing NGENI to ENI.

A set of proteins
In this work, we choose a set of 8 proteins for which two end-point structures are available at the Protein Data Bank (PDB). The RMSD between the two end-point structures was more than 3 Å for all tested proteins. Such a topological difference is enough to test whether a protein undergoes conformational changes that cause its own biological functions. The information about these proteins is listed in Table 1.

Outline of NGENI
The purpose of the NGENI method is to construct a pathway between two end-point structures based on low-frequency modes, which are most relevant to biological function. The pathway comprises consecutive displacement vectors between an intermediate structure and the next one. To obtain these vectors, we established an objective function in which potential energy linearly decreases with respect to the RMSD value by interpolating the distance between spatially close residues (more detailed description on RMSD calculation is given in S1 Text). Here, the coordinates are collected from the position of the alpha carbon atoms (C α s), and spatially close residues are connected by linear springs based on the ENM concept [40,41].
The key idea of the NGENI is to generate transition pathways based on normal mode vectors. A linear combination of normal mode vectors yields the corresponding intermediate structure.
By adjusting the contribution weighting of each mode vector, we can define the vector set that satisfies the desired value of potential energy. Iteratively, a series of these displacement vectors are obtained from the initial to the final conformation to create a possible transition pathway.
Here, the number of iterations is preset by multiplying the RMSD value between the two endpoint structures by 10 in order to get a smooth transition pathway with the RMSD increment of 0.1 Å every iteration step. The overall scheme of NGENI is shown in Fig 1 and more details about the proposed objective function are also described in the following chapter.

Cost function
For a protein composed of n residues (e.g., C α s based on coarse-grained ENM), the coordinates for the two end-point structures are denoted by {x i } and {y i }, respectively. When we use the m lowest normal modes in the simulation, we can define the displacement vector for the i th residue as follows where v i,m is the m th normal mode vector of the i th residue and c m is a weighting constant of the m th normal mode. Thus, we can define and Here, we introduce a cost function as follows where k i,j is an element of the linking matrix which contains binary information about virtual spring connection in ENM. The value of l i,j is the desired distance between the i th residue and j th residue at a certain intermediate state. The value of l i,j can be determined as where α is a scale factor that interpolates between the initial distance kx i − x j k and the final one ky i − y j k. It ranges from 0 (i.e., initial) to 1 (i.e., final) with an increment of 1/s. Again, s is the total number of iterations here. We simplify the cost function in Eq (4) in order to obtain the optimum solution of C w . First, we define C i,j as a part of the cost function This equation can be simplified into a quadratic equation in terms of C w by using a Taylor series approximation for small values of V i C w and V j C w . where and E 3 is the 3 by 3 identity matrix. Then, we write Eq (6) as and In C ð1Þ i;j in Eq (10), we define P ð1Þ i;j 2 R 3Â3 as Then, we can rewrite Considering all the spring connections, we can obtain the following equation where Λ (1) 2 R m×m is defined as Next, for C ð2Þ i;j in Eq (11), taking P ð2Þ Then, We can also simplify the term such that where Lastly, let Λ (3) be a constant such that Consequently, we can derive a quadratic form of the cost function by substitution of Eqs (16), (20), and (21) into (9).
Our goal is to determine the value of C w , which minimizes Eq (22). To do that, the C w has to satisfy the following equation Through this computation, we can obtain the optimum weighting constant C w and determine the optimum displacement vectors from an intermediate state to the next one. Iteratively, this process can generate a conformational transition pathway between the given two endpoint structures. From a computational point of view, the whole process can be divided into two parts: NMA and the main computational part from Eqs (9) to (23). First, the time complexity of NMA is O(nm 2 ) when using the "eigs" function in MATLAB. This function is well designed to find the largest/smallest magnitude eigenvalues of sparse matrix efficiently using Krylov subspace methods including Lanczos and Arnoldi algorithms [42,43]. Next, in the main computational part, the most computational effort is required to construct the cost function in Eq (16)

Verification of NGENI by using the full degrees of freedom
We first evaluated the performance of NGENI with all normal modes (full-NGENI) by comparing it with the conventional ENI pathways [22][23][24] in terms of average RMSD values. For 8 large-scale transition proteins, we obtained average RMSD which is the average of RMSD values between two corresponding intermediate conformations generated by full-NGENI and conventional ENI for all iteration steps. As a result, the negligibly small RMSD values indicate that the full-NGENI generated similar pathways to those of conventional ENI for all cases ( Table 2). This is because the full normal mode vectors can take the complete set of degrees of freedom (DOF) into account. Mathematically speaking, this is nothing more than a different representation of the bases that constitute the given topological space. Therefore, we confirmed that the full-NGENI could generate reliable pathways for a large-scale transition process based on the fact that the conventional ENI pathway has already been verified elsewhere [22][23][24]. Potential of NGENI with partial degrees of freedom As low frequency modes dominantly show collective motion, one can significantly reduce computational cost without loss of generality of the pathway by using only those modes. Here, we define another NGENI with partial DOF, called optimum-NGENI, and test whether the optimum-NGENI is still able to generate reasonable transition pathways. Fig 2 illustrates this concept, in which each curved line represents a transition pathway inside a cylindrical space spanned by corresponding normal modes. The smaller search space of the optimum-NGENI enables us to dramatically reduce the computational cost. The size of the searching space can be easily adjusted by the number of normal modes taken in the optimum-NGENI, but it has not been determined whether this optimum number satisfies all general cases.
To determine the optimum number of lowest normal modes used in the optimum-NGENI, the quality of the optimum-NGENI pathway was evaluated using RMSD between intermediate conformations and the final given structure for every iteration step. If these RMSD values are less than its experimental resolution, then the proposed optimum-NGENI with a particular number of lowest normal modes is considered to satisfy the convergence condition. Although the optimal number of normal modes may be different in each case, 30 lowest normal modes seem to be sufficient to generate reliable pathways, as shown in Table 3.
For further assessment of this convergence condition, Fig 3 presents the transition pathways of two proteins: adenylate kinase and D-allose binding protein. Adenylate kinase catalyzes the transfer of a phosphoryl group from ATP to AMP [44] and undergoes rigid body motions of the NMP bind and LID domains with two pairs of hinges connecting each domain to CORE domain [45]. Fig 3A (upper) includes an actual simulation result showing that the optimum-NGENI successfully generates the rigid body movements of adenylate kinase. In addition, both full-NGENI and optimum-NGENI pathways are compared to each other in Fig 3A  (lower). Unlike full-NGENI, the error of the optimum-NGENI pathway is accumulated at the end and obviously caused by the missing DOF. However, this error is acceptable compared to the experimental resolution of the adenylate kinase structure. This result suggests that only a small portion of the lowest normal modes is sufficient to predict transition pathways without loss of generality because this portion has enough information to comprehend biologically important collective protein motion. Fig 3B also confirms similar results for D-allose binding protein, which has three hinge points between two domains [46]. The upper figure in Fig 3B shows that the generated pathway demonstrates the hinge movement without difficulty, and the lower one verifies that the pathway of optimum-NGENI meets the convergence condition. Furthermore, their reverse transition pathways (from closed to open form) were also generated by optimum-NGENI. They not only satisfy the convergence condition but also preserve realistic geometry during the reverse transition (see S3 Text).  Availability of optimum-NGENI for large proteins Thus far, we have defined optimum-NGENI as the NGENI using only the first 30 lowest normal modes through the evaluation of convergence condition of generated transition pathways. Now, we test if optimum-NGENI is still able to generate transition pathways for relatively large proteins such as group II chaperonin. The detailed structural information of group II chaperonin is provided in Table 4. As shown in Fig 4A, the optimum-NGENI pathway successfully describes the hinge-bending motions of the intermediate domains which play a key role in the folding mechanism of the group II chaperonin. Moreover, RMSD values of optimum-NGENI and ENI are compared to each other in Fig 4B. Although the optimum-NGENI pathway shows higher RMSD error at the end stage, it can still converge below the experimental resolution. Lastly, the bond lengths and bond angles are measured for evaluating physical reality of the proposed pathway ( Fig  4C). We have also confirmed that difference with the two end-point structures is negligible for bond length (less than 0.03Å).  Ideally speaking, there is no size limitation to optimum-NGENI. For large system, we can expect much higher computational efficiency by using a finite number of meaningful normal modes as the driving force of pathway generation. Although there is still an argument on how to predetermine the number of normal modes required to capture the system dynamics without any loss of generality, empirically speaking (also supported by our case study results), the first 30 lowest normal modes are enough to generate the transition pathway successfully. Also, it is noted that this number is not determined by the size of protein structure but the complexity of conformational transition.

Quality of optimum-NGENI pathway
To address whether the optimum-NGENI method achieves goals as good as the conventional ENI, the performance of both methods is compared using the average RMSD values for our protein set including the group II chaperonin (Table 5). Here, these values are obtained from averaging RMSD values between two corresponding intermediate conformations generated by optimum-NGENI and ENI for every iteration step. As the average value is smaller than the corresponding resolution value for all cases, topological difference between two pathways is negligible. To evaluate quality of the optimum-NGENI pathways for adenylate kinase and Dallose binding protein, we compared their weighting constants with those of full-NGENI ( Fig  5). For both cases, the first 30 weighting constants of the full-NGENI were very similar to those of the optimum-NGENI, in the sense that several lowest modes dominantly influence the transition pathway in proteins by generating large-scale and collective motion. Quantitatively speaking, the correlation coefficients between the two methods are 0.988 and 0.992 for adenylate kinase and D-allose binding protein, respectively. This result indeed validates that the proposed optimum-NGENI method can generate transition pathways as good as the conventional ENI does with only the fixed number of lowest normal modes (i.e., 30 in this context). The weighting constants for all the other protein pathways are also provided in S1 Fig.

Computational complexity of optimum-NGENI
The main advantage of NGENI is that one can incorporate large collective motions effectively when predicting transition pathways in proteins, because NGENI generates transition pathways considering geometric constraints and physical mechanics, despite the simple interpolation method. Of course, the computational cost of NGENI is higher than that of ENI because NGENI has to perform NMA at every iteration step to update intermediate conformations. Using a finite number of normal modes, however, the optimum-NGENI can overcome this drawback because it drastically reduces computational burden in the main computation in which the next intermediate conformation is determined by displacement vectors obtained from NMA. A rough mathematical calculation with big O notation yields that the optimum-NGENI follows O(nm 2 ), whereas the conventional ENI follows O(n 2 ) where n is the number of residues and m is a constant number of normal modes utilized in optimum-NGENI (see Materials and methods and S2 Text for more details). As the size of protein increases, the conventional ENI method requires much more computational time than optimum-NGENI. This relationship is verified by comparison of the actual computation time for both methods. For appropriate comparison, we take into account the average computing time to obtain each intermediate conformation. Fig 6 shows that computation time of the ENI (denoted by red circles) grows quadratically with respect to protein size, while the corresponding computation time of optimum-NGENI (denoted by blue quadrangles) increases linearly. Therefore, the optimum-NGENI method can be a reliable alternative to the conventional ENI by balancing physical realism and computational cost, regardless of protein size.

Conclusions
There are significant challenges in using experimental techniques to capture temporally lengthy, large-scale protein dynamics at the atomic level, so simulation methods play an important role in filling this gap by generating transition pathways between different conformational states, which are strongly related to biological functions. However, there is still concern regarding simulation reliability and computational cost. To compensate for this weakness, this work proposes a new morphing method called NGENI that interpolates the distance between spatially close residues based on a linear combination of normal mode vectors. This key idea helps us generate topologically allowable and physically reliable pathways.
Furthermore, the optimum-NGENI successfully provides in-depth study on transition pathway generation. First, it can elucidate how well a minimum number of collective modes generate protein transition pathways. Second, the concept of the optimum weighting constant can be also interpreted as a quantitative measure of the contribution of each mode to the transition pathway. Third, it compromises computational cost with the physical realism of the generated transition pathway by taking only a fixed number of lowest normal modes as a basis for searching space.
Consequently, it is expected that the optimum-NGENI not only assesses degrees of collectivity in protein dynamics, but also captures its functional transition pathway through a linear combination of several intrinsic vibration modes.