Automated fitting of transition state force fields for biomolecular simulations

The generation of surrogate potential energy functions (PEF) that are orders of magnitude faster to compute but as accurate as the underlying training data from high-level electronic structure methods is one of the most promising applications of fitting procedures in chemistry. In previous work, we have shown that transition state force fields (TSFFs), fitted to the functional form of MM3* force fields using the quantum guided molecular mechanics (Q2MM) method, provide an accurate description of transition states that can be used for stereoselectivity predictions of small molecule reactions. Here, we demonstrate the applicability of the method for fit TSFFs to the well-established Amber force field, which could be used for molecular dynamics studies of enzyme reaction. As a case study, the fitting of a TSFF to the second hydride transfer in Pseudomonas mevalonii 3-hydroxy-3-methylglutaryl coenzyme A reductase (PmHMGR) is used. The differences and similarities to fitting of small molecule TSFFs are discussed.


Introduction
Understanding how enzymes achieve their catalytic function is one of the grand challenges of chemistry and biology. Studying enzymes using computational methods has produced highly impactful work, as highlighted by the award of the Nobel Prize in 2014 1 for the development of multiscale methods such as the Quantum Mechanics/Molecular Mechanics (QM/MM) method. 2 Because enzymes consist of tens of thousands of atoms, using even low level electronic structure methods is cost prohibitive for the full system. Furthermore, extensive sampling of the conformational space, e.g. by molecular dynamics simulation at the microsecond time scale for the enzyme, possible ligands, and the surrounding water molecules, is necessary to obtain physically meaningful results. To enable such simulations, a range of classical force fields that approximate atoms and bonds as masses connected by springs have been developed. 3,4 The accuracy of these simulations is dependent on the quality of the force field used. 5 As a result, extensive validation studies of the force field functional form as well as the parameters themselves have been performed.
The use of machine learning (ML) methods in science and technology has expanded exponentially in recent years, in part due to the rapid expansion in computational power and available datasets. In chemistry, applications of ML range from basic research through material research to drug discovery. 6 More pertinent to the topic of the present study, ML has been applied to force field and PEF parameterization given its strengths in pattern recognition. [7][8][9][10] There are numerous examples in materials chemistry, where the accurate description of large systems to predict material properties demanded a cheap method at high accuracy. 11 Another well-recognized example is the ANI-1 potentias 12 that use active learning and neural network algorithms to take high-level QM data to create transferable ML potentials. 13 Even though the development of ML methods for the treatment of enzymatic reactions provides an alternative to the computationally expensive QM/MM methods, there have been comparatively few ML applications for force field development reactions and/or biomolecular systems. One reason is that in most cases, the new potential energy surfaces created break away from the restrictions of a predefined functional form. This is less likely to be successful for the case of the study of reactions in biomolecular systems, as exemplified by Pseudomonas mevalonii 3-hydroxy-3-methylglutaryl-CoA Reductase (PmHMGR) shown in Figure 1. Here, the vast majority of the system (shown in grey) is well described by extensively validated classical force fields. However, these cannot describe the substrate, cofactor, or residues involved in the transition state the reaction (show in color). The large dataset needed for training of an ML PEF for the reactive center is not available from experimental data and cannot be generated from high-level electronic structure calculations due to their high computational cost. Here, we propose an alternative approach that is reminiscent of transfer learning where the functional form and extensively validated parameters of a classical force field (in the present case, AMBER) are used and retrained for a subset of the structure that includes the bond breaking and making atoms as well as key active site residues and cofactors (shown in color in Figure 1) using the quantum-guided molecular mechanics (Q2MM) method that was originally developed for the parameterization of small molecule force fields, especially TSFFs. 14,15 As mentioned before, most of the work done on the use of ML for all-atom force fields has been focused on small molecules or solvents using functional forms determined by e.g. a neural network. 16,17 There are a few examples of the use of ML for fitting predefined functional forms using both linear and non-linear regression algorithms in the literature to reproduce training data from appropriate electronic structure methods. ML in the form of a genetic algorithm was used to optimize a polarizable force field from ab initio QM data 18 as well as the parameterization of reactive force fields. 19 The Parsely force field for small molecules uses QM data for parameterization of an AMBER-lineage with SMIRNOFF atom specification. 20 Similarly, the AMBER-15 Force Balance force field 21 for use with the TIP3P-Force Balance water model 22 is fitted using a weighted least-squares method. The AMOEBA-2013 force field also was optimized using automated techniques to obtain a general polarizable protein force field. 23 However, these studies concern ground state (GS) force fields that are not able to describe bond breaking and making steps of an enzymatic reaction where a TSFF is needed.
One of the best established 24 automated fitting procedures for the parameterization of both ground state and transition state force fields (TSFF) is the Quantum Guided Molecule Mechanics (Q2MM) approach that has been used extensively for the development of TSFF for the prediction of stereoselectivity of small molecule reactions. [25][26][27][28] To the best of our knowledge, the only application of Q2MM to biomolecular systems is a TSFF for transition-state docking of small molecular drugs to P450 enzymes to identify potential sites of hydroxylation. 29,30 However, the code used for this fitting procedure is to the best of our knowledge not widely available.
Q2MM uses training data from electronic structure (usually density functional theory) reference calculations to automatically parameterize molecular mechanics TSFF based on the MM3* PEF. The details of this process for asymmetric catalysis by small molecules have been covered elsewhere 14,15,31 and will not be elaborated on here. Here, we will describe the application of the Q2MM method to derive TSFFs of a predefined functional form compatible with the AMBER-family force fields with particular attention to the differences to the fitting of small molecule TSFFs. We will also discuss the interfacing of the Q2MM tools to the AMBER suite of molecular dynamics programs and demonstrate this workflow for the case of a TSFF for the second hydride transfer of PmHMGR.

Fitting Methods
Q2MM fits the FF parameters by minimizing the value of objective or loss function, where x i 0 is the reference data point, x i is the FF data point, and w i is the weight for that data point. The minimization step in the parameter space is calculated using gradient-based method such as the Newton-Raphson technique and simplex method. 32 The gradient-based method is general and utilizes the Jacobian matrix J where and p j is j-th parameter, which is calculated in many programs using numerical differentiation and therefore the rate-determining step. Thus, the simplex method is often used to avoid the high cost of numerical derivatives. 33 The simplex method in Q2MM is modified to move toward the best point(s) in the parameter space using a bias of reflection point. 32 The modified simplex method has shown to have faster convergence than the Raphson-type methods up to ca. 40 parameters. 31 Thus, it is used to optimize a medium-sized parameter set or a subset of the larger parameter set.
Q2MM, unlike most traditional methods for fitting system-specific FF parameters, 22 at the transition state geometry, the eigenvalues contain one significantly negative value with its eigenvector representing the reaction vector. By providing Hessian matrix information in the objective function, Q2MM uses information on both the transition state geometry and the shape of the potential energy surface around it when fitting the FF parameters. However, because Q2MM fits these parameters to represent the transition state, which is a saddle point, as a minimum on the potential energy surface, the matrix element that corresponds to the negative eigenvalue is, inevitably, altered during the fitting process. This leads to an increase of the objective function value.
To address this and the fact that the algorithms in most molecular force field-based programs 38, 39 38 are designed to optimize towards minima rather than transition states, a small modification to Q2MM is  The Q2MM Flow Scheme The following parameterization scheme is specific towards the implementation of the AMBER20 39 interface of Q2MM and its use for large large biomolecular systems. Details of the method regarding parameterization of TSFF for asymmetric catalysis using other programs such as Macromodel have been documented elsewhere. 15 As an example of using Q2MM for a large biomolecular system, the development of a TSFF for the second hydride transfer transition state of PmHMGR, [43][44][45] shown in Figure   1, will be discussed. Examples of the files discussed in this section as well as the final TSFF are given in the Supporting Information. The Q2MM code itself, which contains the interface to the AMBER Suite of programs, and several published TSFFs are freely available on the Q2MM/CatVS github repository (github.com/q2mm).
In order to develop a TSFF for an enzyme, the first step is to define a model system that includes the reactive species and the relevant parts of the protein involved in catalysis to generate the training data for the TS of this model system. For the example discussed here, the QM/MM or theozyme 46 model incorporated the relevant residues in the QM region derived from our previous studies 43, 44 of the mechanism of HMGR and shown in Figure 2, though other model systems were also explored. 43 Since this model system is derived from electronic structure calculations, only the most essential atoms should be included for efficiency of the fitting procedure even though the methodology is equally applicable to larger numbers of refitted atoms. A fixedatoms.txt file is created to include any atoms frozen in the electronic structure calculation (Figure 2, green atoms). Because the frozen atoms create unphysical Hessian elements, the weight of the Hessian values associated with these atoms are set to zero during the parameterization. Results of transition state optimizations, in a .log file, contain the energetic and geometric data that are used by Q2MM in the parametrization and are thereby included in the Q2MM input as reference. Currently, Q2MM supports interfaces to Gaussian 47 and Jaguar 38 .log files as training data for the parameterization process. The .log file is also used to create a .mol2 file of the model system using the RESP protocol in AMBER. The .mol2 file contains updated partial charges of all the atoms in the model system at the TS and is for used throughout the parametrization. At this point, new atom types should be assigned to the atoms directly involved in the reaction, as their properties will be sufficiently different from that of the parent force field. This allows for the parameters defined by the TSFF to be restricted to a specific atom in the entire system. The atoms to be reparametrized in the case discussed here are shown in Fig. 3A. It should be noted that this procedure is analogous to transfer learning in that parameters trained to a much larger dataset (standard parameters of the Amber force field) and extensively validated in the literature are used as a starting point for retraining a much smaller subset for which smaller training data sets are available. It is a key difference from the development of TSFFs for transition metal catalyzed reactions 14,15,25,26 where there are usually no parameters available for the transition metal environment. As a result, a much larger training set is needed in those cases to achieve a reliable TSFF. Even though the number of atoms to be retrained is usually larger for the case of enzyme catalyzed reactions, the use of a transfer learning approach makes the fitting procedure much more effective because the vast majority of atoms only undergoes minor perturbations in proceeding from the ground state to the transition state of the reaction. The .mol2 file should also be used to generate the force field modification (frcmod) file, using antechamber program of AMBER. 39 The .frcmod file needs to be updated accordingly to be used in Q2MM, examples of which can be found in the documentation on github. All parameters such as bonds, angles, and dihedrals for atoms directly involved in the reaction should be included in the .frcmod file.
Transition state parameters are different from the ground state ones, so initial guesses of the bond lengths and angles should be for the system at the TS as described by the QM reference data. The estimation of the parameters prevents optimization to local nonphysical minima of the objective function and decreases the number of iterations required for parameterization. Force constants are initially set to standard values based on the Generalized Amber Force Field (GAFF), 48 and initial estimations for dihedrals should in our experience be avoided to prevent over-parameterization. The parameter.py module of Q2MM generates a list of a specified parameter type to be optimized that references the .frcmod file line and includes the range of values acceptable for that parameter type. Partial charges should remain unchanged throughout the course of the parameterization. The order of parameterization ( Figure 2) is largely the same as discussed earlier. 15 The force constants should be optimized first while ensuring that the optimized value stays above 32.2 kcal mol -1 Å -2 for bond distances and angles and 3.2 kcal mol -1 Å -2 for dihedrals. Subsequently, the bond length parameter can be refined to reflect the reference data. Bond angles can be optimized after the bond lengths while ensuring that the optimized values are within reasonable ranges. If the optimized angles deviate towards unreasonable values, then this angle parameters value should be "tethered" to the reference data to prevent major deviations during optimization. The tether is defined as a weight value associated that would thereby control the deviation of the parameter being optimized. A higher tether weight should be used in the first round of optimization, then slowly decreased to zero in subsequent optimizations cycles.
Finally, the Vn terms for the torsional potentials are fit to the Hessian data first before being further refined. A second round of optimizations should be performed with a 1% convergence criterion for the penalty function to allow parameter refinement to be closer to the reference data parameters. Additional optimization cycles can be performed as needed until a working transition state force field has been obtained. For enzymatic systems, a working TSFF is obtained when an optimization step changes the objective function by less than a 1% and the values and parameters are deemed realistic by the given user.
Additionally, the resulting force field should be tested in a large-scale molecular dynamics simulations in conjunction with the ground state force field to describe the remainder of the protein (shown in gray in Figure 1). The TSFF will have to be parsed to generate new residue types that contain reparametrized atoms and new library files will need to be created to read into the leap module of AMBER20. This could also involve setting conditions that allow the reacting atoms to have more than the standard amount of bonds in a system. Other important considerations are adjusting the time step of the simulation to account for the vibrations of the reacting atoms and potentially removing the SHAKE algorithm for hydrogens in the TSFF. A short MD simulation should then be performed to ensure that the total energy of the system remains stable with the TSFF in combination with the ground state FF that would be used for the rest of the enzyme.

Application to PmHMGR
This method described above was employed for the second hydride transfer TS of PmHMGR.
Here, the reference data for the training of the TSFF were obtained from QM/MM calculations where the atoms indicated in Figure 2 were treated at the ONIOM-(B3LYP/6-31G(d,p):AMBER) level of theory. 43,44 This includes the side chains of H381, K267, D283 and E83 as well as the substrates and cofactor as shown in Figure 3 and the hmgrqm.log example file in the Supporting Information. As the functional form of the underlying force file to which to fit the TSFF to, AMBER99SB and GAFF for atoms on residues and substrates were used, respectively, as seen in the ts2.frcmod file. During parameterization, the full size of the substrates and cofactor, along with the backbone and sidechains of the residues mentioned above, were included while calculating the MM data ( Figure 3B). As discussed earlier, the bonding character and partial charges of the atoms directly involved in the TS change in going from the ground to the transition state. Furthermore, the nicotinamide ring of the cofactor is dearomatized. To describe these perturbations, new atom types were introduced as indicated in Figure 3A. It is worth reemphasizing that the initial parameters for these new atom types were derived from the standard ground state AMBER99SB parameters and then trained to reproduce the electronic structure results in the training data. In this specific case, only parameters directly associated with these atoms (within 3 bonds) were reparametrized for the TSFF.
As shown in Figure 3 B, the TSFF successfully reproduced the geometries around the reacting center of the active site and could successfully be incorporated into the rest of the enzyme that is treated with a traditional ground state force field. Using this, the enzyme could be simulated at the transition state on the microsecond timescale. The results of these studies will be discussed elsewhere.

Conclusions
In this contribution, we have discussed an automated workflow that combines the Q2MM method with transfer learning-type approaches for the generation of fast and accurate TSFFs for large biomolecular systems. Application of the workflow to the second hydride transfer of HMGR, an enzyme of high biomedical importance, shows that the transition state of this reaction can be accurately reproduced by the TSFF derived by this workflow.
The use of machine learning to generate potential energy functions that are orders of magnitude faster to compute than their training data, which often are derived from accurate but slow electronic structure calculations, is a promising application of ML in chemistry. The work presented here uses the philosophy of transfer learning and applies it to the parametrization of TSFF by retraining of well validated existing force fields as oppose to creating completely new atom types and parameters, as is done in the generation of small molecule TSFF that cover transition metal catalyzed reactions. The results are an early example for using only electronic structure reference data and a much larger number of parameters adjusted in the biomolecular TSFF than in the earlier cases of small molecule TSFFs. They show that idea derived from ML can be used to parameterize a TSFF to simulate enzymes at the transition state ~10 4 times faster than the underlying electronic structure methods, allowing for molecular dynamics simulation for system sizes and timescales well beyond the accessibility of DFT-based methods. 13