The Effective Fragment Molecular Orbital Method for Fragments Connected by Covalent Bonds

We extend the effective fragment molecular orbital method (EFMO) into treating fragments connected by covalent bonds. The accuracy of EFMO is compared to FMO and conventional ab initio electronic structure methods for polypeptides including proteins. Errors in energy for RHF and MP2 are within 2 kcal/mol for neutral polypeptides and 6 kcal/mol for charged polypeptides similar to FMO but obtained two to five times faster. For proteins, the errors are also within a few kcal/mol of the FMO results. We developed both the RHF and MP2 gradient for EFMO. Compared to ab initio, the EFMO optimized structures had an RMSD of 0.40 and 0.44 Å for RHF and MP2, respectively.


Introduction
The need to study very large systems in an efficient manner has led to the development of many computational schemes trying to cope with the limitation in computational resources. Linear (or nearly linear) scaling methods have long been of particular interest because they allow, within their respective framework [1][2][3][4][5][6][7][8][9][10][11], large systems to be treated by quantum mechanics. In particular, the use of fragments [12,13] is very attractive for doing calculations of large systems.
Recently, we developed the effective fragment molecular orbital (EFMO) method [14], which builds upon the fragment molecular orbital (FMO) method [15][16][17][18][19][20], and combines it with effective fragment potentials (EFP) [21][22][23]. EFMO is different from EFP, FMO and FMO/EFP [24,25] in several ways. For instance, the EFPs are computed on-the-fly from gas phase FMO fragment calculations and used for classical interactions of separated dimers and many-body effects. Extending the earlier work [14] limited to molecular clusters at the RHF level, we now present the methodology to treat fragments connected by covalent bonds at the MP2 level.
This article is organized as follows. First, we briefly outline the theoretical background of EFMO. We proceed to discuss the change in methodology needed to include fragmentation across covalent bonds in EFMO, including an overview of how fragment bonds are treated. The addition of correlation in EFMO is also presented here. Second, we benchmark the EFMO energy against ab initio calculations on three different sets of polypeptides and compare to FMO. We apply our findings to proteins and protein like structures. The quality of the gradient together with timings are also presented here. Water clusters are also briefly revisited. Finally, we summarize our results and discuss future directions.

Theoretical Background
In FMO, the total two-body (FMO2) non-correlated energy of a system consisting of N fragments (also called monomers) is given as Here E I (E IJ ) is the energy of monomer I (dimer IJ) in the electrostatic potential (ESP) of the other N{1 (N{2) fragments. The monomers converge in the field of ESP, requiring selfconsistent charge (SCC) iterations. Dimers converge in the field of ESP of the N{2 monomers.
The total non-correlated EFMO energy of a system of N fragments is where E 0 I is the gas phase energy of monomer (or fragment) I. E 0 IJ is the gas phase dimer energy of dimer IJ. The second sum in equation 2 is the pairwise correction to the monomer energy and only applies for dimers separated by a distance less than R resdim .

E POL
IJ and E POL tot are the classical pair polarization energy of dimer IJ and the classical total polarization energy, respectively. The final sum over E ES IJ is the classical electrostatic interaction energy and applies to dimers separated by a distance greater than R resdim . The fragment separation distance R I,J was defined previously [14]. Since EFMO only involves gas phase energy (and gradient) evaluations, only one SCC iteration is required.
In EFMO, the classical terms in the energy expression (equation 2) are calculated from expressions in the EFP perturbation expansion of the interaction energy [21,22]. Based on the converged fragment calculations, EFP parameters are derived on-the-fly completely automatically by computing atom centered monopoles, dipoles, and quadrupoles [26] and dipole polarizability tensors for each electron pair. [27].
The analytical gradient derived previously [14] is reformulated for fragments connected by covalent bonds, and also extended to MP2.

Covalent Bonds
For fragmentation across covalent bonds, no corrections to the basic equation of EFMO is needed. However, the inclusion of fragmentation across bonds requires a change in the methodology. In this paper, we show how fragmentation is carried out on protein backbones, this methodology is transferable to other systems just as FMO was applied to inorganic systems such as zeolites [28] and nanowires [29].
In regular FMO, two different schemes of fragmentation is possible. Common to both is that one specifies pairs of atoms which defines fragment boundaries ( Figure 1). Each detached bond is made of a bond attached atom (BAA) and a bond detached atom (BDA). The latter donates an electron to the fragment containing the BAA. One scheme is the hybrid orbital projection (HOP) approach [16], which allows full variational treatment of molecular orbitals (MO) across the bond during the fragment SCF. The other is the adapted frozen orbital (AFO) method [28,29] which freezes the occupied orbital that describes the bond [30]. EFMO uses the latter method, and for completeness we include a discussion of this particular scheme in this work.
In AFO, a model system around the BAA and BDA is constructed ( Figure 2). RHF calculations are carried out on this system, followed by an Edminston-Ruedenberg localization [31]. The occupied orbital which has the largest overlap with the BDA and BAA is identified as the special bond orbital (SBO) shown on Figure 3. This orbital, along with several virtual orbitals on the BDA is stored for later use in monomer and dimer SCF calculations.
For polypeptides, which is the main focus of this study, there is one SBO per pair of BAA and BDA. This SBO is associated with the fragment that contains the BAA. After the computation of all model systems, monomer calculations are done, followed by a Foster-Boys localization, where the SBO is kept frozen, i.e. not allowed to mix with the rest of the orbitals. This leads to a polarizable point in the centroid of the SBO (Figure 3), obtained from the model system across the bond ( Figure 2). We have thus successfully eliminated the need to manually parametrize the bonds between pairs of fragments.
In the original formulation of EFMO, the electric field arising from a static multipole or induced dipole in fragment I is screened by a Tang-Toennis type expression.
Here, a and b are the screening parameters associated with fragments I and J, respectively. The distance parameterR R is the vector between an induced dipole in fragment I and any of the electric moments in fragment J. The above expression is also the default in EFP [21,22] with the parameters a~b~0:6. We emphasize that the screening parameters are associated with fragments and not individual polarizable points.

Correlation
The introduction of correlation energy in the EFMO method follows previous work in FMO [32][33][34]. The total correlated energy of a system of N fragments is given as.
Here E COR is given as the sum of monomer correlation energies E COR I and pairwise corrections, i.e.
where E COR IJ is the correlation energy of dimer IJ. The distance parameter R corsd determines whether or not correlation is included for a specific dimer. The value of the parameter is discussed in the computational methodology section below. Note that for the correlation energy any size-extensive post-HF scheme can be used.

Computational Methodology
All ab initio and fragment calculations were carried out in a locally modified version of GAMESS [35]. EFMO was parallelized with the generalized distributed data interface [36]. In all calculations, the 6-31G(d) [37][38][39] basis set was employed throughout unless specified otherwise. In all the geometry optimizations, a convergence criterion of 5:0 : 10 {4 Hartree/ Bohr was used.
For FMO (and EFMO), the AFO scheme was used throughout with the default settings for bond definitions (LOCAL = -RUEDNBRG in $CONTROL and RAFO(1) = 1,1,1 in $FMO). The parameters for the electrostatic treatment of dimers R resdim and the threshold for the inclusion of correlation effects R corsd were both set to 2.0 (RESDIM = 2.0 RCORSD = 2.0 in $FMO) unless otherwise specified. The distances are relative to the vander-Waals radii of atoms (see ref [14] for details). The screening parameter for all fragments are set to 0.1 for fragments with and without the SBO (SCREEN(1) = 0.1,0.1 in $FMO), respectively unless specified otherwise.
The following structures used in this study were taken from previous work by Fedorov et. al. [32,34,41] This includes a-helices (a{(ALA) n ) and b-sheets (b{(ALA) n ) of alanine, Chignolin (PDB code: 1UAO) and the Trp-cage (PDB code: 1L2Y). Correlation effects on molecular clusters is carried out by investigating the structures from our previous study [14]. The crystal structure of the 42 residue protein Crambine (PDB code: 1CRN) is also included and protonated using the PDB2PQR tool [42,43].
The three polypeptides used in this study were constructed by selecting six neutral (at pH = 7) amino acids AIVGLT (P1) and AVSNTL (P2) as well as four neutral and two non-neutral (at pH = 7) residues AVKNTD (P3) and padded with two glycine residues at each end for a total peptide length of 10 residues. The polypeptides were protonated (at pH = 7) using the PDB2PQR tool. P1 had neutral termini (arguments -neutralc -neutraln) while P2 and P3 both had charged termini. For each polypeptide, a conformational search was carried out to locate twenty different structures using the ObConformer tool of the Open Babel package [44,45]. They were finally minimized using PM6 [46] in MOPAC [47] with a bulk solvent (EPS = 80.1).
Only results for two residues per fragment are discussed in detail below, and the results for one residue per fragment are shown in the supporting information (Table S1). We note that because of the large charge transfer in some charged systems the one residue per fragment division leads to very considerable errors.
When interpreting the accuracy of the results, the following quantities of errors are defined for energies. The error in energy.
the average deviation of conformers and the mean average deviation (MAD) for conformers Here, M is FMO2/HOP, FMO2/AFO or EFMO and X is RHF or MP2. I runs through N conformers of polypeptides. To evaluate the quality of the EFMO gradient, numerical gradients (+E num ) were calculated on a-(ALA) 10 and compared to its analytical counterpart (+E ana ) by the root mean square (rms) deviation of the individual elements and the maximum deviation N A in equation 9 is the number of atoms in the molecule of interest, i in equation 10 runs through 3N A atomic coordinates.
To measure the compactness of a protein we use the radius of gyration R 2 g given as Results and Discussion

Application to Polypeptides
The performance of EFMO has a critical dependence on the screening parameter (equation 3, Figures S1, S2 and S3, and Tables S2 and S3) because of the close position of a) induced dipoles located at the centroid of the SBO in one fragment and b) the nearby electrostatic moments and induced dipoles in another (especially, adjacent) fragment. In the following, the screening parameter for all fragments is a~0:1 unless otherwise specified. Figure 4 shows the MAD results obtained for two residues per fragment for all three polypeptides (P1, P2 and P3) using FMO2/ HOP, FMO2/AFO and EFMO for both RHF  For the charged polypeptide P2, MAD ( Figure 4) increases by roughly a factor of two. The factor is about 3 for P3 (from 2.02 kcal/mol to 5.94 kcal/mol for the RHF energy). The inclusion of charged residues results in larger induced dipoles, which has a negative impact on the accuracy of the energy in EFMO. The accuracy of charged systems may be ameliorated by solvent screening. [48][49][50].
If one considers the average deviation (equation 7 and Figure 5) instead, it is interesting to note that EFMO compares well with FMO2, and the agreement for P3 is perhaps fortuitous (the error is less than 0.5 kcal/mol for EFMO-MP2). The maximum deviations for EFMO, however, are larger in all cases by roughly a factor of two.
For all three peptide ensembles, there is a good correlation between the compactness of the peptide conformation (measured by the radius of gyration, equation 11) and the error in the energy (see supporting information Figures S4, S5 and S6). More compact structures place the charged groups closer to the polarizable points at the fragment boundaries resulting in large induced dipoles and errors in the total energy.

Application to Proteins
The above benchmark of EFMO serves as an initial probe for how the energy behaves for polypeptides as the number of residues per fragment and screening parameters change. Based on those tests, we now apply EFMO to proteins or protein-like structures. The alanine polypeptides are particularly good for studying any systematic error, albeit they are not a representative benchmark for real proteins.
In Table 1  The results from the a-helices and b-sheets are somewhat more detrimental. With the exception of the RHF EFMO results, the errors are roughly additive for the poly-alanine peptides, so the errors are discussed on a per residue basis. For a-helices, the error in energy increase with system size from 22.94 (0.32) kcal/mol for a{(ALA) 10 to 0.18 (218.94) kcal/mol for the large a{(ALA) 40 helix, which corresponds to an average error per residue of 0.29 (0.03) kcal/mol for a{(ALA) 10 and less than 0.01 (20.47) kcal/ mol for a{(ALA) 40 . The a-helices tend to illustrate the case of over-polarization. For a{(ALA) 10 , the total polarization energy is small (212.89 kcal/mol) but as the system system size increase, so does the total polarization energy (273.81 kcal/mol) in a nonlinear fashion. We note that the MP2 energy for a{(ALA) 20 and a{(ALA) 40 increases linearly with system size but the RHF energy does not. The over polarization is also observed for FMO2/AFO, although the MP2 energies are much better (below 2 kcal/mol) which can only be attributed a better wave function of the individual fragments and their pairs. The b-sheets have errors which are lower than in the a-alanines the errors are from 0.60 (0.89) kcal/mol to 4.05 (6.46) kcal/mol for b{(ALA) 10 and b{(ALA) 40 , respectively. Overall, the average error per residue becomes 0.06 (20.50) kcal/mol and 0.10 (0.16) kcal/mol for b{(ALA) 10 and b{(ALA) 40 , respectively. The b-sheets are planar and not prone to the same over-polarization (the b{(ALA) 40 has a polarization energy of around 50 kcal/mol).
As noted above, the a-helices and b-sheets illustrate two very different polypeptides. The inaccuracy of EFMO for them is somewhat alleviated by the fact that the errors in energy for Chignolin and the Trp-cage proteins are smaller than the a-helices and b-sheets. The Trp-cage has 20 residues and its error in energy of 22.87 (24.21) kcal/mol lie around the corresponding a-helices and b-sheets of the same size 22.75 (29.66) kcal/mol to 1.74 (2.78) kcal/mol, respectively. The same is true for Chignolin.

Gradients and Geometry Optimizations
A key strength of EFMO over other similar methods [7][8][9][10][11] is the availability of the gradient. The gradient of FMO2/AFO has been investigated previously for zeolites [29] where errors in gradient were found to be +E rms : 0:2 : 10 {3 Hartree/Bohr and +E max : 1:4 : 10 {3 Hartree/Bohr when compared to numerical derivatives (equations 9 and 10) although with a smaller basis set than in this study. It was found, that even though these deviations  were present, geometry optimizations did result in satisfactory structures.
In this study, we present an investigation of the EFMO gradient comparing numerical and analytical values for proteins (Table 2). It has roughly the same accuracy-related issues found for zeolites, specifically around the bond regions where rms and maximum errors for FMO2-RHF/AFO with and without the electrostatic potential is 0:51 : 10 {3 Hartree/Bohr, 3:43 : 10 {3 Hartree/Bohr and 0:76 : 10 {3 Hartree/Bohr and 4:71 : 10 {3 Hartree/Bohr, respectively which is on par with what was found for zeolites. The latter result is particularly interesting as it is the FMO2/AFO result on top of which we add the EFP terms to obtain EFMO (equation 2).
Several different approaches to tackle the gradient were attempted. The first is the original approach taken for molecular clusters which is to transfer the gradient terms of the induced dipolesm m ind to the nearest atom only, in this study named EFMO org . This is a clear improvement over the FMO2/AFO (without the ESP) result (+E rms : 0:73 : 10 {3 Hartree/Bohr, +E max : 3:50 : 10 {3 Hartree/Bohr), but some deviations in gradient get worse using EFMO and will be discussed further below. Removing all torque contributions (EFMO nt ) reveals further improvements (+E rms : 0:68 : 10 {3 Hartree/Bohr, +E max : 3:30 : 10 {3 Hartree/Bohr). Another approach, specifically for the induced dipole (EFMO ntzpct ) is to do a percentage based distribution of the induced dipoles based on the distance between two atoms (supporting information Text S1 and Figure  S7). This only applies if the induced dipole is between two atoms and the gradient is distributed based on a percentage of the entire bond length. This further improves the results, but the improvement (+E rms : 0:66 : 10 {3 Hartree/Bohr, +E max : 3:27 : 10 {3 Hartree/Bohr) reveals that the main source of the error is not due to EFMO (Figure 6), but pertains to approximations in the FMO2/AFO gradient. To make sure that the induced dipoles do not cause major problems, an approach was tried to not evaluate the electric field from the static multipole moments and the induced dipoles, both in the energy and the gradient, of adjacent fragments, that is fragment I covalently bound to fragment J does not induce dipoles in I and vice versa. Results with (EFMO ntzpctzadj ) and without (EFMO ntzadj ) percentage based distribution of induced dipoles are (+E rms : 0:66 : 10 {3 Hartree/ Bohr, +E max : 3:73 : 10 {3 Hartree/Bohr) and (+E rms : 0:66 : 10 {3 Hartree/Bohr, +E max 3:74 : 10 {3 Hartree/Bohr) offer no clear advantage over EFMO ntzpct on the RHF level of theory, and consequently MP2 data are not presented.
From Figure 6, it is clear that EFMO fixes some of the issues that FMO2/AFO has, but evidently creates a few new ones at atom indices 111 (backbone nitrogen), 155 (backbone carbonyl), 231 (backbone nitrogen) and 236 (backbone C a ). Common to all is that it is around the bonding region. Evidently, small perturbations in the geometry, specifically around the bonding region, has large implications for the generated EFP parameters. For FMO2-MP2/ AFO and EFMO-MP2 (Figure 7 and Table 2), the errors in the gradient decrease for the EFMO ntzpct methodology (+E rms : 0:61 : 10 {3 Hartree/Bohr, +E max : 2:89 : 10 {3 Hartree/Bohr) while FMO2-MP2/AFO errors are very similar to the corresponding RHF values.
Finally, geometry optimizations were carried out for a-(ALA) 10 using the 6-31G(d) basis set and the EFMO ntzpct procedure. Figure 8 shows the improvement in energy as a function of the number of steps taken in a geometry optimization. The obtained optimized structures have the lowest energies when comparing to all the taken steps, even for one residue per fragment. Compared to RHF (MP2) optimized structures, the rms between the optimized structures are 0.40 (0.44) angstrom (EFMO with one residue per fragment did slightly worse). This can be compared to the 0.3 angstrom that was obtained for FMO2-RHF with HOP previously [41].
EFMO offers a gradient whose quality is similar to FMO2/ AFO calculations but at a reduced cost. The quality of the FMO2/AFO gradient could be improved if fully analytic derivatives available such as what was done by Nagata et. al. for HOP [51][52][53]. Another improvement can be obtained with an addition of the derivatives of the EFP monopoles (and higher order multipoles) as outlined by Xie et al. [5] We recommend EFMO ntzpct for geometry optimizations of polypeptides.

Molecular Clusters
Inclusion of correlation in EFMO (equation 4) warrants a new benchmark of the water clusters that was used in the original EFMO paper. In Table 3, results for MP2 energies are shown for R resdim~Rcorsd~2 :0 for various basis sets. Since there are no covalent bonds, the screening parameter was given its original value of a~0:6. In the original EFMO paper, the errors in energy for water clusters were discussed per hydrogen bond (HB) due to EFMO only describing higher order many-body effects for polarization (see ref [14] for full details), thus, the error is a lack of many-body terms per HB. For EFMO-MP2, only monomer and ab initio dimers are considered correlated and the lack of treatment separated dimers gives rise to new errors but we expect Table 2. Errors in gradient of EFMO and FMO2/AFO for the a{(ALA) 10 polypeptide using RHF and MP2.  these to be small. EFP does include dispersion terms [54], but these are not included in this work. The EFMO-MP2/6-31G(d) results deviate by a maximum of 0.78 kcal/mol per HB, which is worse than FMO2-MP2/6-31G(d) which deviates by a maximum of 20.43 kcal/mol per HB.
Increasing the basis set shows that the EFMO errors are 0.02 and 20.05 kcal/mol per HB for 6-31+G(d) and 6-31++G(d), respectively. For FMO2, the respective errors are 20.76 and 20.48 kcal/mol. The errors we observe for the larger clusters

Timings
In our previous study [14], EFMO-RHF for molecular clusters were two (five) times faster than the corresponding FMO2 energy (gradient) calculation. In Table 4, results for Chignolin and the Trp-cage are presented for 5 nodes using 2 cores per node. All timings were carried out on Intel Xeon X5550 CPUs. Here, using EFMO-MP2 instead of EFMO-RHF increases the computation time by roughly a factor of two (from 14.0 minutes to 29.5 minutes for Chignolin using R resdim~2 :0). For FMO2, the same calculation takes 38.5 minutes and 58.6 minutes, respectively. An EFMO-RHF gradient evaluation for Chignolin takes only three minutes longer than the energy, but becomes a five-fold increase when running EFMO-MP2 gradients. The same trends are observed for the Trp-cage. We note a significant speedup when lowering the cutoff distances R resdim and R corsd , especially for the larger Trpcage. When the cut-off distances go down, the number of ab initio dimers decrease. Especially MP2 gradients require much CPU time due to the number of integrals that needs to be transformed [40].
We note that lowering of the cutoff distances R resdim and R corsd can have significant impact on the accuracy [18,32] like we observed for molecular clusters [14], however for a modest lowering of the thresholds to R resdim~Rcorsd~1 :5, the energy deviations from ab initio are not affected greatly (Table S3).

Summary
The effective fragment molecular orbital (EFMO) method is a merger of the effective fragment potential (EFP) method and the fragment molecular orbital (FMO) method and combines the general applicability of the FMO method (for example, to flexible biomolecules) with the speed of the EFP method. In this work, we have introduced new methodology needed to make EFMO work for systems with covalent bonds such as proteins. This, together with the analytical gradient provides an agile tool to treat proteins at a reasonable level of theory. We also showed how to incorporate electron correlation via Mø ller-Plesset perturbation theory.
We made an extensive study on small polypeptides to assess the need for screening when dealing with covalent bonds and found that an additional screening is needed compared to regular EFP. We showed that the deviations in energy on proteins are on par with FMO2 to within a few kcal/mol when using two residues per fragment. For example, Chignolin is reproduced to within 0.1 kcal/mol compared to FMO2. Timings were consistent with our previous work. We obtained two to five times speedup when using EFMO over FMO2 for RHF. The speedup was somewhat lower when employing MP2 gradients, resulting in speedups between 1.6 and 2.3.
There are many ways in which the EFMO method can be improved and extended, for example, interfacing EFMO with the polarized continuum model (PCM) or the classical dispersion interaction in EFP [54] which would enable us to lower R corsd compared to R resdim , thus speeding up the evaluation of the gradient greatly. Another direction is to follow the multilayer FMO method [55] and the recent frozen domain FMO (FMO/ FD) method [56].
FMO has been applied [57][58][59] to a number of chemical problems, [60] and we expect that EFMO can be a useful method on its own, for example, in the structure optimization of proteinligand complexes and other studies related to drug design.   Text S1 Detailed description of the percentage based distribution of the gradient between two nearby atoms. (TEX)