Interface of the Polarizable Continuum Model of Solvation with Semi-Empirical Methods in the GAMESS Program

An interface between semi-empirical methods and the polarized continuum model (PCM) of solvation successfully implemented into GAMESS following the approach by Chudinov et al (Chem. Phys. 1992, 160, 41). The interface includes energy gradients and is parallelized. For large molecules such as ubiquitin a reasonable speedup (up to a factor of six) is observed for up to 16 cores. The SCF convergence is greatly improved by PCM for proteins compared to the gas phase.


Introduction
Continuum solvation models such as the polarized continuum model (PCM) [1] and the conductor-like screening model (COSMO) [2] offers a computational efficient model of solvation for molecules treated with electronic structure methods. This paper describes the implementation of an interface between the conductor-PCM (C-PCM) model [2,3,4] and the NDDO-based semi-empirical methods implemented in GAMESS [5] (MNDO [6], AM1 [7], and PM3 [8]). There has been several different implementations of semi-empirical/PCM interfaces [2,9,10,11,12] and this work follows the implementation proposed by Chudinov et al. [9] However, we also implement the corresponding energygradient terms and both the energy and gradient terms are parallelized and tested on relatively large systems such as the protein ubiquitin. This paper is organized as follows. 1) We review the relevant expressions for the semi-empirical/PCM interface. 2) We present results of solvation free energies and compare them to previous results. 3 ) We test the numerical stability for geometry optimizations and vibrational analyses. 4) We present timings and parallelization speed-ups for protein-sized systems. 5) We summarize our findings and provide possible ideas for future improvements.

Background and Theory
In PCM, a molecule (the solute) is placed inside a solvent-cavity usually described by introducing interlocked spheres placed on the atoms of the molecule. The solvent is described as a polarizable continuum with dielectric constant e. The interaction between the solute and the solvent is described by the apparent surface charges (ASCs). The PCM equations are solved numerically by dividing the surface area up into a finite set of elements called tesserae with a corresponding ASC q i , an area a i and a positionr r i . There are several implementations of the PCM [13] and in this study we focus on the conductor-like PCM (C-PCM) [2,3,4]. For high dielectric solvents such as water C-PCM yields nearly identical results to the more generally applicable integral-equation-formalism PCM (IEF-PCM) [14] but requires less computational resources.
For C-PCM the ASCs q are determined by solving the following matrix-equation where the matrix C has the elements and V is the potential of the solute in the solvent for each tessera i. The potential V(i) on tessera i is given as where A runs over all nuclei in the solute at positionr r A carrying a charge Z A . P is the density matrix of the solute and V mn (i) are the interaction integrals over basis functions on a tessera i given as For NDDO methods the right hand side of equation 4 is the interaction between a point charge on the surface (represented as s's' in the NDDO approach) and the basis functions of the solute molecule on atom A. The (s's'Dmn) integrals needed in equation 4 are listed in Table 1 for s and p functions. The integrals are rotated from a local ideal coordinate system onto the molecular coordinate system. The local coordinate system is defined by the distance between the atom A containing the basis functions mn and the tessera iR (s's'Dp s p s )~1 4 (s's'Dp p p p )~1 2 Here, D 1 and D 2 are empirical parameters describing chargeseparation for the multipoles. They are defined elsewhere. [15] Following Chudinov et al. [9] the density parameters a l are set to zero in this work and are therefore not shown in the equations.
The electrostatic interaction of the ASCs q on the surface and the molecule is treated by introducing the following one-electron contribution to the Fock matrix where Finally, the PCM electrostatic interaction free energy is calculated as Optimization of the molecular geometry in the PCM field requires the derivative of G with respect to an atomic coordinate the last term is computed analytically [16]. The derivative of the potential with respect to an atomic coordinate is done analytically and we give explicit expressions for all terms in Text S1.

Computational Details
The semi-empirical/PCM interface was implemented in a locally modified version of GAMESS [5]. The semi-empirical energy and gradient evaluations were allowed to run in parallel but no efforts were made to parallelize the integral evaluation or the assembly of the Fock matrix since the diagonalization is the major computational bottle-neck for large systems. The evaluation of the electrostatic potential (equation 3) and its derivative (equation 15) was parallelized. We note that the remaining semiempirical integral-derivatives in GAMESS is evaluated numerically.
We compared our implementation to that of Chudinov et al. for twenty smaller ammonium and oxonium type molecules used in that study. The structures were generated from their SMILES string (see Table 2 and Table 3) using Open Babel [17,18] and optimized in the gas phase and afterwards using the newly implemented code.

Electrostatic Solvation Free Energies
The electrostatic solvation free energies are presented in Tables 2 and 3 for ammonium and oxonium species calculated using PM3/PCM and compared to results published by Chudinov et al. [9] In general, our results underestimate the electrostatic solvation free energy by an average of 21.3 kcal mol {1 and 21.9 kcal mol {1 . The main source of the difference is likely the   [28] often referred to a D-PCM) while we use the C-PCM implementation. The solvation free energies from these implementations can differ by several kcal/ mol even for neutral molecules [29]. (While the reference describes a comparison of D-PCM to IEF-PCM, IEF-PCM and C-PCM yield nearly identical solvation free energies for water.) Another likely source of error is that we use the GEPOL-GB scheme where Chudinov et al. uses a more elaborate scheme to reach convergence of the solvation free energies by subdividing the surfaces incrementally.

Vibrational Frequencies
To test the numerical accuracy of the PCM gradients we optimized the molecules listed in Tables 2 and 3 Table 4 three of the geometry optimizations (A1, O1, and O2) do not converge. A1 can be made to converge by skipping the update of the empirical Hessian matrix (UPHESS = SKIP) but this does not appear to be a general solution to the problem. While some gradient components in these minimizations are quite large the optimizing algorithm eventually settles on a zero step size causing the optimization to effectively stall. The cause of this behavior is not clear since it is only observed for the smallest systems and was not investigated further. The resulting geometries still lead to a positive definite Hessian and the frequencies are not unusually different from the gas phase values.

. As indicated in
In four cases (A7, O4, O6, O8 and O9) the vibrational analyses yields imaginary frequencies between 26 and 200 cm {1 . In the case of O8 and O9 this also occurs for the RHF/STO-3G calculations and in the case of O7-O9 this also occurs for PM3 structures optimized in the gas phase. In most cases the imaginary frequency is associated with the O+ ion and a neighbouring methyl group. The most likely source of these imaginary frequencies is a flat PES associated with the O+ group combined with numerical inaccuracies in the PCM and PM3 gradients.

Timings
In Table 5 we show absolute timings for single point energy and gradient evaluations of proteins either in the gas phase, using DIIS to obtain convergence, or by including the PCM field either with or without SCF convergence acceleration. None of the listed proteins converged in the gas phase without DIIS and even then the SCF converged only for the three smallest proteins: Chignolin, Tryptophan-Cage and Crambine.
The cost of optimizing the wavefunction in PCM is between two (Crambine) and three (Chignolin and Tryptophan-cage) times more expensive than without. For Chignolin, which is the smallest protein in our test set, it took 21 SCF iterations to converge in PCM while only 13 for PCM/DIIS and 14 for PCM/SOSCF. The other proteins converged within 17 iterations without convergence acceleration and within 14 iterations with. For absolute timings regarding larger proteins, Crambine, Trypsin Inhibitor and Ubiquitin finished in 1293, 3455 and 6732 seconds with PCM without convergence accelleration, but are slower (1314, 3649 and 8777 seconds, respectively) with PCM and DIIS enabled. Using SOSCF did not result in an appreciable decrease in CPU time. The increase in CPU time when using DIIS is due to the extra matrix operations associated with this method, which represent the computational bottleneck for sem-empirical methods.
Evaluating the ASC potential derivative (equation 15) analytically has a negligible computational cost compared to evaluating the wavefunction as can be seen from the last column of Table 5.
The relative speedup from running in parallel in the gas phase is shown on Figure S1 where no improvement is observed beyond 4 cores (with a speed up factor of 3) and is not discussed further. The PM3/PCM timings (Figure 1) show better improvement when utilizing multiple cores for all systems. The smaller systems obtain some improvement (a factor 3.4 and 4.2 for Chignolin and Tryptophan-cage, respectively) whereas the larger systems sees improvements of 5.7, 5.7 and 5.9 for Crambine, Trypsin Inhibitor and Ubiquitin, respectively. In all cases maximum speed up is reached for 16 cores because the use of 24 cores introduces some communication overhead which degraded performance.

Conclusions
An interface between semi-empirical methods and the polarized continuum model (PCM) of solvation successfully implemented into GAMESS following the approach by Chudinov et al. [9] The interface includes energy gradients and is parallelized.
For very small systems we found some numerical instability problems in the gradient which caused geometry convergence failure, but geometry optimization appears robust for larger molecules. The use of PCM occasionally introduces imaginary frequencies in the Hessian analysis, but this was also found for RHF/STO-3G PCM calculations and even in a few semiempirical gas phase calculations so these problems do not appear to be specific to the to the current implementation. We therefore consider the current implementation a working code for all practical purposes, but welcome feedback from readers who encounter numerical stability problems for large molecules.
For semiemprical methods the most time CPU-intensive part of the calculation remains the solution of the SCF equations. This part of the code was already parallelized in GAMESS and we show, for the first time, that this implementation applies to semiempirical methods and the new PCM interface. For large molecules such as Ubiquitin a reasonable speedup (up to a factor of six) is observed for up to 16 cores.
It will be interesting to see how much the numerical stability and computational efficiency will improve once the interface is combined with the recently developed FIXSOL/FIXPVA2 method developed by Li and coworkers [30]. We are currently working on implementing the PM6 method in GAMESS to further increase the accuracy and range of application that this new interface offers.

Supporting Information
Text S1 Analytical derivative of the interaction potential. (PDF) Figure S1 Speedup by using multiple cores single gas phase gradient evaluation. (TIFF)