Validating a Coarse-Grained Potential Energy Function through Protein Loop Modelling

Coarse-grained (CG) methods for sampling protein conformational space have the potential to increase computational efficiency by reducing the degrees of freedom. The gain in computational efficiency of CG methods often comes at the expense of non-protein like local conformational features. This could cause problems when transitioning to full atom models in a hierarchical framework. Here, a CG potential energy function was validated by applying it to the problem of loop prediction. A novel method to sample the conformational space of backbone atoms was benchmarked using a standard test set consisting of 351 distinct loops. This method used a sequence-independent CG potential energy function representing the protein using -carbon positions only and sampling conformations with a Monte Carlo simulated annealing based protocol. Backbone atoms were added using a method previously described and then gradient minimised in the Rosetta force field. Despite the CG potential energy function being sequence-independent, the method performed similarly to methods that explicitly use either fragments of known protein backbones with similar sequences or residue-specific /-maps to restrict the search space. The method was also able to predict with sub-Angstrom accuracy two out of seven loops from recently solved crystal structures of proteins with low sequence and structure similarity to previously deposited structures in the PDB. The ability to sample realistic loop conformations directly from a potential energy function enables the incorporation of additional geometric restraints and the use of more advanced sampling methods in a way that is not possible to do easily with fragment replacement methods and also enable multi-scale simulations for protein design and protein structure prediction. These restraints could be derived from experimental data or could be design restraints in the case of computational protein design. C++ source code is available for download from http://www.sbg.bio.ic.ac.uk/phyre2/PD2/.


Introduction
The prediction of protein structure to atomic level resolution and the design of de novo proteins with large scale backbone sampling are largely unsolved problems although there has been a great deal of progress in recent years. Both problems require the ability to rapidly sample a large number of backbone conformations. Sampling protein conformational space using full atom models can be prohibitively computationally expensive so a variety of different approaches have been developed to reduce the search space. This can be achieved by using coarse-grained (CG) protein models, by assembling backbone models from short fragments taken from known protein structures or by a combination of both of these methods.
Coarse-grained models have been increasingly used for modelling large biomolecules over long time scales due to the computational efficiency provided by these methods [1][2][3]. These models vary in the degree of coarse-graining with some models representing multiple amino acid residues with one interaction centre [4], some representing each amino acid residue with a small number of interaction centres [5][6][7][8][9][10][11][12][13], and others that are intermediate between minimal and full atom models [14][15][16]. Potential energy functions for CG models have been most commonly derived using statistics from from the Protein Data Bank (PDB) together with a suitable reference state [2]. Potential energy functions derived this way are known as knowledge-based or statistical potentials. It is also possible to derive CG potential energy functions from physical principles [17].
While CG models in the past were mostly used as toy models to study the general principles of protein folding [18,19] they are now becoming sufficiently accurate and transferable to be used for more directly useful applications. For example, CG models are widely and successfully used in protein structure prediction methods with both lattice models [6,8] and off-lattice methods [20][21][22]. CG models coupled with fragment replacement methods have been particularly successful. Backbone fragments are generally assembled in a Monte Carlo based procedure to assemble a new overall fold. As well as reducing the search space, these methods also have the advantage of guaranteeing models that have protein-like local conformational features. When these techniques are used for modelling loops, a loop closure method is required to ensure that the end of the loops connect the anchor residues in a geometrically correct way. Another disadvantage is that it is not easy to sample conformations using fragment replacement with additional restraints that could come from experimental information or for protein design applications. Fragment replacements are inherently non-local and highly disruptive moves so acceptance rates can be very low with additional restraints. It is also harder to use more advanced sampling techniques such as metadynamics [23] or umbrella sampling [24] as fragment replacement violates detailed balance in most common implementations [25] and this would be even more difficult when coupled with loop closure methods as is necessary in loop modelling. The ability to sample loop conformations with protein-like local structural features directly from a CG potential energy function could be one way of avoiding these problems.
The accuracy of full-atom reconstruction depends on the level of coarse-graining [16]. A number of methods have been developed to rapidly reconstruct mainchain atoms from C a atoms [26][27][28][29]. Sidechains can then be added to the mainchain using fast rotamer-based methods [30,31]. When transitioning between CG and full atom models it is important to retain good model structure quality. However, even in many full atom molecular mechanics force fields the modelling of backbone torsion angles has been problematic but recently efforts have been made to address this [32,33]. A key feature of the C a CG potential used in this study is its emphasis on protein-like local structure [11].
For most protein sequences, experimentally determined structures of homologous sequences are available and can be used as templates for accurate modelling [34,35]. These homology models often have missing sections of the peptide chain where new residues have been inserted during the course of evolution. In these cases these loops will need to be predicted using de novo methods. Loop modelling is also important for computational protein design applications where the backbone structure needs to be redesigned in order to satify some functional constraints [36][37][38]. Loop modelling presents a rigorous and stringent test of de novo structure prediction methods due to their high degree of structural variability and a weaker sequence-structure relationship compared to secondary structure elements. While many loop prediction methods have been previously described [39][40][41][42][43][44][45][46][47][48][49][50], there is only one study on the use of C a CG models for loop prediction without the use of backbone fragments from known protein structures [51].
In this paper we validate a previously developed sequenceindependent CG potential energy function [11] by comparing its performance to some existing fragment and loop closure based methods. Full atom models are constructed from the sampled CG models, gradient minimised in a full atom potential energy function. The lowest energy structures were found to predict loop conformations surprisingly well with a high proportion of sub-Angstrom RMSD predictions.

Results
The method presented here was compared to RAPPER [41] and FALCm4 [52] as the use of the same test set enables a direct comparison using the same metrics presented in those papers. RAPPER was taken as representative of methods that use a dihedral angle build-up method while FALCm4 was taken as representative of fragment replacement methods. The aim of this work was to determine whether it was possible to sample loops within the radius of convergence of full-atom refinement methods using a coarse-grained C a model.

Loop Conformational Sampling
The loop prediction benchmark test proposed by Fiser et al [39] and filtered by DePristo et al [41] was used to assess the performance of the loop modelling methodology. This set contains loop targets of two to twelve residues in length. For each target, 4000 backbone loop conformations were sampled using a simulated annealing protocol (see Methods) using the potential energy function described in equation (1). As an additional control a further set of 4000 backbone loop conformations were sampled for each target where only the E bond (C a -C a pseudo-bond term) and E bump (C a -C a steric repulsive term) terms were included. This was carried out in order to determine the degree to which the other terms in the potential energy function were enhancing conformational sampling and is referred to as the ''control'' in the following text. The PD2 method introduced in this paper ensures that loops are always fully closed and the anchor residues are never moved. This is not always the case with the other loop sampling methods [50].
The RAPPER [41] and FALCm4 [52] methods were benchmarked using the same test set used in this paper. In both of these methods 1000 loop conformations were sampled rather than 4000 in this paper. In order to allow direct comparison with the results produced by RAPPER and FALCm4, 1000 loop conformations were resampled from the 4000 generated loops to estimate comparable statistics using the R statistical package ''boot'' to carry out a stratified bootstrap with 1000 replicates ( Figure 1A). All RMSD-G values were calculated using the backbone heavy atoms N, CA, C, and O without superposition as defined by DePristo et al [41]. The best RMSD-G values were comparable to the RAPPER and FALCm4 methods and significantly better than the control ( Figure 1A and Table S1 in File S1). Ensemble RMSD-G values were similar to FALCm4 but lower than for RAPPER ( Figure 1B and Table S2 in File S1). Interestingly, a higher proportion of the PD2 loop ensemble lay below the 1 Å and 2 Å RMSD-G than both RAPPER and the control ( Figure 1C and 1D, Tables S3 and S4 in File S1). This shows that near native loops were frequently sampled and could enhance the chance of selecting the correct conformation. At this stage no sequence information was incorporated into the PD2 loop sampling method but it still appeared competitive with methods that did include this information. RAPPER samples residue dependent discrete Ramachandran angles while FALCm4 is a fragment replacement-based method that selects fragments based on sequence similarity.

All-atom Structure Refinement and Model Selection
Sidechains were added to the generated backbone loops using the default Rosetta simulated annealing repacking algorithm and the whole loop (including the backbone) was then gradient minimised in a iterative manner as described in the Methods. The lowest energy loop was taken as the prediction. The results were comparable to existing methods and in some cases better (Table 1). Overall the method successfully predicted to sub-Angstrom accuracy 196 out of 351 loops in the test set (examples shown in Figures 2 and 3). In comparison, the control sampling method predicted 91 out of 351 loops in the test set to sub-Angstom accuracy and most of these were the short loops. Of the 174 loops of 8 residues or longer, 48 were predicted to sub-Angstom accuracy but none in the control. This indicates that while sequence independent coarse-grained statistical potential was significantly improving conformational sampling, the control method can successfully sample sub-Angstrom conformations only in the short loops where extensive search is possible. Previous studies have shown that exhaustive conformational searching taking into account crystal contacts together with a good all atom energy function can produce extremely good results [42]. However, this approach does not scale well, can take days of computational time to run and does not seem to work well on all loop test sets [50].
As a measure of backbone structure quality, the Ramachandran distribution was calculated for all generated loop decoys ( Figure 4). Most features of the Ramachandran were reproduced in the loop decoys however there is still room for improvement. The dihedral angle distribution of the generated backbones is a function of both the C a atom positions and of the method used to rebuild the mainchain atoms from the C a positions.

Loops from Recently Deposited Structures
Predictions were carried out on a new loop set taken from recently desposited structures with low sequence or structural similarity to solved structures deposited in the PDB before April 2012 (see Methods; Tables 2 and S5 in File S1; Figure 5). Of these seven loops, sub-Angstom conformations were sampled for 5 loops but no sub-Angstom conformations were sampled by the control. Two sub-Angstrom predictions were made but none were made for the control method. The CG potential energy function appears to be consistently sampling lower RMSD and lower energy loop conformations for both the original test set and the new test set.

Discussion
We have shown that CG sampling techniques have the potential to be viable methods for atomic resolution loop prediction. This could be further improved with more advanced sampling techniques such as metadynamics [23] and incorporating sequence-dependent terms in the CG potential energy function. As loops are sampled from a potential energy function it would be possible to include extra restraints from experimental data or from contact prediction [53]. The CG potential energy function used in this work was initially developed as a method for the design of de novo backbone scaffolds [11]. The results of the paper confirms that it is sampling protein-like loop conformations more frequently than the control and that it works surprisingly well despite the sequence-independent nature of the energy function. It would be possible to incorporate functional geometric constraints as part of a computational protein design pipeline with large scale backbone sampling.
A high proportion of the sampled loop ensemble appears to be close to the native conformation ( Figure 1 and Table S4 in File S1). This suggests that CG loops could be clustered prior to fullatom refinement in order to save time. It is also notable that the minimised native loop almost always has the lowest Rosetta energy (Figures S1 to S10 in File S1). This supports a previous observation that the main bottleneck in de novo protein structure prediction appears to be conformational sampling [54]. This work suggests that CG models of protein structures as part of a hierarchical approach can achieve atomic level accuracy.

C a Potential Energy Function
The a-carbon potential energy function used a sub-set of terms from a previously described potential energy function [11] that is derived using a 27-''letter'' structural alphabet [55]. This was composed of 5 terms (1).
Where E local was composed of harmonic pseudo bond angle, dihedral terms and reference energies which vary as a function of their structural alphabet classification, E bond was a pseudo bond term, E bump was a soft steric repulsive term, and E hbond was a pseudo hydrogen bonding statistical potential term using pseudo N and O atoms as defined by Levitt [5]. The E local reference energy  terms are parameterised such that the equilibrium distributions of each structural alphabet ''letter'' and each pair of consecutive ''letters'' reproduces that observed in the PDB (protein data bank) [11].  Figure 5. Backbone RMSD-G vs. Rosetta energy scatter plots for loops taken from a newly deposited set of structures ( The move set consists of crankshaft moves (analogous to backbone backrub moves), bond moves where two a-carbon atoms are moved by equal amounts in opposite directions along the bond vector and angle moves where two outside a-carbon atoms are rotated by equal and opposite amounts such that the bond angle is changed and the rotation axis is normal to the plane defined by the three a-carbon atoms. All three of these move types are local moves that do not propagate along the whole chain.

Backbone Potential Energy Function
A backbone potential energy function was used for conjugate gradient minimisation after rough backbone atom positions were added to a-carbon models using a previously described method [27] in order to regularise the backbone stereochemistry. This included bond angle, bond length, torsion, improper torsion, 1-4 Lennard-Jones and 1-5 Lennard-Jones terms taken directly from the OPLS-UA force field [56], a soft steric repulsive term to prevent backbone clashes (as described in [11]) and reimplementation of the Rosetta backbone-backbone hydrogen bonding statistical potential [57].
Ensemble Generation Initial loop a-carbon positions were generated by linear interpolation between the fixed anchor a-carbon with the addition of a small random vector displacement followed by conjugate gradient minimisation using only the E bond and E bump terms from (1). The initial positions were then relaxed in the full a-carbon potential using Monte Carlo simulated annealing for a total of 12000 steps. Conformations were generated by an inner cycle of 400 simulated annealing steps at the a-carbon level followed by the addition of initial backbone positions by a fast look-up method [27]. The annealing schedule consisted of 50 steps with b~0:2, 100 steps with b linearly increasing from b~0:2 to b~1:2 and finally 250 steps with b~1:2 (where b~1 kBT ). These conformations were accepted and then minimised in the backbone potential energy function if the number of residues in the loop with w/y dihedral angles that lay in strictly forbidden regions of the nonresidue specific Ramachandran plot, n forb , was ƒ maxf0:1|length,n lowest forb g, where n lowest forb was the lowest previously accepted value of n forb . This was designed to prevent the algorithm getting stuck with no acceptable loops. At this stage 47% of generated loops were rejected.

Gradient Minimisation and Selection with the Rosetta Energy Function
Sidechains were added using the default Rosetta simulated annealing repacking algorithm and the loop atoms gradient minimised in the Rosetta all atom potential energy function using a PyRosetta [58] script consisting of 15 outer cycles and 4 inner cycles. Each of the 4 inner cycles consisted of sidechain repacking followed by gradient minimisation. The weight of the repulsive term of the Lennard-Jones energy was gradually ramped up during the 4 inner cycles in the order 0.02, 0.25, 0.550 and finally 1.0. This was designed to replicate the Rosetta Fast Relax protocol [59]. The backbone and sidechains of the rest of the protein were kept fixed in their experimentally observed positions and the lowest energy structure generated during the protocol was retained. The lowest energy loop decoy was selected as the final prediction.

Selection of New Loops from Recently Deposited Structures
Protein structures solved after April 2012 with novel folds were determined using a hierarchical approach based first on sequence similarity and then on structural similarity. First, the sequences of all structures solved after this date and greater than 20 amino acids in length (10,239) were BLASTed [60] against all PDB sequences deposited before this date. Any matches with a reported BLAST E-value v10 {4 were removed as clear homologues leaving 1350 sequences. The corresponding structures of these 1350 sequences were then structurally compared to a representative set (pairwise sequence identity v30%) of the PDB taken from the PISCES [61] server with a date before April 2012 using MAMMOTH [62]. Any protein with a MAMMOTH hit with an E-value v10 {2 were discarded as structurally similar to an earlier deposited structure, leaving 361 proteins with potentially new folds. Many of these proteins were short (v100 residues) suggesting that they may not constitute a genuine fold. After removing any structures v100 residues, this left 24 potential structures. Finally, of these 24 structures, any that were not high resolution crystal structures (ƒ2:1Å), contained chain breaks/missing residues, or had no loops in the range 8-12 were removed leaving a final set of 4 structures and 7 loops (Table 2). Loops were determined as contiguous sections of coil or turn as defined by STRIDE [63].

Supporting Information
File S1 Supporting figures and tables. (PDF)