Transferable Coarse-Grained Potential for De Novo Protein Folding and Design

Protein folding and design are major biophysical problems, the solution of which would lead to important applications especially in medicine. Here we provide evidence of how a novel parametrization of the Caterpillar model may be used for both quantitative protein design and folding. With computer simulations it is shown that, for a large set of real protein structures, the model produces designed sequences with similar physical properties to the corresponding natural occurring sequences. The designed sequences require further experimental testing. For an independent set of proteins, previously used as benchmark, the correct folded structure of both the designed and the natural sequences is also demonstrated. The equilibrium folding properties are characterized by free energy calculations. The resulting free energy profiles not only are consistent among natural and designed proteins, but also show a remarkable precision when the folded structures are compared to the experimentally determined ones. Ultimately, the updated Caterpillar model is unique in the combination of its fundamental three features: its simplicity, its ability to produce natural foldable designed sequences, and its structure prediction precision. It is also remarkable that low frustration sequences can be obtained with such a simple and universal design procedure, and that the folding of natural proteins shows funnelled free energy landscapes without the need of any potentials based on the native structure.


Introduction
Computer simulations of the protein folding process have in the last ten years reached amazing level of description and accuracy [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16]. The power of the computers and the understanding of the physics that governs folding allows now for a large screening of the experimental data for instance collected in the Protein Data Bank [17]. From a theoretical point of view a successful approach is the''minimal frustration principle'' (MFP) [18][19][20] in which protein folding is described as a downhill sliding process in a low frustration energy landscape (''funnelled'' shaped) towards the native state. While MFP has been proven for lattice heteropolymers [19,[21][22][23][24][25][26][27], in more realistic protein representations a residual frustration which prevents the systematic prediction of the native structure of natural sequences is often observed. Off-lattice instead MFP is used as a main justification for the use of structure-based potentials such as the GO [28] and elastic models [29]. In fact, there is still space for development of transferable models that are capable of systematic associating the experimentally determined native structure to natural sequence. Surprisingly, with the exception of few notable examples [30][31][32][33][34][35][36][37][38], it has also been extremely difficult to artificially construct sequences capable of folding into given target protein structures. The group of David Baker [38] introduced a novel procedure to select sequences with low frustration capable of correctly refolding in vitro to their target structure with a success rate between 8% and up to 40% of the total trials. In their work the authors have introduced a set of rules for the design of the local amino acids interactions to disfavour non-native states. After many iterations, a refolding calculation filters out about 90% of the initial sequences that are found not to have a funnelled energy landscape. The complexity of Baker's procedure demonstrates that is not easy to produce sequences with low frustration.
Here we present a novel protein model where low frustration folding is observed both for natural and designed sequences, the latter obtained without the need of negative design. The novel model is obtained from the optimization of the residue-residue and residue-solvent interaction energy terms under the condition that a large number of sequences designed for 125 test proteins are equal to the corresponding natural sequences. As a result, designed sequences with our model are for several properties comparable to natural ones and fold with a low frustration free energy landscape. We additionally demonstrated that for 15 additional randomly selected proteins, notoriously difficult to fold [39,40], the natural sequences correctly refolded to their corresponding native structures with a remarkable precision between 2.5 and 5 Å . In other words both quantitative protein design and folding are possible simultaneously. We anticipate that our methodology will have direct application for protein design and structure prediction, but also we expect that it will become a reference point for the development of alternative protein models. For instance, a more or less accurate description can be obtained by adding or removing details from our model, under the condition that the maximum valence principle remains satisfied. the case of proteins we have shown (Caterpillar model [43]) that the minimum set of directional interactions translates into the combination of just the backbone molecular geometry and the backbone hydrogen bond interactions (see Fig. 1).
In what follows we will show that by optimizing the interactions under the condition that natural and designed sequences are the same at constant amino acid composition for a large set of proteins, we will also quantitatively predict the folded structures of natural and designed sequences with similar accuracy. This is possible because our model includes the correct set of interactions that satisfy the MVP and, accordingly, the design procedure [43] alone is capable of predicting if a sequence, either natural or artificial, will fold to the target structure. Since we cannot model the particular evolutionary pressure that determined the natural amino acid composition, we chose to keep it constant. Such pressure could be due to many factors such as the particular function of the protein or the difficulty of synthesizing each amino acid type. The ansatz of this work is that folding and design can occur also outside such conditions and that is possible to design a foldable artificial protein from an infinite bath of amino acids. Hence, the above evolutionary pressure is taken into account by fixing the composition to the natural one. The optimization scheme that we used is the maximum entropy principle (MEP) already tested for proteins by Seno et al. [44]. MEP states that the more information is used to model a system the lower the associated entropy will be [45]. Hence, in order to find the optimal parameters that require the least amount of information, all is needed is to maximize the entropy associated with the probability of observing a given protein P(S i ,C j ), where S i indicates to the sequence and C j the three dimensional structure, under the sequence similarity and normalization constraints defined in work of Seno et al. [44]. The derivation follows closely the one used in the work of Seno et al. [44] (the full derivation is in the Supplemental Material together with the details about the model and simulations techniques) and we determined that the entropy maximum corresponds to the values of the model parameters (E, E HOH , and V in Eq. 1) at which the amino acid hydrophobic/hydrophilic (HP) profile [46] and the interaction energy of each residue with all other are simultaneously equal to the natural ones: where the index i runs over the N Seq designed sequences for each protein j of length N j , the E Sol is the hydrophobicity scale of each residue (see Eq.S3 in the SM), while the c i k 's are the contribution to the total energy of each residue calculated within the Caterpillar model. The last term instead guarantees that the Shannon entropy associated to the matrix elements E kl (H are the histograms) is maximized to avoid an uniform matrix. We phenomenologically determined E Shannon~8 :0 for the scaling term to be a good value. Note that here and in the following, energies are given in units of k B T Ref , where T Ref is a reference temperature that sets the scale of the interactions, hence all simulation temperatures are given in units of T Ref . It is important to stress that T Ref is not necessarily the folding temperature or the environment temperature, but all the energies can be rescaled to have T Ref matching the physical temperature. In fact, in what follows we will show that all proteins studied fold approximately at the same temperature, one could think to rescale the energies to set the folding temperature to the one observed in nature. A schematic representation of the algorithm is reported in Fig. 2.
To the best of our knowledge our work is the first of his kind to optimize the model parameters by reducing the differences between natural and designed sequences and, thanks to the MVP, is the simplest (in terms of the number of parameters needed) to successfully and quantitatively reproduce both sequences and structures of natural proteins to high precision.

Parameters Optimization
We began by selecting a protein training set from the Protein Data Bank (PDB) [17], which includes all the proteins that obey the following conditions: a X-Ray structural resolution below 1.5 Å , are made of single chains of length ranging from 20 to 200 residues, and do not contain any DNA or RNA. According to the stated conditions we selected 125 proteins (see Tab. S2 of the SM for the complete list of the PDB id's). It is important to stress that we did not select for specific experimental conditions, in particular pH and temperature during the measurements fluctuate significantly among the proteins in the set. In Fig. 3 and Fig. S3 we plot the comparison of the natural to the designed sequences, the latter obtained with the MEP optimized interaction parameters. The plot shows strong correlation (.0.9) between the total energy of the designed (abscissa) and natural (ordinate) sequences, and between the profiles of the residue the HP profiles and the energy contribution (  For a trial set of the E, E HOH , and V parameters and for each protein in the training set a large number (10 5 ) of sequences with composition fixed to the natural one are generated following the design scheme in the SM. The scoring function F score (Eq.S16 in the SM) is then evaluated and the trial parameters are accepted or reject according to a Metropolis like scheme. New parameter sets are generated at each iteration, and the sequences of the proteins in the training set are redesigned by 10 5 simple pair residue swapping moves, which are accepted or rejected according to a standard Metropolis algorithm with the energy defined in Eq.S4 (see SM). During each design iteration, the HP and energy profiles (Eq.S16 in the SM) are averaged over the observed sequences weighted by their Boltzmann weight. The averaging guarantees that the profiles are calculated over the most probable sequences that, as we showed previously [43], are robust against mutations and are more thermally stable. After ,10 8 iterations the interaction parameters converged to their final values: V~21:0+0:5 and E HOH~0 :015+0:001, and the residue-residue E interaction parameters which are listed in Tab. S1 of the SM. conclude that, for all 125 proteins in the training set, the designed proteins and natural proteins are equally compatible sequences to their respective target structures, strongly suggesting that our procedure may now be used to design realistic protein sequences. We applied the MEP derived parameters to design 15 randomly selected from independent training sets [39,40], and characterized by different secondary and tertiary motives. The top five resulting sequences for each target structures are listed in Tab. S4 of the SM. It is important to stress for this design we relaxed the constraint on the amino acid composition used during the optimization. Hence, the folding of the designed sequences does not depend on the previous knowledge of the natural amino acid composition, nevertheless the amino acids composition of the artificial sequences is similar to the natural one (see Tab. S3). It has to be said that the artificial sequences appear unusual with repeats of the same amino acid (e.g. for 1gab WDDMIIRRRRFVVYYLWGSMTAEVEAEKGTNGFYYHHHD-FGTKKKAQQQSNNL). Such repeats could be due to the approximations of the model, however it is important to remember that we did not include in the design any information about the function of the protein. In fact there is no reason to expect that natural sequences are the only one capable of folding, and we want to stress again that additional constraints applied during the design procedure would dramatically reduce the volume of the sequence space [47] reducing the probability of repeats. We believe the latter to be the main cause of the repeats and we tested this hypothesis forcing into the design procedure and additional constraint expressly rejecting mutations that would result into a repetition of the same amino acid for 3 residues forward and backwards. The resulting sequences are listed in Tab. S6 of the SM. We were surprised to find that sequences with energy comparable to the unconstrained ones had a lower number of permutations (log(N P )*107 instead of 108). Nevertheless, the now more reasonable looking sequences folded in a similar fashion with respect to their unconstrained alternatives (see Fig.S4). Finally, we will show below the model is capable of refolding also several natural sequences, demonstrating that in the model the presence of repeats is not necessary to stabilize natural protein structures. It is important to stress that during the last step of the MEP optimization the fewer sequences generated with fixed composition do not present the repeating patterns (see Table S5). However, this is an interesting problem and deserves a dedicated study that is beyond the objective of this work. Objective of ongoing research is also to experimentally test whether such sequences are capable of folding to the predicted target structures.

Protein design
In order to apply the model to the folding of both designed and natural sequences, we need to balance the residue energy term with the backbone hydrogen bond term (parameter a in Eq.S4 in the SM). The energies can be rescaled by choosing the value of a for which designed sequences fold best to their target structures [43]. Hence, we selected four designed sequences from Tab. S4 (PDB ids 2l09, 3mx7,chain A of 3obh, and 1qyp), and for each sequence we  Fig. S1 in the SM for details). The plot shows for each protein a funnelled profile with a global minimum very close to the respective target structure (DRMSD [ ½1:5{2:0 Å ). So at least below the folding temperature the proteins seems to follow a downhill process. This observation would need a verification with a study of the folding dynamics. The refolding free energy profiles shown in Fig. 4 prove that realistic protein sequences with low frustration folding free energy landscapes can now be designed with a straightforward positive design scheme.

Refolding of natural sequences
The next logical step is to asses the behaviour of the model when refolding natural sequences and prove that folding as well can be performed to a quantitative level with the model. For this we randomly selected 15 proteins known to be difficult to fold (from Tsai et al. [39] and from the 9th edition of the well known Critical Assessment of Techniques for Protein Structure Prediction [40]) and we performed folding simulations of their natural sequences. The results are plotted in Fig. 5(a) and 5(b), where we have superimposed all the computed free energy profiles. Although, the details of each profile might not be clearly visible, a first fundamental feature is apparent, namely the concentration of the free energy minima in the region between 1.5 and 2 Å DRMSD which remarkably is also the same regions observed for the design proteins. A second important result is the funnel shape common to all free energy profiles providing definite proof of the capability of the model of capturing the low frustration folding of natural proteins with a rather high precision. In fact when the predicted conformations of the folded states are compared to the experimentally determined structures, the two overlapped with a precision between 2.4 and 4.1 Å RMSD (see top inset of Fig. 5(a)) which is surprisingly accurate especially considering the simplicity of the model. An alternative comparison of the refolded structures to their native targets is reported in Tab. S7 produced with the ''MaxCluster'' program from Alex Herbert (http://www.sbg.bio.ic.ac.uk/maxcluster/index.html) and the TM-scoring function introduced by Zhang et al. [48,49]. It is important to note that the configurations with the lowest energy are not necessarily equal to the ones corresponding to the minimum of the free energy, however, in most cases, they are very similar. This is due to the strong directional nature of the hydrogen bonds which makes them very sensitive to thermal fluctuations. As a consequence, there are isolated structures that might have a lower energy but are not very stable at finite temperature. We would like to stress that, since the native sequences fold in a similar fashion compared to the designed ones and we did not observe the native sequences themselves as an outcome of the design process (see Tab. S4-S6), we could speculate that, at least within the Caterpillar model, the space of folding sequences is much wider than the one comprising only of natural sequences.

Discussion
To the best of our knowledge our coarse-grained protein model is the simplest, in terms of the number of parameters needed, with a transferable energy function capable of achieving such precision for the prediction of the native folded structures. Also it is one of the very few models that allows for both quantitative proteins design and folding, the latter demonstrated by free energy calculations. It is remarkable that low frustration sequences can be obtained with such a simple and universal design procedure, and that the folding of natural proteins shows funnelled free energy landscapes without the need of any potentials based on the native structure [50].
Although, the artificial sequences present some unnatural features like repetitions of some amino acids, the sequences designed with a natural amino acid composition share many features with the natural occurring ones, and the native structures of the latter are correctly predicted by our model. Hence, we expect that our designed proteins (see Tab. S4), once synthesized, may fold to the structures used as design targets, which may also represent the ultimate and most important test of our methodology. We hope that our methodology will become an useful tool in experiments requiring alterations of natural proteins, or the total redesign of target protein structures. Of course, constraints on the composition can always be applied to the design procedure with no major changes in the procedure. Moreover, the prediction power of the model gives us high confidence that our design methodology may be directly used to tackle important open problems of drug design, or used in a multi-scale approach where the results from our model could be refined with a more accurate but also a computationally more expensive protein model.
Finally, this work not only extends our previous results obtained with the Caterpillar model, but also strengthens the connection among all our work on lattice heteropolymers and protein unrelated systems such as patchy polymers [41,51]. The success of the same design strategy for all these systems demonstrates that the maximum valence principle is a sufficient condition to satisfy for the generalized design of low frustration sequences and the prediction of their proper native state. The profiles have a common funnel shape and show a clustering of the free energy minima in the region 1.5 and 2 Å DRMSD consistent with the results obtained for designed sequences. In b) we plot the free energies for proteins with the worst (2ptl) and the best (3nmd-E) distance of the folded structure from the native one. For the latter the free energy profile shows a minimum remarkably close to the native state probably due to the highly simplified structure of protein 3nmd-E. The minimum of 2ptl, on the other hand, is located further away from the low DRMSD values than the other proteins. This apparent discrepancy is due to the definition of the DRMSD which includes the contribution from the C a atoms located in the long unstructured tail from the residue 1 to 18. Since the probability of observing that particular conformation in solution is very low, it follows that the particular realization of the native structure has a large entropy penalty. However if we measure the overlap ignoring the contribution from the tail we see that the predicted structure of the protein core is again reasonably close to the experimentally determined one (<5.2 Å RMSD). In the insets we compare the experimental structures (in yellow) superimposed to the equilibrium configurations (in red), and we show that the proteins refolded with a precision between 2.4 and 4.1 Å RMSD.    )   Table S3. Comparison of the average composition of the designed sequences and the natural sequences used in the parameter optimization. It is important that since we do not model Cys-Cys bond and the Proline rigid bond we have excluded them from the design alphabet. This is why the frequency associated to those amino acids is zero in the designed sequences. We are currently working on implementing such special cases in the Caterpillar model. We have highlighted in bold the amino acids types with the largest discrepancies namely: Histidine, Methionine, Tryptophan, Tyrosine. Such amino acids are know to be the one with the lowest appearance frequency in nature. Since we did not impose any restriction on the design procedure over the relative abundance of amino acids in nature it is not surprising to find the largest discrepancies in the composition for such amino acids. doi:10.1371/journal.pone.0112852.S007 (PDF)  Table S5. Sequences obtained during the last step of the matrix optimization procedure. The amino acid composition is identical for all sequences and the first is the natural sequences taken from the pdb file. doi:10.1371/journal.pone.0112852.S009 (PDF) Table S6. Designed sequences under the additional constraint that local repetition of up to 5 residues are forbidden. doi:10.1371/journal.pone.0112852.S010 (PDF )   Table S7. Summary of the refolded structures with the natural sequences. The DRMSD value is taken form the minimum of the folding free energy (see Fig. 5), while the Overlap, the gRMSD and the TM-score are calculated using the Max Cluster program from Alex Herbert (http://www.sbg.bio.ic.ac.uk/maxcluster/ index.html). The Overlap is a measure of percentage of matched structural elements between the native and the refolded structures. The gRMSD and TMscore are calculated over the overlapping structural elements. TM-score was defined by Zhang et al. [48,49]. The low overlapping value for 2ptl and 1vif are due to the unstructured sections of the proteins. doi:10.1371/journal.pone.0112852.S011 (PDF) Text S1. Supplemental Material containing the details about the model and simulations techniques together with the derivation of the scoring function with the Maximum Entropy Principle. doi:10.1371/journal.pone.0112852.S012 (PDF)