Predicting Peptide Structures in Native Proteins from Physical Simulations of Fragments

It has long been proposed that much of the information encoding how a protein folds is contained locally in the peptide chain. Here we present a large-scale simulation study designed to examine the extent to which conformations of peptide fragments in water predict native conformations in proteins. We perform replica exchange molecular dynamics (REMD) simulations of 872 8-mer, 12-mer, and 16-mer peptide fragments from 13 proteins using the AMBER 96 force field and the OBC implicit solvent model. To analyze the simulations, we compute various contact-based metrics, such as contact probability, and then apply Bayesian classifier methods to infer which metastable contacts are likely to be native vs. non-native. We find that a simple measure, the observed contact probability, is largely more predictive of a peptide's native structure in the protein than combinations of metrics or multi-body components. Our best classification model is a logistic regression model that can achieve up to 63% correct classifications for 8-mers, 71% for 12-mers, and 76% for 16-mers. We validate these results on fragments of a protein outside our training set. We conclude that local structure provides information to solve some but not all of the conformational search problem. These results help improve our understanding of folding mechanisms, and have implications for improving physics-based conformational sampling and structure prediction using all-atom molecular simulations.


Sequence coverage of fragment simulations
The 8-mer, 12-mer, and 16-mer fragment simulations cover 100%, 88.7%, and 76.7% of the entire sequence space of the 13 proteins considered, respectively. Sequence spans for 8-mers, 12-mers and 16-mers are shown in Figures 1, 2 and 3.   2 Classification success for the best logistic regression models Table 1 shows the success of classifying native and non-native contacts given by our bestparameterized logistic regression models. The best classification successes of models with fewer included terms are also shown, for comparison.  3 Testing the performance of the best logistic regression models against random null distributions Recall the model-selection procedure we used: To judge the quality of each model, we used an accuracy-based measure, where a is the fraction of incorrectly-classified native contacts, and b is the fraction of incorrectly-classified non-native contacts. The Q quantity was used to forward-select models with increasing numbers of logistic regression parameters.
There is complicating issue when using accuracy-based measures to select models. In the case where the data contains many more non-native contacts than native contacts (or vice versa), a high classification accuracy may not reflect a significant improvement over a random null distribution, per se. To test this possibility for our selected models, we built a null distribution of contact metrics to test the random-case performance of our models.
Because there are correlations between contact metrics due to chain connectivity, considerable care was taken to construct null distributions for contact metrics that preserved these correlations. We did this by constructing the null distribution on a fragment-by-fragment basis. For each fragment, the values of the contact metrics were retained, while the assignment of native and non-native contacts was randomized according to a per-fragment bootstrapping procedure. For each fragment, a random contact map was drawn (with replacement) from the full data set. This reassignment procedure, across the entire set of fragments, was repeated 1000 times to construct a distribution of random-case realizations.
We next performed several tests to compare the performance of our selected models on the actual data versus the random null data, which we will describe below. Part (A) of each figure shows the classification success for native contacts given by the model, for both random null data and the actual data. Part (B) of the figures shows the same, for non-native contacts. The green line shows a distribution of classification successes, across all 1000 realizations of null set data, whereas the blue line simply marks the classification success achieved when the classification model is applied to the actual data. In general, non-native contacts are predicted with more significance than native contacts, a trend which increases with chain length and the number of metrics used to train the logistic regression model. This trend is due (at least in part) to the fact that the ratio of non-native to native contacts increases with chain length.
native classification successes, across all 1000 realizations of null set data. The star marks the classification successes when the model is applied to actual data. The contour plot is an interpolation made using the program Mathematica. In all cases, the p-values for joint classification success is less than 0.001 (which is the limit of what we can say given 1000 null realizations).
Part (D) of each figure shows the model quality, Q, for the selected models across a range of classification thresholds. The blue line shows the model quality when applied to the actual data. The green line shows the average model quality across the 1000 random null realizations, as a function of classification threshold. The error bars above and below the green line show the standard deviation across random null realizations, for each classification threshold.
Part (E) of each figure is similar in form to Part (D), but instead shows the Matthews Correlation Coefficient (MCC) as a function of classification threshold (1). The MCC is a quantity used to characterize the quality of a binary classification from the full "confusion matrix" containing the numbers of true positives (T P ), false positives (F P ), true negatives (T N ) and false negatives (F N ). It is been shown to be a more balanced measure of classification accuracy when the classes are of very different size. The MCC is defined as: The results shown in parts (D) and (E) show that our chosen regression models have classification accuracies better than random. It is interesting to note that the models are somewhat predictive even when applied to our random null data. This may be partly due to the mismatch in the numbers of native and non-native contacts, but it may also be due to correlations present in the data independent of sequence. For instance, native contacts in general have small sequence separation, which may correlate with the propensity of a random polymer chain to make local contacts more frequently.
Part (F) of each graph shows a graph of the true positive rate versus the false positive rate as the classification threshold is varied. This is otherwise known as a receiver operating characteristic (ROC) curve (1). In the case where a model has no power to discriminate whatsoever between native and non-native contacts, the ROC curve is a straight diagonal, shown for reference as a red line. The more discriminative power a model has, the more the ROC curve should be off the diagonal, which is what we see in this graph. For all models, the ROC curve for the best classification models applied to the actual data (blue) is farther from the diagonal than the ROC curve for the random null data (green). The null data ROC curve shown is the average across all 1000 null distribution realizations.
Part (G) of each graph shows, for reference, the number of true positives (T P ), false positives (F P ), true negatives (T N ) and false negatives (F N ) for each classification model, applied to the true set of actual contact metrics. (Note that the total number of contact metrics contained in the set may be larger than the number of contacts in the full set of proteins, because the set includes contact metrics from all the fragment simulations, many of which contain overlapping protein sequences).    Fig. 6. The performance of the best one-term 16-mer classification model on actual data versus random null data (see description above for details).  Fig. 7. The performance of the best two-term 16-mer classification model on actual data versus random null data (see description above for details).  Fig. 8. The performance of the best three-term 16-mer classification model on actual data versus random null data (see description above for details).  Fig. 9. The performance of the best (four-term) 16-mer classification model on actual data versus random null data (see description above for details).

Regression models trained on local and non-local contacts only
Can we achieve better logistic regression models to classify contacts as native or non-native, if we train separately on the data for local contacts (|i − j| ≤ 4) and non-local contacts (|i − j| ≥ 5)? The training data was divided into two groups, one with contacts having a sequence separation of 3 or 4 residues, and another with sequence separations of 5 residues or more. The best logistic regressions for the local and nonlocal data are shown in Tables 2  and 3, respectively, and the model relevances (R) for each metric are shown in Figures 10, respectively. Overall, the classification success for the local-only or nonlocal-only data was comparable, but never as high as the classification success using the combined data. Contact probabilities remain the most important metric in the local-only and nonlocal-only regressions, although for local contacts, the model relevance (R) values for the CPROB metric was smaller in the 8-mer and 12-mer regression models. This is probably because local contacts are easily sampled by the chain regardless of sequence, which adds noise to the classification problem.

Benchmark simulations of 16-mer fragments
One of the challenges in the ZAM protocol (2) is sampling a diverse population of conformations using short fragment simulations. To explore different possible topologies, harmonic contact restraints are added to 16-mer simulations to bias the sampling. At this stage, it is particularly critical to sample beta-hairpin structures, which generally take longer to form in short simulations because of their nonlocal topology.
Here, as a test of our sampling methodology, we simulate a set of peptide fragments shown to sample beta-hairpins during the course of the ZAM algorithm for up to 100 ns. Additionally, we simulate 16-mer fragments of a protein outside our training set (PDB code: 1whz), with several different combinations of contact restraints.
There are two issues we try to address with these simulations. The first is to test whether convergence can be achieved with longer simulation times, and if so, whether this can better discriminate between native and non-native hairpin decoys. We find that longer simulation times do not necessarily result in better convergence to native structures, validating our use short fragment simulations. This also suggests that the accuracy of our simulation results is limited more by our forcefield potential than by sampling.
The second issue we address is to test whether the harmonic restraints added at the 16-mer stage can yield better sampling of hairpins, and if this is the case, whether hairpin restraints overly bias the sampling. By comparing across simulations with different sets of restraints, we find that harmonic restraints can help to sample a diversity of conformations in short REMD simulations, and that our prediction results are robust to perturbations from the restraint potentials.

Molecular dynamics simulation.
The AMBER ff96 force field (3) with the solvation model of Onufriev, Bashford, and Case (4) was used to perform replica exchange molecular dynamics (REMD) simulations (5). Each simulation was 100 ns in length, with ranging from temperatures 270-700K. Clustering was done to both reduce the amount of data to process, and to generate good representatives of conformational basins predicted by the forcefield. A set of 10 or less representative conformations, clustered to ∼2Å RMSD by a modified K-means algorithm, is extracted from the (lowest-temperature) data and used for the starting configurations of future rounds of simulation.
The ZAM (Zipping and Assembly Method) protocol for simulating fragments is as previous described in (2; 6). In the early "growth" stage of ZAM, we rely on short molecular dynamics sampling of 8-mer peptide fragments which are then grown to 12-mers for further simulation. At the 16-mer stage, several alternative topologies for the cluster conformations extracted from the 12-mer simulations are explored by adding harmonic contact restraints (with a force constant of 0.5 kcal/mol/Å 2 ) to the residue sidechain centroids.
20 simulations of beta-hairpin systems were performed for 8 peptide fragments taken from 7 proteins, representing 40,000 ns of total simulation, or about 11 CPU-years (assuming 10 ns/day per processor). A summary of the fragments whose native states are hairpins is described in Table 4. A summary of the "decoy" fragments, whose native states are not hairpins, is described in Table 5. The native structures of the 8 fragments simulated here are shown in Figure 11. The decoys are either amphipathic helices or helix-turn-helix motifs.

Results
Here we briefly summarize the main findings observed in our hairpins simulations. We note that most of these features are also observed in our test set of peptide fragment simulations. Figures 19 through 38 show the snapshots of conformations sampled over time for each fragment simulation. The coloring convention is the same as in Figure 11. Each graph shows the conformation clusters and their populations in increments of 10 ns, for the lowest temperature of the simulations. The last nanosecond of each 10 ns-increment was used to generate the conformation clusters. shown in the plot. Note that in many cases, these structures appear to fluctuate quite a bit. This is due to a combination of the intrinsic conformational fluctuations seen in any one replica, and the temperature-swapping done in replica exchange molecular dynamics (REMD).

Native hairpin structures are better sampled when the fragment is centered around the beta-turn.
This may be because hairpins out of the larger tertiary context will not tolerate unpaired ends, and instead prefer to make either ionic (salt-bridge) interactions, or hydrogen-bonded states like helices. This effect may be further magnified by the implicit solvent model used, which favors compact states. The T0340 simulations, having the native turn slightly offcenter in the sequence, are a good example of this. The conformations sampled in the first 10 ns of restrained and unrestrained simulations are loosely defined by side chain interactions, although some beta-hairpins are sampled with high probability in simulations with (S26,I32) restrained.
2. An (i, i + 5) restraint works well to bias the sampling toward beta-hairpins. Figure 12 are contact maps showing contact probabilities across all 16-mer simulations, grouped by the sequence separation of restrained residues. It is interesting to note that similar conditional probabilities of observing hairpins is observed in the statistics of native protein structures. From a set of 3465 protein structures taken randomly the SCOP database (7) (1 structure, or 2 if existing, from each unique SCOP class), we computed the conditional probabilities of contacts in protein structures, given that an (i, i + 5) contact was already present. The results show a high conditional probability of hairpin-like conformations, regardless of sequence. This is consistent with previous work in polymer models showing how this effect can arise from intrachain excluded volume (8).

Shown in
Restraints help in sampling beta-hairpins at early times.
Do longer simulation times help fragments converge to native-like structures?.
In general, no. There are several reasons why this may be the case: (1) simulations longer than the 100 ns performed here would be needed, (2) the physical model we used is not perfect, or (3) tertiary context is needed to drive them into their native states. While our simulations do not address the tertiary context, we do observe anecdotal evidence of both (1) and (2). Of course, without complete convergence, it is difficult to assess the quality of our forcefield, other than by the results of our predictions (described in the main text). For many of our hairpin simulations, there exists quite a bit of variability in the lowesttemperature conformations, and/or population shifts on time scales comparable to the length of the simulation. The simulation results also have features consistent with well-known inaccuracies of Generalized Born models (10; 11? ), including possible over-stabilization of helices for hairpin fragments with known native states (Figure 19), and some problems with overly stable salt-bridge formation.
Overall, the variability of conformations sampled in these simulations at early times, along with the directed sampling achieved with contact restraints (Figures 20, 22, 23), is further validation that 10 ns of REMD simulation is sufficient for obtaining good sampling.
16-mer fragment simulations are robust to restraint perturbations.
To test whether contact prediction scores are robust with respect to restraints, we compared the results of our usual ZAM protocol of using restrained REMD simulation against 1) the results of simulations that had no restraints, and 2) the results of simulations with competing sets of restraints which were allowed to exchange between replicas. We find that these different protocols have little effect on the outcome of the simulations, as shown by the similarity in prediction scores ( Figure 13).

Conformation scores for 8-mer and 12-mer fragment simualtions
We compute a conformation score, C, for a given molecular conformation as follows: Here, i runs over all contacts in the conformation, and j runs over all fragment simulations which contain contact i. We computed conformation scores for all the cluster conformations extracted from 8-mer, 12mer, and 16-mer 1whz fragment simulations. For 8-mers and 12-mers, we observe a correlation (albeit noisy) between a high value of C and a near-native (low-RMSD) structure. Figure  14 shows the RMSD-to-native plotted versus prediction scores for all cluster conformations extracted from 8-mer and 12-mer fragment simulations of 1whz. The plots are reasonably funnel-shaped, that is, the higher the prediction score, the closer that conformation is to the native state.

An examination of decoy structures for 1whz
Recall that for our test protein 1whz, non-native "decoy" conformations were found that gave high native conformation scores. We compared these conformations to the sequencebased predictions from I-sites, a library of local sequence-structure correlations (12). Local I-sites predictions with the highest confidence scores often have corresponding fragment structures that are experimentally stable in solution. It has been shown that I-sites fragments can be used as 'initiation sites' to generate a set of decoy structures that can then be further optimized for protein structure prediction (13; 14). This idea has been used to build an automated protein structure prediction server (14) which uses a hidden Markov model based on I-sites, called HMMSTR, combined with optimization from the Robetta algorithm (15; 16; 17).
When the 1whz sequence is submitted to the HMMSTR/Robetta structure prediction server (14) (with an option set to avoid the use of multiple sequence alignments), the N-terminal beta-strand that our fragment simulations predict as a helical decoy is also predicted to be helical, with a high confidence score of 0.85 ( Figure 18). When multiple sequence alignment is used, this helical I-sites fragment receives much lower confidence scores. This observation is further evidence that statistical occurrences of peptides in structural databases are related to their physical free energies. T0283   T0288  T0311   T0358  T0335  T0363   2reb  1e68  1gb1  1ail   T0309   T0340 1srl Fig. 15. Contact prediction 'logit' scores of the best regression models trained on 8-mer simulations, for all proteins in our test set. Shading is as described in Figure 5 of the main text.  Fig. 16. Contact prediction 'logit' scores of the best regression models trained on 12-mer simulations, for all proteins in our test set. Shading is as described in Figure 5 of the main text.  Fig. 17. Contact prediction 'logit' scores of the best regression models trained on 126mer simulations, for all proteins in our test set. Shading is as described in Figure 5 of the main text.

HMMSTR (without multiple sequence alignment)
HMMSTR (full) Fig. 18. The I-sites library gives a high confidence value to a helical decoy structure we found in our ZAM fragment simulations. Left: When multiple sequence alignments by PSI-BLAST are not allowed, the top I-sites motifs and top 5 HMMSTR/Robetta server predictions predict the same helical decoy structure (cyan) we find using molecular simulation. Right: When multiple sequence alignment is used in the prediction, the decoy helix gets filtered out.