^{1}

^{*}

^{2}

^{3}

Conceived and designed the experiments: VAV MSS KAD. Performed the experiments: VAV MSS. Analyzed the data: VAV. Contributed reagents/materials/analysis tools: VAV MSS. Wrote the paper: VAV. Edited the manuscript: MSS KAD.

The authors have declared that no competing interests exist.

It has long been proposed that much of the information encoding how a protein folds is contained locally in the peptide chain. Here we present a large-scale simulation study designed to examine the extent to which conformations of peptide fragments in water predict native conformations in proteins. We perform replica exchange molecular dynamics (REMD) simulations of 872 8-mer, 12-mer, and 16-mer peptide fragments from 13 proteins using the AMBER 96 force field and the OBC implicit solvent model. To analyze the simulations, we compute various contact-based metrics, such as contact probability, and then apply Bayesian classifier methods to infer which metastable contacts are likely to be native vs. non-native. We find that a simple measure, the observed contact probability, is largely more predictive of a peptide's native structure in the protein than combinations of metrics or multi-body components. Our best classification model is a logistic regression model that can achieve up to 63% correct classifications for 8-mers, 71% for 12-mers, and 76% for 16-mers. We validate these results on fragments of a protein outside our training set. We conclude that local structure provides information to solve some but not all of the conformational search problem. These results help improve our understanding of folding mechanisms, and have implications for improving physics-based conformational sampling and structure prediction using all-atom molecular simulations.

Proteins must fold to unique native structures in order to perform their functions. To do this, proteins must solve a complicated conformational search problem, the details of which remain difficult to study experimentally. Predicting folding pathways and the mechanisms by which proteins fold is thus central to understanding how proteins work. One longstanding question is the extent to which proteins solve the search problem locally, by folding into sub-structures that are dictated primarily by local sequence. Here, we address this question by conducting a large-scale molecular dynamics simulation study of protein fragments in water. The simulation data was then used to optimize a statistical model that predicted native and non-native contacts. The performance of the resulting model suggests that local structuring provides some but not all of the information to solve the folding problem, and that molecular dynamics simulation of fragments can be useful for protein structure prediction and design.

It has long been proposed that much of the information encoding how a protein folds is contained locally in the peptide chain. Indeed, the success of fragment insertion methods for

The question we raise here is not about the success rates of secondary structure predictions. Secondary structure prediction methods such as PSIPRED use knowledge bases of known native structures and can achieve prediction success rates near 80% (as judged by

There are previous studies using molecular dynamics simulations of peptide fragments for structure prediction. Bystroff and Garde performed 10-ns explicit-water simulations using the AMBER ff94 forcefield for 64 8-residue fragments to show that observed helicity correlates well with I-sites predictions

To what extent did our simulations of peptide fragments sample native-like structures? From the native structures of our target sequences, we determined alpha helical and tight turn types across each target sequence using the secondary structure classification algorithm STRIDE

We find that the fragment simulations sample diverse structures. Conformational clustering (see

(A) For each target sequence and fragment length, the C-alpha RMSD-to-native values (in Å) for all representative cluster conformations along the target sequence are shown. Each line on the plot corresponds to a cluster conformation, color-coded by native secondary structure: alpha-helix (yellow), beta-hairpin (cyan), or other turn types (magenta). The relative shading of the lines are proportional to the population fraction. The horizontal axis is the sequence position along the protein chain. (B) The fraction of cluster conformations that sample within a particular RMSD-to-native, across all fragment simulations of a given chain length. For comparison, the black line shows the results for a random distribution of C-alpha RMSD values calculated from native protein structures (see

These fragments typically sample native-like conformations.

Does running longer simulations lead to more native-like structures? We found this not to be the case. On seven different hairpin fragments, we performed 20 REMD simulations (with and without various contact constraints) for a total of 100 ns (

Our data provides an opportunity to draw inferences about what physical properties of intrachain contacts are predictive of whether a peptide conformation is native or not. To do this, we train probabilistic classifier models on several contact metrics, and interrogate the results. For each set of simulated fragments (8-mers, 12-mers, and 16-mers), we explored two kinds of per-contact classification models: a naive Bayes model and a logistic regression model (see

Which classification model best predicts native or non-native contacts from short fragment simulations? In all cases, the logistic regression model gave better classifications than the corresponding naive Bayes model, thus we present only the results from the logistic regression models. Also in all cases, contacts defined by a 7Å distance cutoff performed significantly worse than an 8Å cutoff, thus we only present results from the latter case. The best logistic regression coefficients for 8-mers, 12-mers, and 16-mers are shown in

Length | Distance Method | ||||||

8 | −2.4388±0.1004 | 2.6401±0.2354 | — | −0.0524±0.019 | — | — | |

12 | −2.311±0.057 | 2.594±0.157 | — | −0.0363±0.0074 | −0.0327±0.0085 | — | |

16 | −2.166±0.033 | 2.194±0.113 | 0.093±0.0064 | −0.025±0.0037 | 0.0079±0.0041 | — |

What metrics are the best predictors of whether a simulated fragment has formed native contacts? We examined several metrics (see

Each metric is calculated on a per-contact basis from the simulation data. Further details are in

The

This is interesting because it might be expected that including multi-body terms would be more predictive than just the pairwise contact formation probability, since protein stability is likely to involve non-additivities that could only be captured in complex terms. Instead, we find that simple pairwise terms are the most predictive, with the multi-body terms producing small negative regression coefficients. The negative coefficients can be interpreted as providing a slight correction to the over-counting due to correlation between pairwise contact probability terms.

Results are shown for models built from the (A) 8-mer simulation data, (B) 12-mer data, and (C) 16-mer data. For each contact definition we tested (

We also tested whether we could obtain better classification models by training on local contacts (or nonlocal contacts) alone. We found that, overall, the classification success for the local-only or nonlocal-only data was comparable, but never as high as the classification success using the combined data (see

Now, given the parameters obtained from the logistic-regression models described above, we can compute the probability that a given simulated peptide conformation has native contacts.

Predictions were made using the best logistic regression models built from the 8-mer, 12-mer, and 16-mer simulations.

In the case where the data contains many more non-native contacts than native contacts, a high classification accuracy may not reflect a significant improvement over a random null distribution,

Above the diagonal, the grayscale values at each contact position correspond to ‘logit’ values

Next, we tested our model on a protein outside our test set. We tested 1whz (PDB ID: 1whz), a 70-residue CASP6 target with an

Ribbon diagram of the X-ray crystal structure was made with pymol.

The upper diagonal shows the logit scores

These models make per-contact predictions. But, we are interested in predictions for whole peptide conformations. To turn our contact-based scores into conformation-based scores, we compute a score,

We computed conformation scores for all the cluster conformations extracted from 8-mer, 12-mer, and 16-mer 1whz fragment simulations. For 8-mers and 12-mers, we observe a correlation (albeit noisy) between a high value of

Each dot represents a cluster conformation, color-coded according to its region along the protein sequence: residues 1–20 (cyan), residues 12–39 (magenta), residues 28–53 (yellow), and residues 42–70 (cyan). On the left (residues 1–20 and 28–53) are examples of high conformational cluster scores predicting native structures, while on the right (residues 12–39 and 42–70) are examples of high-scoring decoy structures.

We have performed computer simulations of short peptides—8-mers, 12-mers and 16-mers—using the AMBER 96 force field and the OBC implicit solvation model. Our aim was to see whether the metastable structures of these fragments bear any resemblance to the conformations those fragments adopt in the native states of the proteins in which they appear. We find that the peptide contact probabilities in a logistic regression model lead to a 76% success rate in 16-mers in correctly classifying contacts as either native or nonnative. Across the chain lengths studied, the false negative rates (native contacts classified as non-native) of our best logistic regression models range from about 30–45%. The false positive rates (non-native contacts classified as native) vary from about 20–40%. These results show these predicted peptide conformations in water are significantly more native-like than would be expected from random conformers. Previously, Bystrof and Garde also showed a 75% success rate at predicting native helicity across 64 8-mer fragments simulated using AMBER ff94 and explicit TIP3P water

These results may have useful application in physics-based methods, like ZAM

While our fragment simulations show that some peptide fragments sample native-like states, the sampling still produces many false positives and false negatives. This is consistent with the information-theoretic studies of Crooks and Brenner

Our dataset of peptides was 8-mer, 12-mer, and 16-mer fragments of 8 CASP7 target sequences and 5 other protein sequences with known structures taken from the PDB (see

PDB id | CASP target | Name | Residues | Residues in PDB | 8-mers | 12-mers | 16-mers | |

2hh6 | Yes | T0283 | 112 | 112 | Fragment simulations | 36 | 4 | 12 |

2gzv | Yes | T0288 | 93 | 93 | 30 | — | — | |

2h4o | Yes | T0309 | 76 | 63 | 24 | 7 | 23 | |

2ict | Yes | T0311 | 94 | 94 | 31 | 9 | 32 | |

2hep | Yes | T0335 | 85 | 42 | 13 | 5 | 11 | |

2he4 | Yes | T0340 | 90 | 90 | 29 | 16 | 23 | |

2hjj | Yes | T0358 | 87 | 75 | 28 | 13 | 24 | |

2hj1 | Yes | T0363 | 97 | 87 | 31 | 13 | 23 | |

2reb | No | RecA | 60 | 60 | 19 | 6 | 8 | |

1e68 | No | Bacteriocin | 70 | 70 | 22 | 21 | 33 | |

1gb1 | No | Protein G | 56 | 56 | 49 | 45 | 21 | |

1ail | No | NS1 | 70 | 70 | 63 | 37 | 17 | |

1srl | No | src SH3 | 56 | 56 | 49 | 45 | — | |

Total number of contacts | 4236 | 9865 | 19360 | |||||

Simulation replicas | 15 | 15 | 20 | |||||

Total number of simulations | 424 | 221 | 227 | |||||

Simulation time |
31800 | 16575 | 22700 | |||||

Total simulation time |
71.1 | |||||||

CPU years (10 ns/day) | 8.7 |

We used the AMBER ff96 force field

We simulated the fragments using the ZAM (Zipping and Assembly Method) protocol described in

Classification models were trained on five different contact-based metrics, calculated on a per-contact basis from the simulation data: 1) contact probability (CPROB), 2) a distance profile score (DPROF), 3) a mutual stability score (MSTAB), 4) a mutual cooperativity score (MCOOP) and 5) mesoentropy score (MESO) (

Contact probability is calculated as the fraction of sampled states that have inter-residue distances less than 8Å (we also tested 7Å, and three different distance definitions; see Training and Testing). The distance profile score (DPROF) was developed to obtain more information about the interaction of two residues as a function of distance, by extracting the potential of mean force

These metrics are designed to characterize, for any given contact, the average extent of cooperative (two contact pairs) interactions with the given contact which may indicate (thermodynamic) folding cooperativity.

The mutual stability and cooperativity scores can best be described by considering pairwise distributions of contact probabilities

For a particular pair of contacts

The mesoentropy score is related to the backbone entropy. It measures the distribution of backbone dihedral mesostates, defined by Ho and Dill

Given the various metrics above, of the peptide conformations observed from the simulations in solution, we now ask if there is a way to combine those metrics to make the best possible predictions of what the peptide's structure is in the native state of the protein. For each contact observed in our database of simulated fragments, we have a set of

The ‘naive Bayesian’ approach would be to assume that, for any contact, our set of calculated metrics

Using Equations 1 and 2, and taking the logarithm of the ratio of

Since

Substituting

A potential improvement to the ‘naive Bayes’ model is the

Solving for

In practice, these coefficients (and their error estimates) are found with a maximum-likelihood optimization using Newton-Raphson gradient minimization. The optimization is equivalent to least-squared linear regression in the nonlinear ‘logit’ variables

Note the similarity of the logistic regression model (Equation 6) to the naive Bayesian approach (Equation 4), with

We built both naive Bayes and logistic regression models for 8-mer, 12-mer, and 16-mer fragments separately. For the naive Bayes models, this involved empirically computing histograms in

For each of kind of model, in order to determine the best combinations of metrics on which to train the model, we built separate models for all (2^{5}−1) = 31 combinations of the five contact metrics (CPROB, DPROF, MSTAB, MCOOP and MESO). In addition, for each of the models, we tested three different inter-residue distance definitions (

To avoid over-fitting, the training data used to construct each model was divided randomly into five groups so that independent models could be built for each group. Additionally, 1/5 of the data in each group was set aside for testing the model, and the other 4/5 of the data was used to train the model. This means that for each model, there were 25 independent testing and training rounds: 5 independent model-building rounds, each with 5 leave-one-out trials of testing and training.

To assess which model was the best, we used a statistical hypothesis testing scheme to find a model that most successfully classifies native contacts as well as non-native contacts. Consider a test where we use the statistic

For the naive Bayes models built for each fragment length, the model that yielded the highest model quality (Q) when applied to testing data was chosen as the best model. For the logistic regression models, the 25 rounds of testing and training produced a series of models across which

For each simulation, the probability of a contact being native can be estimated by Equation 7. However, in the case where there are multiple simulations of the same contact (in overlapping fragment simulations), we can use all of the simulation data to estimate this probability. Assuming that each of

A null distribution in C-alpha RMSD values for 8-mers, 12-mers and 16-mers was calculated by taking 10000 random pairwise samples of 8-mer, 12-mer and 16-mer fragments from a set of 3465 protein structures taken from the SCOP database

Because there are correlations between contact metrics due to chain connectivity, considerable care was taken to construct null distributions for contact metrics that preserved these correlations. We did this by constructing the null distribution on a fragment-by-fragment basis. For each fragment, the values of the contact metrics were retained, while the assignment of native and non-native contacts was randomized according to a per-fragment bootstrapping procedure. For each fragment, a random contact map was drawn (with replacement) from the full data set. This reassignment procedure, across the entire set of fragments, was repeated 1000 times to construct a distribution of random-case realizations.

Supplemental Data and Results

(8.84 MB PDF)

We appreciate the help provided by Andrej Sali and his group. We had helpful discussions with Imram Haque. We thank Fred Davis for his help compiling statistics from the SCOP database.