Predicting the Tolerated Sequences for Proteins and Protein Interfaces Using RosettaBackrub Flexible Backbone Design

doi:10.1371/journal.pone.0020451

Figure 1.

Scheme for predicting the tolerated sequences for a protein fold or interaction.

The input is at least one protein structure from the protein structure databank (2QMT in the example). Rosetta first creates an ensemble of backbone conformations using the backrub method [31], then predicts sequences consistent with each conformation in the ensemble, scoring each trial sequence–structure combination using the Rosetta score12, and finally combines the sequences into a predicted sequence profile. This approach ignores potential covariation between side chains. To speed up calculations, the scoring function is split into one-body terms describing the intrinsic energy of a particular residue conformation, and two-body terms between residues; these residue-residue interaction terms are assumed to be pairwise additive. One- and two-body terms are pre-calculated and stored in an interaction graph [42] such that optimization of sequence–structure combinations for entire proteins only takes seconds using look-up tables of interaction energies. For the interaction graph, vectors of residue self-energies (one body) are stored on the vertices (green circles) and matrices of residue interaction energies (two body) are stored on the edges (thick black lines). Computed interaction energies within proteins, between proteins, or between groups of residues can be reweighted to generate custom fitness functions for specific applications. This flexibility in scoring residue groups allows modeling of separate requirements, such as those to maintain residues required in an interaction interface with a binding partner. Group and group interaction reweighting is typically only done for protein-protein interactions. (For the monomeric GB1 domain shown here, no reweighting was applied.)

More »

Expand

Table 1.

Summary of tolerated sequence prediction performance on different datasets using the generalized protocol described here.

More »

Expand

Figure 2.

Prediction of tolerated sequences for GB1 fold stability.

Frequently observed amino acids in phage display are enriched in the GB1 prediction. A. The structure (PDB code 1FCC) of Streptococcal GB1 (blue) is shown bound to the Fc domain of human IgG (green). The core and peripheral residues that were randomized in phage display are shown with sticks and transparent spheres. The side chain atoms (starting at C-beta) of these amino acids are at least 7 Å away from any atom of the Fc domain, making residues selected at these positions unlikely to interact directly with the Fc domain. B. Amino acids are ranked individually for each sequence position by computationally predicted frequency (using the Boltzmann factor kT = 0.23, as described in the main text). Wild type residues, which were used in protein ensemble generation, are shown in red. The dashed line indicates a typical cutoff of picking the top 5 amino acid choices at each position. C. Sequence logos (LOLA, University of Toronto) are shown for predictions with two different Boltzmann factors. The relative degree of specificity (in terms of bits of information, y-axis) shows good correspondence between prediction and phage display. Increasing the Boltzmann factor lowers the overall specificity and brings the absolute frequencies closer to phage display.

More »

Expand

Figure 3.

hGH/hGHR interface tolerance prediction.

The generalized Rosetta 3 protocol described here was applied to rank human growth hormone (hGH) amino acids by computationally predicted frequency. The residue positions shown and their ordering are taken from previously published results using the Rosetta 2 protocol (Humphris & Kortemme, Table 2 [23]). Wild type residues, which were used in protein ensemble generation, are shown in red. For each position, an average of 59% of the amino acids observed in phage display (≥10% experimental frequency) are predicted within the top five computationally ranked amino acids (above dashed line). Overall performance was similar to previous results of the Rosetta 2 protocol. Amino acids (other than wild-type) included in the computationally selected library from the Rosetta 2 protocol are indicated with a star. If the same number of amino acids at each position is used as defined in the computational library in [23], Table 2, the Rosetta 3 protocol misses two frequently observed amino acids included by Rosetta 2 (V67 and L176). Conversely, the Rosetta 2 protocol misses three frequently observed amino acids included by Rosetta 3 (S21, A21, and E22). Both protocols share similar false positive predictions. However, the Rosetta 3 histidine reference energy reweighting (see Methods) eliminates 6 out of 8 histidine false positives (H*).

More »

Expand

Figure 4.

PDZ/peptide interface tolerance predictions.

Shown are 5 representative examples of predictions with the generalized protocol, compared to experimental data from phage display. The Erbin V83K interface prediction involved making the indicated point mutant (V83K) to the PDZ domain prior to backrub ensemble generation (an example of a “premutated” position).

More »

Expand

Figure 5.

Sequences from later genetic algorithm generations contribute more in interface design prediction than in protein stability design prediction.

The total Boltzmann weights in the final PWM for the new sequences sampled in each generation were calculated. The distribution of contributions for each generation across the 200 simulations (one simulation for each backbone in the backrub ensemble) is shown. Boxes span from the first quartile to the third quartile, with the line indicating the median. Whiskers extend to the most extreme data point within 1.5 times the interquartile range of the box. Circles show data points beyond that limit. A. Because the fitness function used for protein-protein interfaces (here shown for a complex between the second PDZ domain of DLG1 and peptides) is different from the fitness function used for optimization of side chain packing, the genetic algorithm is important for enriching the population in sequences predicted to be better binders. B. For optimization of protein fold stability (designing positions in the GB1 core), the initial full protein design phase is very effective at finding a low energy sequence, which dominates the contribution to the position weight matrix (PWM) when the same Boltzmann factor (kT = 0.23) is used. C. When the Boltzmann factor is optimized to minimize the average absolute difference between experiment and computation (kT = 0.59), the contribution of the later generations increases significantly.

More »

Expand