Figure 1.
A stability threshold model of protein evolution.
Proteins are assumed to be functional if and only if they are more stable than some minimal threshold (in the figure, , which is a typical value for natural proteins [53]; note that more stable proteins have more negative
values). When a particular destabilizing mutation (
) occurs, the evolutionary result will depend on the stability of the proteins in the parent population. When the parent proteins are sufficiently stable (top panel), the mutant protein still satisfies the threshold, and so the mutation has the opportunity to spread by neutral genetic drift. But when the parent proteins are not sufficiently stable (bottom panel), the mutant protein fails to stably fold, and is eliminated by natural selection. Therefore, the probability that a mutation that induces a stability change of
will have an opportunity to spread by neutral genetic drift is simply the probability that the parent protein has a stability
.
Figure 2.
Stability distributions and fixation probabilities.
The panel at left show the probability that a protein in an evolving population will have extra stability
, as given by Equation 3. The panel at right shows the probability
that a mutation that causes a stability change of
will be neutral, as given by Equation 4. The units for
are arbitrary; for concreteness here we give them units of kcal/mol.
Figure 3.
An example phylogenetic tree .
This tree shows the sequence data for five sequences at a single site
. The amino acid codes at the tips of the branches (
) show the residue identities for the five sequences at this site. The variables at the internal nodes (
) are the amino acid identities at the site for the ancestral sequences, and must be inferred. The branch lengths (
) are proportional to the time since the divergence of the sequences.
Figure 4.
Prior distributions, , over the
values.
The “regularizing priors” are peaked at the moderately destabilizing value of to capture the general knowledge that most mutations are destabilizing. The “hydrophobic priors” capture the knowledge that mutations that cause large changes in hydrophobicity are often more destabilizing. These priors are peaked at a value equal the the absolute value of the difference in amino acid hydrophobicity (as defined by the widely used Kyte-Doolittle scale [81]). For example, the prior for a mutation from hydrophobic valine (V) to similarly hydrophobic leucine (L) is peaked near zero, while that for mutation from valine to charged lysine (K) is peaked at a much more destabilizing value. The “informative priors” are peaked at the
values predicted by the state-of-the-art physicochemically based program CUPSAT [8], and so are designed to leverage extensive pre-existing knowledge about
values. All the priors are fairly loose to make the
values responsive to their effect on the likelihood. The priors also help regularize [80] the
predictions by biasing them towards a reasonable range.
Figure 5.
Experimentally measured and predicted values for the 68-residue cold shock protein.
The plots at left show the predictions made by the CUPSAT physicochemical modeling program, the consensus approach, and the PIPS phylogenetic inference program using the informative, regularizing, and hydrophobicity priors. To the right is the phylogenetic tree of 763 sequences that was utilized by the PIPS program. The values are the squared Pearson correlation coefficients.
Figure 6.
Experimentally measured and predicted values for the 156-residue ribonuclease HI protein.
The plots at left show the predictions made by the CUPSAT physicochemical modeling program, the consensus approach, and the PIPS phylogenetic inference program using the informative, regularizing, and hydrophobicity priors. To the right is the phylogenetic tree of 239 sequences that was utilized by the PIPS program. The values are the squared Pearson correlation coefficients.
Figure 7.
Experimentally measured and predicted values for the 109-residue thioredoxin protein.
The plots at left show the predictions made by the CUPSAT physicochemical modeling program, the consensus approach, and the PIPS phylogenetic inference program using the informative, regularizing, and hydrophobicity. To the right is the phylogenetic tree of 213 sequences that was utilized by the PIPS program. The values are the squared Pearson correlation coefficients.
Figure 8.
Performance of the phylogenetic inference approach as a function of the number of sequences used.
The PIPS predictions using informative priors were run using subsets of all of the available protein sequences. The resulting predictions were then correlated with the experimental
values (top) or the PIPS
predictions obtained using all available sequences (bottom). The
values are the squared Pearson correlation coefficients. For each number of sequences used, the PIPS predictions were made using 10 different random sequence subsets, and the displayed
values are the average correlations over these 10 subsets. For cold shock protein, the subsets were made at intervals of 20 sequences, while for ribonuclease HI and thioredoxin they were made at intervals of 10 sequences.
Figure 9.
Predicted stability effects of known temperature-sensitive and revertant mutations to H1 hemagglutinin.
In the plots at left, bars indicate the distribution of predicted values for all single mutations, while symbols show predicted values for the temperature-sensitive and revertant mutations. At right is the phylogenetic tree utilized by the PIPS program. The tree labels give the hemagglutinin subtypes and corresponding numbers of sequences. The PIPS predictions are made using the informative priors.
Figure 10.
Locations of the predicted and confirmed stabilizing mutations to H1 hemagglutinin.
The full hemagglutinin trimer is shown in green, with the HA1 chains in dark green and the HA2 chains in light green. The temperature-sensitive mutation (ts-134 [104]–[106]) is shown with red spheres. The yellow spheres show the mutations that were predicted to be stabilizing by the PIPS program. The blue spheres show the four predicted mutations that were experimentally confirmed to actually increase the temperature stability. The structure is PDB code 1RVZ [107].
Table 1.
Plaque growth of influenza A/WSN/33 (H1N1) viruses carrying mutations in hemagglutinin.
Figure 11.
Plaque assays of wildtype, temperature-sensitive (ts), and ts influenza with predicted stabilizing hemagglutinin mutations.
All four of the single mutations allow the virus to plaque at higher temperatures than the ts parent. The multiple mutants plaque more effectively at higher temperatures than the single mutants. Mutations are named according to the numbering scheme described in Table 1.