Inferring Stabilizing Mutations from Protein Phylogenies: Application to Influenza Hemagglutinin

One selection pressure shaping sequence evolution is the requirement that a protein fold with sufficient stability to perform its biological functions. We present a conceptual framework that explains how this requirement causes the probability that a particular amino acid mutation is fixed during evolution to depend on its effect on protein stability. We mathematically formalize this framework to develop a Bayesian approach for inferring the stability effects of individual mutations from homologous protein sequences of known phylogeny. This approach is able to predict published experimentally measured mutational stability effects (ΔΔG values) with an accuracy that exceeds both a state-of-the-art physicochemical modeling program and the sequence-based consensus approach. As a further test, we use our phylogenetic inference approach to predict stabilizing mutations to influenza hemagglutinin. We introduce these mutations into a temperature-sensitive influenza virus with a defect in its hemagglutinin gene and experimentally demonstrate that some of the mutations allow the virus to grow at higher temperatures. Our work therefore describes a powerful new approach for predicting stabilizing mutations that can be successfully applied even to large, complex proteins such as hemagglutinin. This approach also makes a mathematical link between phylogenetics and experimentally measurable protein properties, potentially paving the way for more accurate analyses of molecular evolution.


Introduction
Knowledge of the impact of individual amino acid mutations on a protein's stability is valuable both for understanding the protein's natural evolution and for altering its properties for engineering purposes. Experimentally measuring the effects of mutations on protein stability is a laborious process, so a variety of methods have been devised to predict these effects computationally. Most existing methods rely on some type of physicochemical modeling of the mutation in the context of the protein's three-dimensional structure, often augmented by information gleaned from statistical analyses of protein sequences and structures. These types of methods are moderately accurate at predicting the effects of mutations on the stabilities of small soluble proteins [1][2][3][4][5][6][7][8]. There is little or no published data evaluating their performance on the larger and more complex proteins that are frequently of greatest biological interest, although it might be expected to be worse given the greater difficulty of modeling larger structures.
An alternative approach to predicting the effects of mutations on protein stability utilizes the information contained in alignments of evolutionarily related sequences. This approach, which was originally introduced by Steipe and coworkers [9], envisions an alignment of related sequences as representing a random sample of all possible sequences that fold into a given protein structure. Based on a loose analogy with statistical physics, the frequency of a given residue in the sequence alignment is assumed to be an exponential function of its contribution to the protein's stability (just as the Boltzmann factor in statistical physics relates the probability of a microscopic state to the exponential of its energy). This is often called the ''consensus'' approach, since it always predicts that the most stabilizing mutation will be to the most commonly occurring (consensus) residue. The consensus approach has proven to be surprisingly successful, with a wide range of studies supporting the basic notion that stabilizing residues tend to appear more frequently in sequence alignments of homologous proteins [10][11][12][13][14][15][16][17].
But although it is often effective, the consensus approach suffers from an obvious conceptual flaw: alignments of natural proteins do not represent random samples of all possible sequences encoding a given structure, but instead are highly biased by evolutionary relationships. A particular residue might occur frequently because it has arisen repeatedly through independent amino acid substitutions, or it might occur frequently simply because it occurred in the common ancestor of many related sequences in the alignment. The sequence evolution of even distantly related protein homologs is non-ergodic (as evidenced by the fact that sequence divergence continues to increase with elapsed evolutionary time), and so this problem will plague all natural sequence alignments. Therefore, it would clearly be desirable to extract information about protein stability from sequence alignments using a method that accounts for evolutionary relationships.
In fact, there are already highly developed mathematical descriptions of the divergence of evolving protein sequences. The widely used likelihood-based methods for inferring protein phylogenies employ explicit models of amino acid substitution to assess the likelihood of phylogenetic trees [18]. However, these methods make no effort to determine how selection for protein stability might manifest itself in the ultimate frequencies of amino acids in an alignment of evolved sequences. Instead, in their simplest form, these phylogenetic methods simply assume that there is a universal ''average'' tendency for one particular amino acid to be substituted with another (these ''average'' substitution tendencies are typically given by PAM, BLOSUM, or JTT matrices). More advanced phylogenetic methods sometimes allow for different ''average'' substitution tendencies for different classes of protein residues (such as surface versus core residues, or residues involved in different types of secondary structures) [19][20][21][22][23][24]. Still other methods use simulations or other structure-based methods to derive site-specific substitution matrices for different positions in a protein [25][26][27][28]. However, none of these methods relate the substitution probabilities to the effects of mutations on experimentally measurable properties such as protein stability, nor do they provide a method for predicting the effects of the mutations from the protein phylogenies.
Here we present an approach for using protein phylogenies to infer the effects of amino acid mutations on protein stability. We begin by describing a conceptual framework that quantitatively links a mutation's effect on protein stability to the probability that it will be fixed by evolution. We then show how this framework can be used to calculate the likelihood of specific phylogenetic relationships given the stability effects of all possible amino acid mutations to a protein. Our actual goal is to do the reverse, and infer the stability effects given a known protein phylogeny. To robustly accomplish this, we use Bayesian inference with informative priors derived from an established physicochemical modeling program. We compare the inferred stability effects to published experimental values for several proteins, and show that our method outperforms both the physicochemical modeling program and the consensus approach. Finally, we use our method to predict mutations that increase the temperature-stability of influenza hemagglutinin, a complex multimeric membrane-bound glycoprotein for which (to our knowledge) stabilizing mutations have never previously been successfully predicted by any approach. We introduce the predicted stabilizing mutations into hemagglutinin, and experimentally demonstrate that several of them increase the temperature-stability of the protein in the context of live influenza virus. Overall, our work presents a unified framework for incorporating protein stability into phylogenetic analyses, as well as demonstrating a powerful new approach for predicting stabilizing mutations.

Results
A framework relating the biophysical impact of amino acid mutations to the frequency with which they are fixed during neutral evolution We begin by introducing a conceptual framework that relates the probability that a specific amino acid mutation will be selectively neutral (and so have an opportunity to spread by genetic drift) to its effect on protein stability. Because this conceptual framework forms the starting point for subsequent mathematical inference, it is necessarily highly simplified. It is based on several assumptions which, although motivated by biophysical considerations, are subject to many exceptions. Below we outline these assumptions, and mention some of the exceptions. We hope the reader will become convinced that this conceptual framework strikes a reasonable balance between being realistic and mathematically tractable. The conceptual framework that we describe has previously been successfully employed in simulations [29,30], and later in theoretical treatments [31,32], of protein evolution.
We assume that evolution selects only for a protein's biochemical function, and is indifferent to its precise stability provided that the protein folds with sufficient stability to perform its function. This assumption is imperfect, since some proteins are natively unfolded [33], only kinetically stable [34], or specifically selected for marginal stability in order to aid in regulation [35]. In addition, mildly destabilized proteins might retain partial function while being subject to weak negative selection. This assumption nonetheless captures the overriding idea that most proteins have evolved to fold to stable structures in order to perform biochemical functions that are the actual dominant targets of natural selection. With this assumption, proteins can be viewed as having to satisfy a minimal stability threshold in order to avoid being culled by natural selection (see Figure 1).
We further assume that all protein mutants that satisfy the stability threshold are equally functional, while all mutants that fail to satisfy the threshold are nonfunctional. This assumption has the mathematically desirable property that it neatly divides all mutants into one of two categories (sufficiently stable or nonfunctional). Of course, we recognize that this assumption is not strictly true, since one could fill many pages documenting mutations that are deleterious despite preserving stability. For example, mutations can specifically interfere with a protein's function (such as altering an enzyme's activity)-but experiments have shown that such mutations are rare compared to the much larger number that affect stability [36][37][38][39]. Mutations can also be deleterious if they increase a protein's propensity to aggregate [40][41][42][43] or interfere with its folding [44] or unfolding [16,45] kinetics-but quantifying a mutation's impact on stability provides a partial proxy for these effects since aggregation propensity [40], folding rate [46][47][48][49], and kinetic stability [16] are correlated with stability. Mutations can also have other deleterious effects, such as altering mRNA stability [50], codon usage [51], or the accuracy and efficiency of translation [43,51,52]. We mention these myriad exceptions to explicitly acknowledge their existence. Nonetheless, from here

Author Summary
Mutating a protein frequently causes a change in its stability. As scientists, we often care about these changes because we would like to engineer a protein's stability or understand how its stability is impacted by a naturally occurring mutation. Evolution also cares about mutational stability changes, because a basic evolutionary requirement is that proteins remain sufficiently stable to perform their biological functions. Our work is based on the idea that it should be possible to use the fact that evolution selects for stability to infer from related proteins the effects of specific mutations. We show that we can indeed use protein evolutionary histories to computationally predict previously measured mutational stability changes more accurately than methods based on either of the two main existing strategies. We then test whether we can predict mutations that increase the stability of hemagglutinin, an influenza protein whose rapid evolution is partly responsible for the ability of this virus to cause yearly epidemics. We experimentally create viruses carrying predicted stabilizing mutations and find that several do in fact improve the virus's ability to grow at higher temperatures. Our computational approach may therefore be of use in understanding the evolution of this medically important virus.
forward we will use the concept of stability threshold selection to develop a mathematical relationship between protein stability and evolution. Figure 1 illustrates the stability threshold view of evolution that we have just described. In this figure, a protein's stability is quantified by its free energy of folding (its DG f ), and the effects of mutations by the change they induce in the free energy of folding (their DDG values) [53]. For proteins that do not fold reversibly, some alternative experimental measure of stability (such as resistance to thermal denaturation [54] or proteolysis [55]) is clearly required, but the concept remains the same. The key implication of Figure 1 is that the evolutionary impact of a mutation can depend on the stability of the parent protein into which it is introduced, with a moderately destabilizing mutation being neutral in the context of a stable parent but lethal to a marginally stable parent. That mutational tolerance is indeed enhanced by extra stability in this fashion has been experimentally verified for several proteins [56][57][58]. This idea provides a basis for forsaking the traditional approach of using pre-specificied ''average'' amino acid substitution matrices, and instead adopting the view that the frequency of a particular substitution tells us something about its impact on protein stability. Much of the rest of this paper deals with the mathematical mechanics of how to use the substitution frequencies implied by a set of protein homologs to infer the effects of individual mutations on stability.

Sequence evolution without any selection
To introduce the mathematical analysis, begin by considering protein sequence evolution in the absence of any selection on amino acid composition. Even in the absence of selection, some amino acid substitutions are more likely than others due to the structure of the genetic code and unequal frequencies of different types of nucleotide mutations. In order to express the probabilities of various types of mutations only in terms of amino acid identities, assume that the distribution of codons encoding each amino acid is always at equilibrium. For example, assume that all glycines at all times have the same probability of being encoded by the GGG codon. With this assumption, the current state of a residue can be described by its amino acid identity rather than its codon identity (see [59] for an evolutionary model that operates at the codonlevel). Given that a particular position is currently amino acid y, let c xy denote the probability that a single nucleotide mutation to the codon at this position changes the identity to amino acid x. Nonsense mutations (to stop codons) are assumed to be immediately eliminated by selection, and so leave the codon unchanged. All other mutations are assumed to be neutral. Therefore, all nonsense and synonymous mutations contribute to c yy , and all nonsynonymous mutations contribute to c xy with x=y. Denote the set of all 20620 = 400 values of c xy as C~c xy x,y[ A,C,D,E,F ,G,H,I,K,L,M,N,P,Q,R,S,T,V , f j È W ,Y g:g. Note that 1~P x c xy , since each mutation either leads to a new amino acid (x=y) or leaves the amino acid unchanged (x~y).
Let A be the 20620 matrix with off-diagonal elements A xy~cxy and diagonal elements A yy~{ P x=y c xy . Let u be the rate at which an individual codon experiences a nucleotide mutation, so that each codon experiences an average of ut mutations after an elapsed time of t. It is assumed that all codons in the protein experience the same mutation rate u. As will be seen below, the full model still allows variation in the rate at which substitutions accumulate at different residues, but this variation is caused by selection for stability rather than by differences in the underlying rate of mutation. Without selection for stability, the probability Figure 1. A stability threshold model of protein evolution. Proteins are assumed to be functional if and only if they are more stable than some minimal threshold (in the figure, DG threshold f~{ 5 kcal=mol, which is a typical value for natural proteins [53]; note that more stable proteins have more negative DG f values). When a particular destabilizing mutation (DDG~3 kcal=mol) occurs, the evolutionary result will depend on the stability of the proteins in the parent population. When the parent proteins are sufficiently stable (top panel), the mutant protein still satisfies the threshold, and so the mutation has the opportunity to spread by neutral genetic drift. But when the parent proteins are not sufficiently stable (bottom panel), the mutant protein fails to stably fold, and is eliminated by natural selection. Therefore, the probability that a mutation that induces a stability change of DDG will have an opportunity to spread by neutral genetic drift is simply the probability that the parent protein has a stability DG f vDG threshold that a residue that is initially y will be x after an elapsed time of t is given by the element M xy t ð Þ of the matrix M t ð Þ~exp ut A ð Þ [18]. After a sufficiently long period of time (t??), the probability to find some specific amino acid x is the same across all positions of the protein, and is given by element x of the right eigenvector of A corresponding to the unique zero eigenvalue (the uniqueness of this eigenvector is guaranteed by the Perron-Frobenius theorems, since A plus the identity matrix will be an irreducible and acyclic stochastic matrix). Of course, real proteins tend to prefer some amino acids at certain positions, such as hydrophobic residues in the core. The substitution model that has just been described fails to account for these preferences. The next section explains how this problem can be remedied by incorporating selection for stability.

Substitution probabilities in the presence of selection for stability
The situation described in the previous section changes fundamentally in the presence of selection for protein stability, since mutations will be eliminated if they destabilize a protein beyond the threshold. Specifically, let DG f be the stability (free energy of folding [53] . This description, in which DDG r xy s ð Þ for mutating residue r is a function of the parent sequence s as well as the residue identities x and y, is completely general. However, it is not useful. The reason for this lack of utility is that there are 20 L different possible protein sequences s, since each of the L positions in the protein can take on any of the 20 amino acids. Since L for a typical protein is several hundred residues, the number of different DDG r xy s ð Þ values exceeds the number of atoms in the Universe. This many values cannot reasonably be specified a priori or inferred from available sequence data. However, the situation can be made more tractable by assuming that DDG r xy s ð Þ is independent of s, and so is equal to the same value of DDG r xy for all sequences. This assumption is equivalent to saying that the DDG values for mutations to different residues are independent and additive, which implies that the DDG value of a mutation does not depend on the sequence background in which it appears. This assumption is clearly not completely true, since protein stability depends on cooperative interactions among many residues. However, empirically it appears that the assumption of independent and additive DDG values is nonetheless actually rather good. For example, a number of biochemical studies have indicated that the DDG values for a modest number of amino acid mutations are frequently independent and additive [60][61][62][63][64][65]. Of particular relevance is a study by Fersht and Serrano [65] of the amino acid substitutions separating the homologous proteins binase and barnase, which have 85% sequence identity. They found that combinations of these substitutions had additive effects on stability, indicating that the DDG r xy values are very nearly constant among the sequences that occurred during the evolutionary divergence of these two proteins. This high degree of independence and additivity of experimentally measured DDG values may be due to the fact that pairwise amino-acid interaction potentials can be accurately approximated by independent sites [28,66]. Regardless of the underlying reasons, at least at modest levels of sequence divergence, there is experimental evidence that the approximation of constant DDG r xy values is quite accurate. Assuming that DDG r xy is independent of the particular sequence background greatly reduces the number of these values that need to be determined. To count the number of unique DDG r xy values, note that any closed loop in the space of protein sequences yields no net change in stability. That is, DDG r xy~0 (since there is no stability change when there is no mutation), DDG r xy zDDG r yx~0 (since mutating y to x and then back to y does not change the sequence), and DDG r xy zDDG r zx zDDG r yz~0 (since this combination of mutations leaves the sequence unchanged). Therefore, all DDG r xy values can be determined with reference to mutating an arbitrarily chosen amino acid, which is here taken to be alanine (A). There are 19L different DDG r xA values, since each of the L residues can be mutated to any of the 19 non-alanine amino acids. The specification of all DDG r xA values allows any DDG r xy value to be calculated as All DDG values are therefore uniquely determined by the set This paper will show that the elements of G can be reasonably inferred using informative Bayesian priors.
First, assume that G is known and consider the problem of using this knowledge to determine whether selection will tolerate a particular mutation to some specified protein sequence. Let DG extra Þ be the extra stability of a sequence composed entirely of the alanine reference amino acid. The extra stability of any protein sequence s can be calculated from G and DG extra where s r is the amino acid at residue r of sequence s. Under the stability threshold model, mutating residue r of the folded protein with sequence s to x is acceptable to selection if and only if It may be possible to use this formulation to develop a mathematically tractable description of protein evolution. However, the situation is complicated by the fact that the acceptability of a mutation depends on the protein sequence s. Therefore, describing protein evolution using Equation 2 requires estimating the stability of each sequence that occurs along the phylogenetic tree, and averaging over all possible sequence paths. This paper circumvents this difficult task by making the additional (mean-field) approximation that the acceptability of a specific mutation depends on the average distribution of DG extra f , rather than on the exact stability of the protein sequence in which the mutation occurs. In other words, we take the probability that mutating residue r from y to x is neutral to be equal to the probability that DG extra f ƒDDG r xy . This mean-field approximation eliminates all coupling between substitutions at different sites in the protein.
With this mean-field approximation, the issue becomes determining the average distribution of stabilities in an evolving population of proteins. This problem has been treated previously by simulations [29] and mathematically through matrix [31] and diffusion [32] equation approaches. The average distribution of stabilities turns out to depend on the degree of polymorphism in the population, with highly polymorphic populations (those with the product of the population size N and the per sequence per generation mutation rate m much greater than one) evolving to greater average stabilities than populations that are mostly monomorphic (those with Nm%1) [31,67,68]. Here we will consider only the case where the population is mostly monomorphic, so that all proteins tend to have converged to the same stability before a new mutation occurs (as is the case for the proteins shown in Figure 1). This choice is dictated by the fact that we are unclear how to incorporate the secondary selection for mutational robustness that occurs in highly polymorphic populations [31,67,68]. We acknowledge that some of the proteins that we analyze later in this paper (particularly influenza hemagglutinin) may actually evolve in populations that are highly polymorphic, and suggest that a mathematical treatment recognizing this fact is an area for future research. Given our choice to consider only the case where the population is mostly monomorphic, we will adopt the mathematical formalism described in [31] for the limit when Nm%1 (the more compact diffusion-equation approach of Shakhnovich and coworkers [32] cannot be used since it only applies when Nm&1). Following [31], we discretize the continuous variable of extra protein stability DG extra f into small bins of width b, and assign a protein to bin i if it has extra stability such that 1{i ð Here B is some large integer giving an upper limit on the number of stability bins (so that all proteins in the evolving population have DG extra f w{Bb). Note that all folded proteins fall into one of these bins, since proteins with DG extra f w0 fail to fold under the stability threshold model. Reference [31] finds that the distribution of average protein stabilities is well approximated by an exponential (see the middle panels of Figure 2 of this reference, or alternatively Figure 2A of [29]), such that the probability p o i ð Þ that a protein in the evolving population has extra stability that falls in bin i is where aw0 is a constant describing the steepness of the exponential. Figure 2 shows this distribution of protein stabilities graphically. Note that this exact mathematical form for p o i ð Þ is not proven in [31], but simply that all numerical solutions give distributions for p o i ð Þ that resemble this form. Other mathematical forms could be chosen for p o i ð Þ without altering the mathematical analysis that follows, although they might affect the actual numerical values that are ultimately inferred for the DDG values. In particular, in highly polymorphic populations, the distribution of stabilities is peaked at a value slightly below the stability threshold (see right panels of Figure 2 of [31], Figure 2 of [32], or Figure 2B of [29]) rather than being an exponential. However, any distribution in which highly stable proteins are rare and marginally stable proteins are common should lead to qualitatively similar inferred DDG values, since the subsequent analysis only employs the cumulative distribution function of p o i ð Þ in a rather coarse manner. Given the definition of p o i ð Þ in Equation 3, the exact numerical for a simply sets a scale for the DDG values (in conjunction with the bin size b, it determines their units). As is described later in this paper, in our actual computational implementation, we chose a value for a that placed the magnitude of the inferred DDG values in the same dynamic range as the informative priors.
Using the mean-field approximation for DG extra f , the probability that a mutation is neutral can now be computed from p o i ð Þ and DDG r xy . Stabilizing mutations are always neutral, while destabilizing mutations are neutral with a probability equal to the fraction of time they will not unfold a protein with extra stability drawn from p o i ð Þ. Mathematically, the probability f r xy that mutating residue r from y to x is neutral is where t Á Á Á r is the nearest integer function. Figure 2 graphically illustrates the probability that a mutation will be neutral given its DDG value. Define G r to be the matrix with off-diagonal elements G r xy~f r xy c xy and diagonal elements G r yy~{ P x=y f r xy c xy . The probability that a substitution changes position r of the protein from its original identity of amino acid y to amino acid x after an elapsed time t is therefore given by element M r xy t ð Þ of the matrix M r t ð Þ defined by . The panel at right shows the probability f that a mutation that causes a stability change of DDG will be neutral, as given by Equation where u is the per codon mutation rate as defined above. The previous section showed that in the absence of selection for stability, the probability of finding some specific amino acid at a position was equal for all positions in the limit of long time. With selection for stability, this is no longer the case. Let the probability p r x of finding residue x at position r in the long-time limit be given by element x of the vector p r . The vector p r represents the stationary solution to Equation 5, and so is the probability vector (entries sum to one) that satisfies the eigenvector equation where I is the identity matrix. Given a value of G r , the uniqueness of p r is guaranteed by the Perron-Frobenius theorems, since IzG r is a nonnegative and acyclic stochastic matrix. Since G r depends on the DDG r xy values for the stability effects of mutations, the probabilities of observing amino acids at specific positions in the sequence depends on their stability contributions.

Bayesian framework for inferring DDG values from sequence data
The previous section describes how the probabilities of specific substitutions to an evolving protein are shaped by the set G of DDG values. In practice, we simply have some set S of homologous protein sequences. The inference problem is how to estimate G from S. In so doing, we will also need to estimate C, u, and the phylogenetic relationship among the sequences. The approach we will take is to use Bayesian inference [18,[69][70][71] to estimate G from S. Sadly, the approach is not fully Bayesian, since computational limitations require some important quantities to be estimated by alternative means. Hopefully in the future, the computation can be recast in fully Bayesian terms.
The inference problem begins with the set S of homologous protein sequences. Here it is assumed that these proteins have diverged from a common ancestor by point mutations (any insertions/deletions are ignored), and that there is no recombination within the protein coding sequences. It is further assumed that all of the homologous sequences can be aligned in a fashion that puts their residues in a one-to-one correspondence. In mathematical terms, S~s k 1ƒkƒN È É consists of N homologous sequences of length L, with s k denoting the kth sequence. For each sequence s k , we know the identity s k r of the amino acid at position r (where 1ƒrƒL). The set of amino acid identities for all N proteins at a single site r is denoted by S r The evolutionary relationship among the sequences is given by some phylogenetic tree T . Here T is taken to specify both the topology and branch lengths of a rooted phylogenetic tree, as shown in Figure 3.
Using the prescription of the previous section to calculate the substitution probabilities, it is possible to calculate the likelihood Pr S G,C,u,T j ð Þof observing some set of sequences given the DDG values. Here we briefly outline how this calculation would proceed, closely paralleling the description by Felsenstein [18] of the pruning-based likelihood calculation method he developed [72,73]. Making the standard phylogenetic assumption that evolution at each site is independent, Consider the computation for some specific site r. Figure 3 shows the phylogenetic tree T giving the evolutionary relationship among N~5 sequences, and the sequence data S r ð Þ for site r of these sequences. Given this tree in Figure 3, the likelihood for site r is computed by summing over the twenty possible amino acid identities at each internal node, Assuming the lineages are independent, the probabilities on the right side of Equation 8 can be decomposed as a product, where the M r xy t ð Þ and p r x values are calculated from G, C, and u using Equations 5 and 6. Note that Equation 9 assumes that the sequences have evolved for a sufficiently long period of time that the probability of observing residue x at position r at the root of the tree is the long-time limit p r x . Using the pruning approach of Felsenstein, Equations 8 and 9 can be combined to yield Equations 7 and 10 provide a method for computing Pr SjG,C,u,T ð Þ . But goal of this analysis is to infer the DDG values from the sequences, which is equivalent to computing Pr G S j ð Þ. Using Bayes' Theorem, Solving this equation would yield a fully Bayesian inference of G by summing over the unknown variables C, u, and T . In principle, this equation could also be used to estimate another phylogenetic variable (such as T ) by swapping this variable with G in the equation.
However, in practice, the approach taken here will not be the fully Bayesian estimate given by Equation 11. Instead, to reduce the variable sampling space, other methods will be used to make single-value estimates for each of C, u, and T , so that where C, u, and T have been assigned fixed values. Given a prior Pr G ð Þ over the DDG values, the right-hand side of Equation 12 can be estimated numerically. One attractive aspect of this approach is that there is a basis for specifying a meaningful prior over G, since DDG values can be measured experimentally [53,74], or more easily predicted with at least mild accuracy by one of the available physicochemical modeling programs [1][2][3][4][5][6][7][8]. Equation 12 can in principle be solved by Markov chain Monte Carlo (MCMC) methods [69][70][71] to yield a full estimate of the probability distribution Pr GjS ð Þ. But we are interested in obtaining estimates for the individual DDG values contained in G, since it is these values that have physical meaning. Therefore, we take the DDG values of the maximum a posteriori valueĜ G of G, defined aŝ In the next section, we describe the specific computational approach we have used to solve Equation 13 to obtain the DDG values from an alignment of homologous protein sequences.

Implementation of a computational approach for inferring DDG values from sequence data
In this section, we describe the computer program we have developed to infer DDG values from the sequences of protein homologs by solving Equation 13. Solving this equation requires specification of the phylogenetic tree T , the underlying amino acid mutation probabilities C, the mutation rate u, and a prior distribution Pr G ð Þ over the DDG values. Solving the equation also requires a numerical method for maximizing the argument of the argmax G function. We implemented our strategy using the Python programming language, and termed the resulting program PIPS (Phylogenetic Inference of Protein Stability). This program was used to analyze cold shock protein, ribonuclease HI, thioredoxin, and H1 influenza hemagglutinin as described below. The PIPS source code and the full raw data from the analyses in this paper are available at http://openwetware.org/wiki/User:-Jesse_Bloom.
We built the phylogenetic tree T from the set S of homologous protein sequences using the PHYLIP package [75]. The protein sequences of the homologs were aligned using ProbCons [76] (for cold shock protein, ribonuclease HI, and thioredoxin) or MUSCLE [77] (for influenza hemagglutinin). Phylogenetic trees of these aligned protein sequences were then constructed using the distance-based method of PHYLIP's ''neighbor'' program. For cold shock protein, ribonuclease HI, and thioredoxin, the trees were built using the UPGMA method to create rooted trees that conformed to the assumption of a molecular clock. For influenza hemagglutinin, the variation in the date of isolation of the sequences is substantial relative to their divergence, so the neighbor-joining method (no molecular clock) was used to construct a tree which was rooted to an outgroup sequence.
We calculated the underlying amino acid mutation probabilities C under the assumption that each amino acid was equally likely to be encoded by any of its codons. The probability c xy that a single mutation changed amino acid y to x was the probability that a random nucleotide mutation to one of the codons for y yielded a codon for x, averaged over all of the codons for y. There is evidence that the transition-to-transversion ratio for influenza evolving in humans is somewhere in the range of five [78], so for hemagglutinin we assumed that the nucleotide mutations were made with this bias. We are aware of no clear evidence about the transition-to-transversion ratio for cold shock protein, thioredoxin, and ribonuclease HI, so for these proteins we assumed a ratio of 0.5, which is the expectation in the absence of any mutational bias [79]. We recognize that more accurate amino acid mutation probabilities are likely to be derived from a codon-based model [59], and suggest that incorporating such a model is an area for future work.
The mutation rate u represents the number of nucleotide mutations to a codon that occur for each substitution that is fixed along the branches of the phylogenetic tree (branch lengths are measured in amino acid substitutions per site). Since our program is not yet sufficiently advanced to co-estimate u from the sequence data, we had no strong rationale for assigning a particular value to u. We chose a value of u~5, which corresponds to 20% of nucleotide mutations leading to a tolerated amino acid mutation. While we cannot provide an independent justification for this choice of u, the inferred DDG values were fairly insensitive to the choice of u for values between 3 and 20.
One of the strengths of our approach is that it allows for the use of informative priors Pr G ð Þ over the DDG values. These priors can serve two purposes. One purpose is simply to prevent overfitting by regularizing [80] the DDG values by biasing them towards a central reasonable range. A second purpose is to actively incorporate some of the substantial existing knowledge about how protein structure and amino-acid character influence DDG values. One piece of this knowledge is simply the general fact that most mutations to proteins are destabilizing, and so have DDGw0. It is also known that mutations that cause large changes in the hydrophobicity of amino acids are often more destabilizing. At a more detailed level, there are a number of physicochemical modeling programs that attempt to make quantitative predictions of DDG values from protein structural information [1][2][3][4][5][6][7][8]. We tested phylogenetic inference with priors incorporating information at all three of these levels, as shown in Figure 4. At the most basic level, we used ''regularizing priors'' that simply biased all the DDG values towards the generally observed range of mildly to moderately destabilizing. A second set of ''hydrophobic'' priors were based on the idea that mutations that cause large changes in amino acid hydrophobicity will tend to be more destabilizing. For these priors, the prior estimate for each DDG value was equal to the absolute value of the difference in the hydrophobicities of the wildtype and mutant amino acids, as given by the widely used Kyte-Doolittle hydrophobicity scale [81]. These hydrophobic priors therefore predicted that mutations that caused large changes in hydrophobicity would be highly destabilizing (DDG&0), while those that led to small changes in hydrophobicity would have little effect on stability (DDG&0). A third set of ''informative priors'' were designed to leverage the full available knowledge about the effects of mutations on stability. This knowledge is most completely encapsulated in various physicochemically-based prediction programs [1][2][3][4][5][6][7][8], which utilize a wide range of structural and biophysical information to make quantitative DDG predictions for individual mutations. We chose one of these programs, CUPSAT [8], to predict DDG CUPSAT values for all single aminoacid mutations from the protein crystal structures. We chose the CUPSAT program because it has a publicly available webserver (http://cupsat.tu-bs.de) and has reported benchmarks that equal or exceed those of other prediction programs [8]. The prior estimate for each mutation was then the DDG CUPSAT value predicted by CUPSAT, after rescaling the predictions as described below. For all three sets of priors, the prior Pr DDG r xA À Á for mutating residue r from A to x was a beta distribution probability density function peaked at the prior estimate for that mutation. The beta distribution functions were defined so that the sum of the alpha and beta parameters equaled three, and with the functions going to zero at the upper and lower limits of the allowed range for the DDG values. These prior functions are therefore broad, and loosely bias the DDG values toward the prior estimates. Examples of the priors are shown in Figure 4. The overall prior probability for the set G of DDG values was defined to the be product of the prior probabilities for the individual DDG r xA values, Pr DDG r xA À Á . In order for the phylogenetic inference to work effectively, it is necessary that the priors fall in the same numerical range over which the likelihood function is responsive to changes in the DDG values. The actual DDG values of the phylogenetic inference approach have arbitrary units, so placing the priors in an appropriate dynamic range simply requires that the relevant parameters have compatible relative values. We set a DDG range of g~20, so that for all DDG values, {gƒDDGƒg. The values of the bin size b and the parameter a in Equation 3 are arbitrary, but serve to set the scale for how DDG values affect the substitution probabilities. We chose a value of b~1, and a value of a such that p o i ð Þ falls to one percent of its previous value every g=2 bins (this is a~{log 0:01 ð Þ= g=2 ð Þ). This scaling means that the substitution probabilities as a function of the DDG values can cover a large dynamic range of four orders of magnitude given the limits for the DDG values set by g. It is then necessary to choose priors that fall in the same dynamic range. For the regularizing priors, the prior estimate had a value of five for all DDG values, which corresponds to a moderately destabilizing mutation. For the hydrophobicity priors, we did not rescale the values obtained by taking the absolute value of the difference in Kyte-Doolittle hydrophobicities, since these values already fall in a reasonable range of zero to nine. For the informative priors, we rescaled the DDG CUPSAT values to bring them into an appropriate range. Specifically, we rescaled them so that the difference between the values at the 10th and 90th percentiles was g=2 and the mean DDG CUPSAT value was g=4, and truncated outlier values so that g=4{2g=5ƒDDG CUPSAT ƒg=4z2g=4.
Solving Equation 13 requires a numerical method for finding the valueĜ G of G that maximizes the a posteriori probability. The DDG values for the different positions of the protein are independent, so we maximized the 19 DDG r xA values for each position separately. For each residue r, we first set the DDG r xA values to random numbers drawn from a normal distribution with a mean of zero and a standard deviation of g=2. For each DDG r xA , we then performed a line search to find the value that represented the nearest local maximum in the a posteriori probability. We repeated this procedure for the next DDG r xA value, until we had performed line searches for all 19 values. This constituted one iteration of maximization of the DDG r xA values; we continued performing iterations until no further local adjustments in any of the DDG r xA values increased the a posteriori probability. This maximization algorithm is stochastic, and we cannot guarantee that it converges to the global maximum (or indeed converges at all). However, in practice it always converged rapidly, and repeating the procedure with different random starting values led to highly similar DDG r xA values at the completion of the maximization. We considered this ample evidence that this rather ad hoc algorithm was a sufficient method for solving the argmax G function of Equation 13. Implementing a more sophisticated gradient-based maximization is an area for future research, and may lead to improvements in computational speed. However, the The ''regularizing priors'' are peaked at the moderately destabilizing value of DDG~5 to capture the general knowledge that most mutations are destabilizing. The ''hydrophobic priors'' capture the knowledge that mutations that cause large changes in hydrophobicity are often more destabilizing. These priors are peaked at a value equal the the absolute value of the difference in amino acid hydrophobicity (as defined by the widely used Kyte-Doolittle scale [81]). For example, the prior for a mutation from hydrophobic valine (V) to similarly hydrophobic leucine (L) is peaked near zero, while that for mutation from valine to charged lysine (K) is peaked at a much more destabilizing value. The ''informative priors'' are peaked at the DDG values predicted by the state-of-the-art physicochemically based program CUPSAT [8], and so are designed to leverage extensive pre-existing knowledge about DDG values. All the priors are fairly loose to make the DDG values responsive to their effect on the likelihood. The priors also help regularize [80] the DDG predictions by biasing them towards a reasonable range. doi:10.1371/journal.pcbi.1000349.g004 PIPS program described above was sufficiently fast to be run on a laptop computer to give the predictions described in the next few sections.

Comparison of phylogenetically inferred DDG values with existing experimentally measured values for small soluble proteins
We first tested the phylogenetic inference approach on existing experimentally measured DDG values. Most published DDG values are for mutations to a few small soluble proteins. We examined the ProTherm [82] database, and found that the proteins with the most DDG values were bacteriophage T4 lysozyme, sperm whale myoglobin, Bacillus amyloliquefaciens barnase, Bacillus subtilis cold shock protein, Escherichia coli ribonuclease HI, and E. coli thioredoxin. We then searched for sequences with least 50% identity to each of these six proteins in the UniRef100 database [83]. We found a substantial number of homologous sequences for cold shock protein (763 sequences), ribonuclease HI (239 sequences), and thioredoxin (213 sequences). We therefore chose these three proteins as the subjects of our analysis. For each protein, we extracted from the original references all available experimentally measured DDG values for single amino acid substitutions, to obtain a total of 76 DDG values for cold shock protein [84][85][86][87][88][89], 31 DDG values for ribonuclease HI [90][91][92][93][94][95][96], and 32 DDG values for thioredoxin [14,16,97,98].
In order to provide points of comparison, we first examined the ability of the physicochemical modeling program CUPSAT [8] and the consensus approach to predict the experimentally measured DDG values for these three proteins. We used the CUPSAT webserver to predict DDG values from the protein crystal structures (PDB codes 1CSP [99] for cold shock protein, 2RN2 [100] for ribonuclease HI, and 2H6X for thioredoxin). We calculated the consensus approach predictions using the standard Boltzmann form where DDG is the negative logarithm of the ratio of the frequencies of the mutant and wildtype residues in the alignment of homologous sequences (with a pseudocount of one added to the count for each amino acid before calculating the frequencies). We then used the PIPS program described in the previous section to make DDG predictions by a phylogenetic inference approach. PIPS predictions were made using each of the three sets of priors (informative, regularizing, and hydrophobic) described in the previous section. Figures 5,6,and 7 show the correlations between the predicted and experimentally measured DDG values for each of the three proteins. For each of the three proteins, all methods made predictions that were correlated with the experimental DDG values (with R 2 values ranging from 0.25 to 0.60), although there was also always substantial scatter in the correlation plots. In general, the PIPS program appeared to perform slightly better with the informative priors than with either the regularizing or hydrophobic priors. The PIPS program with the informative priors modestly but consistently outperformed both CUPSAT and the consensus approach (with the R 2 values for the PIPS program exceeding those for CUPSAT and the consensus approach by amounts ranging from 20% to two-fold). Because these correlations are with experimental data spanning a wide range of stabilizing and destabilizing DDG values, it is difficult to discern whether PIPS is also clearly better at identifying the most stabilizing mutations (the metric that would be most relevant for engineering protein stability), although it perfoms at least as well as consensus and CUPSAT in this respect. In any case, we interpret the higher overall correlations obtained with PIPS to indicate that for small soluble proteins, the phylogenetic inference approach with informative priors is more accurate than both a state-of-the-art physicochemical modeling program and the consensus approach. In the remainder of this work, all PIPS predictions are made with the informative priors.
The phylogenetic inference approach determines 19L different DDG r xA values for each protein, where L is the length of the protein. Because such a large number of parameters is being inferred, it is interesting to examine how the performance of the phylogenetic inference depends on the number of sequences used. One way to do this is to make PIPS predictions using a random subset of all of the available sequences, and then to correlate these predictions with the experimentally measured DDG values, or with the PIPS predictions made using all available sequences. We performed such an analysis for all three proteins. Figure 8 shows the results of this analysis. Not surprisingly, using larger numbers of sequences improves the accuracy of the predictions, as measured by the correlations with both the experimental DDG values and those predicted by PIPS using all available sequences. However, the correlations are still quite good when only a fraction of the available sequences are used. These results indicate that although it is obviously advantageous to use more sequences, phylogenetic inference performs fairly well even if only 50 or 100 sequences are used. We suggest that both the informative and regularizing [80] aspects of the Bayesian priors serve to prevent overfitting and guarantee reasonable predictions even when the number of sequences is small.

Test of phylogenetic inference approach's ability to identify known temperature-sensitive and revertant mutations to influenza hemagglutinin
We next tested the phylogenetic inference approach on the more difficult problem of identifying stabilizing mutations to influenza hemagglutinin. Hemagglutinin is a 565-residue trimeric membrane-bound glycoprotein that mediates the binding and fusion of influenza virus with target cells, making it much larger and more complex than most proteins that have been successfully modeled using physicochemical approaches. Influenza has been the subject of intensive sequencing efforts, and so a large number of hemagglutinin sequences are available in the publicly accessible Influenza Virus Resource [101] (http://www.ncbi.nlm.nih.gov/ genomes/FLU/FLU.html). However, these sequences contain unusual patterns of phylogenetic relationship, due to the distinctive selection pressures operating on influenza in humans [102] and birds [103], as well as the fact that most sequencing has focused on a few subtypes of special interest (such as avian H5N1 and human H3N2 and H1N1 viruses). The complexity of the hemagglutinin protein and the strong evolutionary relationships among the available sequences are likely to make the prediction of stabilizing mutations a challenging problem for any method.
As an test data set, we used a collection of previously described mutants to the hemagglutinin of the A/WSN/33 (H1N1) influenza virus. This set contains two temperature-sensitive virus mutants that can replicate only at reduced temperatures (34uC but not 39.5uC) due to a failure of the hemagglutinin protein to be transported to the cell membrane at elevated temperatures [104]. The hemagglutinin proteins of these temperature-sensitive viruses also show an increased loss of hemagglutination activity at high temperature, suggesting general defects in both folding and stability [104]. Each of the two temperature-sensitive viruses is defective due to a different single amino-acid mutation in hemagglutinin [105]. These two temperature-sensitive mutations constitute our set of ''destabilizing'' mutations. For one of the two temperature-sensitive mutants (the one designated as ts-134 in [104][105][106]), a collection of second-site revertant mutations in hemagglutinin have been isolated by selecting for viruses that have regained the ability to grow at elevated temperatures [105,106]. These revertant mutations presumably enhance hemagglutinin's folding and/or stability. There are 16 different revertant mutations, which constitute our set of ''stabilizing'' mutations.
We tested the ability of the CUPSAT program, the consensus approach, and the PIPS program (using the informative priors) to distinguish the temperature-sensitive and revertant mutations. The CUPSAT predictions were made using the crystal structure of the A/PR/8/34 (H1N1) hemagglutinin (PDB code 1RVZ [107]), which is closely related to the A/WSN/33 (H1N1) hemagglutinin (90% sequence identity over the 487-residue portion of the protein present in the crystal structure). For sequence data, we used the full-length hemagglutinin sequences (lab strains excluded) present in the Influenza Virus Resource [101] at the time of our initial analysis (September, 2007). We made no restriction on the host species of the virus, since we assume that the basic requirements for protein folding and stability should be similar in all hosts. We restricted our analysis to those hemagglutinin subtypes with at least close to 50% protein sequence identity to H1 hemagglutinins (sequences from the H1, H2, H5, H6, H8, H9, H11, H12, H13, and H16 subtypes) and excluded sequences from more distantly related subtypes (H3, H4, H7, H10, H14, and H15). This yielded 1,911 unique hemagglutinin sequences, which were used to build the phylogenetic tree shown in Figure 9. Figure 9 shows the predicted stability effects of the temperaturesensitive and revertant mutations from each of the three methods. The CUPSAT program had no ability to distinguish the temperature-sensitive and revertant mutations, since it predicted the stability effects of all of these mutations to be clustered together near the center of the distribution of effects for all mutations. This suggests that either hemagglutinin is too large or mobile for effective physicochemical modeling, or that the CUPSAT program is overfit on the set of small soluble proteins on which it was parameterized (which includes cold shock protein, ribonuclease HI, and thioredoxin). The consensus approach could partially distinguish the temperature-sensitive and revertant mutations, predicting most of the revertant mutations to be more stabilizing than the temperature-sensitive mutations. However, the PIPS program was clearly the most successful approach, cleanly predicting that all of the revertant mutations should be more stabilizing than both of the temperature-sensitive mutations. These results support the findings of the previous section that the PIPS program is more accurate than either the physicochemical modeling program or the consensus approach, and suggest that the extent of its superiority over physicochemical modeling is greater for more complex proteins such as hemagglutinin.

Prediction and experimental verification of new stabilizing mutations to influenza hemagglutinin
We next tested whether the phylogenetic inference approach could predict entirely new stabilizing mutations to influenza hemagglutinin. Our experimental strategy for performing this test was to introduce stabilizing mutations predicted by PIPS into A/ WSN/33 (H1N1) hemagglutinin carrying a known temperaturesensitive mutation (the single mutation responsible for the ts-134 phenotype [105]) and examine whether these predicted stabilizing mutations actually allowed the virus to grow at elevated temperatures.
The PIPS analysis described in the previous section identified 23 different mutations to A/WSN/33 H1 hemagglutinin that were predicted to be the most highly stabilizing (these are the mutations with PIPS DDG values less than 25 that appear in the small leftmost bar of the histogram in Figure 9). Seven of these mutations are to residues in the antigenic sites of H1 hemagglutinin (as delineated in [108]), and so are likely subject to positive selection for diversification. Since one of the basic assumptions of the phylogenetic inference approach is that mutations are selected only for their effects protein stability, we excluded these seven mutations. Another mutation occurs in the N-terminal signal sequence, and so was excluded since it is not present in the final folded structure. Three of the mutations occured in the HA2 polypeptide; we excluded these three mutations since the temperature-sensitive mutation is found in the HA1 polypeptide. This left 12 predicted stabilizing mutations in the HA1 polypeptide. The locations of these predicted stabilizing mutations in the three-dimensional structure are shown in Figure 10. None of these mutations is among the known revertants [106] described in the previous section.
We introduced these 12 predicted stabilizing mutations into the hemagglutinin gene on the background of the temperaturesensitive mutation using site-directed mutagenesis. We then created the mutant viruses at 34.0uC using the influenza reverse genetics system [109], as described in more detail in the Methods section. The viruses were then plaqued on confluent Madin Darby  Table 1 summarizes the results of these plaque assays. The wildtype virus plaqued at all five temperatures, with some reduction in plaque size and clarity at 40uC. The virus carrying the temperature-sensitive mutation in hemagglutinin plaqued at 34.0uC and 35.5uC, formed smaller and more opaque plaques at 37.0uC, and formed no visible plaques at 38.5uC and 40.0uC. Of the 12 mutant viruses, one failed to express in the reverse genetics system. Three appeared to be slightly less stable than the temperature-sensitive parent virus, plaquing only at 34.0uC and 35.5uC. Four had similar profiles to their parent virus, plaquing well at 35.5uC but only weakly at 37.0uC. The other four mutant viruses exhibited clearly enhanced thermotolerance, plaquing well at 37.0uC and weakly at 38.5uC.
To confirm the increased temperature stability of viruses carrying the four apparently stabilizing mutations, we re-grew the viruses from the encoding plasmids and again plaqued them at various temperatures that now included 38.0uC. The results of these plaque assays are shown in Figure 11. All four mutants were clearly more thermotolerant than their temperature-sensitive parent, although still less so than the wildtype virus. To test whether the stabilizing mutations had cumulative effects, we constructed a double-mutant carrying two of the stabilizing mutations, and a triple-mutant carrying three of the stabilizing mutations. As can be seen in Figure 11, these multiple mutants were more thermotolerant than the single mutants, as indicated by better plaquing at 38.5uC.

Discussion
The most compelling evidence for the essential validity of the phylogenetic inference approach presented here is also the source of its greatest potential utility -the fact that it is able to predict experimentally measured mutational effects on stability. We found that it predicted known DDG values for single amino acid mutations to small soluble proteins with an accuracy exceeding that of either of two existing strategies, the consensus approach or a state-of-the-art physicochemical modeling program. Phylogenetic inference also was able to distinguish between known temperature-sensitive and revertant mutations to influenza hemagglutinin, a large multimeric protein that evolves under distinctive selection pressures. The extent to which phylogenetic inference outperformed the consensus approach and especially physicochemical modeling was greater for hemagglutinin than for the small soluble proteins, suggesting that it may be most useful on precisely the more complex proteins that are often of greatest interest in biology and biomolecular engineering.
Our most stringent test of the phylogenetic inference approach was to use it to predict new mutations to hemagglutinin that rescued the growth of a temperature-sensitive influenza virus. Of the 12 predicted stabilizing mutations, four were indeed detectably stabilizing, four had little effect, three were slightly destabilizing, and one appeared to be lethal. How good (or bad) was this performance? Because we did not experimentally test CUPSAT and consensus predictions of stabilizing mutations to hemagglutinin, we cannot directly compare these two methods to phylogenetic inference in this respect. Comparison of the three methods on the set of previously known stabilizing mutations to hemagglutinin (Figure 9) strongly suggests that CUPSAT is unable to reliably distinguish stabilizing and destabilizing mutations to hemagglutinin, but only weakly suggests that phylogenetic inference is superior to the consensus approach in this regard. It therefore remains possible that the consensus approach would have made equally successful predictions. Another benchmark of the phylogenetic inference approach's predictions would be a comparison with a set of random single amino acid mutations to hemagglutinin. But the extensive amount of work required to generate and characterize such a panel of random mutants dissuaded us from carrying out such an experiment. Others seem to have been similarly dissuaded, since we are unaware of any published analyses of the stability effects of truly random Figure 8. Performance of the phylogenetic inference approach as a function of the number of sequences used. The PIPS predictions using informative priors were run using subsets of all of the available protein sequences. The resulting DDG predictions were then correlated with the experimental DDG values (top) or the PIPS DDG predictions obtained using all available sequences (bottom). The R 2 values are the squared Pearson correlation coefficients. For each number of sequences used, the PIPS predictions were made using 10 different random sequence subsets, and the displayed R 2 values are the average correlations over these 10 subsets. For cold shock protein, the subsets were made at intervals of 20 sequences, while for ribonuclease HI and thioredoxin they were made at intervals of 10 sequences. doi:10.1371/journal.pcbi.1000349.g008 mutations even to more experimentally tractable proteins. However, there have been coarse-grained analyses in the form of protein engineering experiments that screen for random mutations that enhance stability. Such experiments typically isolate one detectably stabilizing mutation for every 300 to 1,000 screened (frequencies of 0.4% for an esterase [110], 0.1% for subtilisin [111], 0.1% for a haloalkane dehalogenase [112], 0.2% for a phytase [113], and 0.1% for a fructosyl-amino acid oxidase [114], although methodologies vary widely). Assuming these frequencies can be extrapolated to hemagglutinin, the phylogenetic inference approach's success rate of four in 12 represents an improvement of two to three orders of magnitude over the random expectation -although of course two-thirds of the predicted stabilizing mutations still failed to enhance the virus's thermotolerance. Given these results, as well as the improved but still imperfect predictions of known DDG values, we can simultaneously ask both why the phylogenetic inference approach performs so well and why it does not perform better.
The phylogenetic inference approach performs so well because it ties protein stability to the underlying selection pressures, and so can draw from the full evolutionary histories of homologous proteins. Existing sequence-based strategies such as the consensus approach only consider the final evolved sequences, and so may miss some of the information contained in the substitution probabilities implied by the protein phylogeny. Physicochemical modeling utilizes knowledge about the biophysical forces that determine a protein's structure. But analyzing mutations with physicochemical modeling is more difficult than simply scoring the relative energies of different conformations of the same sequence, since a mutation can induce a change in the unfolded state. Computational descriptions of the unfolded state are still in their infancy, so it may be a long time until physicochemical modeling incorporates all of the subtleties needed to make fully accurate predictions. However, the phylogenetic inference approach leverages the incomplete but substantial knowledge already encapsulated by physicochemical modeling to build informative priors. These priors serve as reasonable initial guesses for the mutational effects on stability, which are then improved based on the substitution probabilities extracted from the protein phylogenies. Both the power of physicochemical modeling and the number of available protein sequences are likely to continue to increase, and as they do, the accuracy of the phylogenetic inference approach should improve correspondingly.
Why does the phylogenetic inference approach not perform better? The approach involves a number of mathematical and conceptual approximations. We are inclined to believe that the most limiting is the idea that all selection on amino acid substitutions occurs along the single additive dimension of protein stability. Clearly this assumption is inaccurate for the (probably small [36][37][38][39]) fraction of residues specifically involved in protein function. But it is also imperfect for the much larger fraction of residues with no direct functional role. These residues are constrained by selection on properties in addition to stability, including folding efficiency [44], kinetic stability [16,45], and resistance to aggregation [40][41][42][43]. Furthermore, even the biologically relevant measure of stability is somewhat unclear. The study of protein stability was pioneered [115] on small proteins that fold reversibly in vitro, allowing for true thermodynamic measurements of DG f and DDG values [53]. However, many proteins do not fold reversibly [45,54,116], and even for those that do, the measured stabilities can be sensitive to the solvent conditions [89,117,118], which are usually quite different from the in vivo cellular milieu [119]. The saving grace from these complications is that different measures of protein stability (thermodynamic, thermal, chemical, proteolytic, kinetic) are substantially correlated with each other [16,55,61,120], and to a Figure 10. Locations of the predicted and confirmed stabilizing mutations to H1 hemagglutinin. The full hemagglutinin trimer is shown in green, with the HA1 chains in dark green and the HA2 chains in light green. The temperature-sensitive mutation (ts-134 [104][105][106]) is shown with red spheres. The yellow spheres show the mutations that were predicted to be stabilizing by the PIPS program. The blue spheres show the four predicted mutations that were experimentally confirmed to actually increase the temperature stability. The structure is PDB code 1RVZ [107]. doi:10.1371/journal.pcbi.1000349.g010 lesser degree with folding efficiency [46][47][48][49] and resistance to aggregation [40]. The phylogenetic inference approach works to the extent that all of these properties can be grouped under the generalized concept of protein stability, and fails to the extent that mutations have distinct effects on each of them. So the inability of some of the predicted stabilizing mutations to rescue influenza's thermotolerance simply means that they did not compensate the original hemagglutinin defect (poor transport from the Golgi and decreased resistance to thermal inactivation [104])-they may still benefit related properties that were not compromised in this particular virus. Ultimately, such issues can be addressed only by relating the full spectrum of a mutation's biophysical effects to its tendency to be fixed by evolution, a type of analysis that should also help resolve the hotly debated question of what selection pressures account for observed patterns of protein evolution [121,122]. Despite these issues, the approach presented here is a clear conceptual improvement over the traditional concept of matrices specifying fixed ''average'' amino acid substitution tendencies that are unrelated to on any specific experimental measurement. Even recent work [19][20][21][22][23][24][25][26][27][28] that uses sophisticated simulations or structural analysis to derive site-specific substitution matrices ultimately fails to connect the substitution tendencies along protein phylogenies to any experimentally tangible properties of the mutations. By making such a connection, our approach reverses the usual tactic of maximum likelihood and Bayesian phylogenetic tree reconstruction. In those methods, some amino acid substitution model is assumed, and then used to infer a phylogenetic tree. Here we have assumed the phylogenetic tree, and then used it to infer the effects of individual mutations on stability. Ultimately, it would be most satisfactory to infer both the phylogenetic tree and the stability effects directly from the protein sequences, perhaps with the assistance of informative priors derived from physicochemical modeling. Performing such a dual inference would of course raise daunting computational issues of adequately sampling from the distributions of both possible tree topologies and mutational effects. However, progress in such a direction could ultimately lead to strategies for analyzing homologous sequences that yield useful information about both evolutionary histories and protein biophysics.

Cloning of plasmids
The eight bidirectional polymerase I/polymerase II influenza reverse genetics plasmids [109] for the A/WSN/33 (H1N1) strain (pHW181-PB2, pHW182-PB1, pHW183-PA, pHW184-HA, pHW185-NP, pHW186-NA, pHW187-M, and pHW188-NS) as well as the null cloning plasmid (pHW2000) were kind gifts from Robert G. Webster at St. Jude Children's Research Hospital. The plasmid pHW184-HA-ts134 was constructed by introducing the single mutation responsible for the ts-134 temperature-sensitive phenotype [105] (Y173H in the numbering scheme where the Nterminal methionine is zero) into hemagglutinin by strand overlap extension PCR, and cloning the insert into the BsmBI restriction sites of pHW2000. A similar procedure was then used to individually construct the 12 predicted stabilizing mutations shown in Table 1 on the background of this temperature-sensitive mutation to yield the plasmids pHW184-ts134-D51K, pHW184-ts134-D51R, etc. The accuracy of all plasmids was confirmed by sequencing the hemagglutinin genes and immediate flanking sequences.
During influenza infections, cells were maintained in influenza growth medium with trypsin (IGM+T), consisting of OptiMEM I (Gibco 31985) supplemented with 0.01% HI FBS, 0.3% bovine serum albumin (BSA, Invitrogen 15260-37), P/S, 100 mg/ml calcium chloride, and 2 mg/ml of tosyl-phenylalanyl-chloromethyl-ketone (TPCK)-treated trypsin (Sigma Aldrich T-8802) Figure 11. Plaque assays of wildtype, temperature-sensitive (ts), and ts influenza with predicted stabilizing hemagglutinin mutations. All four of the single mutations allow the virus to plaque at higher temperatures than the ts parent. The multiple mutants plaque more effectively at higher temperatures than the single mutants. Mutations are named according to the numbering scheme described in Table 1. doi:10.1371/journal.pcbi.1000349.g011 [123]. For plaque assays, 2X IGM+T was prepared from OptiMEM I powder packets (Gibco 22600-050) with 2.4 g of sodium bicarbonate per packet in addition to 2X concentrations of the other components of IGM+T. The TPCK-trypsin was always added fresh immediately before use.

Influenza reverse genetics
The influenza virus was reconstituted from the eight bidirectional reverse genetics plasmids [109] by co-transfecting 250 ng of each plasmid into a co-culture of MDCK and 293T cells in a 6well plate. The co-cultures were seeded the day before with 5610 5 293T and 3610 5 MDCK cells so that the plates were 50-80% confluent at the time of transfection. All transfections were performed using Mirus Transit293 transfection reagent. Posttransfection, plates were maintained at 34 muC in order to allow growth of temperature-sensitive viruses. At 12-18 hours posttransfection, the media was changed to IGM+T (with two washes with phosphate buffered saline, PBS). After 24 hours of growth in IGM+T, 500 ml of the supernatant was passaged to fully confluent MDCK cells in IGM+T to expand the virus. The supernatant from the passage plate was collected after an additional 24-48 hours of growth, at which point significant virus-induced cell cytopathic effects were typically observed. The virus-containing supernatant was passed through a 0.45 mm filter, aliquoted, and stored at 280uC. All experiments involving influenza virus were performed in accordance with Biosafety Level 2 containment procedures.

Plaque assays
For viral plaque assays, 6-well plates were seeded with 3.5610 5 MDCK cells per well so that they reached full confluence in 48 hours. Frozen aliquots of virus were thawed and serial 10-fold dilutions of virus were made in IGM+T. The confluent MDCK cells were washed twice with PBS, and then inoculated with 700 ml of the appropriate virus dilution. The 6-well plates were then transferred to a tissue culture incubator set at the appropriate temperature for 45 minutes, with occasional gentle tilting of the plate to spread the inoculum. An overlay medium was prepared by mixing equal volumes of 2X IGM+T and a 2.4% Avicel microcrystalline cellulose (FMC Biopolymer RC-581) suspension [124]. After the 45 minute incubation, 3 ml of overlay was added to each well and the plates were grown at the appropriate temperature undisturbed for 72 hours. The overlay was then removed by aspiration and the residual Avicel was removed by washing twice with PBS. The cell layer was stained by a 10-20 minute incubation with 0.1% crystal violet in 20% ethanol. The stain was removed with two additional PBS washes, and the plaques were photographed using a gel imager to yield photos like those shown in Figure 11.
Every effort was made to perform the plaque assays consistently, but there was still moderate variation in plaque size, number, and morphology when virus from the same stock was independently plaqued on different days (possibly due to slight variations in the conditions of MDCK cells). Because of the large amount of labor involved, it was of course impossible to perform all of the plaque assays on the same day. Figure 11 shows representative results, but some of the variation in plaque size and morphology may still be due to day-to-day variation. However, all mutants shown in Figure 11 were plaqued in independent experiments on different days using different initial viral stocks, and presence/absence of plaques at the different temperatures was repeatable, despite the modest variations in plaque morphology As can be seen in Figure 11, a crescentshaped patch sometimes appeared in the lower-left corner of the MDCK monolayer. This patch occasionally appeared even in the absence of virus, and is probably due to cell drying or death rather than viral growth.