Correlated Electrostatic Mutations Provide a Reservoir of Stability in HIV Protease

HIV protease, an aspartyl protease crucial to the life cycle of HIV, is the target of many drug development programs. Though many protease inhibitors are on the market, protease eventually evades these drugs by mutating at a rapid pace and building drug resistance. The drug resistance mutations, called primary mutations, are often destabilizing to the enzyme and this loss of stability has to be compensated for. Using a coarse-grained biophysical energy model together with statistical inference methods, we observe that accessory mutations of charged residues increase protein stability, playing a key role in compensating for destabilizing primary drug resistance mutations. Increased stability is intimately related to correlations between electrostatic mutations – uncorrelated mutations would strongly destabilize the enzyme. Additionally, statistical modeling indicates that the network of correlated electrostatic mutations has a simple topology and has evolved to minimize frustrated interactions. The model's statistical coupling parameters reflect this lack of frustration and strongly distinguish like-charge electrostatic interactions from unlike-charge interactions for of the most significantly correlated double mutants. Finally, we demonstrate that our model has considerable predictive power and can be used to predict complex mutation patterns, that have not yet been observed due to finite sample size effects, and which are likely to exist within the larger patient population whose virus has not yet been sequenced.

where q i is the charge of atom i, B i is its Born radius, f ij = r 2 ij + B i B j exp(−r 2 ij /4B i B j ) is the GB distance between atoms i and j (r ij is the distance between the charges), and u e = 1/2(1/ǫ w − 1/ǫ in ) (ǫ in is the solute dielectric constant and ǫ w is the solvent dielectric constant) [1]. We used the wildtype crystal structure of HIV protease subtype B (PDB ID 1NH0) [2] to compute G (f ) e , and a maximally extended chain A from 1NH0 with backbone dihedral angles set to 180 • (except for proline) and sidechain rotamer states set to all-trans to compute G (u) e . For both the folded and unfolded structures, unit charges, corresponding to a specific charge signature for the 99 residues, were placed on the most distal carbon atoms. All other side chains atoms are neutralized. Partial charge dipoles of ±0.4e are placed on every backbone amide and cabonyl group to preserve helix dipole effects [3].
Posing the inverse problem for the Potts model for sequence probabilities Our Potts model for correlated electrostatic mutations deals with a reduced sequence length of 18 (from 99 amino acids) and a reduced amino acid alphabet size of 3. For length N = 18 and a Q = 3 letter alphabet (0, +, -), there are 3 18 = 387, 420, 489 possible sequences or unique charge signatures. For every signature, we wish to calculate the probability of that sequence under two models; an independent model which preserves the database derived univariate marginals and a mean field model (Bethe approximation) which preserves both the database derived univariate and bivariate marginals. We call this the pair correlation model and we refer to the probabilities of a sequence under the independent and pair correlation model as P 1 and P 2 respectively. P 1 probabilities of individual sequences can be obtained by simply taking the product of the observed univariate marginals at each position.
where A i = {0, +, −} is one of the three possible charges at position i and P obs i (A i ) is the observed univariate marginal of charge A i at position i. All observed univariate and bivariate frequencies are derived from the Lee database [4].
The probability of a sequence under the pair correlation model P 2 , on the other hand, is determined by a model which generates sequences that preserve both the observed univariate and bivariate marginals. In order to do so, we fit the sequence signature probabilities to a 3-state Potts model where the Hamiltonian, H is described by field and coupling parameters, which reflect the mutation frequency at a site and the strength of the statistical coupling between two sites respectively.
where P 2 (A 1 , ..., A N ) is the probability of a sequence of length N consisting of charges A i at position i which preserves the univariate and bivariate marginals. The Hamiltonian can be defined as is the coupling between charges A i and A j at positions i and j and Z is the partition function.
This model is the maximum entropy solution to the probability distribution that matches the single-point and double-point correlations [5]. Note that if there were two possible states at each site instead of three, this Potts model would be equivalent to the famous Ising model, which is widely applied in the study of spin glass systems in statistical physics. For a system of 18 positions and 3 states, 18 × 3 = 54 λ i (A i ) and 18 2 × 3 2 = 1377 λ ij (A i , A j ) parameters need to be determined to accurately describe the Hamiltonian which preserves the observed marginals. But the paramaters are not independent as we can apply conditions known as gauge constaints which connect the parameters [6]. For each position Ai λ i (A i ) = 0 and for each pair of positions Ai λ ij (A i , A j ) = 0. This results in two free field parameters per position and four free coupling paramaters for every pair of positions. In this work, following Weigt et al. [6], we have chosen the free parameters so as to maximize the fields and minimize the couplings, on average. In a future communication, we will investigate how the choice of the free parameters affects the information carried by the couplings about spatial proximity.
This inverse problem of determining the fields and couplings, given the univariate and bivariate marginals, is computationally challenging [6]. This problem has been described in the literature as the inverse pairwise Ising problem (for two states) and it is computationally expensive because exact methods to determine the marginals from an initial set of Hamiltonian parameters are slow and therefore iterative procedures to search for the field and coupling parameters for many positions and more than two states is unfeasible. Our own previously described method of fitting pair marginals using iterative proportional fitting (IPF) of log-linear model parameters is slow and may not converge within a desired time frame for the problem at hand [7]. Other proposed methods such as Monte Carlo sampling have been applied on Ising models with a few sites, but may require exponential computational time to converge [8]. The approach we have taken involves iterative inference on a probabilistic graphical model using belief propagation described by Weigt et al. 2009 [6,9]. The difference between our approach and the approach taken by Weigt and coworkers is that while converging the bivariate marginals, we use a mean field model which includes pair correlations in the Bethe approximation [10]. Applying the Bethe mean field appoximation consistently is just as accurate as using the fluctuation dissipation approach taken by Weigt and coworkers [6,9]. We will compare the two approaches in a future communication.

Outline of the algorithm
The algorithm iteratively converges upon the field and coupling parameters using gradient descent. The outline of the algorithm is as follows.
1. For a given set of field and coupling parameters, determine the corresponding univariate and bivariate marginals using Belief Propagation and the Bethe mean field approximation.
2. Compare these computed marginals to the observed marginals and update the field and coupling parameters.
3. Repeat steps 1 and 2 until the Bethe mean field approximated univariate and bivariate marginals determined from the updated fields and couplings converge to their observed values from the sequence alignment.

Belief propagation
Belief Propagation (BP) or the Sum-Product algorithm is an iterative procedure applied to efficiently calculate all the marginals in a tree-like graph [11]. It consists of leaf nodes passing messages to their parents, which in turn process these messages and pass them onwards towards the root. The root then sends messages back to its children nodes and so on until the messages eventually reach the leaf nodes. At this point, for acyclic graphs, the messages will converge [12,13]. For cyclic graphs, this method is approximate but may converge after several cycles of message passing [14][15][16].
The self-consistent belief propagation equations we have implemented in this paper follow the approach of Weigt et al, 2009 [6]. The probability distribution of the pair correlation model P 2 is given by Equation 3. For this distribution, the corresponding BP message update rule is where P i→j (A i ) is the local message passed from node i to node j. This message is a function of the field at i and the product of all incoming messages from the neighbors of i, not including j. The BP propagation messages are passed locally between nodes with random initial values for the messages.
Updates are made and the process is repeated until the messages converge. The proportionality constant is such that the messages at a site sum to 1. Once the messages have converged, marginals are evaluated by taking the product of the field at a site with all the incoming messages to that site Since our implementation of the network is a completely connected undirected graph, with all nodes interconnected to one another, belief propagation is not guaranteed to converge [12,16,17]. However belief propagation on cyclic graphs, called loopy belief propagation, may closely approximate the solutions after several iterations [14][15][16]. For our problem, the marginals are known quantities and it is the fields and couplings that we wish to find. Therefore, we actually have an inverse problem; we need to find the fields and couplings given the marginals. This can be achieved by taking the ratios of Equations 6 and 7, a trick described by Weigt et al. 2009 [6,9], thus allowing us to write the message from i to j in terms of the known marginal at i.
Equation 8 can be used to force the univariate marginals estimated by BP to be the observed marginals. As a result, the field parameters never require updating; once the messages converge, the fields can be explicitly calculated using Equation 7. In other words, the univariate marginals are always conserved.
On the other hand, the predicted bivariate marginals need to match the observed bivariate marginals. This can be approximated by the following equation: where A i and A j are the mutations at positions i and j, λ ij (A i , A j ) is the statistical coupling parameter between i and j, P i→j (A i ) is the message passed from i to j, P j→i (A j ) is the message passed from j to i and Z is the partition function. This equation has been proven by Yedidia and coworkers to be mathematically equivalent to the Bethe approximation, a mean field model, and is what we apply in our code to approximate the bivariate marginals in our system [12,15,17].
Algorithm in detail , which is the database derived frequency of a double mutation. If the couplings have converged, then stop. If the couplings have not converged by a desired amount, update λ ij (A i , A j ) as follows where ǫ is the gradient descent step size, set to 0.0001, and repeat steps 2, 3 and 4 until the pair probabilities converge.

Sampling sequences from probability distributions
Several tests of the finite size effects require us to sample sequences drawn from the probability distribution described by either the independent model or the pair correlation model. To randomly sample sequences from these models, we make use of the inverse transform sampling algorithm. In the first step of this algorithm, a cumulative distribution function (CDF) is constructed from the modelled probabilities for a subset of sequences, for example all sequences with 6 mutations. Using subsets allows us to reduce the universe of available sequences and enables faster sampling. Random floating point values between 0 and 1, from a random number generator, are then used to inversely map the CDF back to a specific, randomly picked sequence. Quantile functions for this inverse mapping are saved in a lookup table to allow for efficient sampling. The samples drawn are independent and uncorrelated; hence no burn-in time is required. In effect, we use a Gibbs sampler, i.e. sequences are drawn according to their probabilities under the model and all draws are accepted.

Associating mutation patterns with protease inhibitors
A drug-annotated sequence alignment consisting of 38,420 HIV protease isolates of multiple subtypes was downloaded from the Stanford HIV database on April 7th, 2010 [18]. This dataset was used determine which inhibitors are significantly associated with specific electrostatic mutations patterns. Since multiple isolates in this database are associated with a single patient, only the most recent subtype B isolate for each patient was extracted, leaving 13,286 protease sequences. Upon examination, many sequences come from patients undergoing antiretroviral therapy with one or more protease inhibitors. However, the majority of sequences come from patients who have not been exposed to any drugs. The difference in the proportion of sequences with a particular mutation pattern in the drug-naive cohort as compared to the proportion of sequences with the same mutation pattern but exposed to a specific drug cocktail, can be used as a measure of association between that drug cocktail and the mutation pattern.
To do find significant associations, the sequence alignment was converted into strings of charge states "n", "-" and "+" as described above, and used to calculate p drug , the proportion of sequences with a unique mutation pattern and exposed to a particular set of drugs, and p naive , the proportion of sequences with the same mutation pattern but exposed to no drugs. A pooled sample proportion t-test was then performed to determine the significance of association for a drug combination with a group of mutations, with the z-score for the null hypothesis of p drug − p naive = 0 given by The standard error, SE is where p = (p drug n drug + p naive n naive )/(n drug + n naive ) is the pooled sample proportion, n drug is the number of sequences exposed to the particular combination of drugs, and n naive is the number of sequences not exposed to drugs.   Figure S3. Comparison between the observed and predicted mutivariate marginals for 2, 3 and 4 mutations. Predicted marginals determined using belief propagation in the Bethe approximation are plotted against the observed marginals for sets of 2, 3, and 4 mutations. The correlation between P bethe ij and P obs ij is 1.00. The correlation between P bethe ijk and P obs ijk is 0.98. The correlation between P bethe ijkl and P obs ijkl is 0.90. Figure S4. Distribution of pair correlation model probabilities for sequences in the tail of the Lee distribution that are observed (red) or unobserved (blue) in the Stanford database. The histogram in red is the distribution of pair correlation model probabilities for sequences found in the tail of the Lee database that also exist in the Stanford database. The histogram in blue is the distribution of pair correlation model probabilities for sequences that are not observed in the Stanford database. The null hypothesis which states that the means of these two distributions are equal, has a low p-value of < 10 −4 , indicating that the null hypothesis must be rejected. Therefore, the difference between the means of these two distributions is statistically significant.   Tables   Table S1. Electrostatic mutation patterns with the highest probabilities under the pair correlation model and the drug combinations they are most strongly associated with. Shown are the top 5 patterns with 2, 3 and 4 electrostatic mutations for which the pair correlation model predicted probability, P 2 , is the highest, together with the drug combination they are most significantly associated with. Drug combinations are listed in order of treatment. The test of statistical association between drugs and electrostatic mutation patterns is based on the the Stanford database [18] (SI Methods). The proportion of sequences with the mutation pattern and exposed to a specific drug was compared to the proportion of sequences with the same mutation pattern but exposed to no drugs. The null hypothesis is that that the two proportions are equal, and the p-value to test the significance of this hypothesis is listed alongside the drug combination. NFV: Nelfinavir, IDV: Indinavir, SQV: Saquinavir, RTV: Ritonavir, APV: Amprenavir. The acronym PI, protease inhibitor, is used in the Stanford database when the drug was unknown. The D30N, N 37D, Q61E, N 88D pattern is not significantly associated with any drug combination.  Table S2. Prediction of novel electrostatic mutation patterns. Shown are 25 electrostatic mutation patterns with the highest probabilities under the pair correlation model that are not observed in the Lee database [4]. P 2 is the probability of the sequence under the pair correlation model, N LEE is the number of times the mutation pattern was found in the Lee database [4], N ST is the number of times the mutation pattern was found in the Stanford database [18]. If the sequence is found in the Stanford database, it may be significantly associated with specific drugs combinations. The drug combinations listed are in order of treatment and have strong p-values of association with the mutation pattern. The test of statistical association between drugs and electrostatic mutation patterns is described in SI Methods. NFV: Nelfinavir, IDV: Indinavir, SQV: Saquinavir, RTV: Ritonavir, APV: Amprenavir. The acronym PI, protease inhibitor, is used in the Stanford database when the drug was unknown.  Table S3. The most statistically deviated pairs of mutations. The 10 most statistically deviated double mutations in the Lee database relative to the independent model. The measure used to test for deviation is dev(i, j) =