Strong Selection Significantly Increases Epistatic Interactions in the Long-Term Evolution of a Protein

Epistatic interactions between residues determine a protein’s adaptability and shape its evolutionary trajectory. When a protein experiences a changed environment, it is under strong selection to find a peak in the new fitness landscape. It has been shown that strong selection increases epistatic interactions as well as the ruggedness of the fitness landscape, but little is known about how the epistatic interactions change under selection in the long-term evolution of a protein. Here we analyze the evolution of epistasis in the protease of the human immunodeficiency virus type 1 (HIV-1) using protease sequences collected for almost a decade from both treated and untreated patients, to understand how epistasis changes and how those changes impact the long-term evolvability of a protein. We use an information-theoretic proxy for epistasis that quantifies the co-variation between sites, and show that positive information is a necessary (but not sufficient) condition that detects epistasis in most cases. We analyze the “fossils” of the evolutionary trajectories of the protein contained in the sequence data, and show that epistasis continues to enrich under strong selection, but not for proteins whose environment is unchanged. The increase in epistasis compensates for the information loss due to sequence variability brought about by treatment, and facilitates adaptation in the increasingly rugged fitness landscape of treatment. While epistasis is thought to enhance evolvability via valley-crossing early-on in adaptation, it can hinder adaptation later when the landscape has turned rugged. However, we find no evidence that the HIV-1 protease has reached its potential for evolution after 9 years of adapting to a drug environment that itself is constantly changing. We suggest that the mechanism of encoding new information into pairwise interactions is central to protein evolution not just in HIV-1 protease, but for any protein adapting to a changing environment.


Introduction
The interactions between the residues within a single protein (epistasis) often determine the structure and function of a protein [1]. Epistatic effects shape the protein fitness landscape and thus guide the evolution of a protein given its genetic background [2]. At the same time, the environment influences the fitness effects of mutations and their epistatic interactions: a drug-resistance mutation that is crucial in a drug environment might have a significant fitness cost in the absence of drugs, requiring compensatory mutations [3] that are likely epistatic. In order to predict evolution over the short term, exquisite knowledge of the fitness landscape is necessary [4,5], and epistastic interactions play a significant role in shaping that landscape [6][7][8][9]. Such interactions have been implicated in the evolution of drug resistance in influenza [10], malaria [11], as well as antibiotic resistance in Escherichia coli [12].
To understand the long-term evolution of a protein, evidence of epistatic interactions at one point in time is not sufficient, as interactions may change over time as the environment changes. However, as epistatic interactions are deduced from measuring the fitness effect of both singleton mutations as well as pairs, obtaining evidence of changes in epistasis is likely prohibitive. Here we show that we can use the mutual (shared) information between protein loci (residues) as a proxy for epistasis, and use publicly available databases to study the long-term evolution of epistatic interactions within a protein. Because these interactions also impact the amount of information a protein encodes about its environment, monitoring changes in epistasis also allows for more accurate estimates of how information evolves over time.
Using information as a proxy for epistasis is not a new idea (see, e.g., [13] and references cited therein, as well as [14] for an application to gene networks), and it has its limitations as positive information between residues is a sufficient (but not a necessary) condition for epistasis (see Supporting text S1). As a consequence, it is possible that two residues interact epistatically but show no information. (On the other hand, if two residues have positive information, they must interact epistatically). Keeping in mind this limitation, studying pairwise information instead of epistasis allows us to use the wealth of time-course sequence data of a protein to study its long-term evolution. We focus here on HIV-1's protease, a small protein necessary for production of mature and infectious HIV-1 particles that is a target of drugs used in treatment of HIV infection, because ample protein sequence data are available from treated and untreated patients spanning multiple years. Covariance between residues in HIV's protease has been discussed before [15][16][17]. We note that it has been suggested that such covariation between positions can in principle be due to population subdivision rather than epistasis [18], but such an effect is ruled out in the present data as the trends are consistent among all data sets sampled, while covariation due to population subdivision would show random effects depending on the subsampling of sequences. Thus, we assume here that all covariation is evidence of epistasis.
Because as of yet a cure remains elusive, the treatment of HIV-1 infection entails a life-long dependence on antiretroviral drugs. However, due to the high mutation rate of the virus [19], evolution of drug resistance is common, requiring frequent screenings for resistance mutations and subsequent adjustments in treatment. The list of resistance mutations [20] is updated periodically and is useful in designing treatment strategies [21]. While mutations that cause resistance to protease inhibitor drugs diminish the ability of the drug to bind to the protease, these mutations are often associated with loss of viral fitness and protease stability [22]. Several compensatory mutations must follow that restore the protease stability, and together with the resistance mutations, they bring about coordinated structural rearrangements resulting in a protease that is both resistant and functional [23][24][25][26]. Thus, drug resistance is achieved by a combination of multiple resistance and co-occurring compensatory mutations, all of which are likely epistatic. Understanding the evolution of drug resistance thus requires us to understand the changes in epistasis that go on as the protein adapts to a landscape that is ever changing as new drugs enter the marketplace.
Since HIV-1 protease sequences from both treated and untreated individuals are publicly available, we can compare the evolution in a changing landscape that is likely to force changes in epistasis to the evolution in a "control" landscape that is constant in time. However, this "control trajectory" must be treated with care, as resistance mutations can spread into the untreated group via new infections [27,28]. Still, the available data provide an unprecedented opportunity to study changes in epistatic patterns in an adapting as opposed to a non-adapting population.
In the following, we first study how the selective pressure of a drug environment affects loci in the protease individually, and then focus on how interactions between loci are affected. The latter study allows us to calculate how the total amount of information the protein encodes about its environment evolves in response to changes in the environment. Finally, we study the network of epistatic pairs in the protein, and how the network changes as the protein adapts.

Results
High selection pressure in drug environment leads to high variation at multiple loci in the protease We analyzed the amino acid sequences of the HIV-1 protease from patients that never received any anti-viral treatment (untreated group), as well as patients that received treatment (treated group), collected in the years 1998-2006 (see Materials and Methods). The HIV-1 protease is a 99 residue enzyme that forms a homodimer which, in order to create the components for a new virion, cleaves the synthesized polyprotein into the active components [29]. Because an inactive protease results in uninfectious viruses, the protease has been one of the first targets of anti-viral drugs.
Calculating the per-site-entropy for each of the 99 positions (a measure of the amount of sequence variation at that position, see Methods) shows that some protease positions are highly variable even in the absence of treatment, but that more loci become variable upon treatment ( Figure 1). This increased sequence variability allows the protein to explore mutations that might confer resistance to the drugs. Moreover, per-site entropies gradually increase over the years, especially in individuals taking antiviral drugs (Supplementary Figure S1).
Studying the entropy differences between treated and untreated protease sequences highlights the protein loci that have undergone a marked increase in variation (positions 10,20,46,54,71,73,82, and 90; all of which have been previously associated with drug resistance,

Epistatic interactions between residues increase following treatment
Information about a protein's environment is stored not only in individual residues of the molecule, but also in the manner in which these residues interact epistatically. We calculated the information estimate I 1 (the component of information that disregards interactions) and the information measure I 2 that includes the pairwise dependencies as measured by mutual information between every pair of positions (see Materials and Methods). An increase in entropy per site corresponds to a decrease in per-site information (I 1 ), as observed for the treated protease sequences over the years (Figure 3, top panel, slopes for treated and untreated data are not significantly different). However, the sum of mutual information of all pairs of positions (the component of information due solely to pair-wise interactions) gradually increases ( Figure 3, middle panel), suggesting that the total information content of the protein shifts to pairwise correlations following treatment (slopes for treated and untreated data are significantly different, p ≤ 0.001).
As this pairwise mutual information is sufficient for epistasis (Supplementary text S1), the trend suggests that epistatic interactions are crucial in the evolution of protease in a drug environment. In contrast, the sum of pairwise mutual information, and hence a protein-wide measure of epistasis, remains fairly constant in a drug-free environment ( Figure 3, middle panel). It should be noted that most of the treated data for year 2003 came from two phase-III clinical trials that focused on antiretroviral drug tipranavir (2900 sequences out of ≈3400 sequences) [30]. Resistance to tipranavir requires accumulation of several mutations, more than the mutations required for other protease inhibitors, and this higher genetic barrier to resistance makes it suitable for salvage therapy for patients already experiencing resistance to other drugs. The substantial decrease in I 1 for the year 2003 thus can be attributed to an increased entropy as a consequence of accumulating an increased number of mutations required for resistance to tipranavir.
Adding the sum of the pairwise mutual information to I 1 gives I 2 , which measures the information content of protein while considering pairwise dependencies between positions in addition to per-site variability ( Figure 3, bottom panel, slopes for treated and untreated data are not significantly different). I 2 gives a more accurate measure of the information content of the protein, although it does ignore any further higher-order interactions between protein residues. While there is currently no evidence that such higher-order correlations are important, some authors have discussed this issue [31]. I 2 shows a slight increase in treated population at initial years, suggesting that the high selection pressure of drug-environment is increasing the overall information content of the protein, but this increase stabilizes in later years. Longitudinal data (protease sequences derived from the same patient at two time-points: first and second isolate) further supports the observation that treatment decreases I 1 but increases the sum of pairwise mutual information, i.e., treatment increases sitewise variability as well as the extent of epistatic interactions in the protein (Supplementary Figure S2). For patients that went from 'untreated' to 'treated' state, as well as patients that continued treatment, I 1 showed a slight decrease in the latter isolate (Supplementary Figure S3, top panel). Stopping treatment shows a slight increase in I 1 , suggesting that entropy (amino acid variation) decreases when the selection pressure of drugs is removed, however, this observation is based on a small sample size (153 sequences Figure 3: Estimates of the information content of the HIV-1 protease. Filled black circles represent untreated data and blue triangles represent treated data. I 1 (see Eq. 4) decreases over time in untreated as well as treated populations (top panel), indicating temporal increase in sequence variability. In contrast, the sum of pairwise mutual information significantly increases upon treatment (p ≤ 0.001, middle panel). On adding the sum of pairwise mutual information to I 1 , we obtain a comprehensive measure of information that considers pairwise interactions between residues (I 2 , Eq. 5). I 2 appears to increase over time in treated data, although the slope is not significantly different from that seen in the untreated data. Error bars represent ±1SD.
in 'treated to untreated' category). The sum of pairwise mutual information also increases when treatment is initiated or continued (Supplementary Figure S2, middle panel), suggesting that information is redistributed towards interactions upon treatment. We note that while the longitudinal data agrees with the temporal trends for I 1 and I 2 , it does not add statistical power to those conclusions due to small sample sizes.

Epistatic interactions become spatially localized under high selection pressures
Residues in a protein form a network of cooperative interactions that mitigate the effects of perturbations [32]. Allosteric communication in a protein is also mediated by a network of residues, propagating information to distant sites [33].
We can look at the changes in epistasis between pairs by studying the network of epistatic interactions, where a connection is drawn between any two loci that show significant positive Each network represents the 99 residue-long protease monomer chain, with nodes representing sequence positions (node with position 1 is colored in darker red) and edges representing the mutual information between pairs of nodes. The weight and thickness of an edge corresponds to the information between the nodes connected by the edge (only those edges with p ≤ 0.05 are shown). Pairwise information, and thus epistatic effects, are fairly constant in drug-free environment, but gradually increase in treated data.
information, see Fig. 4. This network also shows that the extent of protein-wide epistatic interactions increased over time in treated data, but not in the untreated data. We can map the high information loci on the protease structure (blue and red regions on the molecule in Figure 5) for the treated and untreated groups, and study the distance between epistatically interacting loci. While interacting loci are spatially apart in untreated data from the earliest time point (red lines and residues), this is not the case in treated data (blue lines and residues, Figure 5).
To study if there is a trend, we plot the distribution of information as a function of residue distance separating the interacting pairs within the molecule for early and late time points, as well as for treated and untreated groups. We find that the network of interactions becomes more localized when selection pressures are increased in a drug environment ( Figure 6, top row, data for year 1998). Epistatic interactions also appear to become spatially condensed over time from 1998 to 2006 in drug-free environment ( Figure 6), but is is not clear if the effect is solely due to the long-term evolution of the protease or due to complex treatment histories of patients on antiviral therapies.

Discussion
Epistatic interactions between residues shape the fitness landscape of a protein. However, fitness landscapes and epistatic interactions themselves are not constant and change with environment. Even for a singular evolutionary outcome, several alternative mutational path- Figure 5: Epistatic interactions between pairs of residues in treated (blue lines) and untreated (red lines) proteases from the year 1998. The interacting residues are numbered, and shown in red for untreated data, and in blue for treated data (on both chains). For clarity, we show the interactions deduced from the treated samples on the left chain of the HIV-1 protease homodimer only (blue lines), and interactions inferred for the untreated subjects on the right chain only (red lines). Only those interactions are shown where information is greater than 0.1, indicating strong epistasis. The epistatic interactions are spatially more distant for proteases evolving in the absence of drugs, and become more close-range upon treatment. Figure generated using Chimera [34].
ways are accessible to an evolving protein [5]. Thus, it is difficult to predict the evolutionary trajectory of a protein from fitness effects of mutations along with mutation rate and population size, albeit tracing the evolutionary history of a protein to understand the processes underlying its long-term evolution is feasible. While evolution experiments with bacteria, viruses, and yeast provide direct evidence of evolution-in-action to study various aspects of evolutionary dynamics and understand why certain evolutionary paths are chosen [35][36][37], indirect evidence such as sequence data collected over several years is a valuable resource to retrace the evolutionary steps taken by a protein on an adaptive fitness landscape over a long period of time. HIV-1 protease sequences are one such resource that contain the "fossils" of the evolutionary trajectory (albeit in a statistical form) of the protein, in an environment that is constantly changing due to introduction of several protease inhibitor drugs over the years.
The HIV-1 protease responds to the selection pressures of treatment by accumulating mutations that confer drug resistance. At the same time, several positions in the protease show considerable neutrality in the absence of treatment ( Figure 1). Indeed, a complete mutagenesis of the protease showed that several sites are insensitive to mutation in the absence of a selection pressure, and thus appear neutral [38]. Yet, some of those mutations that are neutral on their own often appear in tandem with known resistance mutations, possibly compounding the effect of the resistance mutations [39]. Even resistance mutations usually  do not confer resistance in isolation, requiring multiple mutations to appear before resistance is achieved [40]. Because on average a mutation destabilizes the protein fold by about 1 kcal/mol, proteins cannot accumulate multiple resistance mutations without running the risk of thermal instability [41]. How is it possible then that HIV-1 can become resistant to multiple drugs, often accumulating over 10 mutations in the protease alone [24,26,42]? We suspect that many of the mutations that have no direct effect on drug resistance are in fact restabilizing mutations that must occur in tandem with drug resistance mutations, ensuring that the fold is stable. High mutation rate and large population size of HIV-1 allows the virus to explore multiple mutations simultaneously, and thus is likely to quickly find beneficial epistatic interactions between protein residues. This is corroborated by our observation in Figure 3 that the sum of pairwise mutual information increases while I 1 decreases over time, as the protease adapts to multi-drug regimens, i.e., treatment increases variability per site as well as epistatic interactions, over time.
Administering different protease inhibitors as they reach the market guides the evolutionary trajectory of the protein by accumulating more mutations, as reflected by the increasing entropy and pairwise mutual information over the years. The increase in entropy at protease positions agrees well with the physiochemical changes in the residue properties at that position. These directed changes in physiochemical properties of protease positions could help answer why the observed evolutionary paths were chosen out of the numerous possibilities.
In recent years, the incidence of drug-resistance has been declining due to single-pill therapies that promote adherence to treatment [43]. This observation is in agreement with the stabilization of the temporal trends of I 1 , sum of pairwise mutual information, and I 2 in the later years ( Figure 3). However, the increase in epistasis as approximated by the sum of pairwise mutual information due to treatment is substantial, indicating the potential of the protein to achieve viability by correlated adjustments in the presence of drugs. High entropies at several positions in untreated proteases shows that the protein has a few neutral sites that are not strongly conserved (Figure 1), but treatment increases per-site entropies (and thus variability) as well as mutual information (and thus epistasis) between positions. This suggests that even in the face of increasing sequence variability, epistatic interactions constrain the evolutionary trajectory of the protein towards the desired outcome when confronted with high selective pressures such as treatment. Increase in epistasis increases the connectivity in the network of cooperative residue interactions, allowing the protein to absorb the stability loss of resistance mutations. Previous studies also find pairwise interactions between resistance mutations in the HIV-1 protease [17]. Despite its small size (only 99 residues), there is no evidence yet that the protein has reached its limit for finding correlated adjustments in response to treatment, even though the sum of pairwise mutual information that measures correlations between mutations has stabilized in later years (Figure 3). Better treatment designs such as single-pill once-a-day therapies may keep the virus load in check, and thus its potential for evolution.
It is interesting to note that while epistatic interactions accumulate over-time under highselection pressure environments (treatment), protease sites were relatively independent in its natural course of evolution (no treatment) (Figure 4). This supports the observation that the network of cooperative interactions in a protein is relatively sparse [33]. In addition, the spatial organization of epistatic interactions also became localized under selection pressures of drugs (Figures 6 and 5). The fitness advantage of restricting epistatic interactions to spatially close residues is not clear, although it may be that it is difficult to maintain distant epistatic interactions when the total number of epistatic interactions increase as seen due to treatment.
To our knowledge this is the first study that tracks the evolution of epistatic interactions within a protein over a significant period of time as a population adapts to a changing fitness landscape, along with a "control" population in a constant landscape. We find that overall the protein must increase the amount of information it has to function in the novel environment, but that single substitutions on average lead to a decrease in information as sequence variability is increased. The dual goals of having resistance mutations as well as increased information content can be achieved by storing information in the correlations between residues, giving rise to more and pairs with more and more significant epistasis between them. At the same time, epistatic effects appear to become confined to pairs that are closer to each other within the structure of the protein. Ideally, the present findings should be corroborated by longitudinal studies that sample sequences more frequently, and that measure epistasis directly rather than via the information-theoretic proxy. Such studies would help significantly to understand the tempo and mode of adaptation.

HIV-1 Protease sequences
The protease sequences for HIV-1 subtype B were collected from the HIV Stanford database [http://hivdb.stanford.edu] on September 17, 2013. Sequences were categorized based on the year they were collected from patients. We focused our analysis on the years 1998 to 2006, in which more than 300 protease sequences are available for both treated and untreated sequence datasets (Table 1). Sequences obtained from patients receiving ≥ 1 protease inhibitors are labeled as treated, while sequences from patients not receiving any protease inhibitors are labeled as untreated.  Treated  1998  376  680  1999  1145  946  2000  375  806  2001  476  575  2002  1982  464  2003  1237  3399  2004  2163  436  2005  1424  338  2006  1354  341 We also obtained longitudinal data (protease sequences collected from the same patient after a time interval) from HIV Stanford database for the following: i) patient was untreated at first as well as second isolate collection (untreated to untreated): 596 sequences, ii) patient untreated at first isolate collection but treated with protease inhibitors at second isolate collection (untreated to treated): 395 sequences, iii) patient treated at both isolate collections (treated to treated): 921 sequences, and iv) patient treated at first isolate collection but untreated at second isolate collection (treated to untreated): 153 sequences. The time between first and second isolate collections ranged from one month to a few years.

Information stored in HIV-1 protease
The information content of biomolecules can be estimated using information-theoretic constructs [44][45][46]. This information content is relative to the environment within which the molecule functions, and can change when the environment changes even if the sequence remains the same. Because a changed environment usually translates into reduced information content (reflecting reduced fitness), the virus seeks to recover the information and achieve previous levels by evolving drug resistance. Note that while levels of information content approaching the wild-type can be achieved, the newly acquired information is different from the information lost.
The (Shannon) entropy at each position i (the "per-site-entropy") in the protease sequence is calculated as: where p a is the probability of observing a th amino acid at position i in the protease sequence. These probabilities are obtained from examining multiple alignments of sequences. For convenience, we take logarithms to the base 20, so that the maximum entropy at each site is 1 mer, and the maximum entropy of a polymer of length is mers [45]. The mutual information between two positions i and j is given by: where H i,j is the joint entropy for positions i and j of the protease sequence [45][46][47][48]. The total information content of a protein of length (taking into account all correlations between residues) is given by (adapting a formula for the "multi-information" of events due to Fano, 1961): The terms not shown in Eq. (3) are the higher-order corrections I(i : j : k : m) etc., up to I(i 1 : i 2 : ... : i ), with alternating signs. Corrections due to interactions between three or more sites are expected to be small, but cannot be estimated using the present data because the samples are too small. H max is the sum of maximum entropy at every site, and thus is equal to [45]. We further define the first-and second-order information estimates I 1 and I 2 as follows: Thus, I 1 measures the information content of the protein without considering any interactions among protein residues, and I 2 includes information contained in pairwise interactions between protein positions over and above I 1 , but ignores any other higher-order interactions between residues. The second term in equation (5) is the sum of pairwise mutual entropies (informations) for all pairs of residues in the protein. Note that if information was only described by I 1 , mutations that provide resistance reduce the information in the ensemble, as they increase sequence entropy. Increasing information can then only be achieved via correlated mutations. Note that as we see the correlation information increase in response to drugs, we see also that some sites become more conserved, which also serves to increase information in the ensemble of sequences.

Finite-size corrections for entropy estimates
Small datasets do not correctly estimate the per-site entropies due to dataset-size dependent observed frequencies of residues: this introduces a bias in the entropy and mutual information calculations (see, e.g., [50,51]). Several priors and estimators have been proposed to estimate entropies from under-sampled probability distributions [51], and our analysis suggests that a sample size of 300 with NSB (Nemenmann, Shafee, Bialek) entropy bias correction gives reliable estimates of entropy and mutual information values (see Supplementary text S1). Since the number of treated and untreated protease sequences in our dataset is different across years (Table 1), we sampled (with replacement) 10 sets/year of 300 sequences each to account for sample size bias. We calculate bias-corrected entropies and pairwise mutual information for the subsampled datasets, and report the average values (along with ±1SD). Because several protease sequences had gaps at the sequence ends, we calculated the I 1 , I 2 , and sum of pairwise mutual information for a truncated sequence length (from positions 15 to 90 instead of positions 1 to 99 of the protease sequence, see Supplementary figure S3).

Statistically significant epistatic interactions
To determine the pairs of protein loci with statistically significant information (and hence epistasis) we shuffle the residues at each position in yearly datasets for untreated and treated protease sequences. This shuffling maintains the per-site entropies as the residue composition at each position is maintained (H i and H j ), but any covariation between residues at positions i and j is destroyed [16]. By generating 100 such randomized-by-column datasets for each original dataset, and recomputing information for every pair of positions from these randomized datasets, we determined the p-value for the original information for every pair of positions. P-value is computed as the fraction of times the information from the randomized dataset equalled or exceeded the original information] (if one randomized dataset gives information between a loci-pair higher than the original information, then the p-value for that information is 0.01). For every year, we selected only those position pairs that have significant information with p ≤ 0.05. These pairs are shown in Figure 4.

Statistical analysis of temporal trend of protein information
To compare the temporal trends of I 1 , I 2 , and sum of pairwise mutual information for untreated and treated sequence data, we first fit linear regression models to the yearly data using the R statistical analysis platform [52] and then determine if the slopes of the regression models (for treated and untreated datasets) are significantly different or not.
Although the data used in this analysis is not longitudinal (protease sequences are collected from different patients participating in different clinical trials/studies across the years), the linear regression model is fit to the average values of the response variables (I 1 , I 2 , and sum of pairwise mutual information) calculated by sampling ten sets of 300 sequences each, and thus reflect the approximate temporal trends of the response variables with respect to the two factors (untreated and treated).

Relationship between information and epistasis
We derive replicator-mutator equations for a simplified model of two alleles at two loci, and show that in such a model, positive information is a sufficient, but not a necessary condition for epistasis (Supplementary text S1). The model can be generalized to two loci with 20 alleles in a straightforward manner. Thus, high information between two loci guarantees that there is epistasis between them, but there may be epistatic pairs that are missed by an information-theoretic analysis.

Distance between protease residues
Inter-residue distances were determined using Bio.PDB, a biopython module for analysis of crystallographic structures [53]. Since the protease is a dimer but the sequence data is that of a single chain, we assume that both chains are identical and compute distances between residues from protease chain A in PDB structure 1F7A [54,55].

Relationship between information and epistasis
In order to understand the relation between information and epistasis between two loci, we derive the replicator-mutator equations for two loci in mutation-selection balance, to show that the fitness differential between them due to epistasis creates a genotype frequency distribution that gives rise to information between the loci. Let us consider four genotypes (two alleles 'A' and 'a', at two loci): AA: type 0 (our wild type) Aa: type 1 aA: type 2 aa: type 3 that undergo mutation with rate µ per unit time (see figure below): The probability to find each of these genotypes in an infinite population depends on the fitness and probabilities of the other genotypes. In a discrete update scheme, the probability to find type i at time t + 1 is related to the same quantity at time t via wherew is the mean fitnessw = 3 i=0 p t i w i , and F is the fidelity of replication F = 1−2µ−µ 2 . It is easy to show that p t+1 i = 1 as long as p t i = 1. Equations (1)(2)(3)(4) can be solved numerically iteratively, but alternatively the fixed point (the p i in the limit t → ∞) can be calculated by solving for the right eigenvector of the associated Markov matrix.
Armed with the equilibrium probabilities p i , we can calculate the information between loci as follows. First we define p(A) and p(a) for the first and second locus: giving us the marginal entropies of the first and second locus The joint entropy of both loci is then The shared entropy, or information, is We can the relate this information to the epistasis between loci calculated as [4,56] There are other ways of defining epistasis between loci (see, e.g., [57]), but the qualitative relation between information and epistasis is not affected. An extreme example occurs when w 0 = w 3 and w 1 = w 2 = 0, that is, when the double mutant has the same fitness as the wild type, but the intermediates are lethal. In that case, it is necessary to cross a lethal fitness value to reach the double mutant aa. In this case of reciprocal sign epistasis [37], E = −∞, and If p 0 = p 3 = 0.5 (full equilibration), this is 1 bit of information.
To study the general relationship between epistasis and information, we calculated both epistasis and information starting with w 0 = 1 (wild type fitness), and random fitness values (between 0 and 1) for the single and double mutants w 1 , w 2 , and w 3 . The equilibrium genotype probabilities were obtained by iterating the replicator-mutator equations until they had stabilized (30000 updates of genotype probabilities p 0 , p 1 , p 2 , and p 3 using equations (1-4), with starting genotype probabilities: p 0 = 1, p 1 = p 2 = p 3 = 0).
We find that the absolute value of epistasis |E| is positively correlated with information (Pearson correlation coefficient 0.354). It is clear that positive information is a sufficient (but not necessary) condition for epistasis. Thus, high information between two loci guarantees epistasis between those two loci, but there may be epistatically interacting loci that are missed when focusing only on information, as some loci can interact epistatically without being informative about each other. The direction of epistasis (the sign of E) cannot be determined solely from information.  Figure 8: Correlation between epistasis and information. Each point corresponds to information and absolute value of epistasis calculated for one of 5000 combinations of w 0 , w 1 , w 2 , and w 3 . w 0 is always 1, and other fitness values are randomly assigned between 0 and 1.

Bias in entropy and mutual information calculations
It is well-known that entropy and information estimations from frequencies (maximum-likelihood estimators) are unreliable when sample sizes are small, due to under-sampled probability distributions [58]. We tested several estimators for entropy and, in particular, information calculations: Maximum likelihood (ML) estimator: empirical values of entropy calculated from observed frequencies.
Minimax estimator: Bayesian estimates of the bin frequencies using the Dirichletmultinomial pseudocount model, pseudocount = √ n/20 (n = number of sequences, 20 because of 20 possible residues at each position).
Chao Shen (CS) estimator: Proposed by Chao and Shen in 2003 [62]. Shrink estimator: Proposed by Hausser and Strimmer in 2009 [63]. NSB estimator: Proposed by Nemenman, Shafee, and Bialek [51]. To compare the performance of these estimators as sample size is increasing, we computed entropies at positions 54 and 82 of HIV-1 protease (a pair that is known to show substantial correlation [20]) for the 2003 treated data for gradually increasing sample sizes (total number of sequences in this data = 3600). Entropy and information estimates improve as sample size increases. Based on this analysis, we chose the NSB entropy estimator for our subsampled datasets of size 300 each. Entropy for positions 54 (X) and 82 (Y) of HIV-1 protease as calculated by the different entropy estimators listed above, for increasing sample sizes (top two panels). The blue horizontal lines represent empirical entropy estimates from 3600 sequences, and thus represent the "true" values that the estimators should achieve at smaller sample sizes.The lower two panels show the joint entropy and mutual information estimates for these two positions. It is clear that among the entropy estimators tested, the NSB estimator gives the most reliable estimates of entropy and information down to samples as small as 100 sequences.

Supplementary Text S2
Changes in physico-chemical properties at sites mirror per-site entropy changes Our analysis of residue properties show that the increase in entropy is often associated with changes in residue charge or size at that position (Figure 2 main text). The "Isoelectric point" (pI) is the pH at which there is no charge on the residue: positively charged residues have a higher pI (Arg R: 10.76, Lys K: 9.74) and negatively charged residues have a low pI (Asp acid D: 2.77, Glu acid E: 3.22, see table below for pI and residue-weight of all residues). A negative pI difference implies that residues with a higher isoelectric point (more positively charged, thus more basic) are replaced by acidic and negatively charged residues upon treatment ( Figure  2, main text, middle panel). Positive pI difference means that position is becoming more basic upon treatment. For position 30, although the entropy increase is not as significant as that at other loci, the loss of negative charge at the position (mutation from Aspartic acid to Asparagine, a major drug-resistance mutation) is substantial (Figure 2, main text, middle panel). Position 20, on the other hand, shows a decrease in pI following treatment, suggesting that the position has become more acidic. Several protease positions also show a marked shift in residue-weight at positions posttreatment. Residue-weight difference at a position is positive when heavier residues occupy that position after treatment, as is the case for positions 71 (emergence of Valine in place of Alanine: A71V), 73 (Serine is preferred to Glycine in some treated sequences: G73S), and 90 (Methionine replaces Leucine in 50% of treated sequences, L90M; Figure 2, bottom panel in main text). These heavier residues seen in treated protease sequences are also larger in size than the residues in untreated sequences, and thus could create potential steric clashes unless compensatory mutations occur elsewhere. We also observe negative residue-weight differences at positions 36, 46, 54, and 83, suggesting that smaller residues are now occupying these positions to either avoid steric clashes due to other changes in the protein (accessory mutations), or to change the protein-drug interaction (resistance causing mutations).  Figure S2: Information measures from longitudinal data. The categories are i) treated to untreated (tr=>untr), ii) treated to treated (tr=>tr), iii) untreated to treated (untr=>tr), and iv) untreated to untreated (untr=>untr). Grey and hatched bars represent the first and second time-point, respectively. I 1 decreases as entropy increases due to treatment (top panel), while sum of pairwise mutual information increases due to treatment (middle panel). The net information content of the protein, as approximated by I 2 , does not show any change in the two isolates collected from the same patient.