HIV-1 Subtype B Protease and Reverse Transcriptase Amino Acid Covariation

Despite the high degree of HIV-1 protease and reverse transcriptase (RT) mutation in the setting of antiretroviral therapy, the spectrum of possible virus variants appears to be limited by patterns of amino acid covariation. We analyzed patterns of amino acid covariation in protease and RT sequences from more than 7,000 persons infected with HIV-1 subtype B viruses obtained from the Stanford HIV Drug Resistance Database (http://hivdb.stanford.edu). In addition, we examined the relationship between conditional probabilities associated with a pair of mutations and the order in which those mutations developed in viruses for which longitudinal sequence data were available. Patterns of RT covariation were dominated by the distinct clustering of Type I and Type II thymidine analog mutations and the Q151M-associated mutations. Patterns of protease covariation were dominated by the clustering of nelfinavir-associated mutations (D30N and N88D), two main groups of protease inhibitor (PI)–resistance mutations associated either with V82A or L90M, and a tight cluster of mutations associated with decreased susceptibility to amprenavir and the most recently approved PI darunavir. Different patterns of covariation were frequently observed for different mutations at the same position including the RT mutations T69D versus T69N, L74V versus L74I, V75I versus V75M, T215F versus T215Y, and K219Q/E versus K219N/R, and the protease mutations M46I versus M46L, I54V versus I54M/L, and N88D versus N88S. Sequence data from persons with correlated mutations in whom earlier sequences were available confirmed that the conditional probabilities associated with correlated mutation pairs could be used to predict the order in which the mutations were likely to have developed. Whereas accessory nucleoside RT inhibitor–resistance mutations nearly always follow primary nucleoside RT inhibitor–resistance mutations, accessory PI-resistance mutations often preceded primary PI-resistance mutations.


Introduction
HIV-1 is a highly mutable pathogen.In the decades since it entered human populations, it has accumulated extensive sequence variation leading to the development of different subtypes and recombinant forms [1].Although the enzymatic targets of therapy are among the most conserved parts of the HIV-1 genome, these too can develop marked variation, particularly in the setting of selective antiretroviral drug pressure.Indeed, it is not uncommon for drug therapy to select for protease and reverse transcriptase (RT) variants containing substitutions at more than 10% of their amino acids [2].However, despite this high degree of mutation, the spectrum of possible virus variants appears to be limited by patterns of amino acid covariation.
In 2003, we published two studies that examined the extent of covariation among RT and protease residues in the presence and absence of antiretroviral therapy [3,4].Despite the relatively large size of the datasets in these studies-2,244 protease sequences and 1,210 RT sequences-there were insufficient data to examine patterns of covariation of different mutations at the same position.As more sequence data have become available, we are now analyzing covariation among mutations (rather than positions) in protease and RT.This expanded analysis uses a highly specific measure of covariation, the Jaccard similarity coefficient, and a multidimensional scaling based on this coefficient.In addition, we examine the relationship between conditional probabilities associated with a mutation pair and the order in which those mutations develop in viruses for which longitudinal sequence data are available.

Protease
Protease sequences from 3,982 protease inhibitor (PI)naive individuals and from 3,475 PI-experienced individuals were available for analysis.The PI-experienced individuals had received a median of 1 PI (interquartile range, 1-3).
Jaccard similarity coefficients and their standardized Z scores were calculated for all pairs of mutations at different positions present three or more times among the sequences from PI-naive and PI-experienced individuals.Among 19,203 pairs of mutations from the PI-experienced individuals, 161 pairs were significantly associated after adjusting for multiple comparisons by controlling the family-wise error rate at ,0.01.Of these 161 pairs, 92 (57%) were positively associated (Z .5.1, unadjusted p , 4.4 3 10 À7 ) and 69 (43%) were negatively associated (Z , À5.0, unadjusted p , 4.8 3 10 À7 ).Table 1 shows the Jaccard similarity coefficients and conditional probabilities of the 40 strongest positively associated protease mutation pairs and the ten strongest negatively associated protease mutation pairs.Table S1 shows the complete list of 161 statistically significant mutation pairs.
For the positively associated mutation pairs, Table 1 also contains two columns with data on the temporal order in which correlated mutations occurred in sequences with both mutations from persons in which an earlier sequence was available that contained only one of the two mutations.For example, the first row shows that among persons with both I54V and V82A in whom an earlier sequence contained only one of these two mutations was available, I54V occurred first in nine (26%) of 34 people, and V82A occurred first in 25 (74%) of 34 people (p , 0.01).In contrast, the fourth row shows that among persons with both A71V and L90M, each of the mutations was as likely to occur first (26 of 51 versus 25 of 51; p ¼ NS). Figure S1 plots the relationship between the log of the ratio of the conditional probability of two mutations versus the log of the ratio in which two mutations develop, indicating that the conditional dependence between mutations is highly correlated with the order in which the mutations develop when they occur together (r 2 ¼ 0.56, p , 0.001).
Among the 18 positively associated pairs in Table 1 containing a major and an accessory PI-resistance mutation (as defined in Methods), the accessory mutation appeared first more often in 12 of the 18 pairs.There were several striking patterns of temporal association among these 18 pairs of correlated major and accessory mutations.The major mutation L90M preceded the accessory mutation G73S in 31 of 34 persons for whom temporal data were available.In contrast, the accessory mutation L63P preceded L90M in 160 of 172 persons, and the accessory mutations L10I and A71V preceded the major mutation I84V in 51 of 59 and 35 of 38 persons, respectively.
Figure 1 plots the mutations along axes representing the first two principal components.The first principal component accounted for 10% of the total inertia and separates the nelfinavir-resistance mutations D30N and N88D from the main group of PI-resistance mutations.The second principal component accounted for 7% of the total inertia and separates V82A-associated mutations (I54V, L24I, and M46L) from L90M-associated mutations (M46I, G73S, and I84V).Finally, the lower-left part of the figure contains a cluster with seven of the 11 mutations recently reported to be associated with phenotypic and clinical resistance to the newest PI, darunavir (V32I, L33F, I47V, I50V, I54L/M, and L76V).
Among 7,131 pairs of mutations in sequences from PI-naive persons, 65 pairs were significantly associated (family-wise error rate , 0.01; Table S2).All but three of the positive associations among PI-naive persons were weaker (i.e., had a lower Z score) than the positive associations among treated persons in Table 1.

Reverse Transcriptase
RT sequences from 2,601 RT inhibitor-naive and from 5,188 RT inhibitor-experienced individuals were available for analysis.The RT inhibitor experienced individuals had received a median of three nucleoside RT inhibitors (NRTIs; interquartile range, 2-4) and zero nonnucleoside RT inhibitors (NNRTIs; interquartile range, 0-1).
Jaccard similarity coefficients and their standardized Z scores were calculated for all pairs of RT mutations at different positions present three or more times among the sequences from RT inhibitor-experienced and -naive persons.Among 65,624 pairs of mutations from the RT inhibitor-experienced persons, 327 pairs were significantly associated after adjusting for multiple comparisons by controlling the family-wise error rate at ,0.01.Of these 327 pairs, 213 (65%) were positively associated (Z .5.2, unadjusted p , 2 3 10 À7 ) and 114 (35%) were negatively associated (Z , À5.0, unadjusted p , 5 3 10 À7 ).Table 2 shows the Jaccard similarity coefficients and conditional probabilities of the 40 strongest positively associated RT mutation pairs and the ten strongest negatively associated RT mutation pairs.Table S3 shows the complete list of 327 statistically significant RT mutation pairs.
Positively associated mutation pairs consisted primarily of Type I or II thymidine analog mutations (TAMs; as defined in Methods); accessory NRTI mutations that occurred in combination with Type I or II TAMs (K43E, E44D, V118I, H208Y, D218E); and Q151M-associated mutations (V75I,

Author Summary
The identification of which mutations in a protein covary has played a major role in both structural and evolutionary biology.Covariation analysis has been used to help predict unsolved protein structures and to better understand the functions of proteins with known structures.The large number of published genetic sequences of the targets of HIV-1 therapy has provided an unprecedented opportunity to identify dependencies among mutations in these proteins that can be exploited to design inhibitors that have high genetic barriers to resistance.In our analysis, we identified many pairs of covarying drug-resistance mutations in HIV-1 protease and reverse transcriptase and organized them into clusters of mutations that often develop in a predictable order.Inhibitors that are active against early drug-resistant mutants are likely to be less prone to the development of resistance, whereas inhibitors that are active against fully evolved clusters of mutations may be useful drugs for salvage therapy.
F77L, F116Y).Among the top 40 associated mutation pairs, there were only three positive associations between Type I and II TAMs (M41L, L210W, and T215Y with D67N).The strongest significant association between an NRTI and an NNRTI mutation was between L74V and Y181C (J ¼ 0.17, Z ¼ 8.9, unadjusted p , 1 3 10 À11 ).Of note, the associations between the five accessory mutations listed above and Type I and II TAMs have also previously recently been described by  Svicher and coworkers [6] and Cozzi-Lepri and coworkers in independent datasets [7].The conditional probabilities and the temporal data columns show that each of the accessory NRTI mutations consistently follows the Type I or II TAMs.Among 12 pairs with a TAM and an accessory mutation, the TAM occurred first more often in all 12 pairs and was preceded by the accessory mutation in only 6% of pairs.In addition to the five accessory mutations in Table 2 (K43E, E44D, V118I, H208Y, and D218E), other NRTI mutations that consistently followed TAMs included the known treatmentselected mutations T69D and T69N. Figure S2 plots the relationship between the log of the ratio of the conditional probability of two mutations versus the log of the ratio in which two mutations develop, indicating that the conditional dependence between mutations is highly correlated with the order in which the mutations develop when they occur together (r 2 ¼ 0.81, p , 0.001).
The Jaccard dissimilarity coefficients associated with the 561 pairs of 34 mutations were used for a multidimensional scaling.The mutations included in this analysis were the 23 positively associated mutations in Table 2 and 11 additional clinically relevant NRTI-resistance mutations (K65R, A62V, T69ins, L74I/V, V75M, Y115F, M184V, and K219R/E/N). Figure 2 plots the mutations along axes representing the first two principal components.The first principal component accounts for 13% of the total inertia and separates the TAMs from the Q151M-associated mutations, whereas the second principal component accounts for 9% of the total inertia and separates the Type I and Type II TAMs.A62V, K65R, and Y115F are mutations that cluster with Q151M but may also occur with Type II (but not Type I) TAMs.D67N is a Type II TAM that can also occur with Type I TAMs, and it therefore occurs between Type I TAMs and Type II TAMs in terms of the second principal component.The non-TAM mutations, M184V and L74V, demonstrated no clustering with other NRTI-associated mutations.
At several positions, there was sufficient data to contrast covariation patterns for different mutations (Table 2, Figure 2, and Table S3).The Type I TAM, T215Y, clustered with other Type I TAMs, whereas the Type II TAM, T215F, clustered with other Type II TAMs.K219Q/E were Type II TAMs that cluster with other Type II TAMs.In contrast, two less common mutations at this position (K219N/R) were positively associated with Type I TAMs.T69D was associated with both Type I and Type II TAMs, whereas T69N was associated only with Type II TAMs.L74V was associated with the NNRTI-resistance mutations L100I, K103N, and Y181C, whereas L74I was associated with M41L.V75I was associated with Q151M-associated mutations, whereas V75M was positively associated with the Type I TAMs.
Among 19,431 pairs of mutations in sequences from RT inhibitor-naive persons, 41 pairs were significantly associated (family-wise error rate ,0.01;Table S4).However, all of the positive associations among RT inhibitor-naive persons were weaker (i.e., had a lower Z score) than the positive associations among treated persons in Table 2.

Discussion
In this analysis of amino acid covariation in protease and RT sequences from more than 7,000 persons infected with HIV-1 subtype B viruses, we confirmed several previously reported patterns of amino acid covariation and identified many new patterns of covariation.Multidimensional scaling further organized many of the correlations into clusters of co-occurring mutations.RT covariation was dominated by the distinct clustering of the TAMs and Q151M-associated mutations, and by the separation of the Type I and Type II TAMs.Protease covariation was dominated by the clustering of nelfinavir-associated mutations (D30N and N88D), two main groups of PI-resistance mutations associated either with V82A or L90M, and a newly identified cluster of the mutations V32I, L33F, I47V, I50V, I54L/M, and L76V.This new cluster of mutations is associated with decreased susceptibility to all PIs, including the salvage therapy PIs amprenavir and lopinavir and the recently approved PI darunavir.Although none of the sequences in this study were from patients who received darunavir, this drug is highly similar to amprenavir and is affected by the same PIresistance mutations.
Previous studies of HIV-1 covariation have used either the Pearson correlation for binomial random variables or mutual information [3][4][5][6][8][9][10].The correlation coefficient is overly sensitive to rare pairs of mutations because its statistical significance is based on a departure from equality between the diagonal and off-diagonal products of a 2 3 2 contingency table.In contrast, mutual information is insensitive to rare pairs of mutations, approaching a high level only for commonly occurring pairs of mutations.We therefore used the Jaccard similarity coefficient, which uses only those sequences in which at least one of a pair of mutations is present, and we assessed the significance of this coefficient using a distribution based on the underlying data.
We also used a conservative correction for multiple   Covariation between two mutations may result from the shared inheritance of the mutations from a founder virus, from a shared evolutionary pressure (e.g., an antiretroviral drug) that independently selects for each mutation, or from a functional dependency between the mutations.In our analysis, covariation was unlikely to result from shared inheritance because the most strongly covarying mutations occurred solely among treated HIV-1 isolates, consistent with the repeated selection of the correlated mutations in many different isolates as a result of selective drug pressure rather than the inheritance of the correlated mutations from a small number of ancestral viruses.
However, the possibility that many of the covarying residues resulted from similar selective pressures rather than from functional dependency cannot be excluded.For example, it is possible that some pairs of covarying protease amino acids result from the selective pressure of the same PI or possibly pair of PIs.Shared selective pressure is a possible explanation for why covarying mutations are not necessarily close to one another in tertiary structures (Figure S3) [4].An analysis of covariation that controls for treatment history would be better able to distinguish functional dependency from shared selective pressure.However, for most PIs and NRTI combinations, insufficient data are available for such an analysis.Identifying similar patterns of covariation in one or more independent lineages (e.g., other non-B subtypes) would also provide additional independent evidence for functional dependency.
Our examination of conditional dependency between mutation pairs, the temporal order in which mutations occur, and the relationship between these two types of data provided new insights into the evolution of protease and RT in persons receiving antiretroviral therapy.A strong positive relationship between the conditional dependency ratio of two mutations and the order in which the mutations occur would represent the most parsimonious mechanism for HIV-1 to develop multiple mutations (i.e., the mutation that occurs more often in a pair of mutations would be on average more likely to occur first).Nonetheless, we found that the positive relationship between conditional dependency and the order of mutation occurrence was stronger for covarying RT (r 2 ¼ 0.81) compared with protease (r 2 ¼ 0.56) mutation pairs.This suggests that the number of mutational steps required to develop multiple PI-resistance mutations may be greater on average than that required for developing the same number of multiple NRTI-resistance mutations.We also found that accessory NRTI-resistance mutations nearly always followed primary NRTI-resistance mutations (particularly the TAMs).In contrast, the commonly recognized accessory PI-resistance mutations were as likely to precede as to follow major PI-resistance mutations.This frequent precedence of accessory PI-resistance mutations results in part from the fact that many of the accessory PIresistance mutations are polymorphic and therefore present prior to the start of therapy.However, this alone does not explain the marked dependency of some major mutations on polymorphic accessory PI-resistance mutations that occur only at low levels in untreated persons.
The strong positive relationship between conditional probabilities and temporal data that we describe support the validity of previous research, which used cross-sectional data to infer mutational pathways [11] and causality [12,13].Our results also suggest that there is a complex process underlying the order in which major and accessory PIresistance mutations develop during PI therapy, and that the designation of major PI-resistance mutations as primary and accessory PI-resistance mutations as secondary often refers only to their roles in causing resistance and not to the order in which they develop.

Materials and Methods
Virus sequence data.Sequences included HIV-1 subtype B RT and protease sequences from published studies in the Stanford HIV Drug Resistance Database (http://hivdb.stanford.edu)[14].For patients with more than one sequence, only the latest sequence obtained while receiving treatment was analyzed.For each gene, separate analyses were done for the sequences from treatment-experienced and treatment-naive individuals.
RT positions 1-240 and protease positions 1-99 were analyzed.Mutations were defined as differences from the consensus wild-type subtype B amino acid reference sequence (http://hivdb.stanford.edu/pages/asi/releaseNotes/index.html).For each pair of mutations (X, Y), the numbers of sequences containing both mutations (X and Y), only one mutation (X or Y), or neither mutation (not X, not Y) were counted and used to populate a contingency table.Sequences containing mixtures at either of the two positions were excluded from analysis of that pair of positions.

aJ:
Jaccard similarity coefficients; J RAND : expected value of the Jaccard similarity coefficient assuming Mut X and Mut Y occur independently based on a permutation procedure described in Methods; J SE : the standard error of J; Z: (J À J RAND ) / J SE .Mutation pairs are ordered by their Z statistic.b Based on the number of persons with both mutations (Mut X and Mut Y) in whom an earlier sequence containing only Mut X or Mut Y: N X0!XY is the number in whom Mut X developed first; N 0Y!XY is the number in whom Mut Y developed first.c Major (Maj) indicates mutations in the protease substrate cleft or that directly reduce drug susceptibility.Accessory (Acc) indicates mutations that are commonly believed to reduce susceptibility only in the presence of a major mutation or to compensate for the decreased replication associated with enzymes containing major mutations.Miscellaneous (Misc) includes the remaining mutations, including some that are treatment-associated (see Methods) and others that are not.When both mutations belong to the same category, only one abbreviation is shown.doi:10.1371/journal.pcbi.0030087.t001

Figure 1 .
Figure 1.Multidimensional Scaling of 35 HIV-1 Protease Mutations Includes the 22 mutations obtained from the mutation pairs with the highest positive association (Table 1) in bold, and 13 additional clinically relevant protease inhibitor resistance mutations (L10F, V32I, L33F, I47V, I50V/L, F53L, I54L/M, Q58E, L76V, V82T, and N88S).The graph is a 2-D projection of the distances among the 35 mutations, in which the distance between any two mutations is measured by their Jaccard dissimilarity coefficient among persons who have received at least one protease inhibitor.doi:10.1371/journal.pcbi.0030087.g001

aJ:
Jaccard similarity coefficients; J RAND : expected value of the Jaccard similarity coefficient assuming Mut X and Mut Y occur independently based on a permutation procedure described in Methods; J SE : the standard error of J; Z: (J À J RAND ) / J SE .Mutation pairs are ordered by their Z statistic.b Based on the number of persons with both mutations (Mut X and Mut Y) in whom an earlier sequence containing only Mut X or Mut Y: N X0!XY is the number in whom Mut X developed first; N 0Y!XY is the number in whom Mut Y developed first.cI: Type I TAMs; II: Type II TAMs; Acc: accessory NRTI-resistance mutations; 151: Q151M-associated mutations; Miscellaneous (Misc) includes the remaining mutations, including some that are treatment associated (see Methods) and others that are not.When both mutations belong to the same category, only one abbreviation is shown.doi:10.1371/journal.pcbi.0030087.t002PLoS Computational Biology | www.ploscompbiol.orgMay 2007 | Volume 3 | Issue 5 | e87 0840 HIV-1 Amino Acid Covariation pairs of RTI mutations were significantly associated using a family-wise error rate of 0.01.

Figure 2 .
Figure 2. Multidimensional Scaling of 34 HIV-1 Reverse Transcriptase Mutations Includes the 23 mutations obtained from the mutation pairs with highest positive association (Table2) in bold, and 11 additional clinically relevant nucleoside RT inhibitor resistance mutations (K65R, A62V, T69ins, L74I/V, V75M, Y115F, M184V, and K219R/E/N).The graph is a 2-D projection of the distances among the 37 mutations, in which the distance between any two mutations is measured by their Jaccard dissimilarity coefficient among persons who have received at least one nucleoside RT inhibitor.doi:10.1371/journal.pcbi.0030087.g002

Table 1 .
Forty Highest Positively Correlated Protease Mutation Pairs and Ten Highest Negatively Correlated Protease Mutation Pairs from PI-Experienced Persons

Table 2 .
Forty Highest Positively Correlated RT Mutation Pairs and Ten Highest Negatively Correlated RT Mutation Pairs from RTI-Experienced Persons