Distinguishing Functional Amino Acid Covariation from Background Linkage Disequilibrium in HIV Protease and Reverse Transcriptase

Correlated amino acid mutation analysis has been widely used to infer functional interactions between different sites in a protein. However, this analysis can be confounded by important phylogenetic effects broadly classifiable as background linkage disequilibrium (BLD). We have systematically separated the covariation induced by selective interactions between amino acids from background LD, using synonymous (S) vs. amino acid (A) mutations. Covariation between two amino acid mutations, (A,A), can be affected by selective interactions between amino acids, whereas covariation within (A,S) pairs or (S,S) pairs cannot. Our analysis of the pol gene — including the protease and the reverse transcriptase genes — in HIV reveals that (A,A) covariation levels are enormously higher than for either (A,S) or (S,S), and thus cannot be attributed to phylogenetic effects. The magnitude of these effects suggests that a large portion of (A,A) covariation in the HIV pol gene results from selective interactions. Inspection of the most prominent (A,A) interactions in the HIV pol gene showed that they are known sites of independently identified drug resistance mutations, and physically cluster around the drug binding site. Moreover, the specific set of (A,A) interaction pairs was reproducible in different drug treatment studies, and vanished in untreated HIV samples. The (S,S) covariation curves measured a low but detectable level of background LD in HIV.

However, such covariation analysis can be confounded by important phylogenetic effects [13,14]. One major challenge for covariation analysis is distinguishing covariation that is genuinely due to selection pressure, from covariation that is simply due to coinheritance from a common ancestor. When a mutation first occurs in an individual chromosome, other mutations are already present in that chromosome, and initially this mutation will be inherited in 100% linkage with those other mutations. Such cooccurrence due to common ancestry is classified as background linkage disequilibrium (BLD) [20] (Fig. 1A). Over time, however, such linkage will be scrambled by events such as recombination and mutation, returning to equilibrium (no statistical association between them). For example, homologous recombination events between any pair of mutations will gradually scramble any linkage between the mutation pair at a rate that is proportional to the physical distance between them, the recombination rate, and the passage of time. This raises several questions. How to distinguish BLD from the covariation due to selection pressure? What fraction of covariation is BLD? How strong is BLD in HIV? The evidence from different studies has been ambiguous. On one hand, studies indicate that phylogenetic effects in HIV are strong. Phylogenetic analysis has successfully inferred HIV transmission history from HIV sequences [21,22]. On the other hand, HIV's high mutation rate [23,24], recombination rate [25][26][27], and short generation time [28][29][30] should reduce the phylogenetic effect significantly.
It should also be emphasized that phylogenetic analysis can be confounded by strong selection pressure. Whereas phylogenetic analysis interprets the presence of the same mutation in several individuals as evidence of common ancestry, selection pressure creates bias for ''convergent evolution'' [31] in which the same mutation can evolve independently many times due to positive selection. Such hidden biases are incompatible with the assumptions of classical phylogenetic analysis [32]. For example, one study has reported that selection pressure for drug resistance can cause incorrect phylogenetic inferences from HIV sequences (compared with the known transmission history) [33].
Thus, it is important to develop methods that can distinguish these two causes of covariation. Recently, analytical methods such as parametric bootstrap [34] and phylogeny-based shuffling [35] have been applied to estimate what fraction of covariation arises from phylogenetic effects. In this paper we take a different approach, based on comparing levels of covariation between different types of mutation pairs, such as pairs of amino acid mutations (A,A) vs. pairs of synonymous mutations (S,S). Whereas (A,A) pairs are subject to both phylogenetic effects and selection pressure on functional interactions between amino acid sites, (S,S) pairs cannot be subject to this type of selection pressure (since they leave the amino acid sequence unchanged) (Fig. 1). We have therefore used synonymous mutations to measure background LD [20], to systematically distinguish covariation from phylogenetic effects vs. other sources of covariation, independent of mathematical model assumptions. Throughout this paper we will use the term ''covariation'' to refer to observed statistical association (without implying any specific interpretation of its cause); ''background linkage disequilibrium'' to indicate the specific interpretation of co-inheritance from a common ancestor; and ''selective interactions'' to indicate the specific interpretation of selection pressure for co-occurrence of a given pair of amino acid mutations. The ''selective interactions'' include 3D structural interactions (both local and long-range) as well as phenotypic covariation due to shared selection pressure. It should also be noted that selection on nucleotides (e.g. constraints on the RNA structure in viruses [36,37]) rather than on amino acids can also cause BLD, as measured by (S,S) covariation, thus may contribute to the covariation of all three types of mutation pairs, (A,A), (A,S) and (S,S), in this study.

Metrics of LD in HIV and Its Comparison with Background LD
First, we performed standard analyses of Linkage Disequilibrium (LD) on a dataset of about 50,000 HIV-1 pol gene sequences of subtype B, covering a 1.4 kb region of the HIV protease and reverse transcriptase (RT) genes, mostly from patients under antiretroviral drug treatment [38]. Following the procedure of the Human Genome HapMap project [39], we applied a minimum frequency criteria to the data before measuring the LD. After applying the frequency cutoff of 2%, our dataset included 398 distinct single nucleotide mutations, each with 3260 observation counts on average. It should be noted that due to the very large size of this dataset and the high rate of mutation in HIV, we detected a very high density of mutations, including mutations at the majority of individual nucleotide sites, most with large numbers of observations. This provided a uniquely high-resolution mutation dataset for mapping LD. The density of mutations (observations per nucleotide) in this dataset is 100-fold higher than in the data from the Human Genome HapMap project [39].
Another crucial difference between HIV and classical examples of LD analysis (i.e. the human data), is that this region of the HIV genome experiences very strong positive selection pressure due to antiviral drug treatments [45,46]. To control for possible LD caused by amino acid selection pressure, we repeated this LD analysis strictly for pairs of synonymous mutations. Such mutations do not change the amino acid sequence and thus are not subject to amino acid selection pressure, yielding a measurement of ''background LD'' free of amino acid selection artifacts [20]. D9 and r metrics of the background LD decayed as a function of physical distance ( Fig. 2A, and Fig. S1A). However, the average background LD by these metrics was smaller than the average LD, across the whole one kb region. The average D9 for background LD decayed from 0.02 for adjacent mutations to 0.01 for distances of 0.4 kb or more, about two-fold lower than the standard LD curve over the same distance range ( Fig. 2A). The average r showed a similar pattern (Fig. S1A). This suggests that selection pressure plays an important role in shaping LD in HIV.

Comparing Covariation of (A,A), (A,S) and (S,S) in HIV
To examine this hypothesis further, we subdivided mutation pairs into three groups: pairs of synonymous mutations (S,S); pairs of amino acid mutations (A,A); and pairs consisting of one amino acid mutation and one synonymous mutation (A,S). (A,A) pairs experience both phylogenetic effects and possible selective interactions; that is, (A,A) pairs that together increase reproductive fitness may be selected for co-occurrence. By contrast, since . Furthermore, the average covariation of (A,A) was much higher than that of (A,S) and (S,S) (Fig. 2B). The average D9 of (A,A) gradually declined from 0.18 to 0.03 over about 1000 bases, while the average D9 of (A,S) and (A,A) started at less than 0.05 and rapidly dropped to 0.01 at around 300 bases. On average, (A,A) covariation levels were twoto five-fold higher than those of (A,S) and (S,S) across this range of distances. The conclusion also held for the frequency cutoff of 1% and 4% ( Fig. S2 and S3). In addition, the difference in distribution for covariation scores D9 of (A,A) vs. those of (A,S) and for (A,A) vs. (S,S) was statistically significant (both p-values less than 10 216 , Wilcoxon rank sum test -see Materials and Methods). Thus, a predominant fraction of (A,A) covariation does not appear to be attributable to background LD as measured by (S,S) covariation.
It is also striking that the (A,S) and (S,S) covariation (measured by D9 and r) behaved similarly, in contrast with (A,A) covariation. The average D9 of (A,S) and (S,S) both started under 0.05 and gradually decayed until they reached a flat of around 0.01 at 300 bases (Fig. 2B). The same pattern was repeated in the average r curve (Fig. S1B). However, it is also interesting that there appear to be slight differences between (A,S) and (S,S) at short distances (less than 200 bases). The average D9 value for (A,S) was significantly higher (up to 0.04) than (S,S) for adjacent mutations, but decayed more rapidly, so that this difference vanished beyond 300 bases. This higher value of (A,S) vs. (S,S) is consistent with the known strong positive selection for amino acid mutations in this region [45,46], since (A,S) pairs would be directly affected by such potential selective sweep events [47,48], whereas (S,S) pairs can only be affected indirectly (i.e. only by selective sweep for a third mutation that is a positively selected amino acid mutation).

Comparing Covariation of (A,A), (A,S) and (S,S) in the Stanford-Treated Dataset
To assess the reproducibility of these results, we repeated this analysis of (A,A), (A,S) and (S,S) covariance in a second, independent dataset, containing about 7,000 drug-treated HIV samples of subtype B covering either protease or RT (Stanford-Treated; see Materials and Methods). 73 amino acid mutations and 103 silent mutations (mutation frequency $5%; see Materials and Methods) were included in the analysis.
Although the average number of samples per site in Stanford-Treated was less than one tenth of the Specialty dataset, we found the same covariance pattern -the (A,A) covariation (D9) was much stronger than that of (A,S) and (S,S) (both p-values less than 10 27 , Wilcoxon rank sum test -see Materials and Methods), and the covariation levels of (A,S) and (S,S) were similar (pvalue = 0.89, Wilcoxon rank sum test -see Materials and Methods). The average D9 of (A,A) started at around 0.20 and declined to 0.07 over a scale of 800 bases; while for (A,S) and (S,S), the average D9 started less than 0.07 and then both fluctuated at around 0.05 (Fig. 3A). The average r showed a similar pattern, differing from D9 mainly in measurement scale (Fig. S4A). Overall, (A,A) covariation was about two-to four-fold higher than (A,S) and (S,S) covariation levels. Again, the (A,S) and (S,S) curves were largely indistinguishable within the range of sampling variance inherent in the dataset (Fig. 3A). These data demonstrate again that most (A,A) covariation in this region of HIV is not attributable to background LD as measured by (S,S) covariation, suggesting a dominant role for selective interactions due to selection pressure imposed by antiviral drug treatment.
To exclude the possibility that the consistent pattern between Specialty and Stanford-Treated datasets results from overlapping samples, we eliminated from the Specialty dataset all sequences with 98% or higher identity to samples in the Stanford dataset and re-analyzed the Specialty dataset. After the filtering, the (A,A) covariation level is still much higher than (A,S) and (S,S) across all distances (Fig. S5).

Comparing Covariation of (A,A), (A,S) and (S,S) in the Stanford-Untreated Dataset
To test the role of drug-induced selection via a negative control, we carried out the same analysis in a set of samples collected from untreated patients (Stanford-Untreated; see Materials and Methods). Previous studies have showed that comparison of these Treated vs. Untreated datasets can identify the effects of antiviral drug treatment [15,18]. The Untreated dataset contained about 4,500 drug-naive samples covering either protease or RT (see Materials and Methods). 42 amino acid mutations and 107 silent mutations (mutation frequency $5%) were included in the analysis.
Strikingly, the large difference in covariation between (A,A) and (A,S)/(S,S) disappeared in the untreated dataset. The average D9 of (A,A) fluctuated around the average D9 of (A,S) and (S,S), at approximately 0.07 (Fig. 3B). The same pattern was repeated in the average r curve (Fig. S4B). These data provide a clear, independent confirmation that drug-induced selection pressure is indeed the explanation for the surplus covariation of (A,A) pairs (relative to background LD measured by (S,S)) in the Specialty and Stanford-Treated datasets, both of which included drug-treated samples.

(A,A) Pairs Near the RT Active Site Show Strong Covariation
The (A,A) covariation decay curve (Fig. 2B) revealed a clear double-peak for pairs between 400-550 base distance in the Specialty dataset. Strikingly, a similar double-peak was observed at the same location in the Stanford-Treated (A,A) curve (Fig. 3A). We analyzed the two datasets separately to identify the mutation pairs responsible for these two peaks. These data revealed that the peaks were caused by the same set of mutation pairs in both datasets. One peak resulted from strong covariation between a cluster of mutations RT 41L and 43E with another cluster RT 208Y and 210W; while the other peak reflected strong covariation between a cluster RT 67N and 70R with the cluster RT 208Y, 218E and 219E/Q. Interestingly, in the three-dimensional protein structure, all these residues lie close to the reverse transcriptase active site (Fig. 4), less than 25 Å apart. Furthermore, mutations RT 41L, 67N, 70R, 210W and 219E/Q are known RT drug resistance mutation [49]. Thus every single one of the (A,A) covariation pairs observed in these peaks consisted of either one, or two known drug-resistance mutations. This analysis of the individual (A,A) covariation pairs provides independent confirmation that these specific residues are positively selected for drugresistance.  (Fig. 5). We used Fisher's exact test to calculate the covariation score h (see Materials and Methods), which detects statistically significant covariation even in cases with smaller counts. No minimum frequency criteria was applied. This allows us to comprehensively compare the covariation of different types of mutation pairs. In this analysis, we used the Specialty dataset, due to its much larger number of samples (about 50,000). For each pair of codon positions, the map displays the highest level of covariation measured for mutations at that pair (see Materials and Methods).
These data reveal several striking differences between (A,A), (A,S) and (S,S) covariation. First, the (A,A) map contains a large fraction (2.6%) of strong covariation effects (h.5 at 95% confidence; see Materials and Methods), compared with only a small fraction for (A,S) (0.3%) and (S,S) (0.3%). Second, whereas strong (A,A) covariation broadly distributed across the whole map (Fig. 5A), in the (A,S) and (S,S) maps covariation clustered close to the diagonal (Fig. 5B and 5C), i.e. for codon positions that are close in the sequence. These data suggest that background LD decays rapidly, within 100 nt (about 30 amino acids). The fact that (A,A) covariation extended more broadly, indicates that it arises from a different process than background LD. Third, a large fraction (30 out of 53) of the codon positions identified by (A,A) covariation are known drug-resistance mutation sites [49] in HIV protease (for the list of drug-resistance codons identified, see Materials and Methods), confirming again that these covariation effects involve drug-resistance selection.

Correlation of the Covariation Between Independent Datasets
We have shown that the level of (A,A) covariation is higher than (A,S) and (S,S) covariation in both drug-treated datasets (Specialty; Stanford-Treated). However, do these independent datasets display the same covariation effects? To answer this question, we compared the covariation measurements for each (A,A) pair from these two datasets. For each (A,A) pair, we plotted the covariation value observed in the Specialty dataset against the covariation value observed in the Stanford-Treated dataset (Fig. S6). Strikingly, the (A,A) pairs that covaried in the Specialty dataset  also covaried in the Stanford-Treated (Fig. S6A), and the covariation values in the two datasets showed strong quantitative agreement, yielding a high correlation coefficient of 0.83 (See Materials and Methods) between these two independent datasets. In contrast, for (A,S) and (S,S) pairs, the covariation detected in these two datasets was low, both giving the correlation coefficients of 0.35. This indicates that a single, consistent pattern of selective interactions is reproducibly discovered in the two independent datasets, but only for (A,A) covariation.
To assess whether drug treatment acts as the consistent amino acid selection in these two datasets, we compared the (A,A) covariation in the Treated dataset with that in the Untreated one. We found that the high consistence of (A,A) covariation between the Specialty and the Treated (correlation coefficient 0.83) disappeared in the comparison between the Untreated and the Treated, leaving a correlation coefficient of only 0.39 (Fig. S6D). This suggests that drug treatment (shared by the Specialty and the Treated datasets, but not the Untreated dataset) causes the nearly identical pattern of selective interactions found in these two independent datasets.

DISCUSSION
We have systematically separated the covariation induced by selective interactions from background LD, using silent (S) and amino acid (A) mutations. Selective interactions between amino acids can be detected by (A,A) pairs, but not by (A,S) or (S,S) pairs. Our analysis of the pol gene in HIV suggests that a large portion of (A,A) covariation in HIV results from selective interactions. Meanwhile, the (S,S) covariation curves suggest a low but detectable level of background LD in HIV. Although HIV has extremely high mutation and recombination rate, as well as short generation time, the (S,S) covariation metrics were still able to detect some BLD, decreasing as a function of physical distance (Fig. 2).
Several lines of evidence demonstrate the robustness of these conclusions. First, the same results were found by three different measurements of covariation: the widely used D9 and r metrics, and Fisher's exact test. Second, these results were reproduced in independent experimental studies (the Specialty and Stanford-Treated datasets). Third, the high level of consistency between independent (A,S) and (S,S) covariation curves suggests that the much higher level of covariation observed for (A,A) pairs cannot be attributed to background LD. Fourth, we also found direct evidence that the difference in covariation levels between (A,A) vs. (A,S)/(S,S) is due to selection, specifically, antiviral drug treatment, by comparing treated vs. untreated datasets. Fifth, the most prominent (A,A) interactions in the HIV pol gene have been independently identified as drug resistance mutations that physically cluster around the drug binding site. Finally, the specific set of (A,A) interaction pairs was reproducible in different drug treatment studies, and vanished in untreated HIV samples. Our result agrees with the 'observation of positive epistasis in HIV [50]. A previous study in plastid genomes also indicates that the significant covariation in plastid genomes is likely due to changes in the selective constraints of amino acids [51].
Could the surplus of the (A,A) covariation compared with that of (A,S) and (S,S) in the treated datasets (Specialty and Stanford-Treated) be an artifact of differences in the intrinsic mutation rates between silent and amino acid mutations (e.g. silent mutations are more likely to be transitions than transversions, thus evolving faster)? We directly tested this possibility by performing the same analysis in samples from untreated patients (Stanford-Untreated). Such an artifact should have also have been observed in the untreated dataset. Yet, the difference between (A,A) vs. (A,S)/(S,S) disappeared in the untreated dataset (Fig. 3), indicating that this difference was due specifically to drug-treatment. It should also be noted that in addition to drug treatment, there are other sources of selection, such as immune pressure. Like the drug-induced selection, this too only causes (A,A) but not (A,S) or (S,S) covariation. However, we didn't detect a significant difference between (A,A) vs. (A,S)/(S,S) in the untreated samples, suggesting our approach is not sensitive enough to detect weaker selection.
How might drug treatment cause the dramatic increase in covariation of amino acid mutation pairs observed in HIV? Several models are possible. 1) Drug treatment selects for mutations that directly cause drug resistance (called primary mutations), many of which may have secondary effects such as reducing protein stability and/or other aspects of viral fitness. These mutations can in turn induce selection pressure for mutations that compensate for these effects (called accessory mutations; e.g. a mutation that restores the protein stability). Such secondary selection effects will cause a pattern of covariation of primary mutations with their associated accessory mutations. By contrast, in the untreated samples, where such positive selection forces are presumably weaker, we did not detect significant evidence of selective interactions. 2) The covariation can be caused by shared selection pressure among amino acid mutations. If mutation X and Y are independently selected for under the same drug treatment, the two mutations are likely to covary. To distinguish the aforementioned two possibilities, we need to estimate the fitness of mutation X and Y separately and compare the sum with the fitness of the XY double mutation, which is beyond the scope of this study.
Our data also indicate that accurate measurement of background LD is useful to improve the accuracy of functional interaction prediction. Accurate measurement of the background LD would enable calculation of a threshold value above which the statistics will have a specific probability of resulting from causes other than background LD. Comparison of statistical values of covariation calculated from (A,A) pairs with the background (i.e. (S,S) covariation) allows identification of pairs of sites having a specific probability of interacting due to selection on amino acids. Such phylogenetic effects should be taken into consideration in covariation analysis of amino acid interactions.
In this paper, we have only analyzed the pol gene, which is known to have experienced strong drug selection. The same method should be applied in the other regions of the HIV genome. We expect the (A,S) and (S,S) covariation, while still consistent with each other, varies across the HIV genome due to different phylogeny across the genome. The comparison of (A,A) covariation with that of (A,S)/(S,S) across the HIV genome will provide a global view of the influence of selection on mutation covariation. For example, such comparison in the env gene will hopefully improve our understanding of the interplay between the host selection and phylogeny in that region. The same analysis could also be done in subtypes other than subtype B.

HIV-1 sequence data
The Specialty dataset contained 48,927 subtype B sequences, mostly from patients under antiretroviral drug treatment. Multiple sequence alignments and mutation detection were performed as previously described [38,52]. All these sequences covered the whole protease and part of the RT. The Treated and the Untreated datasets were downloaded from Stanford database (http://hivdb.stanford.edu/) [53], selecting only subtype B sequences. In the Treated dataset, there were 1795 protease sequences treated with protease inhibitor (this subset were used to calculate the covariation between mutations in protease) and 5121 reverse transcriptase sequences treated with either nucleoside reverse transcriptase inhibitor (NRI) or non-nucleoside reverse transcriptase inhibitor (NNRI) (this subset were used to calculate the covariation between mutations in RT), including 1320 samples that cover both protease and reverse transcriptase sequences (this subset were used to calculate the covariation between mutations in protease and those in RT). In each subset, we required a minimum mutation frequency of 5%. In the Untreated dataset, there were 2620 PI-naïve protease sequences and 1795 NRI/NNRI-naïve reverse transcriptase sequences, including 1208 samples that have both protease and reverse transcriptase treatment-naïve.

Measurements of covariation for individual mutation pairs
We used the Fisher exact test [54,55] to test for non-random associations between mutation a at position X and mutation b at position Y, by computing the p-value for the two-sided test using the 262 contingency table: N XaYb , N XaY0 , N X0Yb and N X0Y0 ?N-XaYb is the number of samples that have mutation a at position X and also mutation b at position Y; N XaY0 is the number of samples that have mutation a at position X and but no kind of mutation at position Y; N X0Yb is the number of samples that have no mutation at position X and have mutation b at position Y; N X0Y0 is the number of samples that have mutation at neither position. We computed the odds ratio, its confidence interval (95% two-sided) and p-value using the fisher.test function from the statistical software package R. Note: the maximum likelihood estimator for h is provided by (N XaYb ?N X0Y0 )/(N XaY0 ?N X0Yb ); for independent mutations X a and Y b , h = 1. We calculated D9 and r following the standard procedures [56][57][58], using p 1 = (N XaYb +N XaY0 )/N, q 1 = (N XaYb +N X0Yb )/N, x 11 = N XaYb /N, where N = N XaYb +N-XaY0 +N X0Yb +N X0Y0. . Finally, we used Wilcoxon rank sum test (wilcox.test function in the R package) to compare different types of mutation pairs with respect to their covariation score (e.g. D9). The pvalue is calculated for the null hypothesis that the covariation scores for the two types of mutation pairs are from the same distribution.

Average LD as a function of distance
Mutation pairs with negative LD were excluded. Mutation pairs were ranked by physical distance. We calculated smoothed curves using a sliding window, the window width of 2% or 4% of the total data, and an offset for neighboring windows of the 1/2 the window width.

Covariation map
For each pair of mutations, we used Fisher exact test to compute a p-value for statistically significant covariation, along with a lowerbound estimate for the strength of covariation h based on the 95% confidence interval. Only statistically significant mutation pairs (p,10 26 for a single pair, yielding a significance level of 0.01 after the Bonferroni correction) were included in our analysis. For a given codon position pair, the strongest covariation value h for any pair of mutations at the two positions (of the designated type: (A,A), (A,S), or (S,S)) was displayed in the map.

Comparing covariation between datasets
To test whether the covariation derived from two datasets, X and Y, was consistent, we plotted for every mutation pair the covariation measurement r in X vs. that in Y. We also calculated the correlation coefficient between the two datasets. Since correlation coefficient is very sensitive to outliers, the lowest correlation from 2000 bootstrap replicates (the R boot library) was taken as the correlation score between X and Y. Figure S1