Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada

Host immune selection pressure influences the development of mutations that allow for HIV escape. Mutation patterns induced in HIV by the human leukocyte antigen (HLA) are HLA-allele specific. As ethnic groups have distinct and characteristic HLA allele frequencies, we can expect divergent viral evolution within ethnicities. Here, we have sequenced and analyzed the HIV pol gene from 1248 subtype B infected, treatment-naïve individuals in Canada. Phylogenetic analysis showed no separation between pol sequences from five self-identified ethnic groups, yet fixation index (FST) values showed significant divergence between ethnicities. A total of 17 amino acid sites showed an ethnic-specific fixation pattern (0.015<FST <0.060, p<0.01), and 27 codons were inferred to be under positive selection (p<0.01), with each set of sites strongly associated with HLA sites (p = 1.78×10−6 and p = 1.91×10−7, respectively). Within the pol gene, eight sites under HLA selective pressure were correlated with ethnicity, indicating ‘adaptive divergence’ between the groups studied. Our findings highlight challenges in HIV vaccine design in ethnically diverse countries with subtype B epidemics.


Introduction
Immune-mediated selection pressure is one of the strongest forces driving HIV evolution, with human leukocyte antigen (HLA) proteins playing a critical role. HLA cell-surface proteins bind and display antigenic epitopes cleaved from viral proteins. Epitope display initiates the cytotoxic T lymphocyte (CTL) response and the destruction of HIV infected cells. However, HLA proteins are able to display only epitopes to which they bind tightly. Consequently, amino acid substitutions within HIV epitopes can interfere with the CTL response by decreasing the efficiency of epitope binding, disrupting the intracellular processing of epitopes or impairing recognition by T cells. Thus, the incredibly high mutation rate of HIV [1], combined with strong selective pressure, facilitate immune escape through the mutation of sequences that are targeted by the CTL response [2][3][4].
HLA alleles are extremely diverse, with each allele capable of binding different, but overlapping, sets of viral epitopes. Among populations which share common HLA alleles, HIV can evolve in parallel at HLA-associated sites due to the fixation of HLA-allele specific escape mutations [5][6][7]. Concordantly, the frequency of HLA-associated polymorphisms in circulating HIV isolates has been shown to reflect the prevalence of HLA alleles in different populations [8]. Links between HLA alleles and the HIV mutation patterns they generate have been established by multiple large scale association studies on cohorts for which both HLA allele data and viral sequences are available [5,6,[9][10][11][12]. In the largest study of its type, Brumme et al. mapped polymorphisms due to HLA immune escape across HIV genome sequences within a multicenter cohort of over 1500 HIV patients (International HIV Adaptation Collaborative, or IHAC) from the USA, Canada and Australia [6]. In a subsequent study, John et al. demonstrated that at an 8-11 mer resolution, HLA responses differed according to ethnicity, establishing that there were distinct inheritable patterns of HIV immune response [7]. In a diverse population in the USA, ethnic-specific selection patterns were observed in HIV because frequencies of HLA alleles resolved at a high level differed across the groups studied. Congruently, Kosakovsky Pond et al. found that the strength of selection varied at sites in HIV between two genetically distinct populations [13].
Similar to the USA, the Canadian HIV epidemic is ethnically heterogeneous. According to surveillance data reported in 2008 and for which ethnicity data was available, 44.3% of HIV cases were Caucasian, 33.3% Aboriginal, 11.6% African-Caribbean, 4.5% Asian, and 4.1% Latin-American [14]. Of particular note is the over-representation of Aboriginals in the Canadian HIV epidemic, estimated to account for 8% of prevalent infections [15] but only 4% of the population [16]. Population studies in the USA have shown that HLA allele frequencies differ significantly between the five major 'outbred' ethnic groups: African-Caribbean, Asians, Caucasians, Native Americans and Latin-Americans [7,17]. To gain insight into the forces driving the evolution of the HIV epidemic, we sought to investigate whether HIV sequences coming from different ethnic groups in Canada exhibited characteristic mutation patterns resulting from shared host-driven selective pressures.
Since HLA allele frequency data are currently not available for association studies in the Canadian population we studied, we used a recently developed method to compare host selection pressure between populations in the absence of HLA allele frequency data [18]. In order to examine the differences in selective pressure within different ethnic groups, we compared site-specific frequencies of amino acids in HIV pol sequences classified according to ethnicity. This method offers the additional advantage of not requiring phylogenetic separation of sequences for the populations studied [18]. We found divergent HIV sequence patterns among ethnic groups at 8 sites under positive selection that have been shown to mutate under HLA-associated immune pressure. Results are consistent with differential HIV-1 adaptation to HLA class I alleles among ethnic groups in Canada.

Epidemiological Characteristics of the Study Population
Long term infections are most likely to bear evolutionary imprints resulting from the host's cellular immune response and would therefore be the most relevant to the analysis. In order to maximize the probability that observed mutation patterns were due to HLA selective pressure within the subject under study, and not reflecting immune selection from the transmitting partner, we included only samples from long-term infections (older than 155 days), as determined by the capture enzyme immunoassay or BED-CEIA test [19]. Sequences from 1248 ethnicity-typed subtype B samples, from established infections, were included. Sequences were separated into five ethnic groups previously demonstrated to differ in HLA allele frequencies in North America [17] (Table 1)

Viral Divergence Cannot be Explained by Phylogenetic History
Because observed sequence patterns could be due to founder effects and phylogenetic clustering among ethnic groups, it was necessary to identify a priori phylogenetic clustering. A maximum likelihood (ML) tree containing 1272 sequences was first constructed with FastTree 2.1 [20] under the most appropriate model selected by jModelTest [21], which was the general time reversible model with among-site rate heterogeneity (GTR+C). Two apparent monophyletic clusters associated with common ethnicity were removed, and ML tree reconstruction was repeated on the remaining 1248 sequences. The APE package [22] implemented in R [23] was then used to sample subtrees randomly across the tree for further Bayesian phylogenetic analysis in BEAST [24]. Seven subtrees containing 88-161 sequences were sampled, covering 66.4% of the tree (829/1248 sequences). Subtrees were then tested for clustering of ethnicities in BaTS ( Figure 1; see Figure S1 for all other subtrees) [25].
Within each subtree, terminal nodes were annotated with our character of interest (ethnicity) and we tested whether the distribution of ethnicity on the phylogeny was non-random. Association indices and parsimony scores both indicated an absence of phylogenetic clustering by ethnic group in any of the subtrees (Table 2). Therefore, on the final tree, sequences from ethnic groups were no more clustered than expected by chance, and we could proceed with testing whether pol sequences showed evidence of ethnic-specific mutation patterns.

Seventeen Sites in Pol are Divergent Between Ethnicities
In spite of this lack of shared viral ancestry within ethnic groups, an analysis of molecular variance (AMOVA) [26] of 413 amino acid sites in the pol region demonstrated significant divergence in amino acid composition between ethnicities at 17 sites (p,0.01; Figure S2). Divergent sites were highly polymorphic, as compared to the rest of the pol sequence, and their average entropy was seven times higher (0.28 and 0.04; respectively, p,5610 220 ). The extent of population differentiation at these sites was measured through the fixation index (F ST [27]) that quantifies the proportion of the observed variation that is contained within the subpopulation ('S ', here, ethnic group) as compared to the total population ('T').
Measured in this way, population differentiation could either be indicative of divergence in amino acid composition between all groups, or could point to a deviation within a single ethnic group. Significant F ST values (at the 1% level) ranged from 1.5% to 6.0%, signifying that at sites that are divergent between groups, ethnic subgrouping accounted for up to 6% of the observed variation. At all but two of these 17 sites, the most common amino acid was conserved across ethnic groups; only the frequencies of amino acids differed between groups. Patterns of dominance were nevertheless observed at the remaining 15 sites. The highest F ST values were noted at sites where the most prevalent amino acid differed between ethnicities, PR93 (F ST = 0.060) and RT277 (F ST = 0.052, Figure 2; Figures S2,S3). When the AMOVA was performed on all 1272 sequences (including two clusters of sequences from Aboriginal patients), the same 17 sites were identified, with F ST values changing only slightly (data not shown).

Divergent Sites are Strongly Associated with HLA Sites and Sites Under Positive Selection
In order to understand the origin of this differentiation at viral sites relative to ethnicity, we further investigated the role of HLAdriven convergent selective pressures. Based on the literature, 78 sites known to be HLA-associated (HLA+) were mapped across the sequenced pol region [6]. In parallel, using the Single Ancestor Counting algorithm in HyPhy [28], 27 codons were inferred to be   Figure 3). Sites divergent between ethnicities were also strongly associated with both HLA+ (p = 1.91610 27, Fisher's exact test) and Pos+ sites (p = 1.78610 26 , Fisher's exact test, Figure 3). Eight codons were both divergent between groups, and inferred to be under positive selection: PR35, PR93, RT135, RT162, RT245, RT277, RT293 and RT297 ( Figure S2). According to the definition put forward by Perez-Sweeney et al. [18], these sites can be considered putatively adaptive divergent sites, i.e. selective pressure may differ at these codons between the groups studied. Of these, all except RT293 were also on the list of HLA-associated sites produced by Brumme et al. [6], suggesting that the observed selection pressure is the result of HLA alleles differentially distributed across ethnicities. Moreover, fifteen additional sites were HLA+ and Pos+, although not divergent between ethnicities. These sites appear to be under strong HLA induced selection pressure within the Canadian population, in response to HLA alleles shared among the different ethnic groups examined. Finally, six sites were divergent between groups, and HLA+, but not Pos+ ( Figure 3). These sites could either be false positives due to low frequency haplotypes (that is, not divergent), or false negatives (that is, adaptive) not detected by the codon model.

Strength of Selective Pressure Differs Between Populations Only at a Small Number of Sites
Finally, strength of site-specific selective pressure was directly compared between pairs of populations [13]. A level of significance a = 0.005 (Bonferroni; 56(521)/2 = 10) was chosen to correct for multiple comparisons. The strength of selective pressure differed only at a small number of sites (Table 3). All these sites had previously been detected as divergent by the F ST test (RT108, RT111), or were included in the list of HLA-associated sites (PR19, RT166, RT198), or both (PR35; Figure S2). Differences in strength of selection pressure on internal branches, indicating substitutions selective at both the individual and population level, occurred only at two sites in protease, PR19 and PR35. Recent adaptations, on branches leading to the tips of the phylogeny, are likely to be maladaptive and purified at the population level, and indeed were not detected by the codon model ( Figure S2).

Discussion
In this study, we characterized HIV sequence diversity and evolution in the ethnically heterogeneous Canadian epidemic in the absence of subject HLA data. We first demonstrated that HIV sequences from patients of different ethnic origins showed distinctive mutation patterns, and that these patterns were not associated with viral ancestry or founder effects. Further analysis confirmed that divergent sites were strongly associated with sites known to be under HLA selective pressure, although there was little evidence for different non-synonymous rates of mutation between pairs of populations. The five ethnic groups studied had previously been demonstrated to have characteristic HLA allele frequencies in different populations [17]; we therefore suggest that HIV is evolving in parallel at these sites within ethnic groups due to shared HLA alleles.
HLA allele frequency data for different ethnic groups is gathered from all published studies and stored in a centralized database [29]. Using this database and known associations between HIV mutations and specific HLA alleles [6], we sought to attribute our observations to HLA alleles more frequently observed in particular ethnic groups. Although HLA data for our study population were not directly available, we used data from comparable ethnic groups in the USA, with Amerindians being used as a proxy for Aboriginals. At most sites, our observations were consistent with available HLA allele frequency data for each ethnic group. For example, at site RT277, the escape mutant R was seen more frequently in Caucasians than in the other ethnic groups. This result is in agreement with a known association between RT277 E and the A*03 allele, found more frequently in Caucasians (up to 30%) than in any other group. Notably, the escape mutant R was the consensus in the IHAC cohort [6], as it would be here, while the susceptible K was more common in Mexico [9]. Broken down by ethnicity, K was the most common amino acid in all groups other than Caucasian. Similarly, at site PR93, we observed high frequencies of B*15 associated escape mutant L in Asians and Latin-Americans. This is consistent with high frequencies of B*15 among Asians (up to 40%). However, we did not find higher frequencies of B*15 among Latin-Americans as compared to other ethnicities in the database. Meanwhile, the high frequency of B*07 escape mutant RT162 C is congruent with the higher frequency of allele B*07 in Caucasians (up to 19%). At sites PR35, RT135, RT245 and RT297, patterns became more complicated to interpret because of high amino acid diversity. Nevertheless, the high frequency of B*57 escape mutant RT245 E among Asians and Latin-Americans is in accordance with the very high frequency of B*57 in these ethnic groups as compared to Caucasians (up to 20% as compared to up to 6% in Caucasians). The degree to which the B*35 escape mutant RT297 A was found among Aboriginals is consistent with the very high frequency of B*35 detected among Amerindians [30] (up to 45% as compared to up to 24% in other ethnic groups). In the context of an association between B*35 and faster disease progression [31], this latter finding is important and is the subject of an ongoing clinical trial in Manitoba (Canadian HIV Trials Network CTNPT 004). Our results are in agreement with reviews highlighting the unique HLA makeup of Amerindian populations, and their restriction in diversity of HLA types [32,33]. It is therefore very likely that the role of distinct patterns of HLA selection is particularly important in a country like Canada, where a single ethnic group, Aboriginals, is so disproportionately affected by the epidemic.
Although not included in the initial list of HLA sites used, our data showed that RT293 was divergent between ethnicities (F ST results) and under positive selection when all ethnicities were pooled together. This site has been identified to be under HLA immune pressure in a subsequent HIV/HLA genetic association study performed in Mexico [9]. In that study, the frequency of HLA alleles were distinct as compared to those found in the IHAC cohort and were suggested to reflect the unique Amerindian/ Caucasian HLA admixture. Because in our study, amino acid frequencies at this site differed between ethnicities and positive selection was detected, we suggest that selection pressure at this site is indeed HLA-driven in our population, possibly reflecting a similar influence of Aboriginal HLA alleles.
Understanding the forces shaping HIV genetic diversity is crucial to surveillance efforts and for the control of the HIV epidemic. It is clear that heritable factors such as HLA alleles that are associated with ethnicity are strong forces shaping the evolution of HIV [12,34]. The strength of the method used to characterize HIV evolution here is at least twofold. First, it allows for selection pressures to be compared across groups which do not necessarily phylogenetically separate from each other. Second, the method combines several approaches for detecting differential selection pressure, thus reducing the likelihood of type I and type II errors: (i) we identified putative ''adaptively divergent'' sites using population genetics and codon-based selection tests; (ii) we evaluated whether these sites corresponded to known HLA epitopes using Fisher's exact test; (iii) we estimated whether the strength of selective pressure at each site differed between pairs of populations.
Although not entirely discordant, the sites detected by the test to compare selective pressures differed from those detected by the F ST test. This may be due to a number of reasons. Most importantly, the former test compares only strength of selection between datasets. The model does not allow for variation in nucleotides frequencies at each codon position between pairs of populations. Indeed if different amino acids at a given site were selected for between the two populations, but with the same strength, these would not be highlighted as significant. The F ST test, on the other hand, detects different patterns of amino acid frequencies between populations but does not incorporate information on strength of selection. In addition, the test comparing selective pressures may be lacking in statistical power as (i) the required correction for multiple comparisons is overly conservative (Bonferroni) and (ii) only a small number of sequences were available, in particular from African-Caribbeans, below the 50 recommended in the HyPhy user manual. Yet sites with the lowest p values (but above 0.005) were highly consistent across comparisons (data not shown).
Our results support the conclusion of John et al's that ethnicity data may improve the resolution of HIV/HLA association studies [7]. For statistical power, such studies often group HLA alleles into supertypes. However, within supertypes, the selection effects of different alleles vary significantly. As the distribution of highly resolved HLA alleles is most marked within supertypes, HIV/ HLA associations can be masked by this grouping. In our study, different HLA-associated mutation patterns were identified between ethnic groups, even in the absence of HLA data. Thus, as an alternative to classifying HLA alleles to four-digit resolution, we propose that combining 2-digit resolution with ethnicity data may improve the strength of associations demonstrated between HLA alleles and HIV mutations, without prejudicing statistical power. Thorough analysis of cohorts for which HIV sequences as well as ethnicity and HLA allele data are available could be used to evaluate this hypothesis.
Although our findings are compelling, there are several limitations to our analysis. Most importantly, the number of immune-associated polymorphisms would have been greatly increased had HLA allele information been directly available for patients. It is a testament to the strength of HLA-mediated HIV adaptation that it can be detected with such a rough classification as declared ethnicity. The HLA alleles expressed across our ethnic groups are not population-specific. In particular, the Caucasian group is likely to be of admixed origin and to contain subpopulations with different HLA allele frequencies. Moreover, distinct HLA alleles can drive the selection of the same escape mutant [8], potentially obscuring a correlation between the two. In agreement, estimated F ST values were low, although significant. Nevertheless, despite the potentially imperfect grouping in our analysis, statistical associations were found between mutation patterns and ethnic groups, confirming that our analysis is robust.
Another limitation of this type of analysis is that HLA-associated mutation patterns in HIV may not necessarily reflect the HLA background of the patient currently infected. Although HIV evolves rapidly within a new patient to match his or her HLA type [35], usually during early infection [4], there is also evidence that HIV may revert to wild type [36] or, in the absence of a fitness cost, maintain escape mutants associated with the transmitter's HLA type. In the acute stage in particular, an adaptation profile specific to the newly infected individual may not yet have been generated. Instead, HLA-associated sites may reflect adaptation to a previous HLA type. To avoid this issue, we used HIV sequences only from patients classified as having established infections according to the BED assay, as over time HIV is more likely to adapt to the new patient's HLA type.
The dataset under study was generated though HIV surveillance for new diagnoses and presents certain limitations due to small subsets of data. The African-Caribbean population in particular, is greatly under-represented, reflecting the imbalance of subtype B infected ethnic groups in the epidemic. In Canada, the vast majority of HIV cases are among Caucasians and Aboriginals. Nevertheless, we felt that an inclusion of all major ethnic groups, despite their small numbers, would benefit the analysis, in so far as results are more widely applicable to similar ethnically-diverse western epidemics.
The data could also be biased by the unequal distribution of exposure categories across ethnic groups. Specifically, there is an over-representation of intravenous drug users and an underrepresentation of men who have sex with men among Aboriginals. Polymorphisms transmitted within single exposure category and thus single-ethnicity transmission clusters could artefactually appear ethnic-specific. While in some epidemics sequences cluster significantly by exposure category, overlap between exposure categories is seen in Canadian clusters [37] and so we do not expect this to be the case. There is otherwise a mix of exposure categories across the ethnic groups, and we took care to remove two clusters which may have introduced bias.
We have shown that HLA-associated mutation patterns differ across the ethnicities studied. These observations support ongoing research investigating whether differences in HLA allele frequencies could explain the susceptibility of Aboriginals to HIV. Furthermore, our study suggests that incorporating ethnicity information into future HLA association studies in Canada may be important. Finally, an understanding of ethnic specific host influence on viral evolution is critical for vaccine development, as an effective vaccine will require targeting regions that are reactive across groups.

Phylogenetic Analysis and Character Association
Phylogenetic interrelationships between 1272 sequences were reconstructed in FastTree 2.1 under a GTR+C model, as selected by jModelTest. Two clusters of sequences for which there was an association between ethnicity and phylogeny were removed from the dataset and the ML reconstruction was repeated on 1248 sequences. Using the APE package in R, subtrees were randomly selected for Bayesian phylogenetic analysis in BEAST. Sequence subsets were run in duplicate under a Bayesian Skyline Plot model with 10 breakpoints and linear splines. Convergence was assessed in Tracer after 100 million generations. After elimination of a burn-in period (10-20% of run), a posterior distribution of trees was generated for analysis in BaTS. Terminal nodes were annotated with our character of interest, ethnicity, and the non-random distribution of ethnicity was tested in each subtree.

Positive Selection and Population Divergence
Sites previously identified to be under HLA-mediated selection [6] were mapped along sequences in the alignment. Sites inferred to be under positive selection (p,0.01) were determined using HyPhy in each ethnic group dataset. In datasets exceeding 50 sequences, the SLAC algorithm was employed [40]. The SLAC algorithm calculates observed numbers of non-synonymous (N) and synonymous (S) mutations at each codon in an alignment, and compares these to expected numbers (E[N] and E[S]) in order to estimate selection pressure (dN = N/E[N], dS = S/E[S]). A low dN/dS ratio (,1) indicates purifying selection, while a high dN/ dS (.1) suggests diversifying positive selection pressure. In datasets ,50 sequences, both the FEL and REL algorithms were used, and only sites appearing in both lists were considered to be under positive selection (as recommended by the HyPhy user manual).
Site-specific entropy values for each amino acid within the alignment were calculated using the Los Alamos Entropy Tool (http://www.hiv.lanl.gov/content/sequence/ENTROPY/entropy_ one.html). Entropy is a measure of the variability at each position in an alignment; a site with high entropy is highly variable.
Next, population divergence between ethnic groups was measured at the amino acid level by calculating the Fixation Index (F ST ) using an analysis of molecular variance (AMOVA), as implemented in Arlequin v3.5.1.2 [41]. In order to calculate F ST between amino acid sequences, positional amino acid frequency data were generated in BioEdit and transformed into a format readable by Arlequin using an in-house Perl script (available upon request).
In order to determine whether sites that were divergent between ethnic groups were associated with sites under positive selection and HLA-associated sites, we used Fisher's exact test. All statistical analyses were carried out in SPSS [42].
Finally, we estimated whether the strength of selective pressure at each site differed between pairs of populations using two tests [13] implemented in HyPhy. The former compares selection pressure across two trees constructed for populations separately. However, recent non-synonymous changes (on external branches, leading to tips) reflect only adaptation at the host level and may be maladaptive at the population level. Hence the second test discriminates between substitutions occurring on internal branches from those occurring on external branches of the tree. Figure