Candidate Vaccine Sequences to Represent Intra- and Inter-Clade HIV-1 Variation

A likely key factor in the failure of a HIV-1 vaccine based on cytotoxic T lymphocytes (CTL) is the natural immunodominance of epitopes that fall in variable regions of the proteome, which both increases the chance of epitope sequence mismatch with the incoming challenge strain and replicates the pathogenesis of early CTL failure due to epitope escape mutation during natural infection. To identify potential vaccine sequences to focus the CTL response on highly conserved epitopes, the whole proteomes of HIV-1 clades A1, B, C, and D were assessed for Shannon entropy at each amino acid position. Highly conserved regions in Gag (cGag-1, Gag 148–214, and cGag-2, Gag 253–331), Env (cEnv, Env 521–606), and Nef (cNef, Nef 106–148) were identified across clades. Inter- and intra-clade variability of amino acids within the regions tended to overlap, suggesting that polyvalent representation of consensus sequences for the four clades would allow broad HIV-1 strain representation. These four conserved regions were rich in both known and predicted CTL epitopes presented by a breadth of HLA types, and screening of 54 persons with chronic HIV-1 infection revealed that these regions are commonly immunogenic in the context of natural infection. These data suggest that vaccine delivery of a 16-valent mixture of these regions could focus the CTL response against conserved epitopes that are broadly representative of circulating HIV-1 strains.


Introduction
Efforts to design a vaccine against Human Immunodeficiency Virus type 1 (HIV-1) have been disappointing. The principal empirical strategies that yielded successful vaccines against other viruses in the past have not provided protective immunity. The first unsuccessful attempts included strategies using inactivated whole virus or virus protein subunits, which would be expected to raise antibodies and HLA class II-restricted helper responses against HIV-1. When such approaches (including a phase III trial of the HIV-1 envelope-based ''AIDSVAX'') failed to produce protective humoral immunity, researchers turned to the idea that a vaccine to elicit HLA class I-restricted cytotoxic T lymphocyte (CTL) responses might provide protection against disease if not infection, given the increasingly clearly protective role of CTL in the immunopathogenesis of HIV-1 infection.
The attempts to generate HIV-1-specific CTL responses with a vaccine have focused heavily upon vector development for immunogenicity, as safety concerns precluded the classical empiric approach of using live-attenuated HIV-1. Although numerous strategies ranging from naked plasmid DNA to replication-competent vaccinia vector showed promise in animal models, to date only recombinant adenovirus serotype 5 (rAd5) has appeared to be reliably immunogenic for CTL responses in humans. Vaccination of HIV-1uninfected persons with modified replication-incompetent rAd5 containing HIV-1 genes elicited CTL responses comparable to those raised by natural HIV-1 infection, as measured by ELISpot and intracellular interferon-c assays [1]. Unfortunately, the first large efficacy trial of this approach was halted for futility at mid-enrollment, when interim safety analysis revealed that there was no difference in infection incidence or set-point viremia levels after infection between placebo and vaccine arms [2].
The cause of vaccine failure remains unknown, but there are at least two major types of possibilities (reviewed in [3,4]). First, the ''immunogenicity'' of the vaccine as reflected by IFN-c-based assays using exogenously loaded small peptides as surrogates for cellular HIV-1 infection may not have reflected true functional immunogenicity for antiviral CTL, i.e. capability to recognize levels of epitope on infected cells [5] or phenotype/effector/ trafficking capability of the CTL raised. While this has yet to be investigated, the ability of the same rAd5 vector to elicit protective immunity against SIV in the rhesus macaque model suggests that immunogenicity from this vector may be adequate. Second, the CTL targeting elicited by the vaccine may have been faulty in at least two ways related to immunodominant targeting of variable epitopes. Recent functional studies of CTL antiviral activity have indicated that the impact of epitope variation has been greatly underestimated by prior peptide-based assays [5,6]. The targeted epitopes from the vaccine may not have matched the epitopes in the circulating strain in this case. Another possibility is that targeting of variable epitopes may have replicated natural misdirection of CTL immunodominance. In essence, the vaccine approach in the trial of rAd5 delivered genes for most of the HIV-1 proteome (Gag, Pol, Nef), to mimic natural infection and whole viral immunologic exposure. Presumably, this would give the host similar choices for CTL targeting and result in immunodominance patterns comparable to early natural infection.
While other successful vaccines have targeted viruses where natural immunity is frequently protective if the host survives the initial infection, natural immunity against HIV-1 infection generally fails. Thus, mimicking naturally occurring immune responses with a vaccine may not suffice for HIV-1, in contrast to other viruses. Increasing data about early HIV-1 infection suggest that CTL targeting initially is misdirected towards variable epitopes [7,8], allowing early viral escape [9,10,11,12], and inadequate early immune containment with resultant irreversible depletion of the CD4 + T helper cell pool [13,14,15] that dooms the efficacy of the CTL response in the long term despite eventual re-targeting against conserved epitopes. A vaccine might therefore offer the opportunity to alter the patterns of CTL immunodominance seen in natural infection by predetermining memory responses against epitopes that are highly conserved, rather than those with the highest affinity [16] or other immunodominant properties that do not limit escape. Thus, some vaccine developers have considered strategies to provide vaccine sequences to target conserved epitopes, in contrast to the current standard approach of delivering whole proteins (including variable regions). Here we identify conserved regions within the HIV-1 proteome to propose as vaccine sequences, with additional representation of intra-and inter-clade variation.

Ethics Statement
The portion of this work involving human subjects was performed under a protocol approved by the Office for Protection of Human Research Subjects (IRB) at the University of California Los Angeles. Written informed consent was received from each participant.

Determination of Shannon entropy across the HIV-1 proteome
Shannon entropy is a quantitative measurement of uncertainty in a data set such as a collection of protein sequences. Comparing the sequences against each other, the entropy of each amino acid position reflects the uncertainty (variability) of that position across all sequences. Lower entropy reflects more predictability at that position, e.g. the amino acid is always the same (no entropy), or few different amino acids observed rarely. Higher entropy reflects more uncertainty at that position, e.g. many different amino acids frequently occupy that position with no predominant amino acid.
To calculate Shannon entropy across the HIV-1 proteome, the sequences of each full length protein within each clade were aligned against clade consensus protein sequences (Los Alamos HIV Sequence Database) using Clustal 62.0.10 on an Apple Macintosh Pro running OS 610.5.7, with manual editing. Shannon Entropy at each amino acid position was then calculated using the Shannon Entropy online tool at the Los Alamos National Laboratory HIV Sequence Database: http://www.hiv.lanl.gov/content/ sequence/ENTROPY/entropy_one.html. (Further explanation is given at the Los Alamos National Laboratory HIV Sequence Database: http:// www.hiv.lanl.gov/content/sequence/ENTROPY/entropy_readme. html ).
Listing of known and predicted epitopes within regions of the HIV-1 proteome Known and predicted CTL epitopes within regions of the HIV-1 proteome were listed using the Epitope Location Finder program online tool at the Los Alamos National Laboratory HIV Immunology Database (using all available class I HLA types): http://www.hiv.lanl.gov/content/sequence/ELF/epitope_analyzer. html.

Analysis of amino acid variation within clades
The frequencies of amino acid polymorphisms at specific locations within each clade were assessed using the online QuickAlign tool from the Los Alamos National Laboratory HIV Sequence Database: http://www.hiv.lanl.gov/cgi-bin/QUICK_ALIGN/ QuickAlign.cgi.

HIV-1-infected participants
Persons with chronic HIV-1 infection who were not on antiretroviral therapy were recruited through a University of California Los Angeles IRB-approved protocol. PBMC were isolated from freshly drawn whole blood by ficoll-hypaque gradient and viably cryopreserved. Plasma viremia data were obtained from medical records (or patient self-report in a few cases where medical records were not available).

Mapping of HIV-1-specific CTL responses
Standard interferon-c ELISpot assays were performed using polyclonally expanded CD8 + T cells as previously described in detail [17]. In brief, PBMC were thawed and cultured with a CD3:CD4 bi-specific monoclonal antibody, resulting in expansion of CD8 + T cells (generally at least 95% CD3 + CD8 + ) that has been shown to mirror CTL responses in unexpanded CD8 + T cells [17,18]. After 14 days, the cells were utilized for standard ELISpot assays using overlapping peptides (15 amino acids sequentially overlapping by 11 amino acids) from the NIH AIDS Reference and Research Reagent Repository. Gag peptides included clade B consensus and DU422 sequences (catalog #8117 and #6869). Env peptides included clade B consensus or MN sequences (#9480 and #6451). Nef peptides included clade B consensus sequences (#5189). Pools of 16 or fewer peptides were utilized for a first round of screening, followed by 464 matrices to identify individual peptide candidates for a second round of screening, followed by confirmation of individual peptides in a third round of screening.

Results
Despite the overall variability of the HIV-1 clade B proteome, there are stretches of relatively conserved sequences The plasticity of HIV-1 sequences is an obvious barrier for vaccine development. To assess sequence variability and conservation across the viral proteome, all available clade B complete HIV-1 protein sequences in the Los Alamos National Laboratory (LANL) HIV Sequence Database were assessed for Shannon entropy at each amino acid position. The entropy at each position and the mean entropy for each stretch of nine amino acids (corresponding to most potential epitopes) were plotted for each of the nine viral proteins (Figure 1). The plot revealed gross differences between proteins, such as generally higher variability in Env and relatively lower variability in Pol. All proteins contained spikes of variable codons scattered throughout, although there were stretches of lower variability with fewer spikes even in the generally more variable proteins Env and Nef. These plots therefore demonstrated that despite the overall plasticity of the HIV-1 proteome, there are sequence constraints that constrict variability in certain regions within several proteins. Four regions from highly expressed proteins (i.e. excluding Pol) that were at least 40 amino acids in length and generally conserved were selected for further examination (Figure 1 shaded regions). Pol was excluded due to its low level of expression from translation requiring a ribosomal frameshift along the gag-pol transcript [19], which can reduce the immunogenicity and antiviral efficacy of Pol-specific CTL [20,21].

Sequence constraints fall in the same regions across different HIV-1 clades, and within selected conserved regions, inter-clade variability is limited to consistent sites
HIV-1 is observed in genetically distinct clades that reflect divergent pathways for evolution [22]; clades A (A1), B, C, and D reflect the vast majority of circulating strains worldwide. To assess whether the patterns of sequence variability observed for clade B (Figure 1) apply to other clades, the entropy for available clade A1, C, and D HIV-1 complete protein sequences in the LANL HIV Sequence Database were analyzed in parallel for Shannon entropy at each amino acid position. Comparisons of the clades revealed that the patterns of variability across the proteome were very similar in general ( Figure 2 and data not shown), suggesting that different HIV-1 clades share certain areas of functional/structural constraint despite divergent evolution. Examining the four selected conserved regions, it was notable that differences between clade consensus sequences in these regions were limited to a few codon positions ( Figure 3). This finding further suggested that inter-clade variability in these conserved regions is relatively limited, with a few wobble positions that have diverged between clades.
Within conserved regions of the HIV-1 proteome, much of the intra-clade diversity overlaps inter-clade sequence variation within each clade demonstrated that many of the positions that were more variable intra-clade overlapped positions where the consensus sequence differed inter-clade. Furthermore, at these varying positions within each clade, there often was a single dominant non-consensus polymorphism that corresponded to the consensus amino acid for another clade. For example, in region cGag-1, the consensus amino acid at position 159 was isoleucine for clades A1, C, and D, and valine for clade B. Within clades A1, C, and D, position 159 often varied from consensus, and the most common polymorphism was valine. Conversely for clade B, position 159 also frequently varied from consensus, and the most common polymorphism was isoleucine. These observations suggested that there is a high degree of overlap between intra-and inter-clade variability in these conserved regions, consistent with a recent study examining HIV-1 sequence variation [23]. Thus a mixture of clade consensus sequences of these conserved regions (to represent HIV-1 variability worldwide) would represent a large portion of circulating HIV-1 sequences within any clade (to represent HIV-1 variability within individuals).

The selected conserved regions are rich in potential CTL epitopes
To assess whether these regions might be immunogenic if delivered in a CTL-based vaccine, the Los Alamos HIV Immunology Database was scanned for known and predicted CTL epitopes falling within the regions (Table 5). Each region contained many previously reported CTL epitopes associated with a variety of HLA types, as well as binding motifs for many additional HLA types. These findings suggested that these conserved regions contain many epitopes that can be presented by a broad range of HLA types, and therefore should be immunogenic for CTL responses from most persons.

The selected conserved regions are immunogenic in natural infection
To assess this prediction and test whether the selected conserved regions are indeed commonly immunogenic in actual HIV-1 infection, the CTL responses within 54 persons in the Los Angeles area with chronic infection (not receiving antiretroviral treatment) were examined. HIV-1-specific CTL responses were defined by standard interferon-c ELISpot assays using proteome-spanning 15-mer peptides overlapping by 11 amino acids. All four regions were targeted by CTL within multiple persons ( Table 6). The least often recognized was the Env region, where responses were seen in 4/54 persons (7.4%). The two Gag and Nef regions were each targeted similarly in frequency, at 17/54 (31.5%), 17/54 (31.5%), and 19/54 (35.2%) respectively. Across the group, 38/54 (70.4%) of persons targeted at least one region, 17/54 (31.5%) targeted at least two regions, 2/54 (3.7%) targeted at least three regions, and none was observed to target all four regions. Although these subjects were enriched for HLA B*57 due to the bias of selecting untreated persons, the presence or absence of B*57 did not appear to predict responsiveness against these regions. Correlations between targeting of these conserved regions to viremia were not observed in this relatively small number of subjects (not shown). Overall, these data confirmed that these conserved regions can be immunogenic in the context of whole HIV-1 infection, suggesting that CTL responses could be focused preferentially on these regions by an exclusive vaccine.

Discussion
While a CTL based vaccine may not prevent HIV-1 infection, it may offer the opportunity to attenuate disease from subsequent infection. To do so, however, the CTL responses generated by the vaccine must surpass the efficacy of those that would be raised in the natural course of infection. Simply priming CTL responses to have a head start versus HIV-1 infection and immune damage could be advantageous, but the recent failure of an apparently immunogenic CTL-based vaccine to have any impact on viremia set-point [2] suggests that this is not sufficient. Clearly, the interaction between CTL and HIV-1 during acute HIV-1 infection is crucial, because set-point viremia (which predicts the long term rate of disease progression) is determined by the end of acute infection, and is maintained in large part by the CTL response [24,25,26]. Given the tendency of this response to target variable epitopes and the observation of rapid and frequent viral escape during acute infection (when massive depletion of crucial helper CD4 + T cells occurs), use of a vaccine to focus the response selectively against highly conserved epitopes may offer an avenue to address a key shortcoming of the natural immune response. The other key advantage of such targeting would be minimizing the chance that the epitope sequences of the vaccine would mis-match those of an incoming challenge strain of HIV-1.
Despite the overall plasticity of the HIV-1 proteome, viral variability is not limitless. Its protein sequences clearly are constrained in particular domains, which is not surprising given the many structure/function requirements for viral replication that are imposed on the small proteome. Shannon entropy analysis reveals stretches that are apparently highly constrained in variability and therefore may be useful for CTL focusing by a vaccine. While these regions were selected purely on the basis of entropy analysis, it is interesting that they closely correspond to distinct functional domains in Gag, Env, and Nef. The cGag-1 and cGag-2 sequences both fall in the p24 capsid; the former spans four a-helices that are absolutely required for viability, and the latter spans the so-called ''major homology region'' that is structurally conserved across other diverse lentiviruses such Feline Immunodeficiency Virus, Rous Sarcoma Virus, Murine Mammary Tumor Virus, Bovine Leukemia Virus, Friend Murine Leukemia Virus, and HTLV-1 [27]. The cEnv region falls in the ectodomain of gp41 and spans the functionally crucial fusion peptide and hepatad repeats [28]. The cNef region falls in the ''central conserved region'' and includes the second alpha helix (key for interaction with the src kinase SH3) and the three adjacent beta strands that compose the central structured core domain [29], and thus contains a key binding site required for many cellular effects of Nef, and sequences that are a crucial structural scaffold. Within these generally conserved regions, the remaining variability tends to occur at a few amino acid positions, further reflecting the strict functional restrictions for sequence variation.
Much of the intra-clade variability overlaps that of the inter-clade variability of these sequences, suggesting that there are limited choices for amino acid substitutions in a few foci of ''wobble''  These conserved regions therefore could be utilized in a polyvalent vaccine. A mixture of clades A1-D consensus sequences for each of the four regions would provide highly conserved epitopes, and include the most common variants of epitopes spanning the few ''wobble'' positions that vary intra-and interclade. Delivering these regions separately (in a mixture of 16 individual vectors) rather than concatenating them into a single construct, could be advantageous for preventing competition between epitopes and yield a broader response [30,31,32]. While providing the four consensus versions of each region may help represent the breadth of polymorphisms in these overall conserved sequences, another benefit may be dilution of less common variants. An epitope sequence that is conserved in all four consensus sequences will be expressed at four-fold the level of an    epitope variant that is present in only one consensus sequence.
Although such a polyvalent approach may be cumbersome logistically, there is ample precedent for polyvalent vaccines against other pathogens such as Streptococcus pneumoniae [33] and Human Papilloma virus [34]. Other groups also have devised differing strategies for including highly conserved epitopes in CTL-based vaccines. Rolland et al identified stretches of 8 or more amino acids selected to be highly conserved across the entire M-group for use as a peptide-based vaccine [35]. This approach identified 46 peptides (listed as ''first tier'' peptides in: http://www.mullinslab.microbiol.washington.edu/HIV/Rolland 2007/Rolland2007-supplement_files/SuppFigs.html#S1), but most of these do not correspond to known epitopes, and ten consist of eight amino acid sequences (unlikely to contain epitopes, which are usually nine amino acids). Thus, this approach may be too stringent to provide the breadth of epitopes required to cover a breadth of HLA types, and additionally faces the difficulty that peptide-based vaccines generally have had poor immunogenicity in humans. More similar to our proposed strategy, Letourneau et al [36] screened for stretches of conserved sequences, identifying 14 candidate proteome regions that are relatively conserved. As a vaccine candidate, they concatenated these into a monovalent linear genetic construct, and alternated clade A-D consensus sequences for each consecutive region in an attempt to minimize clade bias. In contrast to our proposed polyvalent vaccine, this approach may not adequately account for intra-and inter-clade polymorphisms within the conserved sequences, because each stretch represents the sequence of only one clade. Concatenation of the 14 regions also could yield unwanted spurious epitopes at the 13 sites of splicing, although testing of this strategy in transgenic human A*02 mice suggested that it can be immunogenic for proper A*02-restricted CTL responses.
The utility of our proposed strategy depends on the hypothesis that vaccine delivery of the selected sequences will steer the early CTL response towards highly constrained epitopes that do not  Table 5. Known and potential epitopes in the conserved regions.  allow escape. While the determinants of effective CTL targeting remain hypothetical, increasing data suggest that highly conserved epitopes in highly expressed proteins are advantageous. Across large numbers of persons, targeting of Gag, a highly expressed and relatively conserved protein that should therefore yield epitopes that are highly expressed and conserved compared to epitopes from other proteins on average, has been shown to trend with better immune control [37,38,39,40]. Additional studies on cohorts large enough to power statistical dissections of individual epitope contributions (rather than protein targeting in general) to setpoint viremia have begun to suggest reveal epitopes that are associated with better immune containment. Recent data [41] correlated CTL targeting against six epitopes to a beneficial effect in lowering viremia in a cohort of chronically-infected persons. Interestingly, when results of the STEP trial were analyzed, vaccine-induced targeting of any of these epitopes also was associated with lower viremia set-point in persons who had subsequent HIV-1 infection. Of note, four of these six epitopes fall in our proposed vaccine regions (which were designed before these associations were reported): two in cGag-2 (KRWIILGLNK, Gag 263-272, and DRFFKTLRA, Gag 298-306) and two in cNef (HTQGYFPDW, Nef 116-124, and LTFGWCFKLV, Nef 137-146). These data suggest that our sequences have been selected based on properties that yield advantageous epitopes (although all advantageous epitopes do not necessarily fall in these regions). The optimal vector system and viral sequences to be delivered remain to be determined in HIV-1 vaccine development. Our described sequences are proposed for the latter, given the high degree of sequence conservation and immunogenicity in natural HIV-1 infection. About 70% of persons had detectable responses against at least one of the four regions, suggesting that a sizeable proportion of the general population have HLA types that would allow immunogenicity. A vaccine containing only these regions could have a higher response rate, because it is likely that in natural infection, potentially recognized epitope responses against these regions are masked by the immunodominance of other epitopes outside these regions, and possibly limited by ''original antigenic sin.'' It is also unclear whether the four chosen regions are equally immunogenic; only about 7% of persons with chronic infection responded against cEnv, while cGag-1, cGag-2, and cNef had similar response rates of about 32 to 35%. This agreed with the lesser number of previously reported epitopes in cEnv compared to the other regions. The number of predicted epitopes in cEnv was similar, however, suggesting that there might be some intrinsic reason for reduced immunogenicity of Env due to some factor such as protein trafficking and accessibility to the class I antigen pathway.
In conclusion, we have identified highly conserved regions of the HIV-1 proteome that exhibit few variable amino acids. These regions appear to be reasonably immunogenic in natural infection. Representation of inter-clade variability within these regions allows coverage of much of the intra-clade variability at those positions. These data support the consideration of polyvalent mixtures of these sequences as vaccine inserts to pre-set memory CTL responses against highly conserved epitopes, thereby favorably altering the immunodominance pattern in subsequent natural infection. A panel of 54 persons (from the Los Angeles area) with chronic HIV-1 infection and not receiving antiretroviral treatment was screened for HIV-1-specific CTL responses by IFN-c ELISpot assays using overlapping peptides spanning each protein. Gag-specific responses were screened using peptides based on clade B consensus and strain DU422 sequences. Env-specific responses were screened using peptides based on clade B consensus or strain MN sequences. Nef-specific responses were screened using peptides based on clade B consensus sequence. The presence or absence of detectable viremia is indicated in the second column; these individuals were biased towards slow progressors due to recruitment selection for being untreated. Recognized peptides that fall entirely within each conserved region are indicated by NIH AIDS Reagent Repository catalog number. The minimal number of epitopes recognized by each person is listed in the last column (assuming that consecutive overlapping peptides contain a single epitope). doi:10.1371/journal.pone.0007388.t006 Table 6. Cont.