Dissecting the Dynamics of HIV-1 Protein Sequence Diversity

The rapid mutation of human immunodeficiency virus-type 1 (HIV-1) and the limited characterization of the composition and incidence of the variant population are major obstacles to the development of an effective HIV-1 vaccine. This issue was addressed by a comprehensive analysis of over 58,000 clade B HIV-1 protein sequences reported over at least 26 years. The sequences were aligned and the 2,874 overlapping nonamer amino acid positions of the viral proteome, each a possible core binding domain for human leukocyte antigen molecules and T-cell receptors, were quantitatively analyzed for four patterns of sequence motifs: (1) “index”, the most prevalent sequence; (2) “major” variant, the most common variant sequence; (3) “minor” variants, multiple different sequences, each with an incidence less than that of the major variant; and (4) “unique” variants, each observed only once in the alignment. The collective incidence of the major, minor, and unique variants at each nonamer position represented the total variant population for the position. Positions with more than 50% total variants contained correspondingly reduced incidences of index and major variant sequences and increased minor and unique variants. Highly diverse positions, with 80 to 98% variant nonamer sequences, were present in each protein, including 5% of Gag, and 27% of Env and Nef, each. The multitude of different variant nonamer sequences (i.e. nonatypes; up to 68%) at the highly diverse positions, represented by the major, multiple minor, and multiple unique variants likely supported variants function both in immune escape and as altered peptide ligands with deleterious T-cell responses. The patterns of mutational change were consistent with the sequences of individual HXB2 and C1P viruses and can be considered applicable to all HIV-1 viruses. This characterization of HIV-1 protein mutation provides a foundation for the design of peptide-based vaccines and therapeutics.


Introduction
The quasispecies replication of RNA viruses has been recognized for over 30 years following the initial observation of the high proportion of mutants in a growing population of bacteriophage Qb [1].HIV-1 is now a classic example of this model of rapid genome evolution [2][3][4][5][6].As a result of high rates of genetic mutation [7] and recombination [8], cell infection by HIV-1 is followed by immune escape of viruses with related and highly diverse genotypes [9,10].Even a single founder virus, when introduced into cells, is quickly transformed into a quasispecies assortment of progeny viruses [11][12][13].Given the vast array of different genotypes generated in every replication cycle, the challenge of designing a vaccine that would prevent the immune escape of the mutant progeny of infected cells is widely recognized [14][15][16][17].A continuing goal is a greater understanding of HIV-1 diversity and an effective strategy to overcome this diversity.Towards this end, there is a need for more detailed and quantitative analysis of the extent of HIV-1 mutational changes, including the composition and incidence of the different variants of the viral proteome.
Herein, we describe a large-scale analysis of the diversity of HIV-1 sequences reported over at least 26 years.Clade B was selected for the developmental studies as it had the largest number of recorded HIV-1 sequences.Over 58,000 clade B sequences, both partial-and full-length and distributed among the nine proteins, were aligned, with over 1,000 sequences at most nonamer positions (Table S1).As the immune-relevance of the sequences was a primary focus of the study, the analysis was conducted with 2,874 nonamer positions, overlapping eight residues (1-9, 2-10, etc.), where each nonamer at the positions is a potential viral human leukocyte antigen (HLA) binding core [18] and T-cell receptor (TCR) ligand [19].As expected, there were major differences in the fractions of variant nonamer sequences of the proteins, with more conserved positions in Gag and Pol, and greater diversity in Env and Nef.However, despite the differences in the structure and function of the proteins, or the extent of their genomic diversity, the viral proteins shared characteristic patterns of change in the incidence of index, major, minor, and unique nonamer motifs with increased mutation.This quantitative characterization of the mutational changes of the clade B proteome provides data that are applicable to the design of vaccines and therapeutics that are least impacted by the diversity of the viral proteins.HIV-infected individuals, ''elite controllers'', who maintain low levels of plasma virus [20][21][22], as well as animal model experiments [23][24][25][26][27] suggest that vaccines that generate CD8+ T cells, even against only a few epitopes, can result in a long-term control of virus proliferation.We suggest that an effective immune response to HIV-1 in humans might be possible with a vaccine comprised solely of the limited set of highly conserved protein sequences, as described herein, that may provide memory responses not compromised by immune escape or the presence of competing altered peptide ligands.

Data Preparation, Selection and Alignment of HIV-1 Clade B Proteome
HIV-1 protein sequence records were retrieved from the NCBI Entrez Protein Database [28] in August 2008 by searching the NCBI taxonomy browser for HIV-1 (Taxonomy ID 11676).HIV-1 clade B records were extracted from the collected data by use of BLAST [29][30][31][32] searches (version 2.2.18; parameters: low complexity filter -off, expect -10, descriptions and alignments -100,000), using the HIV-1 clade B protein reference sequences from the Los Alamos HIV sequence database (www.hiv.lanl.gov)as queries (Table S2).Cutoff for the classification of each clade B protein was determined by manual inspection of the BLAST output.Possible misclassification of non-clade B HIV-1 sequences as clade B was not apparent as the patterns of motif change were similar for each protein, particularly for Gag, Pol, and Env proteins that form the major basis of phylogenetic relationships [33,34].Additonal phylogenetic study for classification of the sequences as clade B was not practical given the high viral diversity, large number of sequences analysed, and the difficulty/ subjectivity in the interpretation of the phylogenetic tree due to likely low bootstrap values.Duplicate sequences of each protein were removed to minimize bias that may result from collection of redundant protein sequences derived from identical HIV-1 isolates.Highly similar sequences were retained because arbitrary removal of such sequences would introduce additional bias.Further selection of one sequence per subject was deemed unnecessary as it would not give a true representation of the HIV-1 viral diversity as a complex quasispecies of genetically related but distinct sequences in individual subjects.Thus, all nonredundant sequences, both full-length and partial, were aligned and analyzed.Partial sequences were included in the alignment because they provided additional data for the study of diversity.Multiple sequence alignment was difficult for some of the proteins because of the large number of diverse and partial sequences, and thus different approaches were explored.Sequence alignments of Vif, Vpr and Vpu were performed with PROMALS3D [35].The large sequence datasets of Gag, Pol, Tat, Rev, Env, and Nef were first split into smaller and more manageable groups (about 200-500 sequences per group).These smaller groups were aligned by use of PROMALS3D or CLUSTAL W [35,36] and refined by use of RASCAL [37].They were then merged into full protein multiple sequence alignment, with conserved sites as anchors to facilitate reliable alignment of the highly diverse positions (Figure S1).All multiple sequence alignments were manually inspected and corrected for misalignments.Variation in the number of collected sequences between adjacent positions of the alignment resulted from the inclusion of partial sequences.Alignment positions with high fractions, 95% or more, of gaps (insertions or deletions) were removed in order to minimize alignment errors.Because of its great diversity, only 9,661 of the 29,211 extracted Env protein sequences were successfully aligned and analyzed.The selection of these Env sequences was semi-selective, however, the large number of sequences aligned and analyzed help to reduce sampling bias.
A caveat is the unknown history of the many recorded viral sequences; for example, the source of the virus, infection history, possible treatment, or details of cultivation.In addition, the quality of the sequences from the database is also unproven.Nevertheless, it is unlikely that analysis of another population of HIV-1 would differ significantly from the data herein.This assertion is supported by the sequences of individual infectious clade B viruses, HXB2 [38] and C1P [39], that have nonamer position patterns of conservation and variability consistent to those of the historically recorded viruses.

Nonamer Sequence Analysis Approach
HIV-1 clade B sequence alignment diversity was based on a sliding window approach of size nine, overlapping eight amino acids (1-9, 2-10, 3-11, etc.) that captured the entire nonamer repertoire of the proteome [40,41].Nonamer sequences represent the possible core HLA and TCR binding domains, and provide a defined examination of the protein diversity in relation to cellular immunity.It can be assumed that each nonamer is a potential Tcell epitope because of the large array of HLAs with different binding specificities in the human population (HLA Informatics Group [http://www.anthonynolan.org.uk/HIG]) and given that an average of 0.1-5% of overlapping nonamers spanning a protein will bind to a particular HLA molecule [42].Moreover, in this context, the potential immune relevance of a mutation is greatly  expanded whereby a single amino acid substitution affects nine overlapping nonamers spanning a region of 17 amino acids, except mutations occurring in the first and last eight amino acids of the protein termini (Figure S2).These features of immune-relevant diversity would not be captured by variability analysis of sequence alignments with a sliding window of size one (single amino acid approach).Further, mutational entropy is completely different between these two approaches, with a maximum entropy of 39 for nonamer approach versus 4.2 for single amino acid approach.

Shannon's Nonamer Entropy and Quantitative Analyses of Diversity Motifs
Shannon's nonamer entropy [31,43,44] was used as a general measure of HIV-1 protein sequence diversity.Each nonamer position in the protein alignments was quantitatively analysed for the incidence (% occurrence) of the different sequences present at the position, with arbitrary sorting for equal incidences (Figure 1).The first ranked sequence of each position was defined as the index, while the remaining ranked sequences with at least one amino acid difference from the index were defined as variants and Amino acid number at the start and end of the nonamer position in the protein alignment.The symbols & and + denote the mixed-variable and highly diverse nonamer positions, respectively (see Figure 3 for definitions).b Total number of protein sequences analysed at the respective nonamer position; the differences between nonamer positions was due to the inclusion of both partial and full-length sequences in the alignments.c Shannon nonamer entropy (H(x); see Figure 2 for details).d The index nonamer is the most prevalent sequence at the given aligned nonamer position.
e Variants differ by one or more amino acids from the index sequence.f The major variant is the most common variant sequence at the position.g Minor variants are multiple different repeated sequences, each occurring more than once and with an incidence less than or occasionally equal to the major.h Unique variants are those that occur only once in the alignment.i Nonatypes are the distinct sequences among the variants.j HXB2 nonamer sequence and its incidence at the given nonamer position; amino acids identical to the index are denoted as ''.''.k C1P nonamer sequence and its incidence at the given nonamer position; amino acids identical to the index are denoted as ''.''.HXB2 and C1P nonamers not found at the aligned nonamer position are indicated with 0% incidence.doi:10.1371/journal.pone.0059994.t001organized as three motifs: (1) ''major'' variant, the most prevalent individual variant of the index; (2) ''minor'' variants, multiple different sequences, each present in two or more of the aligned viral sequences, and each with an incidence less than or equal (occasionally) to that of the major variant; and (3) ''unique'' variants, each unique to a single viral isolate.The combined incidence of the major, minor and unique variants at a nonamer position was inversely related to the corresponding incidence of the index sequence and represented the total variant population at the position.The incidence of distinct variant nonamer sequences at a given nonamer position (nonatypes) was also determined.Further, the correspondence of the index sequences to the sequences of the original clade B isolate, HXB2 [38] and an arbitrarily selected recent strain, C1P [39] was evaluated.HXB2 is considered the clade B prototype sequence [45], while C1P is an infectious virus cloned full-length from residual viral RNAs recovered from the plasma of a HAART-treated patient and sequenced by modern methodologies.
Variant nonamers that contained gaps (-) or any one of the unresolved characters, including B (asparagine or aspartic acid), J (leucine or Isoleucine), X (unspecified or unknown amino acid), and Z (glutamine or glutamic acid) were excluded from the quantification.All data analyses were performed with the 2,874 aligned nonamer positions of the proteome that had more than 100 sequences (out of 3,133 total aligned proteome nonamer positions).However, data for positions with less than 100 sequences are shown in some figures and tables to maintain the protein sequence structure.

Shannon Entropy Overview of HIV-1 Clade B Proteome Sequence Diversity
Shannon entropy [31,43,44] of the aligned nonamer positions was used as a measure of overall sequence diversity of the HIV-1 clade B proteome (Figure 2A).The remarkable complexity of the viral protein sequence structure was evident from the extensive presence of high entropy nonamer positions, with a large fraction of variant sequences in each of the HIV-1 proteins, including Gag and Pol, the most conserved.The maximum evolutionary entropy (9.2) of HIV-1 nonamer sequences far exceeded that of ,6 for avian influenza [32], ,4 for dengue [31], and ,2 for West Nile virus [30].Each of the HIV-1 proteins was found to contain positions with greater than 50% variant sequences, and some with variants as high as 98% of the aligned nonamer sequences.A severe discordance in the relationship between entropy and the incidence of variants of the nonamer positions was also evident (Figure 2B).While high entropy is a general indication of high diversity, several positions of lower entropy (,2.0) contained high incidences of variants, some comparable to those of higher entropy regions (.4.0).In general, only positions with variants incidence of less than 20% had a relatively linear relationship with entropy.In contrast, positions with more than 20% variants displayed a wide range and non-linear rise in entropy that increased greatly at positions with an incidence of variants $80%.The wide range of entropy can be attributed to the composition and incidence of the variant motifs that comprised of many different sequences.

Dissecting HIV-1 Clade B Proteome Sequence Diversity: Scope of the Analysis
Dissecting protein sequence diversity of HIV-1 clade B included quantitative analyses of diversity motifs of each of the 2,874 aligned nonamer positions of the proteome (Table S3).An example of this data is shown with 27 aligned overlapping nonamer positions of Env 114-148 (Table 1).Each of the 27 aligned nonamer positions contained about 1,000 to 3,900 sequences.The first five positions (Env 114-122 to 118-126) were relatively conserved with entropies of about 1.0, and index nonamers with incidences of 85% to 89%.These conserved sequences were identical to the corresponding sequences of the HXB2 and C1P.The remaining small fraction (11 to 15%) of the aligned sequences at these conserved positions were variants of the index nonamers and represented by about equal fractions of the major and minor variants (5-7%), and a minimal fraction, ,1 to ,2%, of unique sequences.The incidence of the different variant nonamer sequenes (nonatypes) at these positions was about 2%, represented by the single major variant, different minor variants, and unique variants.In contrast, the remaining nonamer positions, Env 119-127 to 140-148, included positions of high diversity, with entropy as high as about 9.2, as few as 2% index sequences, 98% as variants of the index, and nonatypes as high as 36%, based almost totally on different minor and unique variants.As an indication of the high variability of these positions in the HIV-1 population, none of the approximately 2,000 to 4,000 The nonamer positions of the proteome and the individual proteins were defined as highly conserved (black, index incidence $90%), mixed-variable (white, index incidence ,90% & .20%),and highly diverse (grey, index incidence #20%).Highly conserved positions were only 9% of total proteome nonamer positions, ranging from 0% for Vpr and Tat, to 19% for Pol.Highly diverse positions comprised 14% of the proteome, ranging from 1% Pol to 27% Env and Nef, each.Mixedvariable positions comprised 77% of the proteome, ranging from 69% Env to 88% Vpr.doi:10.1371/journal.pone.0059994.g003sequences of the aligned population contained nonamers that were identical to those of C1P, and only a small fraction (1% or less) corresponded to HXB2 nonamers.

Distribution of Conserved and Variable Nonamer Positions in HIV-1 Clade B Proteome
Nearly all nonamer positions of the aligned viral proteins contained variants with one or more mutations.Only two Pol positions with more than 100 sequences in the alignment (Pol 956-964, 957-965) were evolutionarily completely conserved among the recorded sequences that were analyzed in this study.Highly conserved positions with index sequences in 90% or more of the aligned viruses represented only 9% of the proteome, mostly in Pol, Gag, and even Env (Figure 3).The remaining small fraction of variant nonamers, 10% or less, of these highly conserved positions generally contained one amino acid mismatch to the index (median average; data not shown).Highly diverse positions, with 80% or more total variants, represented 14% of the proteome and 23 to 27% of the Env, Nef, Vpu and Rev sequences.The variant sequences of these highly diverse positions differed greatly from the index sequences, many with multiple amino acid mismatches (median average of 3), contributed largely by the unique variants.Overall, a large fraction (91%) of the proteome nonamer positions contained more than 10% variants of the index sequence (i.e.mixed-variable and highly diverse nonamer positions).

Dynamics of Variant Motifs with Increased Mutation
A singular finding was the distinctive dynamics of the individual variant motifs of the HIV-1 clade B proteome in relation to increased mutations (Figure 4A).The extreme range in the conservation decrease represented by incidences of index sequences, 100% to 2%, was inversely matched by the increase of total variants incidences, 0% to 98%.Although these inversely related incidences of the index and total variants appear to be in a continuum, drawn as a line (Figure 4A), the data depicted were the actual values of the 2,874 nonamer positions examined.The three variant motifs, each exhibited a characteristic distribution of incidence change with increased total variants.The major variant had a distinctive pyramidal incidence pattern based on the fact that, by definition, the major variant could not exceed the index sequence.Thus, at nonamer positions with greater than 50% total variants, there was a corresponding reduced incidence of both the index sequence and the major variant.However, although the mean incidence of the major variant was only about 10%, there were many positions with over 50% total variants that contained almost equal incidence of the index and major variant.It is apparent that index, antigen epitopes of highly diverse proteomic positions have little quantitative dominance as an immune target over the major variant.The minor variants, each with individual incidence less than or occasionally equal to that of the major variant, were collectively the predominant variant motif for majority of positions, particularly those with greater than 50% total variants.It is evident that the minor variants are a highly dynamic population resulting from fitness selection support in the continued mutation of the index sequences and major variants.The multiple different sequences of the minor variants present additional great variety of potential epitope sequences as immune escape variants or altered peptide ligands.The remaining population of variant nonamer sequences comprised of unique variants, each representing a single viral genotype within the recorded population.Unique variants were distinctive for their presence at almost every nonamer position, including the highly conserved, but with low incidences (average of 3%), despite the increase of other variant sequences.Notably, at positions of greater than about 60% total variants, the incidences of unique variants for some positions were dramatically increased in each of the proteins, particularly in Nef with a maximum incidence of 53%.The increased observation of unique sequences represented positions of hyper-variability, with index incidence of less than or equal to 10% and the remaining 90% or more variants as mostly different minor and unique nonamer sequences.The presence of these unique variants in every protein and at nearly all nonamer positions (Table S3), even those highly conserved, indicates that they are not artifacts of sequencing or result of truncated partial sequences (e.g.due to premature stop codon).Nonatypes, a measure of distinct variant sequences (Figure 4A), comprised a large fraction (up to 68%) of the highly diverse positions, composed largely of different minor and unique variants.
The incidence distributions of the index and variant motifs for the proteome are depicted by use of violin plots (Figure 4B).Overall, the index nonamer was the principle motif, but with a mean incidence of only about 52%.The predominant mutation of the index was the minor variant motif, with an incidence distribution inversely related to that of the index.The majority of the major and unique variants were of low incidence, as also shown in Figure 4A, but with a fraction of high incidence unique variants, particularly at positions of $80% total variants incidences (i.e.highly diverse positions).The nonatypes, composed of the single major variant, the different minor variants, and each of the singular unique variants at a nonamer position, were also chiefly of low incidence, except at highly diverse positions due to the great increase in unique variants incidences.The increased incidence of different variants at highly diverse positions provides an extreme repertoire of possible immune escape variants for a quasispecies population.
An additional salient finding was that the dynamics in the changes of the incidence patterns of the variant motifs were repeated for each of the individual proteins, as most evident with the large number of nonamer positions of Gag, Pol, Env and Nef (Figure 5).The only distinct differences between the individual proteins were the fraction of nonamer positions with low incidences of variants (Gag and Pol), compared to those with high incidences of variants (Env and Nef) (Figure 6).Thus, despite the differences in the structures and functions of the proteins, the motifs and incidences of the variants of the index sequences were basically the same for each protein, with similar dynamics in the patterns of change with increased mutation.

Correspondence of the Index Sequences to those of HXB2 and C1P Proteins
The correspondence of HIV-1 clade B index nonamer sequences herein defined, concatenated as full-length proteins, was compared to Nef (Figure 7) and the other proteins (Figure S3) of the individual HXB2 and C1P viruses.There was a high single amino acid identity (.82%) between the historical index and the individual sequences of these viruses (Table S4A), with differences that included a small fraction of indels (insertions/deletions), besides mutations (Figure 7).In contrast, the nonamer sequence identity was much less than that of single amino acids, with only 61% of the HXB2 and 49% of the C1P nonamers identical to the index sequences (Table S4B).As expected, the match between the index and the individual HXB2 and C1P sequences occurred mainly at the highly conserved nonamer positions, and almost none at the highly diverse positions (Table S3), highlighting the consistency of sequence changes in the viral population.Based on this, approximately half of the overlapping nonamer sequences of a given HIV-1 strain can be expected to differ from the historical index sequences of this study.

Index Switching
Index switching, another relevant finding, results from similar incidences of the index sequences and major variants at some mixed-variable and several highly diverse nonamer positions where the incidences of the major variants (average of 8%) were almost indistinguishable from the index sequences (average 12%).Consequently, there were a significant number of nonamer positions in the proteome where fitness increase of one or more amino acid mutations were sufficient to change the incidence rank of a given variant nonatype as the index, alternative to the expected index sequence based on the preceding position (Figure 8).Index-switching positions were readily revealed by amino acid mutations that did not follow the consecutive concatenation of the index sequences, as shown by the red labeled residues in Figures 7 and S3.Index switching may also occur with the other variant motifs and likely represents a subset of a larger phenomenon, motif switching that may be common in a quasispecies population, where the members of the motifs dynamically swap ranking as mutation accumulates.This is likely particularly so at the hyper-variable nonamer positions where the difference in the incidences of the different nonatypes is almost negligible.

Discussion
This study has elucidated the HIV-1 variant nonamer sequence structure and incidence with increased mutation of the clade B virus proteome.The virus samples, collected over at least 26 years (1983 to 2008, based on records with available isolation date), likely included many differences in mode of HIV transmission, stage of infection, treatment, co-infection, etc.The results thus provide a compendium of the possible spectrum of sequence variants of HIV-1 under many different conditions and a model of the collective evolution of multiple HIV-1 populations.It was apparent that virtually all nonamer positions were capable of generating multiple variants of index sequences for the cooperative fitness of protein structure; only two Pol positions were completely conserved.The remarkable plasticity of the virus was demonstrated both by incidences of the three variant motifs at the different nonamer positions, and by the presence of numerous variant nonatypes (distinct sequences), especially at the highly diverse positions.Such regions of high diversity were present in each protein, Gag and Pol included, but most commonly in Rev, Vpu, Env, and Nef, where 23 to 27% of the aligned nonamer positions contained 80% or more of sequences that were variants of the index sequence.The incidence of nonatype sequences at the highly diverse positions was remarkable; for example, about 70% in Nef.Importantly, the reliability of this collective evolutionary diversity was supported by the consistency with the individual patterns of mutational change of HXB2 and C1P viruses.
Despite the many differences in structure and function of the HIV-1 proteins, the change of incidence of the major, minor and unique variants with increased sequence change was essentially the same for each protein, differing only in the relative fractions of conserved and diverse sequences.An interpretation of these data is that the three variant motifs represent inherent patterns in the organization of a vast number of variable protein sequences that facilitate HIV-1 fitness-selection.By definition, as the fraction of total variants of the nonamer positions increased, the minor variants became the predominant form, particularly as they replaced both the index and corresponding major variant at positions with more than 50% total variants.The large fractions of nonatypes at highly diverse positions is in keeping with the quasispecies model, that progeny viruses contain virtually any sequence change consistent with the cooperative fitness of protein sequences [2].This pattern of sequence change was hypothesized by Eigen [46] to have resulted from a combination of Darwin's selective evolution of the dominant replicator and the presence of a ''clan'' of variant molecules that also have maximum reproductive fitness.The extreme sequence plasticity could provide a spectrum of viral variants that allow the virus population to efficiently exploit changes in selection pressure and facilitate the long-term evolutionary stability and versatility of the virus [46].
The intense repertoire of variants in the mixed variable and highly diverse regions of the virus proteome likely supports a combination of mechanisms for loss of host immune control of the virus.In addition to escape from T cell immunity, the multitude of mutated sequences likely contribute to the general collapse of the early immunity to the virus by altered peptide ligand inhibition of T-cell responses and T-cell exhaustion [47][48][49][50][51][52][53][54][55][56].Index switching may be an important mechanism that facilitates this loss of immune control.A small incidence increase of a particular amino acid mutation in the mixed-variable and the highly diverse positions can result in a variant nonatype replacing the prevalent, index sequence of a quasispecies population.
Several methodologies for vaccine development that were designed to overcome HIV-1 sequence diversity have been reported, the majority as strategies based upon the use of centralized sequences to limit the differences between viral strains and the vaccine [15].The design of these vaccines has generally been based on phylogenetic relationship or the consensus, most common sequence, of the viral population [57][58][59][60][61].However, a limitation of these approaches as a vaccine strategy is that, as shown herein, the centralized or consensus sequence is not an indicator of conservation; the average incidence of the prevalent, index sequence was about 50%, and at the highly diverse positions was as low as 2%.A recent modification of the centralized approach was to prepare sets of 'mosaic' proteins, assembled from fragments of natural sequences that are compressed into a small number of native-like proteins [62][63][64].However, the full-length mosaic vaccines do not overcome the low incidence (average 12%) of the index sequences at the highly diverse positions and have the possibility of unnatural nonamer sequences as epitopes at the 310 to 318, is shown with the index sequence (41%) containing aa E at the 9 th position.The major variant (38%) contains the variable aa D at the corresponding position.A minor variant (11%) has a mutation at aa position 7 relative to the index.A total of 37 nonatypes had individual incidences less or equal to 10%.The dominance of aa E in the index is maintained for the next few nonamer positions.At position 314 to 322, index switching is observed, where the sequence with aa D is now the index (41%) and the one with E is the major variant (35%).This relationship continued till position 316-324, with reversal of sequence ranks to the original state at position 317-325, where the sequence with aa E is the index.At an index switching position, the index is alternative to that expected, relative to the preceding position.C. Concatenated index, formed by linking the overlapping index sequences of the nine nonamer positions.At position 318 (coloured), the aa D, which did not follow the concatenation of the index is shown below the sequence.doi:10.1371/journal.pone.0059994.g008junctions of the mosaic fragment.A reported alternative was the use of concatenated mosaic sequences selected from the conserved regions of Gag, Pol and Env [64]; however these also contained highly diverse sequences, including a sequence of Pol that matched a minor variant of incidence 3% in our dataset.Further, these mosaic constructs are also subject to unnatural junctional epitopes.
Given that all nonamer positions are capable of generating variants, barring the two completely conserved Pol positions, our rationale for a new vaccine strategy is one based on the selection of peptide sequences conserved in over 90% of the recorded virus population as a means for establishing selective T-cell based immunity.The goal is the selective development of immune responses that are restricted to highly conserved sequences, with a greatly reduced incidence of mutations that could function as altered peptide ligands.These conserved T-cell epitopes would not be subject to competition in TCR activation by a large or even greater fraction of variant sequences that could function in promiscuous HLA binding but act as antagonists to the TCR receptor.The importance of T-cell responses in the control of HIV-1 infection has been revealed by the findings of the role of Tcells in the reduced viremia of elite controllers [20][21][22].In addition, rhesus macaque animal model experiments [23][24][25][26][27] suggested that CD8+ T cells, even against only a few epitopes, have the potential of limiting virus replication following virus infection and result in long-term immune control under conditions that limit HIV-1 immune escape.The immunity provided by peptide-based vaccines composed solely of the limited set of highly conserved index sequences, many of which are immunogenic as observed by matches to reported T-cell epitopes (data not shown), may be considered a possible strategy for the development of an HIV-1 vaccine.

Figure 1 .
Figure 1.Definitions of HIV-1 clade B nonamer sequence motifs.The different sequence motifs of the aligned HIV-1 clade B isolates were identified as shown with 20 sequences of a model nonamer position.The ''Index'' nonamer is the most prevalent sequence, present in 8 of the 20 isolates.The ''Major'' variant is the most common variant of the index (5/20).''Minor'' variants are multiple different repeated sequences, each with incidences less than that or occasionally equal of the major variant.''Unique'' variants are those represented by a single aligned sequence.''Nonatypes'' (boxed) are the distinct variant sequences at a given nonamer position; in this example one of major, two of minor, and three of unique.doi:10.1371/journal.pone.0059994.g001

Figure 2 .
Figure 2. HIV-1 clade B proteome nonamer sequence entropy and total variants.A. Entropy (in black) and incidence of the total variants of the index (in red) was measured for each aligned nonamer (nine amino acids) position (1-9, 2-10, etc.) of the proteins.The entropy values indicate the level of variability at the corresponding nonamer positions, with zero representing completely conserved sites (0% total variants incidence), with a maximum of about 9 at the extremely variant sites (,98% total variants incidence).B. Relationship of entropy and the incidence of total variants for the proteome nonamer positions.doi:10.1371/journal.pone.0059994.g002

Figure 4 .
Figure 4. Dynamics of diversity motifs of HIV-1 clade B proteome. A. Motif incidence in relation to total variants incidence: index sequence (orange), total variants (black), major variant (blue), minor variants (pink), unique variants (green), and nonatypes (yellow).B. Violin plot of the frequency distribution of the indicated proteome sequence motifs.The width of the plot (x-axis) represents the frequency distribution of a given incidence of the indicated motif.''x'' represents the arithmetic mean incidence value.doi:10.1371/journal.pone.0059994.g004

Figure 5 .
Figure 5. Dynamics of diversity motifs of HIV-1 clade B proteins.The color key for each sequence motif is described in Figure 4. doi:10.1371/journal.pone.0059994.g005

Figure 6 .
Figure 6.Frequency distribution violin plots of the diversity motifs incidences of HIV-1 clade B proteins.The legend for the violin plot is described in Figure 4. doi:10.1371/journal.pone.0059994.g006

Figure 7 .
Figure 7.Comparison of HIV-1 clade B Nef concatenated index sequence with the HXB2 and C1P sequences.The numbers before and after the concatenated index sequence represent amino acid positions of the comparison; the comparison is shown in blocks of 60 amino acids.Identity of the index amino acids with those of HXB2 (green) and/or C1P (blue) is represented by ''.''; those that differ are shown by the respective amino acid.Amino acid mutations of the aligned viruses that did not follow the concatenation of the index are shown in red.The corresponding amino acids of HXB2 and C1P sequences are also shown at these positions, without representing identical residues by ''.''.The green dashes represent amino acid deletions in HXB2.doi:10.1371/journal.pone.0059994.g007

Figure 8 .
Figure 8. Index switching.A. Gag protein alignment region of amino acid (aa) positions 310 to 326 is shown for the first and last five sequences of the aligned viruses.The aa position 318 (coloured) is involved in index switching and includes two aa, the prevalent E observed in 59% of the sequences, with its variant D in 41%.B. The nine aligned overlapping nonamer positions (310-318, 311-319, etc.) represent the sliding windows of the alignment region in A. Each nonamer position is shown with the index sequence, two of the variant nonatype sequences and the total number of remaining variant nonatypes of incidence equal or below 10%.The first nonamer position,

Table 1 .
A sample of the quantitative analysis of HIV-1 clade B Env protein diversity ' .
'All percentages are shown to the nearest whole number.a