New Findings in Cleavage Sites Variability across Groups, Subtypes and Recombinants of Human Immunodeficiency Virus Type 1

Background Polymorphisms at cleavage sites (CS) can influence Gag and Pol proteins processing by the viral protease (PR), restore viral fitness and influence the virological outcome of specific antiretroviral drugs. However, data of HIV-1 variant-associated CS variability is scarce. Methods In this descriptive research, we examine the effect of HIV-1 variants on CS conservation using all 9,028 gag and 3,906 pol HIV-1 sequences deposited in GenBank, focusing on the 110 residues (10 per site) involved at 11 CS: P17/P24, P24/P2, P2/P7, P7/P1, P1/P6gag, NC/TFP, TFP/P6pol, P6pol/PR, PR/RTp51, RTp51/RTp66 and RTp66/IN. CS consensus amino acid sequences across HIV-1 groups (M, O, N, P), group M 9 subtypes and 51 circulating recombinant forms (CRF) were inferred from our alignments and compared to the HIV-1 consensus-of-consensuses sequence provided by GenBank. Results In all HIV-1 variants, the most conserved CS were PR/RTp51, RTp51/RTp66, P24/P2 and RTp66/IN and the least P2/P7 and P6pol/PR. Conservation was significantly lower in subtypes vs. recombinants in P2/P7 and TFP/P6pol and higher in P17/P24. We found a significantly higher conservation rate among Group M vs. non-M Groups HIV-1. The late processing sites at Gag (P7/P1) and GagPol precursors (PR/RTp51) presented a significantly higher conservation vs. the first CS (P2/P7) in the 4 HIV-1 groups. Here we show 52 highly conserved residues across HIV-1 variants in 11 CS and the amino acid consensus sequence in each HIV-1 group and HIV-1 group M variant for each 11 CS. Conclusions This is the first study to describe the CS conservation level across all HIV-1 variants and 11 sites in one of the largest available sequence HIV-1 dataset. These results could help other researchers for the future design of both novel antiretroviral agents acting as maturation inhibitors as well as for vaccine targeting CS.


Introduction
The human immunodeficiency virus type 1 (HIV-1) Gag proteins are essential for the virus, as they have a structural and functional role in the viral cycle. They coordinate viral trafficking, membrane binding, assembly, cofactor packaging, budding, and viral modulation. Gag proteins are generated through viral maturation, essential in the viral life cycle by enabling the generation of mature infectious viral particles through the proteolytic process in specific cleavage sites (CS) of Gag precursor (Pr55 gag ) and GagPol precursors (Pr160 GagPol ) proteins by the viral protease (PR) [1,2]. Gag precursor is cleaved within the virion in three main structural Gag proteins: matrix (P17 or MA), capsid (P24 or CA) and nucleocapsid (P7 or NC), flanked by two spacer segments (P1 and P2) with regulatory functions [3]. Gag P6, a sixth protein of Gag precursor, plays an essential role in the release of the virus from infected cell membranes [3]. During translation of the Gag precursor an occasional ribosomal frameshift leads to the production of a GagPol precursor protein, the abundance of which is approximately 5% that of Gag precursor [4]. GagPol precursor contains the main structural proteins matrix P17, P24, P7, a transframe protein (TFP), P6 pol and the three viral replication enzymes, PR, reverse transcriptase (RT) and integrase (IN) [3]. PR is activated concomitant with viral budding. As PR is only active as a dimer, it is thought that autoprocessing is initiated by dimerization of two PR domains that are embedded in the GagPol precursor [5]. Maturation triggers a second assembly event that generates a condensed conical capsid core, which organizes the viral RNA genome and viral proteins to facilitate viral replication in the next round of infection [6].
Processing of both HIV-1 Gag and GagPol polyproteins by the viral PR is highly specific, temporally regulated, and essential for the production of infectious HIV-1 particles. The differential rate of processing at each of the 11 proteolytic reactions by cleavage exists [6] and is determined by the context surrounding processing sites of the CS [7]. However, the precise mechanisms governing the rates of the cleavage events are still not fully understood [7].
The physical consequence of Gag cleavage is a morphological rearrangement of the non-infectious immature particle to a mature infectious particle. For this reason, amino acid substitutions on Gag proteins, included in CS, could influence processing [2,8], morphogenesis, budding [9], the virus replicative capacity or viral fitness [3,10] and the virological outcome of specific regimens, particularly to protease inhibitors (PI) [5,[11][12][13][14][15][16][17][18][19][20]. In fact, several Gag substrate mutations, included in CS, can confer PI resistance in the absence and/or presence of PR mutations [17][18][19][20]. The fundamental role of proteolytic maturation in the generation of infectious particles makes inhibition of this process an attractive target for therapeutic intervention. Thus, a new class of potential antiretroviral drugs targeting individual Gag CS has entered development [21].
Whether or not the processing regulation is different across HIV-1 variants remains unclear. It is well known that HIV-1 shows a high genetic diversity due to its high replication rate, the error-prone RT and the recombination events between HIV-1 variants occurring during the viral replication after co-infection and/or superinfection events [22][23][24]. A large number of HIV-1 variants have been described based on viral sequences homology and HIV-1 has been divided into four groups: M (main), O (outlier), N (non-M, non-O) and P [23]. HIV-1 Group M is subdivided into 9 subtypes (A-D, F-H, J, K), at least 58 circulating recombinant forms (CRF) (http://www.hiv.lanl.gov/content/ sequence/HIV/CRFs/CRFs.html) -designated by a number and the genetic subtypes present in their genome-and multiple unique recombinant forms (URF), widely spread throughout the world and with different recombination breakpoints from those found in CRFs. At least 20% of the 34 million infected humans have an inter-subtype URF or CRF [25] and new inter-subtype recombinants have increasing prevalence and complexity in the pandemic, including in some European countries [26]. Genetic variability in PR and CS provide the potential to modulate PR activity and susceptibility to PI [20]. For instance, CS polymorphisms in certain HIV-1 group M variants can influence the virological outcome of a first-line LPV/r single drug regimen [19].
Despite the high biological relevance of CS during HIV-1 maturation and the importance of the knowledge of CS conservation for the design of both novel antiretroviral agents acting as maturation inhibitors as well as for vaccine targeting CS in future, scarce data of HIV-1 variant-associated CS variability is available. Previous reports only analyzed a limited number of HIV-1 variants and site sequences [3,27,28]. Thus, the goal of our descriptive analysis was to analyze, for the first time, the conservation rate at amino acid level of each individual protease CS located within Gag or Pol for all HIV-1 groups, Group M subtypes and recombinants circulating in the HIV/AIDS pandemic. For this purpose we used a large dataset of HIV-1 sequences routinely deposited at Los Alamos National Center for Biotechnology Information or GenBank. We also defined the consensus sequences at each CS in all HIV-1 variant, identifying the highly conserved amino acids residues in each CS.

Sequence Data
All the available HIV-1 gag/pol sequences were retrieved from GenBank, (http://www.ncbi.nlm.nih.gov/). The 12,934 gag/pol sequences comprised 2,844 nucleotides, located from 790 to 2,292 in gag and from 2,253 to 5,096 in pol encoding the proteins shown in Table 1. These sequences belonged to 4 groups (M, O, N, P), 9 Group M subtypes (A: sub-subtypes A1 and A2, B, C, D, F: subsubtypes F1 and F2, G, H, J and K), 51 of the 58 CRF currently described, and with available sequences at GenBank and URF ( Figure 1). For the subsequent analysis, we grouped in 12 recombinant families the closely related CRF sharing the same parental subtypes and very similar recombination patterns (Figure 2), as previously recommended [23]. All gag/pol nucleotides sequences were retrieved in FASTA format, including the subtype B HXB2 reference sequence. The MEGA (Molecular Evolutionary Genetics Analysis. Arizona States University, Tempe) program version 5.05 (http://www.megasoftware.net/) [29] was used to perform the nucleotides alignments and to translate them into amino acids. After performing the alignments, we determined the residues and their location in Gag and Pol proteins ( Table 1), identifying their nucleotides and amino acids and numbering them according to HXB2 subtype B reference strain (GenBank accession number K03455). We then identified the residues and the location of 11 cleavage sites (CS) within Gag and GagPol precursors: P17/P24, P24/P2, P2/P7, P7/P1, P1/P6 gag , P7/TFP, TFP/P6 pol , P6 pol /PR, PR/RT p51 , RT p51 /RT p66 and RT p66 /IN according to HXB2 sequence.

Inferred Consensus Sequences
The consensus sequence is considered the sequence carrying the most frequent residues, either nucleotides or amino acids, at each position in a multiple sequence alignment. We collected all Gag and Pol consensus sequences available in GenBank (http://www. hiv.lanl.gov/content/sequence/NEWALIGN/align.html).
The HIV-1 Group M variants with inferred consensus sequences in GenBank are indicated in Figure 2, and were calculated as explained in http://www.hiv.lanl.gov/content/sequence/ NEWALIGN/align.html#consensus. Using our amino acid alignment, composed of 12,934 sequences, we determined new consensus sequences for each HIV-1 group and each Group M subtype, CRF and URF in the 11 CS (Figures 3 and 4). Then, we manually compared our inferred variant-associated consensus sequences at each CS with the ones provided by GenBank when available, showing the discrepancies.
We also retrieved the consensus-of-consensuses sequence provided by GenBank in order to generate an alignment of gag and pol individual consensus sequences that were used to analyze the conservation rate across sites and HIV-1 variants ( Figure 5).

Amino Acid Conservation Rate at CS Across HIV-1 Variants
All gag and pol sequences from GenBank were grouped according to the HIV-1 variant. We manually compared the degree of amino acid conservation in each CS, determined by the number of coincident amino acids among the 10 residues of each CS, in all downloaded sequences from each given variant with respect to the consensus-of-consensuses sequence provided by GenBank. The exact percentage of conserved amino acid residues for each HIV-1 variant and site with respect to the GenBank consensus-of-consensuses amino acid sequence was calculated counting the number of coincident residues in each of the 10 positions in the site in all sequences ascribed to a given variant

Data Analysis
Changes in rates were assessed using the chi-square analysis. Statistical analyses were performed using Epi Info v6.0 (Centers for Disease Control and Prevention, Atlanta, GA, USA). Significance was set at p,0.05.  Figure 1 shows the Group M variants distribution of our retrieved sequence dataset, including a total of 7,913/3,269 Gag/Pol sequences from 9 HIV-1 group M subtypes (A: subsubtypes A1 and A2, B, C, D, F: sub-subtypes F1 and F2, G, H J and K), 1,060/583 Gag/Pol sequences ascribed to 51 CRF and 12/11 Gag/Pol URF sequences.

Gag/Pol HIV-1 Sequences Used for the Analysis and Variants Distribution
In order to simplify the analysis, we grouped all the sequences from the 51 CRFs in 12 different CRF families according to a similar recombination pattern ( Figure 2). The downloaded sequences for each subtype and CRF family are detailed in Figure 5. Despite the large difference in the number of 8,985 Gag/3,863 Pol retrieved sequences, the specific distribution of HIV-1 Group M subtypes and CRF families was similar for both genes ( Figure 1). Recombinants displayed 11.9% gag and 15.4% pol sequences. Among subtypes, sequences from subtype B were the most represented in both gag/pol (43%/74.3%) coding regions, followed by sequences ascribed to subtype C (27.4%/16.7%), subsubtype A1 (21.6%/4.3%) and subtype D (5.6%/1.9%). There were no gag sequences from sub-subtype F2 and subtypes J and K available in our dataset. Within the recombinants, family 01 (69.7%/44.4%) was the most represented, followed by families BC (7.2%), AG (5.9%), cpx (5.6%) and BF (4.5%) in gag and by families AG (16.8%), cpx (12.5%), BF (9,4%), and BG (4.3%) in pol, among others. URF sequences represented less than 0.3% of downloaded sequences (12 gag and 11 pol sequences).  Changes are only indicated when they appeared in a specific position in at least 50% of the GenBank downloaded sequences in order to compare them with the GenBank consensus-of-consensuses sequence. Asterisks indicate the HIV-1 variants shown in Figure 2 with non available consensus sequence in GenBank. Black represents highly conserved amino acid residues and present in more than 99% of the 9,028 Gag and 3,906 GagPol HIV-1 sequences with respect to the consensus-of-consensuses sequence. When two residues within the analyzed sequences from each HIV-1 variant HIV-1 Variant-specific gag/pol Consensus Sequences Available at GenBank Figure 2 shows the specific subtypes and recombinants with consensus sequences in gag and pol available in GenBank, which carries the most frequent residue, either nucleotide or amino acid, at each position in a multiple sequence alignment. Table 2 summarizes the amino acids involved in each of the 11 CS (10 amino acids per site) in the HXB2 isolate as well as the consensusof-consensuses sequence for each CS, defined by GenBank after the alignment of 28 gag/24 pol individual consensus sequences, corresponding to 8/7 subtypes among 9 in Group M and to 11/10 CRF within the 58 described ( Figure 2). The consensus-ofconsensuses sequence was taken as reference for the analysis of the conservation at amino acid level across variants in the 110 residues (10 amino acids in each of the 11 CS), as described in Methods.

New Inferred Consensus Sequence in HIV-1 Groups, Subtypes and Recombinant vs. that Provided by GenBank
Since gag and pol consensus sequences were not defined by GenBank in all HIV-1 subtypes and CRF, we deduced our personal consensus sequence for all HIV-1 variants using our generated alignment of 9,028 Gag and 3,906 Pol HIV-1 sequences. We determined that the rate of amino acid residues among the retrieved sequences coincided with the consensus-ofconsensuses in the corresponding site. For the first time, we inferred the consensus sequence in each site for the different HIV-1 groups and for all subtypes, sub-subtypes and recombinants within Group M. Figures 3 and 4 show the HIV-1 variants that carry amino acid differences with the corresponding consensus-ofconsensuses sequence from GenBank in CS. We identified when our inferred consensus sequence presented the same amino acid residue as consensus-of-consensuses provided by GenBank. All discrepancies found between our inferred variant-specific CS consensus sequences with the consensus-of-consensuses provided by GenBank were also identified (see Table S1).

Observed Differences in CS Conservation Rates Across HIV-1 Variants and Sites
We evaluated the percentage of conserved residues in the retrieved sequences for each HIV-1 variant and site with respect to the GenBank consensus-of-consensuses amino acid sequence, as explained in Methods. We established a color code to indicate the different levels of conservation, and the exact amino acid conservation rate in each CS and variant ( Figure 5). Interestingly, despite the structural and functional roles of proteins in the viral cycle, we observed different conservation rates across the sites and HIV-1 variants.

Discussion
HIV-1 genomes analysis provides useful biological information in terms of the structure and function of viral proteins [31]. The correct core formation is essential for the production of infectious HIV particles and this is known to be dependent on accurate proteolytic processing of Gag. Thus, mutations that disrupt the cleavage of individual sites or alter the order in which sites are cleaved result in aberrant particles that have significantly reduced infectivity [6]. Although other publications previously reported that certain CS were more conserved than others, they only analyzed a very limited number of HIV-1 variants and site sequences [3,27,28]. Thus, to our knowledge, our study is the first to evaluate the conservation rate in 11 CS within Gag and GagPol precursors and to define the consensus sequences in each site using a large sequence dataset including all Group M subtypes and most CRF. Furthermore, it is the first study that includes sequences from Groups N, O and P, identifying completely conserved residues at CS present in all 4 groups. We showed the conservation rate in each HIV-1 variant and CS, finding different conservation rates across sites in the 4 HIV-1 groups and in Group M variants, including a complete panel of recombinants, whose prevalence and complexity is increasing in the pandemic [23]. In fact, the different clade distribution for gag and pol sequences retrieved for GenBank used in the study could be explained by the large number of recombinants circulating in pandemic, with different clades in different viral genome genes.

New Findings on CS Variability Across HIV-1 Variants
Only a limited number of studies have previously evaluated the natural variation within gag and CS [3,5,12,28,32,33]. However, these have mainly focused on subtypes B and/or C and they have analyzed a smaller dataset or a limited number of CS in most cases. Furthermore, the majority of the studies used HXB2 subtype B as reference sequence for conservation analysis [5,32,33], pNL-4-3subtype B [12] or the Group M most recent common ancestor sequence [3]. Only one used the consensus-ofconsensuses sequence provided by GenBank as a reference for comparisons [28]. None inferred a consensus sequences for each analyzed HIV-1 variant and site. Other studies included either recombinants or non-M Group sequences. Despite the wide variety in the number of sequences that we downloaded from GenBank for Group M (8,985/3,863 gag/pol sequences) with respect to the rest of groups (43/43 gag/pol sequences) or certain subtypes (H, J, K), sub-subtypes (F2) or CRF, available data permitted the establishment of a comparison among conservation rates at CS and we were able to define specific-variant differences at each CS consensus sequence for each HIV-1 group, subtype, CRF and URF (see Figure 5). Our data reflects that the degree of conservation differs between individual amino acid positions at CS and provides significant discrepancies across specific HIV-1 variants and CS, thus improving the available GenBank data for specific-HIV-1 variants consensus sequences. By using a large dataset of 12,934 sequences from all HIV-1 variants, our study revealed that the CS3, CS5, CS7 and CS8 were the least conserved processing sites across all HIV-1 variants. This finding is in agreement with previous publication using a smaller dataset with 32 subtype C, 34 subtype B and 18 other subtypes sequences [3]. Additional studies are necessary to understand the higher variability in these CS with important roles in the viral cycle. In more detail, CS3 is the first processing site in Gag and GagPol precursors and it is critical for RNA dimer maturation [34]; CS7 is involved in the activation of the viral PR and in the timing and specificity of GagPol cleavage [35]; CS5 is responsible for protein P6 gag synthesis which is required for the mature and infectious virion release [36]; CS8 is essential for PR autoprocessing and, it could probably be involved in the correct required PR dimerization [37].

Structural Constrains to CS Variation
Complex interactions of the substrate amino acids within the active site of the viral PR are required for efficient Gag and GagPol cleavage by the PR. HIV-1 PR is only functional in dimeric form and a single monomer is embedded within each precursor. Two individual monomers in different GagPol chains must, therefore, come together to form an embedded dimeric PR, which ultimately cleaves itself into a mature form [37]. HIV-1 maturation requires the recognition by PR of the asymmetric, three-dimensional conformation of the Gag substrate, rather than a particular peptide sequence [38] and, afterwards, PR mediates the cleavage of the HIV-1 structural Gag and GagPol polyproteins by interacting with specific CS [6,39]. Each substrate has a unique structure that differs in amino acid composition [3]. It is thought that these small differences in substrate structure impact affinity for the viral PR and contribute to the highly regulated and ordered stepwise process of maturation in which the individual cleavages occur at different times and rates [3,30]. Additional determinants beyond amino acid sequences and local secondary structures of CS are involved in Gag and GagPol processing [7]. As Gag is conserved, there are constraints on the viability of viral strains with multiple mutations due to the fact that combined mutations are likely to destabilize multiprotein structural interactions that are critical for viral function [40]. Thus, amino acid sequence conservation indicates that the specific amino acids are required to maintain basic structure and function, although other authors have suggested an important role of RNA structure in HIV-1 conservation [33,41]. It is known that physicochemical and structural properties of certain HIV-1 proteins with functional roles in the viral cycle as gp41 can be strongly conserved despite substantial sequence diversity, apparently indicating a delicate balance between evolutionary pressures and the conservation of protein structure [42]. The protein structure, specifically a-helix domains, has been associated with conservation in HIV-1 [33] and is a stable structural element in proteins [43].
Our study reveals which can be the most important CS amino acid sequence for maintaining viral processing by PR and the level of tolerance to amino acid change in each HIV-1 variant. Moreover, the significantly higher conservation observed comparing the late vs. the first CS in Gag and GagPol precursors (flanking the PR) would suggest a higher requirement of structural constrains in the last steps of viral processing. Although the aim of our study is purely descriptive, we strongly believe that it can serve as a working tool for research into the better understanding of the CS structure required for a correct cleavage efficacy across HIV-1 variants and for the design of maturation inhibitors and vaccines targeting CS. Understanding HIV-1 gag and pol coevolution [44,45] and the influence of naturally occurring specificvariant polymorphisms at PR [46] in the cleavage process is also crucial for a better interpretation of the biological significance of amino acid changes in CS in the context of a specific HIV-1 variant. Lastly, whether or not the variant specific-residues located in each CS modulate the replicative capacity of the corresponding variant, as was observed for specific natural polymorphisms in the PR in some non-B variants [47], requires further investigation.
It has been suggested that sequences around the CS in Gag are equally conserved as functional motives and sequences targeted by RT inhibitors and are more conserved than non-functional motives [28]. These authors suggested that the amino acid sequences overlapping the CS are immunogenic and, consequently, a vaccine targeting CS could be used for the majority of the world population [28]. Thus, our data on CS conservation across HIV-1 variants could provide useful data to design potential targets for an effective vaccine development against HIV effective for all groups, subtypes and recombinants. Moreover, since mutations within CS have been associated with PI exposure and maturation inhibitor resistance [5,32], our results could potentially provide a better understanding of the role of gag in antiretroviral resistance and in the development of future maturation inhibitors [4].

Conclusion
This descriptive study firstly determines the CS conservation degree across most HIV-1 variants and sites in a large dataset composed of 12,934 sequences, inferring the consensus sequences at amino acid level in 11 CS in all Group M subtypes and most CRF and URF, as well as in Groups O, N and P. Our results provide new findings that can help for a better understanding of viral evolution, Gag and GagPol precursors' processing and gag structure-function relationships, among others. Our descriptive research could help other researchers in the design of both novel antiretroviral agents acting as maturation inhibitors and for vaccine targeting CS. The biological significance of HIV-1 variant-associated variability found in each processing site in our study needs further future investigation. Table S1 HIV-1 variants showing differences with the CS consensus-of-consensuses sequence inferred by GenBank. (DOC)