Molecular Characterization of HIV-1 Subtype C gp-120 Regions Potentially Involved in Virus Adaptive Mechanisms

The role of variable regions of HIV-1 gp120 in immune escape of HIV has been investigated. However, there is scant information on how conserved gp120 regions contribute to virus escaping. Here we have studied how molecular sequence characteristics of conserved C3, C4 and V3 regions of clade C HIV-1 gp120 that are involved in HIV entry and are target of the immune response, are modulated during the disease course. We found an increase of “shifting” putative N-glycosylation sites (PNGSs) in the α2 helix (in C3) and in C4 and an increase of sites under positive selection pressure in the α2 helix during the chronic stage of disease. These sites are close to CD4 and to co-receptor binding sites. We also found a negative correlation between electric charges of C3 and V4 during the late stage of disease counteracted by a positive correlation of electric charges of α2 helix and V5 during the same stage. These data allow us to hypothesize possible mechanisms of virus escape involving constant and variable regions of gp120. In particular, new mutations, including new PNGSs occurring near the CD4 and CCR5 binding sites could potentially affect receptor binding affinity and shield the virus from the immune response.


Introduction
Human immunodeficiency virus type 1 (HIV-1) gp120 envelope protein is characterized by a remarkable genetic variability [1][2][3], due to extensive replication of the virus and to a high mutation rate of the reverse transcriptase [4]. This variability is largely responsible for the ability of the virus to escape the host immune response, in particular neutralizing activity of antibodies [5][6][7][8]. The role of the variable regions of gp120 (V1 to V5) in promoting virus escape has been reported by several groups, including ours [9][10][11][12][13][14][15]. These studies have shown that both amino acid sequence length and number of putative N-glycosylation sites (PNGSs) in the V1, V2 and V4 variable regions of the protein change during the course of the disease, generating HIV variants resistant to neutralizing antibodies. Similar mechanisms of escape have been observed in different HIV-1 subtypes of the M group [10,[15][16][17][18], although subtype-specific differences may be present [9,19].
In HIV-1 subtype C, the most widespread HIV clade worldwide and the prevalent one in Southern Africa [20,21], we and others have shown that during the chronic stage of the disease the V1 and V4 variable regions of gp120 undergo a significant increase of their aminoacid sequence length and new PNGSs are generated [9,20], while the V5 undergoes changes in its net electrical charge during the course of the disease. These changes may play a role in glycan packing and immune evasion [9,14,22].
Few studies have been conducted to date that have investigated the modulation of molecular sequence characteristics of the constant regions of gp120 during the course of the disease, even though these regions are responsible for HIV entry into the cell, and are important targets of the immune response, including neutralizing antibodies. The V3 domain, known to be variable in clade B and D viruses, is relatively conserved in HIV-1 subtypes A, C, G, E and H [Los Alamos website. Available:http://www. hiv.lanl.gov/content/sequence/HIV/COMPENDIUM/1999/7/ V3LoopEvolution.pdf. Accessed 2014 March 31,and 9,23,24]. It plays a critical role in virus entry, since it binds both CCR5 and CXCR4 receptors and is a target of neutralizing antibodies [9,23,25,26]. The C3-V4 region has been reported to be a major target of the early autologous neutralizing response in HIV-1 subtype C infection and the N-terminal portion of C3, which contains the a2 helix region, may play a role in virus escape processes [6,24,27,28]. It has also been reported that some residues in C3 bind the CD4 receptor, in association with amino acids in the C4 region [29][30][31][32].
In the present study, we have therefore investigated how amino acid characteristics of the constant C3, C4 and V3 regions of HIV-1 subtype C (i.e. sequence length, number of PNGSs, number of sites under positive selection pressure and electrical charges) are modulated during the course of the disease. We used a novel integrative bioinformatics sequence analysis which allowed us to hypothesize a model to explain how conformational changes in these regions may regulate virus escape from the immune response.

Ethics Statement
The study was approved by the Committee for Research on Human Subjects (Medical) at the University of Witwatersrand and by the Ministry of Health and Social Welfare Scientific Ethical Committee of Swaziland. All participants provided written informed consent to participate in the study.
Sample collection, determination of disease stage and description of the cohorts A total of 72 plasma samples were collected from in the time period 2005 to 2007 from HIV-positive individuals, living in South Africa and Swaziland, all naïve for antiretroviral therapy (ART); 24 samples were obtained from the 2006 Swaziland HIV National Serosurvey [33,34], 24 samples from the Chris Hani Baragwanath Hospital (CHBH) in Soweto, Johannesburg, South-Africa, in the framework of the activities included in the AVIP project (European Commission website. Available: http://ec.europa.eu/ research/health/infectious-diseases/poverty-diseases/projects/84_ en.htm. Accessed 2014 March 31) and 24 samples from the HIV/ AIDS National Referral Laboratory (NRL) at the Government Hospital in Mbabane, Swaziland, in the framework of activities linked to the Italian AIDS Programme. Ethical clearance for these studies was obtained from local Ethical Committees.
The 72 patients were classified for disease stage into 3 groups, based on Avidity Index (AI) assay data [34,35] and CD4+ T cell number, as previously described [9]. Briefly, by AI assay we identified those HIV infections that occurred within 6 months from blood drawing, independently of their CD4+ T cell number. A number of 24 patients were classified as recently infected (ES = Early disease Stage). All these patients were from Swaziland, collected in the frame of the 2006 Swaziland National Serosurvey [33,34]. Patients that by AI were identified having an established infection were further classified into patients at chronic or late disease stage, according to their CD4+ T cell number. Specifically, 24 patients with a CD4+ T cell count .200/ml were classified in the chronic disease stage group (CS = Chronic disease Stage), whereas 24 patients with a CD4+ T cell count #200/ml were included in the late disease stage group (LS = Late disease Stage). CD4+ T cell number median values were 544,5/ml (values ranging from 418 to 811) and 122/ml (values ranging from 7 to 199) for CS and LS groups, respectively. The 24 patients included in the CS group were from South Africa. Samples from 6 out of these 24 patients were collected in 2005. The remaining 18 samples were collected in 2006. The 24 patients included in the LS group were from Swaziland. Samples from 13 out of these 24 patients were collected in 2006. The remaining 11 samples were collected in 2007.

Viral RNA extraction and sequence amplification
Viral RNA was extracted from 0.5 ml of plasma using Qiamp viral RNA mini Kit (Qiagen), after treatment with Heparinase (Sigma).
The V1 to V5 coding region in the env gene was amplified by RT-PCR using SuperScriptTM One-Step RT-PCR System (Invitrogen) and specifically designed outer primers for clade C. The following primers were used: AC-Env Outer For = 59CAGATGCATGAGGATATAATCA39; ED12 m Outer Rev = 59AGTGCTTCCTTGCTGCTCCCAA39. The RT-PCR mix was composed of 0.5 to 1 mg of RNA and an RT/Taq buffer mixture containing dATP, dCTP, dTTP, dGTP 0.4 mM (Roche), MgSO 4 2.4 mM (Invitrogen), 1 ml RT/Taq-platinum (Invitrogen), RNase inhibitor 40 U/ml (Invitrogen), and primers 10 mM (MWB-Biotech); RT-PCR reaction was carried out using a thermal cycler (Eppendorf) and the following program: 45uC for 309 for retrotranscription and 94uC for 29 for the RT denaturation; the resulting cDNA was amplified as follows: 94uC for 150, 50uC for 300 and 68uC for 19300for a total of 40 cycles and a final extension step at 68uC for 79.
The resulting PCR product was again amplified in a nested PCR, using specifically designed primers in order to obtain nucleotide sequences corresponding to the V3-C3 and the V4-C4-V5 Env regions, respectively. Primers specific for each region were the following: for the V3-C3 region: V3C3ForA = 59CTGTTAAATGGTAGCCTAGC39 and V3C3RevA = 59GCAATAGAAAAATTCTCC39. If amplification failed, a nested PCR was repeated using another couple of inner primers, i.e. V3C3ForB = 59CACAGTACAATGTACACATG39 and V3C3RevB = 59RCAATAGAAAAATTCTCCTC39; for the V4-C4-V5 region: V5-ARev59 = TATAATT-CACTTCTCCARTTGTC. If amplification failed, a nested PCR was repeated using another couple of inner primers: V4-BFor59 = TTTAATTGTRGAGGAGAATTTTTCTATTG39, V5-BRev59 TATTTATATAATTCACTTCTCCAATTGTC39. Each nested PCR reaction was conducted as follows: 1-5 ml aliquot of the amplified product was added to a reaction mixture containing dNTPS 200 mM (Roche), MgCl 2 2.5 mM (Invitrogen), the couple of 20 mM primers corresponding to the region that had to be amplified (MWB-Biotech) and AmpliTaq Gold DNA polymerase 2.5 U/ml (Applied Biosystems). Amplifications were carried out specifically for each region, as following: V3-C3: 96uC for 79 (1st cycle), then 150at 94uC, 300 at 50uC, 300 at 72uC, for 40 cycles; V4-C4-V5 region: 96uC for 79 (1st cycle), then 150 at 94uC, 300 at 44uC, 300 at 72uC, for a total of 40 cycles and a final extension step of 79 at 68uC.

DNA purification and sequencing
Amplified DNA of each region was purified using Qiaquick PCR purification kit (Qiagen), according to the manufacturer's protocol. The DNA samples were then quantified and checked for purity by measuring absorbance at 260 nm and 280 nm. The V3-C3 and V4-C4-V5 amplified regions were then sequenced using BigDye (Applied Biosystem), with the same primers as the nested PCR. Sequencing was performed on uncloned PCR products to identify the prevalent viral quasispecies.
The electropherogram analysis was edited using Chromas Pro (Technelysium website. Available: www.technelysium.com.au/ ChromasPro.html. Accessed 2014 March 31). All sequences were aligned using ClustalW by Bioedit [36] and corrected for the multiple alignment by manual editing. The nucleotide sequences were translated into amino acid sequences using GeneRunner (www.generunner.net) and further codon-aligned using BioEdit [36].

Phylogenetic analysis
Phylogenetic analysis was based on the V3-V5-coding region from 72 HIV-1 variants isolated from the 72 patients (one variant/ patient) included in the cohorts. It was carried out using the PAUP software (version 4.0) (http://paup.csit.fsu.edu/downl.html) with the K81 model of substitution and using both Neighbor-Joining (NJ) and Maximum Likelihood (ML) treebuilding methods. The evolutionary model was chosen as the best fitting nucleotide substitution model in accordance with the results of the Hierarchical Likelihood Ratio Test (HLRT) implemented in the MODELTEST software (version 3.6) [37].
The parameters for the nucleotide substitution model were estimated by the ML method using a NJ tree (Jukes-Cantor distance) as the base tree [38].
The statistical robustness and reliability of the branching order within phylogenetic trees were confirmed by either a bootstrap analysis using 1000 replicates, for the NJ tree, or the zero branch length tests for the ML tree. All calculations were performed with PAUP software (version 4.0).

Analysis of electric charges
To determine the electric charge of each region, the algebraic sum of all charged residues, both positive and negative, was considered as total electric charge (Q tot ). The sum of all positive charged residues was considered as total positive charge (Q pos ) and the sum of all negative charged residues as total negative charge (Q neg ).
An original Matlab (the MathWorks Inc. USA) program was written to yield the distribution of Q tot , Q pos and Q neg directly from the *.fas files with the aligned sequences. Charge distribu-tions were analyzed with the Wilcoxon's rank sum test, after a preliminary checking of the non-Gaussian nature of the distributions using the Kolmogorov-Smirnov test, as previously described [39]. Finally, the charge correlation coefficient was calculated with Matlab.

Codon specific dN/dS ratio
The CODEML program implemented in the PAML 3.14 software package (UCL website. Available: http://abacus.gene. ucl.ac.uk/software/paml.html. Accessed 2014 March 31) was used to investigate the adaptive evolution of the V3, C3 and C4 regions of HIV-1 env gene [40]. For each region, sequence alignment was performed prior to testing which amino acids were under positive selection. Six models of codon substitution, M0 (one-ratio), M1a (nearly neutral), M2a (positive selection), M3 (discrete), M7 (beta), and M8 (beta and omega) were applied in this analysis [41]. Since these models are nested, we used codon-substitution models to fit the model to the data using the likelihood ratio test (LRT) [42].
The ''Bayes empirical Bayes'' (BEB) approach implemented in M2a and M8 was used to determine the positively selected sites by calculating the posterior probabilities (P) of the different classes for each site [43].
The strength of positive selection (i) for each amino acid site was calculated as previously described [29], by defining the Weighted where v ki represents the k-th v value for the i-th site, and the weight f ki is the associated posterior probability, according to model M8 (10 values of v are chosen for the discrete beta distribution, and another value of v is not constrained by the beta distribution and is allowed to be greater than 1). For each pair of data set, each one associated to a disease stage, we tested whether the strength of positive selection was significantly different (H1), as opposed to being equivalent (H0), by using a paired Wilcoxon rank sum test with a continuity correction applied to the normal approximation for the P values [44]. Only shared sites having a weighted mean v i value greater than 1 in the two data sets being compared were included in the test.

Three-dimensional mapping and contact maps
The core structure used for mapping the positive selected residues corresponds to the X-ray structure of CD4-bound JR-FL gp120 (Protein Data Bank [PDB] ID: 2B4C) [45]. This structure was chosen because it provides the spatial representation of V3, and is therefore useful for the identification of the PS residues in V3 (besides those in C3 and C4) in the appropriate structural context. Moreover, the chosen structure derives from a strain that has the same co-receptor usage as the majority of the isolates hereby characterized. Three-dimensional images were generated using the pyMOL v. 1.3 software (PyMOL website. Available: http://www.pymol.org/ Accessed 2014 March 31).
For generation of the contact maps, we referred to the same gp120 structure used for 3D mapping [45]. The contact map is a distance matrix that identifies, in a known 3D structure (e.g., taken from Protein Data Bank), the proximal contact between two areas. The contact distances between C-alpha carbons were used to identify interactions between regions or differences between the clades. A 10 Å cut-off distance between the C-alpha carbons was considered as a measure for contact [27]. Contact maps were generated with CMView v0.9.6, after loading the appropriate molecular structure by specifying its PDB identification number. The software allows to highlight relevant regions (e.g. C3 or C4 of Env) or single residues of the loaded structure.

Statistical analysis
The Wilcoxon rank sum test was used to evaluate the statistical significance of the difference between parameters analysed in the study (if not explicitly stated otherwise).
Furthermore, to measure existing correlation between sequence length and glycosylation status, the Spearman's correlation test was used. All analyses were conducted using the Stata 8.2 software.

Phylogenetic analysis
We amplified and sequenced the V3-V5-coding regions of HIV variants isolated from 72 HIV-infected individuals living in South Africa and Swaziland, naïve for ART during the years 2005 to 2007. A phylogenetic analysis based on the V3-V5 sequence ( Fig. 1) revealed that all patients were infected by HIV-1 subtype C variants, as also previously described [9]. In the figure virus sequences from patients have been highlighted with different colors according to year of collection of sample: 2005 in blue, 2006 in red and 2007 in green. Since we wanted to study if and how sequence molecular characteristics changed during the evolution of the disease, we had to verify that the natural Env sequence evolution through the three year period would not have influenced the molecular analyses we performed. As it is evident in figure 1, the different years are well intermixed in the figure, indicating that the sampling year was not a main causal factor for the observed differences in the sequence characteristics of the Env regions we have considered for our studies. Furthermore, the phylogenetic analysis in figure 1 shows an intermixing of all the sequences, independently of the patient region of origin, indicating that a possible, if any, difference in the ethnical background of the two populations from Swaziland and South Africa does not constitute a bias for the further analysis we performed.

Amino acid sequence length and PNGS distribution
Lengths of the amino acid sequences of the gp120 V3, C3 and C4 regions from the HIV-1 infecting virus variant obtained from each one of the 72 patients, as well as the predicted tropism of the variant, are reported in Table 1. As shown in the Table, the median length of amino acid sequences of the V3, C3 and C4 regions did not significantly vary with the disease stage. The V3 sequence length was equal to 35 amino acids in virus variants from 66 out of 72 samples, 34 amino acids in 5 variants, and 37 amino acids in 1 variant. In the variants with a V3 sequence of either 34 or 37 amino acids, no correlation with disease stage was found. In fact, of the five variants with a V3 sequence of 34 amino acids, 2 were from ES, 2 from CS and 1 from LS patients. The only variant with a V3 sequence of 37 amino acids was obtained from an LS patient. Regarding the predicted viral tropism, the three sample groups presented a similar pattern with 20 out of 24 variants being R5 tropic and 4 being X4 tropic in the ES group; 19 R5 and 5 X4 tropic in the CS group and 18 R5 tropic and 6 X4 tropic in the LS group.
Amino acid sequence length of the C3 region ranged from 50 to 54 amino acids. No significant differences in sequence length among the disease stages were observed (Table 1), although a tendency to a broader length variability was detected in the ES group where 15 variants had a C3 region of 52 amino acids, 4 of 53, 2 of 51, 2 of 50 and 1 of 54. In the CS and LS groups, the C3 region was more conserved with 21 variants bearing a C3 region of 52 amino acids in both disease stages. The remaining 3 variants in the CS group had a C3 region of 51 amino acids, whereas in the LS group, 2 had a C3 region of 53 amino acids and 1 of 51 amino acids. The C4 region length was of 41 amino acids in all 72 variants. Table 2 reports the mean and median number of PNGSs for each region and for each disease stage. Overall, both the median and the mean number of PNGSs of each region did not show statistically significant variations among the three disease stages. A qualitative analysis of the distribution of PNGSs among the different groups of patients is reported in figure 2. A number of PNGSs, in particular, PNGS N301 (in V3), N332, N339 and N356 (in C3), and N442 and N448 (in C4) are conserved in variants obtained from patients at all the disease stages with a frequency higher than 70%, with the exception of PNGS 339 (numbering according to HXB2 sequence) in the C3 region that is present in virus variants obtained from patients at the LS stage where it is represented with a frequency of about 60%. For this reason these PNGSs have been named ''fixed''. Besides these, other PNGSs were found with a lower frequency in all the regions. Most of these PNGSs are represented with a frequency that is often ,20% and are not constantly present in virus variants obtained from patients at all disease stages. Therefore, they have been named ''shifting'' PNGSs, in agreement with previous reports [9,14,46,47]. The majority of the shifting PNGSs are present in the C3 region in variants obtained during the chronic stage of the disease, in particular in the a2 helix (amino acids 335 to 352) and in the Nterminal and in C-terminal portions of the C3 region. The PNGS at position 362 was found with a frequency of 33.3% in the variants from the CS group (encircled in the figure), whereas it is present at much lower frequency in variants from the ES and LS groups. A shifting PNGS between N442 and N448 in the C4 region was detected only during the late stage of the disease (encircled in the figure). Of note, we also found a low frequency of expression, at all disease stages, of the glycosylated residue 295, which has been described to be highly conserved in other HIV-1 subtypes [47] that comprehends the last amino acid in the C2 region and the first two in the V3 region. Since this PNGS is located between the C2 and the V3 regions, it has not been reported in figure 2.

Analysis of positive selection
For each disease stage, we evaluated which amino acid residue in the constant V3, C3 and C4 regions was under positive selection pressure, by calculating the dN/dS ratio for each position. Table 3 lists the sites that have a mean dN/dS ratio .1 in variants from at least one of the disease stages. The number of amino acid positions under positive selection pressure is considerably higher in the C3 region at the chronic disease stage. In fact, in this region, ten, twelve and three residues appear to be under positive selection in the virus variants from ES, CS and LS groups, respectively. Nine out of twelve residues under positive selection pressure in the variants from CS group cluster in the N-terminal portion of the C3 domain, within the a2 helix (amino acids 335 to 352). Within the V3 domain, only three sites are under positive selection pressure (amino acids 300, 309 and 322) in variants from the CS group, two in variants from the ES group (amino acids 300 and 322) and two in variants from the LS group (amino acids 300 and 309). In the C4 domain only one site is under positive selection pressure (amino acid 429) in virus variants obtained at all disease stages. In this region, an additional amino acid at position 446 appears to be under positive selection pressure only in the LS group.
We have then calculated the strength of the positive selection pressure for each one of the identified residues. No statistically significant difference between the set of weighted mean values (v i ) among the disease groups was found for both the V3 and C4 regions (data not shown). Conversely, for the C3 domain, a significant difference (p = 0.00016) was found between the set of weighted v i of the ES and CS groups (Fig. 3), with sequences from variants of the ES group displaying residues with higher v i values than the ones calculated for the CS group.

Molecular visualization of sites under positive selection pressure
In order to characterize the positions of the residues under positive selection in the gp120 protein, we mapped them on the three-dimensional structure of the CD4-bound gp120 derived from the HIV-1 JR-FL strain [45]. Figures 4, 5 and 6 show the positions of the residues under positive selection in the C3, C4 and V3 regions (spheres in purple), respectively. Sites in C3 from variants obtained from the ES and CS groups are more numerous than those found in the variants from the LS group (Fig. 4). These sites are located in the a2 helix region (in blue), near the CD4binding site and close to residues critical for the binding to CCR5 (spheres in red). Residues within the C3 region that are critical for the binding to CD4 (amino acids 366 and 368, not shown in the figure, since hidden by the CD4 structure) are fully conserved in virus variants obtained at all the three disease stages, as also confirmed by the low Shannon Entropy values (data not shown). In contrast, the number of sites under positive selection pressure in the C4 region (Fig. 5) is small in the virus variants from all the three disease groups. Again, these residues are placed near the CD4-binding site or close to residues critical for CCR5 binding. Finally, the few sites under positive selection in the V3 region are all proximal to the highly conserved residues critical for CCR5 binding (Fig. 6).

Electric charge analysis and interaction between regions
In order to investigate whether the electric charges of C3 and C4 regions of gp120 (that are the principal components of the  Table 1. Amino acid sequence length of C3, C4 and V3 regions and estimated tropism of HIV-1 clade C variants from patients at different disease stages. CD4 binding site) can have an impact on the conformation of the CD4 binding site, we calculated the electric charge of each one of these regions and correlated it with the electric charges of all the other variable regions. After computing all the total electric charge (Q tot ) correlations, we found statistically significant correlations between C3 and V4 and between C3 and V5 regions. These correlations are shown in figures 7 and 8. A negative correlation was present in the C3-V4 interaction (increased Q tot in V4 region and decreased Q tot in C3 region) at all stages of disease (Fig. 7). This correlation was statistically significant in the LS group of patients (panel c, p = 0.0090). The electric charge analysis of the C3-V5 interaction showed a strong tendency to a positive correlation (increased Q tot in both C3 and V5) in the virus variants from LS group of patients (panel f, p = 0.0526). This correlation was statistically significant when the interaction between electric charges of a2 helix (in C3 region) and V5 was considered ( Fig. 8 panel f, p = 0.0192). The correlations of the electric charges between C3 and V4, C3 and V5, a2 helix and V4, and a2 helix and V5 were then further characterised by means of contact maps, which allowed us to identify close spatial contacts between two regions. Figure 9 shows the contact maps for C3-V4 and C3-V5 (Panel A), and a2 helix-V4 and a2 helix-V5 (panel B) in the crystallographic structure of the JR-FL gp120 bound to CD4 [45]. As shown in panel A (top), C3 and V4, and C3 and V5 regions strongly interact (light blue area). In particular, amino acids in C3 at positions 360, 362 and 364, which have been found under positive selection pressure in the virus variants obtained during the chronic disease stage, interact with N-terminal residues of V4 region (panel A, left). In the interaction between C3 and V5 (panel A, right) a single residue at position 360 of the C3 region, under positive selection pressure in the variants from the chronic disease stage, was found to interact with residues in the C-terminus of the V5 region.
In the panel B of figure 9, the interactions of a2 helix with V4 and V5 regions are reported. Residues 335, 336, 337, 343, 346, 347 and 350 in a2 helix, which have been found under positive selection pressure in the chronic disease stage, interact with residues in the C-terminal and N-terminal of V4 (left). Finally, a single residue in the a2 helix, at position 349, interacts with one residue in the C-terminal of the V5 region (right).

Discussion
The role of gp120 variable regions in HIV escape from immune response has been previously investigated (9)(10)(11)(12)(13)(14)(15)(16)(17). However, the role of constant regions of gp120 has not been thoroughly studied to date. In the present work we focused on those gp120 constant regions of HIV-1 clade C which are key in the binding to CD4 receptor and coreceptors and are involved in the processes of HIV-1 cell entry and in host immune response. Amino acid sequence lengths, PNGSs distribution, residues under positive selection pressure and electric charges within the V3, C3 and C4 conserved domains of HIV-1 clade C variants obtained from individuals at different stages of disease, were investigated. The aim of the study was to evaluate if and how these sequence characteristics in these constant regions were modulated during the course of the disease.
The study was performed on only patients naïve for ART to avoid influence of antiretroviral therapy on variants selection. It therefore provides a clearer picture of mechanisms of HIV immune escaping in patients who are not treated with antiretroviral drugs, a condition that, unfortunately, is still common in many poor developing countries where HIV is highly endemic.
We observed that the V3 sequence length is well conserved in all the disease stages (35 amino acids, on average), as previously reported [Los Alamos website. Available: http://www.hiv. lanl.gov/content/sequence/HIV/COMPENDIUM/1999/7/ V3LoopEvolution.pdf. Accessed 2014 March 31, and 9, 23, 24], since this length possibly ensures the optimal conformation for the binding to CCR5 [48]. Virus tropism, predicted on the basis of the V3 sequence, does not switch from R5 to X4 at the later disease stage, as it has been instead described for the HIV-1 B subtype [25,[48][49][50][51], although a greater, but not significant, number of X4 strains have been observed at the late stage, when compared to early and chronic stages.
The amino acid length of the C3 region ranged from 50 to 54 residues. In line with previous studies, we confirmed that the C3 domain in subtype C variants is relatively variable, in particular in its a2 helix portion, although sequence length variability among disease stages was not statistically significant. The a2 helix subdomain is more exposed to the solvent and to neutralizing antibodies in clade C than in clade B virus and thus exhibits a higher sequence entropy at the polar face [24]. Conversely, the C4 region was constant in its amino acid sequence length (41 aa) at all the disease stages. This may be due to the fact that, although     located together with the C3 region in the outer domain of gp120 protein, C4 is buried into a hydrophobic pocket and therefore is less exposed to the selection pressure of the host immune response. The number of conserved PNGSs in the V3, C3 and C4 regions was constant at all the disease stages. All conserved PNGSs in these regions are found in all M group viruses with a frequency . 70% [52]. A single fixed PNGS in V3, at position 301, involved in coreceptor binding, was observed in the N-terminal portion [53]. No shifting PNGSs in V3 were observed at all disease stages, in agreement with previous data indicating that in the V3 domain the presence of shifting PNGSs is very rare, since appearance of new PNGSs may interfere with co-receptor binding [10]. In addition, we found a low frequency of expression of the glycosylated residue 295, located at the confluence of C2 and V3, within the target epitope of the 2G12 monoclonal antibody. Absence of the 295 residue in the clade C variants explains the resistance of subtype C strains to the neutralising effects of this monoclonal antibody [54]. In the C3 domain, shifting PNGS were mainly observed within the a2 helix and the C-terminal portions, downstream of the conserved PNGS N356. These amino acid positions are important targets of neutralizing antibodies and the generation of new PNGSs could provide a shield against the immune system response. Interestingly, we have identified a PNGS at position 362 with a frequency of 33.3% in the CS group and with a much lower frequency (8.3%) in the ES and LS groups. As observed by Sterjovski et al., this residue, adjacent to the CD4-binding site, is associated with an enhanced virus fusogenic capacity and entry [55]. Since an increase in length of the amino acid sequence of V1 and V4 regions during the chronic stage of the disease may result in a reduced affinity to the CD4 receptor [9,56], we can speculate that the PNGS 362 in C3 may be selected during the chronic stage to compensate the reduced affinity to CD4.
In the C4 region, two conserved PNGSs at positions 442 and 448 and a few shifting PNGSs between these two residues, have been identified. The N442 residue was suggested to be involved in glycan shielding [57,58], whereas the N448 residue is known to be critical for the binding to CCR3 co-receptor, together with residue N356 in C3 [59]. The N448 residue has also been reported to protect the V3 loop from neutralizing antibodies and to promote processing and/or presentation of T-helper epitopes for antiviral T-cell response [60,61]. Since PNGSs in the C4 region are positioned close to the pocket of the CD4-binding site, the shifting PNGSs lying in that region may shield the virus from both neutralizing antibodies and the T-helper mediated immune response [14].
In an attempt to characterize other sites that could be involved in virus escaping, we identified those sites in V3, C3 and C4 that are under positive selection pressure at the different disease stages and mapped them onto the three-dimensional structure of the CD4-bound gp120 from the JR-FL HIV-1 virus [45]. Most of these sites under positive selection pressure were found within the C3 region and, in particular, in the a2 helix portion during early and chronic disease stages. We also detected three sites in the V3 region under positive selection pressure during the chronic disease stage, and two sites in both the early and late disease stages. The presence of sites under positive selection pressure in the a2 helix region, as well as in the V3 region, indicates a strong diversifying selection of these regions, in a subtype-specific manner [19,29,30,62].  The presence of sites under positive selection pressure in C3 region of HIV-1 clade C subtype during the chronic disease stage has been described in literature [63]. However, we also found a strong reduction of sites under positive selection pressure in the C3 region during the late disease stage. This is the first time, to our knowledge, that such an evidence is reported. Lastly, in the C4 region close to the CD4 binding site, two sites were found under positive selection pressure during the late stage of the disease and only one in the early and chronic disease stages. Of note, some of the sites under positive selection pressure were also found close to critical residues for binding to either CD4 or CCR5.
The strength of positive selection was found to be significantly different among the disease stages only in the C3 region. In particular, it was found to be significantly higher at the early stage of the disease, when compared to the one at the chronic stage. Taken together these data lead us to speculate that during the early disease stage, when the immune response may be less developed compared to the chronic stage, more degrees of freedom for the three-dimensional structure of the viral particle surface are allowed. Conversely, during the chronic disease stage, the greater pressure generated by a more developed immune response may lead to the selection of more non-synonymous substitutions that are compatible with an increasingly constrained three-dimensional conformation of the gp120 molecule in terms of viral fitness, since fewer degrees of freedom are allowed. The greater number of positive selected sites in the C3 region during the chronic disease stage, in comparison to those observed during the early stage, may be seen as a compensatory effect by which the virus counteracts the lower strength of positive selection in the CS stage with an increased number of positive selected sites in this stage. This effect may, in turn, benefit a higher resistance of the virus to the immune response, in particular to the action of neutralising antibodies.
Since no significant differences in the number and the strength of positive selection were observed in both V3 and C4 regions at any disease stage, we propose that the C3 region could be a more critical target for positive selection than C4 and V3, as already suggested by Moore et al., using HIV-1 clade C chimera viruses [6]. In this regard, data in literature indicate that the a2 helix in HIV-1 subtype C is more exposed to the solvent than in HIV-1 subtype B and that the residues with dN/dS .1 are associated with resistance to neutralization [24]. Electrical charge analysis revealed a significant strong negative correlation between the total electric charges of C3 and V4 in virus variants during the late stage of disease coupled with a strong positive correlation between the total electric charges of C3 (in particular the a2 helix) and V5 during the late stage of the disease. Contact maps analysis showed that the N-terminus of the a2 helix and the C-terminus of the V4 region are in close proximity. This may be related to the charge negative correlation between C3 and V4 (and a2 helix and V4), seen in the variants at late disease stage, in agreement with previously published data [24]. Similarly, C3, in particular the a2 helix, strongly interacts with V5, with a positive electric charge interaction. Overall, data on electric charge and interaction among regions suggest that during the course of the disease changes in electric charge and strong physical interactions can affect the conformation of the regions of gp120. These changes may be instrumental in the virus escape mechanism. In fact, the negative value of the charge cross-correlation between C3 and V4 in the late disease stage may indicate that an attractive force is mediated by such changes, thus maintaining a more compact and closed structure of the two regions. Contact maps indicate that C3 and V4 are in close proximity and show a spatial preference for charged residues, suggesting that the negative correlation between C3 and V4 can have a physical explanation in terms of electrostatic interactions between these two regions. It could be assumed that the negative C3-V4 correlation is the hallmark of an electrostatic attraction between these two regions. This, in turn, induces a conformational change in both regions towards a more closed conformation that may protect the C3-V4 epitope from neutralizing antibodies. The absence of a significant C3-V4 negative correlation during the chronic disease stage could be compensated by the presence of the PNGS at position 362 in the C-terminal region of C3, whose frequency is increased during chronic disease stage. The PNGS at position 362 is near the CD4 binding site, and strongly interacts with the V4 region, as shown by the contact maps. The presence of the glycan residue at position 362 could, therefore, act as a shield against neutralizing antibodies. In late disease, the electric charge-mediated attraction between C3 and V4, which favors a more closed conformation, could be counter-balanced by the repulsive forces between C3 (in particular the a2 helix) and V5, shown here. These repulsive forces could be responsible for a more open conformation of these two regions. Since the orientation of V5 regulates the opening of the CD4 binding pocket [55 and Cenci and D'Avenio, personal communication], these data support the hypothesis that, when the immune system becomes weaker with disease progression, a more open conformation in these two regions may increase the fitness for the binding to the CD4 cell receptor, as critical residues for CD4 binding have been shown to be present in C3, in addition to those present in C4 [30,31]. HIV-1 gp120 protein model is from JR-FL strain [45]. Sites under selection pressure are indicated as spheres in purple. Labeling of each positively selected site is centered on the acarbon of the amino acid. CD4 receptor is in green. Residues involved in CCR5 binding are indicated as spheres in red. The a2 helix region is indicated in blue. Amino acid residue 300 in V3 is partially hidden by residue 441 of the C4 region, to which it is spatially close. Amino acid numbering is according to HXB2 sequence. ES = Early disease Stage CS = Chronic disease Stage LS = Late disease Stage. doi:10.1371/journal.pone.0095183.g006 It is also noteworthy that some sites under positive selection pressure are found to be involved in the interactions between C3 and V4 (amino acids 360, 362 and 364 in C3) and between C3 and V5 (amino acids 360 and 362 in C3), hinting at a possible role in conformational changes driven by the immune pressure. Similarly, the contact map between the a2 helix and the V4 region indicates that the sites under positive selection pressure within the a2 helix (residues 335, 336, 337, 343, 346, 347 and 350) strongly interact with C terminal residues in V4 and with one residue at V4 N-terminal. It is plausible that the a2 helix structure, interacting with V4 through electric charge rearrangements, could relocate the neutralizing epitope in the C3-V4 region, thereby eluding the host's neutralizing antibodies.

Conclusions
We have identified, for the first time and in a large sample of virus variants, several distinguishing features in V3, C3 and C4 regions of HIV clade C gp120 protein. We have described the variability of these features during the course of the disease and speculated on possible implications in virus infectivity and resistance to the host's immune response. In particular, we described the presence of sites in C3, C4 and V3 regions that are under positive selection pressure and are proximal to residues important for the binding to CD4 receptor or co-receptors, and mapped them in critical areas of contact between these regions. These sites may be instrumental to charge-associated conformational changes that may play a role in the evolution of HIV.
These findings may contribute significantly to the search for new immunogens suitable for developing a sterilising vaccine against HIV, as well as to the design of novel antiviral drugs.
Accession codes. Sequences included in this paper have been deposited in GenBank under the accession numbers JN121046 to JN121117. Figure 9. Contact maps between C3-V4, C3-V5, a2 helix-V4 and a2 helix-V5 of gp120. Contact maps are generated from the crystallographic structure of the CD4-bound JR-FL2 gp120 [45]. Regions are indicated in violet and in pink. The boxed area in light blue is the contact area. Solid lines split the region in the N-terminal and C-terminal portions. In the contact areas, only amino acids that have been found under positive selection are shown (indicated by arrows and circled). Amino acid numbering is according to HXB2 sequence. Panel A) left: C3 region is reported horizontally (in violet), V4 region vertically (in pink). Panel A) right: C3 region is reported horizontally (in violet), V5 region vertically (in pink). Panel B) left: a2 helix is reported horizontally (in violet), V4 region vertically (in pink). Panel B) right: a2 helix is reported horizontally (in violet), V5 region vertically (in pink). doi:10.1371/journal.pone.0095183.g009