Geospatial HIV-1 subtype C gp120 sequence diversity and its predicted impact on broadly neutralizing antibody sensitivity

Evolving diversity in globally circulating HIV-1 subtypes presents a formidable challenge in defining and developing neutralizing antibodies for prevention and treatment. HIV-1 subtype C is responsible for majority of global HIV-1 infections. In the present study, we examined the diversity in genetic signatures and attributes that differentiate region-specific HIV-1 subtype C gp120 sequences associated with virus neutralization outcomes to key bnAbs having distinct epitope specificities. A total of 1814 full length HIV-1 subtype C gp120 sequence from 37 countries were retrieved from Los Alamos National Laboratory HIV database (www.hiv.lanl.gov). The amino acid sequences were assessed for their phylogenetic association, variable loop lengths and prevalence of potential N-linked glycosylation sites (pNLGS). Responses of these sequences to bnAbs were predicted with a machine learning algorithm ‘bNAb-ReP’ and compared with those reported in the CATNAP database. Subtype C sequences from Asian countries including India differed phylogenetically when compared with that from African countries. Variable loop lengths and charges within Indian and African clusters were also found to be distinct from each other, specifically for V1, V2 and V4 loops. Pairwise analyses at each of the 25 pNLG sites indicated distinct country specific profiles. Highly significant differences (p<0.001***) were observed in prevalence of four pNLGS (N130, N295, N392 and N448) between South Africa and India, having most disease burden associated with subtype C. Our findings highlight that distinctly evolving clusters within global intra-subtype C gp120 sequences are likely to influence the disparate region-specific sensitivity of circulating HIV-1 subtype C to bnAbs.


Introduction
The extraordinary diversity of env targeting neutralizing antibodies is a barrier to achieving the desired vaccine-induced and antibody-mediated protection. Evolving antigenic diversity in global and region-specific circulating HIV-1 subtypes is complex which not only poses significant roadblocks to developing preventive vaccine but also poses a challenge in neutralizing antibody mediated prophylaxis and treatment. Broadly neutralizing antibodies (bnAbs) act solely on the HIV-1 envelope glycoprotein (Env) for neutralizing genetically distinct HIV-1 subtypes. Till date, a number of bnAbs have been discovered from elite neutralizers and a few of them have been found to be able to prevent acquisition as well as significantly reduce plasma viral loads, when tested both in animal models and humans [1][2][3], thus justifying their importance as products for prevention and treatment. Some of the bnAbs that are currently being evaluated through human clinical trials will provide additional possibilities for prevention [4,5], such as their extent, in addition to virus neutralization, in persistent viral clearance [6,7] and eliminating HIV-1 infected cells [2,8,9]. While some single bnAbs have been found to show significant breadth across subtypes, combination of bnAbs with distinct epitope-specificity is believed to provide the most effective response against globally diverse HIV-1 subtypes and also would likely prevent the development of antibody-escape variants [2]. The trimeric Env glycoproteins on the virus surface are the most diverse of all proteins encoded by HIV-1; which differs by greater than 20% of amino acids between matched subtypes [10][11][12][13][14] and which continues to diversify at a population level [14][15][16][17]. This has been substantiated by observation that several epitopes that are targeted by different bnAbs have been found to vary over time [15]. While bnAbs target both surface gp120 and membrane proximal external region (MPER) of gp41, gp120 exhibits extraordinary sequence divergence compared to gp41 MPER and variation in this region is believed to represent distinct genetic subtypes, or clades, which are prevalent in distinct geographic regions [18,19]. Although bnAbs isolated from individuals infected with one particular subtype are generally effective at neutralizing viruses belonging to other subtypes, antibody potency is often found to be correlated with matched subtypes as described elsewhere [14,[20][21][22][23][24]. Moreover, diversity has been found to have an impact even within matched subtypes, as demonstrated by the fact that the subtype-matched neutralization advantage was more apparent in regions with distinct viral diversities [14]. HIV-1 subtype C accounts for approximately half of the global infections [25], which predominates in India and South Africa. Yet, robust, comparative env sequence diversity and evolution analysis, critical for bnAb based intervention strategies in these regions is severely lacking [26]. While recent development of bnAbs has considerably improved our knowledge on conserved epitopes that they target, and Env structure associated with broad and potent virus neutralizing antibodies, a greater understanding of the antigenic diversity of global HIV-1 subtype C Env would facilitate understanding potential bnAb combination that could overcome the intra-clade C diversity. In the present study, we examined the variation in signature sequences, loop length, N-linked glycosylation and key epitopes within the existing globally circulating HIV-1 subtype C gp120 targeted by potent bnAbs and predicted their impact on virus neutralization.

Evidence of phylogenetic divergences of globally circulating HIV-1 subtype C gp120 sequences
Previous studies have demonstrated that HIV-1 subtype C is the most abundant globally circulating subtype responsible for approximately 46.6% of all HIV infections [27]. In the present study, we retrieved from HIV database (www.hiv.lanl.gov), a total of 23750 sequences covering the complete gp120 gene (HXB2 coordinates: 6225-7758). To mitigate the intra-individual quasispecies bias, we applied 'one sequence/individual filter' retaining 1927 full length sequences. As shown in Table 1, there is uneven distribution of gp120 sequences from different countries. Of the 37 countries analyzed, South Africa (ZA) was found to contribute in the database (www.hiv.lanl.gov) more than 1000 sequences, while Malawi (MW), Botswana (BW) and Zambia (ZM) contributed between 100 to 1000 sequences. Interestingly, complete gp120 sequences were found from only 84 unique individuals from India (IN), while all other countries contributed approximately 50 or less sequences. Upon removal of identical sequences as well as those harbouring internal stop codons, a total of 1814 full length HIV-1C gp120 protein sequences were retained for further analysis. We next analyzed the phylogenetic properties of the subtype C gp120 sequences that uniquely represent globally circulating subtype C across different geographies. Average number of amino acid differences per site between sequences grouped by source country were estimated by MEGA [28]. Sequences from Asian Countries (IN, China (CN) and Nepal (NP)) were observed to be closer to each other (0.183-0.208) compared to African countries (ZA and MW: 0.207-0.234). This trend continued in the maximum likelihood tree constructed as indicated in Fig 1A, wherein sequences from Asian countries (IN, CN and NP) were observed to form a unique sub-cluster. To further validate these observations statistically, a subset of sequences (N = 251) forming the aforementioned node was reanalysed with 1000 ultrafast bootstrap replicates and SH-aLRT. As indicated in Fig 1B, sequences from Asian countries clustered together with 97% bootstrap and 92.1% SH-aLRT support. They also clustered distinctly from African sequences supported by 84.5% SH-aLRT and 79% ultrafast bootstrap replicates.

Geography based subtype C gp120 loop length and charge variation
Changes in the lengths of variable loops within gp120 has been previously documented to be an evasion mechanism of HIV-1 to escape neutralizing antibody driven humoral responses [29][30][31]. Therefore, we next assessed lengths of variable regions between sequences from different countries. To prevent low sampling bias, we included sequences from six countries in this analysis that could contribute more than 50 sequences each (Fig 2A)

Comparison of abundance of potential N linked glycosylation sites (pNLGs)
HIV-1 gp120 is a heavily glycosylated protein with host derived N-linked glycans making up 50% of its total mass. These glycans play an important role in ensuring viral infectivity as well as evading neutralizing antibodies [32] emphasizing the importance of their assessment. In the present study, proportion of pNLGs were compared between sequences from 14 countries (Table 1). A median of 25 pNLGs were observed in the sequences (range: 11 to 33, 10-90 percentile: 22-28). There were no overall significant differences between the average number of pNLGs between countries (Kruskal Wallis test, p>0.05). Out of a total of 3003 pairwise analyses of each pNLG position for all possible two country combinations performed with Fisher's exact test, 342 combinations were found to be statistically significant (Fisher's exact test, p<0.01) involving 31 of the 33 pNLG sites. N88, N156, N160, N197, N276, N 301 and N386 were observed to be highly conserved across all countries with >70% abundance. N301 (abundance: 89-100%) was observed to be the most conserved site which plays a critical role in various envelope functions including membrane fusion and thus is highly conserved [33].

Variation in Shannon entropy indicate significant intra-subtype C diversity
Shannon entropy is a measure of variability wherein higher entropy indicates higher variability. Entropy values were predicted with Entropy-One tool available through LANL HIV database. To assess the entropy differences associated with bnAb contact sites, we plotted entropy data for Subtype C overall (all sequences), South Africa, India, Malawi, Botswana, Zambia and Tanzania ( Fig 4A). To prevent bias because of hypervariable nature of certain gp120 domains, the residue positions present in hypervariable regions were removed from the subsequent comparisons. While no difference was observed in entropy profiles at key bnAb sites between overall subtype C entropy and that observed in South Africa (Mann-Whitney test, p = 0.0759), corresponding entropy profile in India was significantly different both from overall subtype C (p<0.0001) and South Africa (p<0.0001). Difference between entropy profiles of South Africa and India were not significantly different in pairwise comparison with Malawi, Botswana, Zambia and Tanzania (Mann-Whitney test, p > 0.05). To further assess the entropy differences, Indian sequence 'query' data set (N = 83) was compared against South African sequence 'background' data set (N = 910) through Entropy-TWO tool on LANL HIV database which generated Shannon entropy values with statistical confidence measured through Monte-Carlo randomization with 100 replacements. Overall, 133 amino acid positions across gp120 were detected to have differential entropy between India and South Africa, of which 83 sites had higher entropy in South Africa while 50 sites had higher entropy in India ( Fig 4B). As

Variation in abundances of epitopes associated with resistance and susceptibility to bnAbs
Next, we examined the abundance of HIV-1 subtype C resistance phenotype as defined in the CATNAP database of key bnAbs having varied epitope specificities in gp120 across different countries. We assessed sequence datasets of different counties (www.hiv.lanl.gov) and calculated the frequency of key amino acid residues associated with the sensitivity and resistance to individual bnAbs. We examined the select key bnAbs targeting CD4bs (VRC01, VRC07, 3BNC117 and N6), V1V2 region (PG9, PG16, PGT145, PGDM1400 and CAP256.VRC26.25) and V3 supersite (PGT121, PGT128 and 10-1074). As shown in Fig 5, we found considerable variation in the abundance of amino acid residues that form epitopes associated with neutralization resistance to bnAbs with unique specificity. Variation in abundances of the following residues that form epitopes to bnAbs with unique specificities were observed in geographically divergent globally circulating HIV-1 subtype C (Fig 5). Furthermore, we performed Fisher's test for pairwise comparisons for resistance associated residue abundance at each of the contact sites for . For CD4bs directed bnAbs: variation in the sensitivity to global HIV-1 subtype C to VRC01 was found to be majorly associated with N234 and S364. Similarly, for V1V2 directed bnAbs: K130, I161, Q170, Y173, S291, T297, N332 and N340 associated with significant variation in PGT145 sensitivity; K130, I161,I165, V169 and N332 (most significant) associated with significant variation in PGDM1400 sensitivity; K130, I161, R166, V169, Q170 and N332 (most significant) associated with significant variation in PG9 sensitivity; K130, I161 and K171 associated with significant variation in PG16 sensitivity; I165. R166, V169 and N332 associated with significant variation in CAP256.VRC26.25 sensitivity was observed. Finally, for V3 directed bnAbs: K155, I165, R252, N289, E293, I307, H330, N332, S334, A336 and N448 associated with significant variation in PGT121 sensitivity; R252, N295, N300, H330, S334, A336, Q344, and T415 associated with significant variation in PGT128 sensitivity; K155, N156, I165, N230, T240, R252, K282, I307, A316, H330, S334, A336, Q344 and N448 associated with significant variation in 10-1074 sensitivity were observed. While the present study focuses on gp120 sequences which are majorly relevant to the sensitivity to gp120 targeting bnAbs, changes in sequences in gp41 associated with resistance to bnAbs targeting gp120 have also been reported [26,[35][36][37][38]. Upon analysis of 1768 sequences representing different countries, we found high abundance of residues in gp41 associated with resistance to bnAbs particularly those target V1V3 and V3 region (S1 Fig). Taken together, our study indicates the existence of variation in the abundance of key bnAb contact sites across global circulating HIV-1 subtype C. This observation highlights that the choice of bnAb combination for effective neutralization coverage against diverse region-specific circulating HIV-1 subtype C would likely vary significantly.

Evidence of accumulation of bnAb resistance phenotype in globally circulating HIV-1 subtype C over time
With abundance of bnAb neutralization data against specific envelope sequences obtained in vitro as well as through clinical trials, several machine learning-based algorithms are increasingly becoming available that can predict probable sensitivity to bnAbs on the basis of gp120 sequences. In the present study we employed one such recently published algorithm bNAb-ReP [39] to predict sensitivity of 1466 gp120 sequences selected in the present study from countries India, China, South Africa, Malawi, Zambia and Tanzania to following bnAbs: 3BNC117, VRC01, VRC07, PGT145, CAP256:VRC26.25, PGDM1400, PG9, PG16, PGT121, PGT128 and 10-1074. The prediction data were plotted along with country-wise in vitro data available through CATNAP database for a total of 283 sequences as indicated in Table 2. As indicated in Fig 6, probability values greater than 0.5 point towards sensitivity to bnAbs while those lower than 0.5 indicate probable resistance. For VRC01, while CATNAP database indicated unequivocal sensitivity of majority of sequences from all 6 countries, bNAb-ReP predicted a significant fraction of sequences from African countries to be resistant. Predictions for VRC26.25, 10-1074, PGT121 and PGDM1400 matched those reported in the CATNAP database. Similar to VRC01, predictions for 3BNC117, PGT128, PG9, PG16 and VRC07 did not match the data from CAT-NAP and indicated probable resistance in many of the sequences from all 6 countries. To assess if bnAb sensitivity differed over time, prediction data for each of the 11 bnAbs was plotted against three periods based on the reporting date of the sequences (Fig 7). The three periods con- Despite this decrease, most sequences were predicted to be susceptible to PGDM1400 and VRC26.25. However, susceptibility to 3BNC117, VRC01, VRC07, PGT145, PG9 as well as PG16 was predicted to have reduced significantly over the period of time assessed.

Discussion
Given that subtype C accounts for approximately half of the global HIV burden, limited information on the intra-clade C env diversity and its association with variation in their neutralization phenotypes exists. In the present study, we compared the genetic attributes of the globally circulating HIV-1 subtype C gp120 sequences that differentiate the region-specific intra-clade HIV-1 subtype C neutralization diversity. For this, we used the available information in the HIV database (www.hiv.lanl.gov) and established algorithms [39] to suitably predict the association between genetic features that potentially affect the HIV-1 intra-clade C neutralization diversity. In spite of the disparity between the number of existing region-specific unique HIV-1 subtype C sequences in the database, region-specific distinct genetic clustering was observed by phylogenetic analysis of gp120 amino acid sequences. It is to be noted that our analysis was based on one sequence per individual to avoid any sampling bias on an individual level s. The region-specific subtype C gp120 divergence could possibly be due to diversity in the population level across geography [14]. Indeed, studies have shown that HIV-1 can selectively incorporate broad range of biologically active host proteins in the process of viral egress, which can potentially exhibit altered pathogenicity and neutralization phenotypes [40]. This also indicates possible association between intra clade C genetic diversity with ethnically distinct population. The evolutionary genetic drift within subtype C that we observed from phylogenetic analysis is likely due to differential host characteristics which include immune response and differential genetic bottlenecks. For example, gp120 sequences from Asian countries were found to demonstrate monophyletic clustering compared to that observed with those obtained from African countries. A number of studies [29][30][31] have demonstrated the role of loop length in the hypervariable regions, (with particular reference to V1V2 region), charge and N-linked glycosylation on altered neutralization phenotype. In the present study, we observed that while there is an existence of region-specific variation in V1V2 loop length, V4 loop length variation between subtype C sequences was found to be most profound across geographic boundaries. In addition, we also observed that the average gp120 length of African countries was found to be smaller compared to other regions. Our data demonstrates subtle but significant variations in these attributes thereby indicating that a common set of bnAbs are not likely to be equally effective against the HIV-1 subtype C circulating globally. The above conclusion was further substantiated by our finding of significant variation in the entropy profiles (variation in sites/ positions associated with different bnAbs) between the geographically distinct gp120 sequences. With limited data available in CATNAP, indeed we found evidence of variation in susceptibility of region-specific clade C to different bnAbs, which substantiate our observation. Interestingly, as reported elsewhere [16,41,42], our data also predicted potential likelihood of accumulation of resistance phenotype overtime to existing bnAbs. This observation indicate that it is necessary for continuous surveillance of evolving viruses in the context of subtype C along towards prioritizing bnAb combinations that will optimally dissect and overcome the evolving genetic diversity. Interestingly, a similar accumulation in ART resistance has been documented and studies that concurrently evaluate the interplay of these two evolutionary patterns, heretofore considered to be mutually exclusive, may highlight novel and synergistic therapeutic strategies. Furthermore, the recently concluded phase 2b HVTN 703/HPTN 081 Antibody Mediated Prevention (AMP) trial study also highlighted the discrepancy between the expected bnAb (VRC01) sensitivity of the viruses currently circulating in the population versus the sensitivity predicted in vitro through historically sampled viruses and the indispensability of continual virus surveillance [43][44][45].

Conclusion
The differences in HIV-1 clade C gp120 sequences observed herein indicate disparate and distinctly evolving clusters within clade C with differential predicted responses to bnAbs. Elucidation of neutralization diversity of subtype C particularly in context of evolution of gp120 over time will be essential for selecting appropriate bnAb combination for effective prophylaxis and treatment and also in informing rational vaccine design. Our study highlights that towards developing HIV-1 bnAbs as products for prevention and treatment, continued surveillance of the evolution of genetic features with particular reference to env gene that are targets of neutralizing antibodies of globally circulating HIV-1 remains crucial for the identification and prioritization of combination of bnAbs that would best provide maximal geography and population-specific neutralization coverage.

Retrieval of gp120 sequences
Sequences for the gp120 gene were retrieved from manually curated LANL HIV database (www.hiv.lanl.gov). Briefly, HIV-1 subtype C nucleotide sequences fully covering the genomic region 6225-7758 (as per HXB2 numbering) were retrieved and subsequently filtered with a one sequence per individual filter criterion. Sequence entries without any information regarding the sample source country were excluded. Multiple sequence alignment for the amino acid sequences along with HXB2 sequence (GenBank: K03455.1) was produced with Gene cutter ("Gene Cutter," LANL). Gene Cutter clips the coding regions from unaligned nucleotide sequences and produces amino acid alignments based on Hmmer v 2.32 algorithm with a training set of the full-length genome alignment. Alignments were manually curated using Bioedit v7.2.5 [46]. Sequences with internal stop codons were discarded.

Phylogenetic analysis
The number of amino acid differences per site were estimated by averaging over all sequence pairs between different countries using Molecular Evolutionary Genetics Analysis software (MEGA v.10) [28]. The rate variation among sites was modelled with a gamma distribution (shape parameter = 1). This analysis involved 1814 amino acid sequences. Phylogenetic trees were generated for the amino acid alignments with iqtree under 'HIVb' model with estimated Ƴ parameters and number of invariable sites [47]. Robustness of the tree topology was further assessed by SH-aLRT as well as 1000 ultrafast bootstrap replicates implemented in iqtree. A subtree consisting of 251 sequences were again constructed as mentioned previously.

Estimation of the variable loop properties and potential N linked glycosylation sites
Variable loop regions for V1 (131-157: HXB2 numbering), V2 (158-196), V3 (296-331), V4 (386-417) and V5 (460-469) were retrieved from amino acid alignments with Bioedit v7.2.5. Each of the loop datasets were then processed with custom bash/awk scripts to generate length statistics. The length distributions were further assessed and compared by Kruskal Wallis test followed by Dunn's multiple comparison test. Cumulative variable loop charge values were predicted for each of the sequences with custom bash scripts wherein, Lysine (K), Arginine (R) and Histidine (H) residues were assigned +1 values each while Aspartic acid (D) and Glutamic acid (E) were assigned -1 values each. Potential N-linked glycosylation sites were predicted in amino acid sequence datasets with N-GlycoSite tool hosted at the HIV-LANL database [48]. Prevalence of each of the pNLG sites under study were calculated using custom bash/awk scripts and were further assessed statistically using Fisher's exact test. Country-wise pNLGs abundance heatmap was plotted using 'pheatmap' package in R.
The temporal prediction data was stratified into approximately three decades as follows

Statistical analyses and data presentation
Statistical analyses for variable loop length distributions were performed using GraphPad Prism version 5.01 for Windows, GraphPad Software, San Diego California USA. Statistical Comparison of Fisher's test for pNLG sites as well as abundance of bnAb resistance associated residues was performed through R statistical computing software (v3.4.0) and R studio v1.0.143 [52,53]. Phylogenetic trees were visualised and edited with the R package 'Graphlan' [54]. Variable entropy positions were plotted on prefusion gp120 envelope model derived from PDB:5U7O in Chimera v1.14 [55]. Plots depicting variable region characteristics, entropy differences and bNAb sensitivity predictions were prepared using 'ggplot2' package in R [56]. Trend analysis for predicted bnAb sensitivity was performed by Jonckheere-Terpstra test implemented in R statistical software.