Few amino acid signatures distinguish HIV-1 subtype B pandemic and non-pandemic strains

The Human Immunodeficiency Virus Type I (HIV-1) subtype B comprises approximately 10% of all HIV infections in the world. The HIV-1 subtype B epidemic comprehends a pandemic variant (named BPANDEMIC) disseminated worldwide and non-pandemic variants (named BCAR) that are mostly restricted to the Caribbean. The goal of this work was the identification of amino acid signatures (AAs) characteristic to the BCAR and BPANDEMIC variants. To this end, we analyzed HIV-1 subtype B full-length (n = 486) and partial (n = 814) genomic sequences from the Americas classified within the BCAR and BPANDEMIC clades and reconstructed the sequences of their most recent common ancestors (MRCA). Analysis of contemporary HIV-1 sequences revealed 13 AAs between BCAR and BPANDEMIC variants (four on Gag, three on Pol, three on Rev, and one in Vif, Vpu, and Tat) of which only two (one on Gag and one on Pol) were traced to the MRCA. All AAs correspond to polymorphic sites located outside essential functional proteins domains, except the AAs in Tat. The absence of stringent AAs inherited from their ancestors between modern BCAR and BPANDEMIC variants support that ecological factors, rather than viral determinants, were the main driving force behind the successful spread of the BPANDEMIC strain.


Introduction
The Human Immunodeficiency Virus Type I (HIV-1) is one of the most important human pathogens that have emerged in the 20 th century and exhibits an extraordinary degree of genetic variability, organized in groups (M, N, O, and P), subtypes (A-D, F-H, and J-L), subsubtypes, and many recombinant forms [1]. HIV-1 subtype B comprises approximately 10% of all HIV infections in the world, being one of the most globally disseminated HIV-1 variants and the most prevalent subtype in the Americas, Europe, Oceania, as well as some Asian countries [2].
The HIV-1 subtype B shares a common ancestor with subtype D that was present in Kinshasa, capital of Democratic Republic of Congo (DRC), by the early 1940s [3]. The currently a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 accepted hypothesis about the emergence and worldwide spread of subtype B is punctuated by two major founder events. The first event occurred when the HIV-1 subtype B ancestor strain moved from the DRC into the Caribbean around the middle 1960s, establishing the first HIV epidemic outside the African region [4][5][6][7]. The second major event took place when one subtype B strain spread from the Caribbean to the U.S. around the late 1960s and got access to high-risk transmission networks globally connected that fueled the establishment of a pandemic clade (B PANDEMIC ) responsible for most of the infections of this subtype worldwide [8,9]. In contrast, other non-pandemic subtype B lineages (B CAR ) spread and circulates at high prevalence in several Caribbean islands and the Northern South American region, but were not successfully disseminated worldwide [4,6,[8][9][10][11].
The introduction of the B PANDEMIC ancestor in highly connected transmission networks in the U.S. during the very early phase of the epidemic probably accounts for the successful dissemination of this viral variant worldwide [4,12,13]. Differences in viral fitness, however, may also have shaped the uneven geographic distribution of distinct HIV-1 subtype B lineages. Viral transmissibility correlated with the plasma viremia during chronic infection [14], and some studies found that plasma viremia within subtype B is highly heritable, thus indicating that this trait depends strongly on the virus genotype [15][16][17][18]. Notably, a significant trend for higher viral loads among subjects infected with B PANDEMIC relative to B CAR strains was recently described in French Guiana [11] which may have played a role in the differential transmissibility of the two viral strains. Studies of molecular signatures in non-pandemic subtype B lineages, however, have been limited so far to the analysis of the env gene of B CAR strains circulating in Trinidad and Tobago [19,20].
The objective of this work is to identify amino acid signatures (AAs) that can distinguish B CAR and B PANDEMIC strains by the analysis of full-length (FL) and partial genome subtype B sequences representative of different Caribbean islands and American countries and by reconstructing the sequences of the most recent common ancestors (MRCA) of B CAR and B PAN-DEMIC strains. These analyses may provide crucial information to understand the potential relevance of viral genetic determinants on the epidemic spread of different subtype B variants.

HIV-1 subtypes B and D datasets
HIV-1 subtype B FL genome sequences from North America (n = 330), South America (n = 151), and the Caribbean (n = 25), as well as Sub-Saharan African subtype D FL genome sequences (n = 18) that were available at Los Alamos HIV Database (http://www.hiv.lanl.gov) by November 2018, were downloaded (S1 and S2 Tables). We also downloaded HIV-1 subtype B sequences from Caribbean islands with high prevalence of B CAR strains and that covered selected genomic regions of gag (HXB2 coordinates: 1,264 to 2,148), pol (HXB2 coordinates: 2,253 to 3,272), and env (HXB2 coordinates: 6,450 to 8,480) and HIV-1 pol sequences (HXB2 coordinates: 2,253 to 3,272) from drug-naïve individuals of Caribbean origin (S1 Table). Only one sequence for patient was selected.

Clade assignment of HIV-1 subtype B sequences
The HIV-1 subtype B sequences were aligned with the HIV Align online tool [21] and then manually curated. The presence of putative intra-subtype recombinant sequences was evaluated using the RDP4 software [22], being deemed as recombinant those sequences selected as such by three or more of the algorithms. The remaining FL and partial subtype B genome sequences were classified as either B CAR or B PANDEMIC based on their placement on a maximum likelihood (ML) phylogenetic tree inferred with the PhyML program [23] under the best nucleotide substitution model, selected using an online web server [24]. The heuristic tree search was performed using the SPR branch-swapping algorithm, and the reliability of the obtained topology was estimated with the approximate likelihood-ratio test [25] based on the Shimodaira-Hasegawa-like procedure. The ML phylogenetic trees were visualized using the FigTree v1.4.4 program [26].

Reconstruction of ancestral subtype B sequences
To reduce computation time while retaining most viral diversity information, we generate a "non-redundant" subtype B FL genome subset by removing very closely related B PANDEMIC sequences. To achieve this goal, B PANDEMIC sequences with identity above 91.5% were grouped with the CD-HIT program [27] using an online web server [28], and only one sequence per cluster (the oldest one) was selected. To map amino acid changes fixed during the evolution of subtype B, FL genome sequences of the MRCA of B CAR and B PANDEMIC strains were then reconstructed using a Bayesian Markov Chain Monte Carlo (MCMC) approach, as implemented in BEAST v1.8 [29,30] with BEAGLE [31] to improve run-time. The Bayesian timetree was reconstructed using the GTR+I+Г4 nucleotide substitution model [32], a relaxed uncorrelated lognormal molecular clock model [33], and a Bayesian Skyline coalescent tree prior [34] with non-informative default priors. MCMC chains were run for 100 × 10 6 generations, and convergence and uncertainty of parameter estimates were assessed by calculating the Effective Sample Size (ESS) and 95% Highest Probability Density (HPD) values, respectively, after excluding the initial 10% of each run with Tracer v1.7 [35]. The convergence of parameters was considered when ESS � 200. After the exclusion of sequences corresponding to the burn-in phase, the remaining ones were utilized to generate an FL consensus sequence for each MRCA using the Seaview version 4 program [36].

Amino acid signature (AAs) analyses
Nucleotide sequences were translated, and the software VESPA (Viral signature pattern analysis) [37] was used to identify positions in which the most common amino acid differs between B CAR and B PANDEMIC datasets as well as between subtype B and subtype D datasets. These positions were then selected, and for each a Chi-square test, as implement in R version 3.5.0 [38], was used to evaluate the statistical significance of their different amino acidic compositions. AAs were defined by positions in which both the most common amino acid was different, and the overall amino acid composition was significantly different between viral clades. For specific genomic regions corresponding to the structural gag, pol, and env genes, the number of B CAR sequences was expanded, and the process to identify AAs between B CAR and B PANDEMIC datasets previously detailed was applied. The false discovery rate (FDR) method was used to correct for multiple hypothesis testing and to reduce false positives. Statistical significance was defined as FDR < 0.05.

Phenotypic prediction
We determine the frequency of genetic polymorphisms in accessory (Vif, Vpr, Nef) and regulatory (Rev) HIV-1 proteins of B PANDEMIC and FL/expanded B CAR datasets that were previously associated with slow HIV-1 disease progression and differential function [39][40][41][42][43][44][45][46][47][48][49][50]. The Geno2Pheno algorithm was used to predict the chemokine receptor tropism of the B PANDEMIC and expanded B CAR env dataset sequences based on their V3 region [51]. V3 was studied through the 11/25 rule (R or K at position 11 and/or K at position 25) [52][53][54], and the combination of positively charged amino acids at position 25 and an increase in total net charge [55]. The frequency of surveillance drug-resistance mutations (SDRMs) was also explored in B CAR and B PANDEMIC pol sequences retrieved from drug-naïve individuals of Caribbean origin using the Calibrated Population Resistance (CPR) tool (http://cpr.stanford.edu/cpr.cgi) [56]. A Chisquare test, as implement in R version 3.5.0 [38] was used to evaluate the significance of the results in both cases. Statistical significance was defined as p-values < 0.05.

Classification of HIV-1 B CAR and B PANDEMIC FL sequences
From the 506 HIV-1 subtype B FL genome sequences of American origin initially selected, 28 sequences (6%) were identified as putative intra-subtype B recombinants and removed from the final dataset (S1 Table). The ML phylogenetic analysis of the remaining 478 subtype B FL genome American sequences revealed that most Caribbean sequences (82%) branched as basal strains and were classified as B CAR strains, while most sequences from North (97%) and South (99%) America branched in a well-supported (SH-aLRT = 0.98) monophyletic sub-clade and were thus classified as B PANDEMIC strains (Fig 1 and S1 Table). Despite the low number of subtype B FL genome sequences available from the Americas, the estimated relative prevalence of the B CAR lineages in different countries was entirely consistent with previous estimates [6,8,9] based on much larger pol sequence datasets. Similarly, classification of additional subtype B Caribbean, covering specific regions of gag (n = 495), pol (n = 775), and env (n = 529) genes produced country ratios of B CAR/ B PANDEMIC sequences akin to their counterparts based on the FL genome (S1 Fig and S1 Table).

AAs in B CAR and B PANDEMIC modern sequences
In order to identify AAs of different subtype B clades, we compared the FL genome B PANDEMIC sequences (n = 450, sampled between 1978 and 2015) of American origin with FL genome (n = 18, sampled between 1983 and 2011), Gag (n = 28), Pol (n = 197), Env (n = 59) and Rev (n = 59, given the superposition of its CDS with Env) B CAR sequences of Caribbean origin (Table 1). Twenty-eight sequences were originally classified as B CAR , but 10 were removed for subsequent analyses because were sampled outside the Caribbean region. The same reasoning was used in the expanded dataset, where only B CAR sequences of Caribbean origin were considered. The analysis of translated FL genome sequences identified nine positions that displayed compositions significantly distinct between B CAR and B PANDEMIC datasets, covering structural (one in P6, one in PR and one in RT), regulatory (one in Tat and three in Rev) and accessory (one in Vif and one in Vpu) proteins (Table 2). Expanded B CAR datasets comprising partial regions of Gag, Pol, Env, and Rev encompass four out of the nine AAs previously identified. Three positions (one in PR, one in RT, and one in Rev) had their results endorsed by the additional sequences, while statistical significance in the P6 position was lost. Analyses of the expanded B CAR datasets also identified additional AAs not detected in the FL genome dataset (Table 2), four in Gag (three in P24, positions 27, 120 and 148; and one in P7, position 12) and another in the RT (position 211).
By combining FL and partial genome B CAR datasets, 13 positions were considered as signature positions differentiating B CAR e B PANDEMIC clades: four located on Gag (three in P24, and one in P7); three on Pol (one in the PR and two in the RT); three in Rev; while Vif, Vpu, and Tat each contributed with one position ( Table 2). No signature position was identified in Vpr, Env, or Nef. None of the 13 AAs that distinguished contemporary B CAR and B PANDEMIC sequences were found to be invariant sites, with the exception of position Rev 102 in the B CAR lineages. Furthermore, we also observed that for most AA positions, the most common amino acid found in a given subtype B clade corresponds to the second most frequent amino acid in the other clade ( Table 2). The exceptions were position 207 in RT that displayed E 48 /A 25 as the most frequent amino acids in B CAR strains and Q 82 /E 9 in B PANDEMIC ones, and position 57 in Rev that displayed A 53 /E 32 as the most frequent amino acids in B CAR strains and G 40 /E 27 in B PANDEMIC ones.

AAs in B CAR and B PANDEMIC MRCA sequences
When the reconstructed MRCA sequences of B CAR and B PANDEMIC clades were compared, 21 amino acidic positions differentiated both ancestors ( Table 3). Eight of them were located in  (Table 2). Thus, of the 21 amino acidic positions that differentiated the B CAR and B PANDEMIC ancestors only two continue to distinguish the contemporary descendant sequences. Furthermore, this analysis suggests that most (11/13) AAs that distinguished contemporary B CAR and B PANDEMIC sequences were probably

PLOS ONE
Few amino acid signatures distinguish HIV-1 subtype B pandemic and non-pandemic strains not inherited from their ancestors. It is interesting to note, however, that the most common amino acid in 11 (including the two inherited from ancestors) out of 13 positions associated with the AAs was the same in B CAR and subtype D sequences; while subtype D and B PANDEMIC sequences coincide in only one position.

Discussion
The current work suggests that the hypothesis that viral genetic determinants shaped the remarkable differences in the geographic dissemination pattern of the HIV-1 B PANDEMIC and B CAR strains is highly unlikely. Among over 3,000 positions analyzed across nine genes coded by the HIV-1 genome, we detected only 13 AAs distinguishing the B CAR and B PANDEMIC clades. All AAs that did differentiate the B CAR and B PANDEMIC clades correspond to sites with  Tat Rev GP120 GP41 GP41 GP41 Nef   Position  84  15  91 180 12  26  30  42  35  49 122 207 47  66  74  56  363  133  182  209 10 The

PLOS ONE
Few amino acid signatures distinguish HIV-1 subtype B pandemic and non-pandemic strains

PLOS ONE
Few amino acid signatures distinguish HIV-1 subtype B pandemic and non-pandemic strains distinct degrees of polymorphism and not to invariant (or highly conserved) sites. Furthermore, for 11 out of 13 AAs positions, the most common amino acid found in a given subtype B variant corresponds to the second most frequent amino acid in the other variant. Our study also suggests that 11 out of 13 AAs that distinguished contemporary B CAR and B PANDEMIC sequences were probably not inherited from their ancestors and that most (19/21) amino acid differences inferred between the B CAR and B PANDEMIC ancestors evolved into polymorphic sites with quite comparable compositions in modern descendant sequences. Finally, nearly all AAs identified were located outside functionally relevant protein domains. Our data supports that B CAR and B PANDEMIC strains are probably not distinguished by functionally relevant AAs in structural genes. Analyzes of both FL and expanded partial env sequences failed to detect AAs in this variable genomic region. Furthermore, no significant differences in the frequency of CCR5/CXCR4 tropic variants were detected between B CAR and B PANDEMIC sequences, supporting not great variation in the chemokine receptor usage between pandemic and non-pandemic subtype B strains. These results are fully consistent with a previous study that demonstrate that the env V3 consensus sequence of B CAR strains from Trinidad and Tobago differs by few amino acids from the B PANDEMIC V3 consensus and that no phenotypic features, including syncytium induction, neutralization profiles, and chemokine receptor usage, distinguish both subtype B lineages [19]. Furthermore, all AAs in Gag and Pol that distinguishing B CAR and B PANDEMIC sequences were located outside known conserved protein functional domains.
By contrasting, a few interesting differences between B CAR and B PANDEMIC strains were observed in non-structural genes. The single AAs in position 23 of Tat is located in a cysteinerich domain and has been reported as one of the three major sites of p53-derived restriction of Tat mediated by PKR phosphorylation [67]. The presence of an N residue in that position, that is the most prevalent amino acid in subtypes A, C, D and B CAR (44%) strains, but not in B PAN-DEMIC ones (22%), have been associated with increased Tat transactivation, probably through

PLOS ONE
Few amino acid signatures distinguish HIV-1 subtype B pandemic and non-pandemic strains enhanced P-TEFb binding [67,70]. We also detect much higher frequency of the naturally occurring variation Vpr R77Q in B CAR (83%) respect to B PANDEMIC (48%) sequences. That mutation, that also predominates in subtypes A, C, D, G, and H, reduces apoptosis and CD4 T-cell depletion in ex vivo-infected cells and is much more prevalent in subtype B-infected LTNPs individuals (75-90%) than in subjects with progressive HIV disease (33-42%) [45]. The similar genetic composition of B CAR and several pandemic HIV-1 clades (subtypes A, C and D) at positions Tat23 and Vpr77 argued against the hypothesis that differences at such positions resulted in a more restricted dispersion of B CAR compared with the B PANDEMIC strains.
Despite the very small size (n = 18) of the B CAR FL genome dataset here used, some evidences support that the B CAR genetic variability was not severely underestimated in this dataset and that the B CAR consensus sequence obtained was probably not biased by the low number of FL genomic sequences available. First, the most common amino acid recovered in most positions from extended datasets was fully coincident with the one detected in the FL dataset. Expanded and FL datasets converged in 99.7% (297/298) of Gag amino acid positions, 99.1% (336/339) of Pol positions, and 97.8% (683/698) of Env positions analyzed. Second, the degree of polymorphism of the 13 AAs positions that distinguished B CAR and B PANDEMIC sequences was roughly comparable in FL and expanded B CAR datasets. The paucity of FL B CAR strains, however, might have restricted our ability to detect some additional AAs between B CAR and B PANDEMIC sequences. By increasing the number of B CAR sequences we failed to recover new AAs between B CAR and B PANDEMIC env sequences, but we detected four additional AAs in Gag and one in Pol, increasing the overall number of AAs from three to eight in those genomic regions.
The low number of FL B CAR sequences used might have also introduced some bias on the reconstruction of the MRCA sequences. According to our analysis, of the 13 AAs detected in modern B CAR and B PANDEMIC sequences, only two matched with divergent sites between the B CAR and B PANDEMIC ancestors. This observation support that most genetic differences between the B CAR and B PANDEMIC ancestors evolved toward positions with similar amino acid composition in modern subtype B sequences and that most AAs in modern B CAR and B PAN-DEMIC sequences arose during subsequent evolution and diversification of subtype B lineages. An inspection of the amino acid composition at those 13 positions in the related subtype D clade, however, supports a different scenario. We observed that B CAR and subtype D sequences displayed the same prevalent amino acid in most (11/13) AAs positions, consistent with genetic identity inherited from the common B/D ancestor. In sharp contrast, the B PANDEMIC and subtype D sequences displayed the same prevalent amino acid in only one AAs position. Therefore, our reconstruction of the MRCA sequences may have underestimated the number of B CAR /B PANDEMIC AAs inherited from the ancestors.
In summary, albeit some mutations fixed in the HIV-1 B PANDEMIC ancestral strain could potentially have some phenotypic impact on viral transmissibility, the absence of stringent AAs distinguishing modern B CAR and B PANDEMIC variants and the similar amino acid composition between B CAR and other group M subtype pandemic variants at key sites indicates that viral genetic determinants were probably not the main factor shaping the divergent pattern of geographic spread of B CAR and B PANDEMIC variants. The successful dissemination of B CAR strains in some Caribbean countries that exhibit the highest HIV-1 prevalence rates outside of Africa [6,11] also argues against the hypothesis of a reduced B CAR viral fitness. These results support that stochastic events leading to the introduction of B PANDEMIC ancestor into globally connected populations were the most probable driving force behind its pandemic dissemination and substantiate the crucial need for continued molecular surveillance of HIV-1 transmission on key populations worldwide.  Table. Predicted co-receptor usage by B CAR and B PANDEMIC env sequences. The table summarizes the predicted usage of chemokines receptors CCR5 and CXCR4 based on different criteria: 1) the Geno2Pheno algorithm, which classifies the sequences between R5 variants or X4 and R5X4 dual-tropic variants; 2) the 11/25 Rule, which asses the presence of arginine (R) or lysine (K) at position 11 of env V3 sequences and/or K at position 25; 3) the combination of R at position 25 of V3 and a net charge of � 5. (PDF) S4 Table. Prevalence of transmitted drug-resistance mutations in B CAR and B PANDEMIC PR/RT sequences. The table summarizes the surveillance drug-resistance mutations (SDRM) identified in PR/RT B CAR and B PANDEMIC sequences retrieved from drug naïve subjects. PI (protease inhibitor), NRTI (nucleoside analog reverse-transcriptase inhibitors), NNRTI (nonnucleoside analog reverse-transcriptase inhibitor). The p-values obtained in chi-squared tests are listed in the last column. (PDF) S1 File. (PDF)