Heterogeneous Evolution of HIV-1 CRF01_AE in Men Who Have Sex with Men (MSM) and Other Populations in China

Introduction The HIV epidemic in men who have sex with men (MSM) continues to grow in most countries. However, the phylodynamic and virological differences among HIV-1 strains circulating in MSM and other populations are not well characterized. Methods Nearly full-length genomes (NFLGs) of the HIV-1 CRF01_AE were obtained from the Los Alamos HIV database. Phylogenetic analyses were conducted using the NFLG, gag, pol and env genes, using the maximum likelihood method. Selection pressure analyses at the codon level were performed for each gene in the phylogenetic clusters using PAML. Results Sequences isolated from MSM in China clustered in Clusters 1 (92.5%) and 2 (85.71%). The major risk factor for Cluster 3 was heterosexual transmission (62.16%). The ratio of non-synonymous to synonymous substitutions in the env gene (0.7–0.75) was higher than the gag (0.26–0.34) or pol (0.21–0.26) genes. In env gene, Cluster 1 (4.56×10-3subs/site/year) and 2 (6.01×10-3subs/site/year) had higher evolutionary rates than Cluster 3 (1.14×10-3subs/site/year). Positive selection affected 4.2–6.58% of the amino acid sites in the env gene. Two sites (HXB2:136 and 316) evolved similarly in Clusters 1 and 2, but not Cluster 3. Conclusion The HIV-1 CRF01_AE in MSM is evolving differently than in other populations.


Introduction
The HIV epidemic in men who have sex with men (MSM) continues to grow in most countries [1]. More than half of new HIV infections occur among MSM in both the United States of America and the United Kingdom [2,3]. In China, the proportion of MSM among those newly diagnosed with HIV increased to 29.4% in 2011 [4]. The drivers of the HIV epidemic in MSM are complex; they include increased high-risk behaviors, high risk of transmission through receptive anal intercourse, and a high prevalence within the network of possible sexual contacts [5]. There is an unmet need for studies focusing on the phylodynamics and virology of HIV transmission and acquisition risks for MSM and transmission dynamics within the MSM networks.
The molecular epidemiology of HIV infections in MSM in China was first studied in a small cohort (n = 45) in [2005][2006]. In this cohort, the predominant subtype was the US-European origin subtype B virus (71.1%), followed by the CRF01_AE (24.4%), and CRF07_BC (4.4%) subtypes [6]. Four serial cross-sectional surveys in MSM, from 2005 to 2009 suggested that Non-B subtypes increased rapidly in recent years, in particular, CRF01_AE increased from 3.7% in 2005 to 50% in 2009 [7]. A nationwide molecular epidemiological survey in MSM showed that CRF01_AE accounted for 62.1% of infections in China as a whole between 2009 and 2011 [5]. In summary, CRF01_AE has become the predominant strain of the virus in MSM in China.
Recent studies have confirmed that CRF01_AE was introduced from Southeast Asia in the 1990s and has expanded rapidly in China [8]. The CRF01_AE epidemic in China is comprised of multiple genetically distinct clusters that have different risk factors and are epidemic in different geographic regions. However, the evolutionary history of the clusters has not been well characterized.
Here, we have conducted a large-scale phylogenetic analysis of nearly full-length genomes (NFLG) of CRF01_AE strains to infer their evolutionary relationship. The substitution rates of the clusters were estimated using the Bayesian Markov Chain Monte Carlo (MCMC) method. We have also estimated the non-synonymous to synonymous substitution rate ratios (dN/dS ratio), and identified the positive selection sites for each cluster. These studies provide novel insights into the evolution of CRF01_AE in MSM, and will likely contribute to improving HIV-1 surveillance and vaccine development.

Sequence Data
All of the available NFLG sequences for CRF01_AE were obtained from the Los Alamos HIV database (http://www.hiv.lanl.gov/), on April 18, 2015. Identical sequences in the dataset are represented by the oldest sequence in the group. The dataset included 685 sequences. An initial alignment of the sequences was performed using Gene Cutter from the Los Alamos HIV sequence database (http://www.hiv.lanl.gov/content/sequence/GENE_CUTTER/cutter.html). The accession numbers for the sequences used in this study are summarized in S1 File.

Phylogenetic Tree Analysis
Phylogenetic analysis was performed for the NFLGs, gag, pol and env genes using the maximum likelihood (ML) method in RAxML [9]. Two hundred bootstrap replicates were performed using the GTR-GAMMA, the GTR model of nucleotide substitution with the Gamma model of rate heterogeneity. The tree was color-coded using FigTree (ver.1.4.2) (http://www. tree.bio.ed.ac.uk/software/figtree/).

Evolutionary Rate
The substitution rates of the different clusters were estimated using the BEAST software and implementing an MCMC method [10]. The GTR+I+Г4 nucleotide substitution model and coalescent Bayesian skyline model were incorporated in the MCMC method [11]. A relaxed molecular clock model with uncorrelated lognormal distribution was used to infer the timescaled maximum clade credibility phylogenies [12]. Multiple independent MCMC runs were performed and assessed for consistency. The MCMC analyses were combined to give a total chain length of 0.5-4x10 7 steps with sampling every 5,000 steps. The first 10% of the states of each chain were discarded as burn-in. Ten thousand trees were then sampled to estimate the evolutionary rate using LogCombiner v1.8.0. Convergence of relevant parameters was assessed by effective sample sizes over 200 in Tracer v1.5 (http://tree.bio.ed.ac.uk/software/tracer/).

Selection Pressure Analysis
To examine the selection pressure placed on each cluster, we estimated the ratio of non-synonymous (dN) to synonymous (dS) substitutions for each cluster, using the HyPhy package [13]. Selection pressure analyses at the codon level for each gene in the different clusters were conducted using the CODEML program in the PAML 4.4 software package to apply site-specific models for detecting positive selection [14]. Two selective models that allow for positive selection (2a and 8; ω>1) were compared with two null models (1a and 7, respectively) that do not allow for positive selection. The likelihood ratio test was used to determine whether there were significant differences between the null model and the alternative model by calculating twice the loglikelihood difference following a χ 2 distribution, with the number of degrees of freedom [15].

Phylogenetic analysis
Three phylogenetic trees were constructed from the CRF01_AE NFLGs (HXB2 nucleotide sequence numbering 796 to 8,905 nucleotides [nt]) using the ML approach with bootstrapping analyses to assess clade robustness: (1) env fragments (HXB2 6,789 to 8794 nt); (2) gag fragments (HXB2 796 to 2,216 nt); and (3) pol fragments (HXB2 2,085 to 5,094 nt). The results for the NFLGs and env fragments are shown in Fig 1. The results for pol and gag fragments are shown in S1 Fig. Three clusters (numbered 1, 2, and 3) were observed in CRF01_AE sequences isolated from Chinese patients. MSM was the predominant risk factor for patients in Cluster 1 (92.5%) and 2 (85.71%). In contrast, heterosexuality was the major risk factor for patients in cluster 3 (62.16%).

Evolutionary analysis
To evaluate the evolutionary changes that characterized the different clusters, we calculated the dN/dS ratios and the evolutionary rates of each cluster ( Table 2). The dN/dS ratio represents  the magnitude of the selective pressure. A higher selective pressure indicates that the gene (or site) is under stronger positive selective pressure for amino acid substitution [16]. The dN/dS ratio for the env gene (0.7-0.75) was higher than the gag (0.26-0.34) or pol (0.21-0.26) genes, indicating greater selective pressure was exerted on env. Within the env genes, Cluster 1 (4.56×10 -3 subs/site/year) and Cluster 2 (6.01×10 -3 subs/site/year) had higher evolutionary rates than Cluster 3 (1.14×10 -3 subs/site/year; Table 2).

Site-by-site Analyses
Positive selection usually affects only a few residues in a protein, therefore we used the site-specific model in the PAML package to identify the positively selected sites (PSS) [17]. Two selection models (M2a and M8) fit the data significantly better than the null models that did not incorporate selection (M1a and M7). The models indicated that 4.2%-6.58% of the amino acid sites in the env gene appear to be under positive selection (dN/dS ratio: 3.78-5.81) ( Table 3).
We then compared the similarities and differences in the PSS in the three clusters. Four PSS found in clusters 1 and 2 were not under positive selection in cluster 3. Of these, three (HXB2: 721, 775 and 816) were located in gp41, which suggested that the CRF01_AE gp41 protein is evolving and adapting in MSM. Seven PSS were identified in the three clusters that were evolving differently. Of these, two (HXB2:136 and 316) were evolving similarly in Clusters 1 and 2 and were significantly different from Cluster 3 (Fig 2). Threonine (T) was frequently used in site 136 (HXB2) in Clusters 1 (42.5%) and 2 (92.8%), while praline (P) was frequently used in Cluster 3 (52.9%). Lysine (K) was frequently used in site 316 (HXB2) in Clusters 1 (50%) and 2 (96.4%), while asparagine (N) was frequently used in Cluster 3 (47%, Table 4).

Discussion
In this study, we carried out a large-scale sequence analysis of HIV-1 CRF01_AE. The CRF01_AE sequences isolated from MSM in China formed two clusters similar to previous studies [5]. These findings suggest that MSM have their own group, and that the HIV-1 subtypes circulating in MSM have unique evolutionary characteristics. We also observed distinct differences in the geographical distribution of the clusters: Cluster 1 was found more frequently in LiaoNing, while Cluster 2 was more concentrated in Beijing. As only a small fraction of sequences is included in the database, our study's dataset is just a small sample of the viruses that circulate worldwide. This is a potential weakness of our study. Sample bias could be one of the reasons causing the distinct geographical differences among clusters. The selective pressure is stronger on env than gag or pol, and more sites under positive selection were identified in env. The env gene is associated with viral transmission and host cell tropism; it is also the primary target of the host immune response. Thus, many studies have evaluated its contribution to viral replication and HIV-1 pathogenesis [18]. The selective pressure exerted during transmission enhances env entry efficiency and HIV-1 viral fitness, which might help explain the growing epidemic in MSM.
The envelope protein initially forms as a precursor (gp160) that is leaved by a cellular protease to produce the surface subunit gp120 and the transmembrane subunit gp41. The gp120 protein is comprised of five variable (V1 to V5) and five conserved constant (C1 to C5) domains [19]. In our study, many of the PSS were found in the variable domains. However, there were marked differences between clusters. Some PSS were only observed in Clusters 1 and 2 (HBX2: 32, 721, 775, and 816). Sites 136 and 316 (HBX2) evolved differently and are located in the V1 loop and V3 loop respectively. The two sites circulating in MSM preferentially use threonine (T) and Lysine (K). V1/V2 loop is generally exposed on the envelope and is one of the first targets of the early immune response [18]. Previous studies suggested that more compact or shorter V1/V2s reduce number of N-linked glycosylation sites and increase the number of quasispecies replicating in the plasma of donors at the time of transmission [20]. V1 and V2 also conceal the CD4 binding site; thus, deletion of the V1 and V2 regions increases viral sensitivity to neutralizing antibodies. The V3 loop is exposed and engages the coreceptor (CCR5 or CXCR4), which then mediates membrane fusion. Further studies are needed to better understand how the 136 and 316 sites (HBX2) affect viral transmission. Broad and potent HIV-1 neutralizing antibodies (bNAbs) are the goal of many HIV-1 vaccine programs [21]. The four most vulnerable sites on the env glycoprotein are the CD4 binding site (CD4bs), glycan dependent epitopes in V1V2 and near the base of V3/C3, and linear epitopes in the membrane proximal external region (MPER) of gp41 [21]. These sites are widely recognized to differ between HIV-1 subtypes. Here, we demonstrated that they can also differ between transmissions, which add to the difficulty of developing an effective vaccine.
In summary, we conducted a large-scale sequence analysis of the HIV-1 CRF01_AE. The CRF01_AE sequences isolated in MSM in China formed two clusters and the highest rates of evolution were observed in the env gene. In addition, the amino acids mutations at PSS differed between the clusters and are likely associated with virus budding and antigen recognition. These results further our knowledge of CRF01_AE evolution across transmissions, and will likely help improve HIV-1 surveillance and vaccine development.