High prevalence and diversity of HIV-1 non-B genetic forms due to immigration in southern Spain: A phylogeographic approach

Phylogenetic studies are a valuable tool to understand viral transmission patterns and the role of immigration in HIV-1 spread. We analyzed the spatio-temporal relationship of different HIV-1 non-B subtype variants over time using phylogenetic analysis techniques. We collected 693 pol (PR+RT) sequences that were sampled from 2005 to 2012 from naïve patients in different hospitals in southern Spain. We used REGA v3.0 to classify them into subtypes and recombinant forms, which were confirmed by phylogenetic analysis through maximum likelihood (ML) using RAxML. For the main HIV-1 non-B variants, publicly available, genetically similar sequences were sought using HIV-BLAST. The presence of HIV-1 lineages circulating in our study population was established using ML and Bayesian inference (BEAST v1.7.5) and transmission networks were identified. We detected 165 (23.4%) patients infected with HIV-1 non-B variants: 104 (63%) with recombinant viruses in pol: CRF02_AG (71, 43%), CRF14_BG (8, 4.8%), CRF06_cpx (5, 3%) and nine other recombinant forms (11, 6.7%) and unique recombinants (9, 5.5%). The rest (61, 37%) were infected with non-recombinant subtypes: A1 (30, 18.2%), C (7, [4.2%]), D (3, [1.8%]), F1 (9, 5.5%) and G (12, 7.3%). Most patients infected with HIV-1 non-B variants were men (63%, p < 0.001) aged over 35 (73.5%, p < 0.001), heterosexuals (92.2%, p < 0.001), from Africa (59.5%, p < 0.001) and living in the El Ejido area (62.4%, p<0.001). We found lineages of epidemiological relevance (mainly within Subtype A1), imported primarily through female sex workers from East Europe. We detected 11 transmission clusters of HIV-1 non-B Subtypes, which included patients born in Spain in half of them. We present the phylogenetic profiles of the HIV-1 non-B variants detected in southern Spain, and explore their putative geographical origins. Our data reveals a high HIV-1 genetic diversity likely due to the import of viral lineages that circulate in other countries. The highly immigrated El Ejido area acts as a gateway through which different subtypes are introduced into other regions, hence the importance of setting up epidemiological control measures to prevent future outbreaks.


Introduction
The high evolutionary rate and recombination capacity of human immunodeficiency virus type 1 (HIV-1) determine the existence of an array of subtypes and recombinant forms circulating worldwide [1][2][3][4]. HIV-1 non-subtype B ("non-B") variants cause around 90% of infections worldwide, and largely predominate in African or Eastern European countries with generalized HIV-1 epidemics. Subtypes C and A, and circulating recombinant forms (CRF) CRF01_AE and CRF02_AG, are responsible alone for 70% of the world's infections [5]. Nowadays the proportion of infections by HIV-1 non-B variants in Spain lies at 12-15%, depending on the study and technique used to characterize variants [6,7]. Nonetheless, the predominance of HIV-1 subtype B in developed countries (where antiretroviral therapy is more widespread), implies that this is the most widely studied subtype from the genetic, biological and therapeutic viewpoints. The full biological meaning of the genetic variability of HIV-1 is still not completely understood. However, several major differences between the biological properties of certain genetic subtypes in have been described; e.g., virulence, tropism and transmissibility [8,9], use of chemokine co-receptors [10], disease progression [11], susceptibility to some antiretroviral drugs [12,13], sensitivity to viral load quantification methods [14,15] and detection [16]. These findings evidence the importance of epidemiological information about different subtypes.
Eastern Andalusia is located in south-eastern Spain, and includes the provinces of Almería, Granada and Jaén. Given its location and geographic closeness to the African continent, this region has received a notable foreign migratory influx in the last decade. Andalusia is the fourth Spanish Autonomous Community in number of foreign population, only surpassed by Catalonia, Madrid and the Valencian Community. The main source of immigration in Eastern Andalusia stems from its intensive farming practices, mainly in the El Ejido area (located in the province of Almería), where one in every four citizens is an immigrant.
Phylogenetic analyses, in conjunction with geographical data, can assess the existing relationship between migratory events and spread of HIV-1 on a local scale [17][18][19][20], and to study HIV-1 transmission networks locally [21][22][23][24]. As in previous studies [25], our center collects the HIV-1 pol gene sequences linked to the patients' clinical data to monitor baseline drug resistance in naïve individuals in Eastern Andalusia. Our aims were to describe the molecular epidemiology and evolutionary history of non-B forms in Eastern Andalusia over the 2005-2012 period, and to explore their putative geographical origin prior to their arrival to our region.

Study population
During the study period (2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012), 693 pol gene sequences of patients newly diagnosed with HIV-1 in different Eastern Andalusian hospitals were collected from routine drug resistance analyses. These hospitals were distributed in 3 provinces: Granada (which included its capital city of Granada and Motril), Jaén, and Almería (including its capital city of Almería and El Ejido). The pol sequences (protease (PR), codons 4-99; reverse transcriptase (RT), codons 38-247) obtained by the Trugene1 HIV Genotyping kit (Siemens, NAD), were linked to demographic (risk group, age, gender, country of origin, sampling year, and attending hospital), clinical (CD4+ T-cell count) and virological (plasma viral load) information. Demographic information was voluntarily collected during clinical interviews. This study was approved by the San Cecilio Hospital's Ethics Committee, and no consent information was required as patient information remained anonymous and was de-identified prior to analyses.

HIV-1 pol sequencing and subtype assignment
All the sequences were trimmed to 883 nucleotides (nt) and aligned using ClustalW [26]. The viral subtype was studied with the REGA v3.0 subtyping tool (http://dbpartners.stanford. edu:8080/RegaSubtyping/stanford-hiv/typingtool/), and was confirmed by phylogenetic analysis through maximum likelihood (ML) using the randomized Accelerated Maximum Likelihood (RAxML) program, accessible on the CIPRES Science Gateway [27]. The general timereversible (GTR) model with a gamma-distributed heterogeneity rate across sites was employed, applying 1000 bootstrap iterations. A representative dataset of HIV-1 group M sequences, including non-recombinant subtypes (A-K) and recombinant forms (at least four representative sequences of each non-recombinant subtype and the CRF currently available from the analysis) were downloaded from the Los Alamos HIV sequence database (http:// www.hiv.lanl.gov) was used as a reference dataset (S1 Table).
The assignment to any subtype/CRF was considered definitive if the query sequence was included with the reference sequences corresponding to that viral variant in a monophyletic cluster supported by high bootstrap values (>70%) [28]. Any genetic form not associated with reference subtypes/CRFs was classified as a unique recombinant form (URF), whose recombination pattern was further studied by a Bootscan analysis using the SimPlot v3.5.1 software [29]. The bootscanning method in SimPlot consists of a sliding-window phylogenetic bootstrap analysis of the query sequence aligned against a set of reference strains to reveal breakpoints. The Neighbor-Joining algorithm was selected, with the Kimura 2-parameter substitution model. We employed a window size of 200nt moving in 10nt increments. We used a minimum cutoff for the bootstrap value of 70% to reliably assign each of the breakpoint segments to a parental variant.
We have submitted to GenBank the major groups of HIV-1 non-B variants under accession numbers MF628109 to MF628250. These were defined as those found in at least five patients. With the aim of protecting the identity of patients infected with rare genetic forms of HIV-1, and for similar scientific and ethical reasons as explained in other HIV cohorts [30][31][32], we decided not to submit to GenBank those sequences corresponding to the less frequent variants.

Inference of the putative geographical origins of the HIV-1 non-B variants circulating in Andalusia
To further characterize the relationships among the major groups of HIV-1 non-B variants, we interrogated GenBank for genetically related sequences to our major subtypes/recombinant forms using HIV-BLAST (http://www.hiv.lanl.gov/content/sequence/BASIC_BLAST/basic_ blast.html). The 10 most closely related GenBank sequences to each of our study sequences, were downloaded and included in each dataset. We also included all the pol sequences (start: 2293 and end: 3290, HXB2 coordinates), available in the HIV Los Alamos database sampled in Spain for each dataset: subtype A1 (n = 60), subtype C (n = 52), subtype F (n = 143), subtype G (n = 64), CRF14_BG (n = 25), and CRF02_AG (n = 265). Since very few sequences for CRF06_cpx were available in public databases (http://www.hiv.lanl.gov/content/sequence/ HIV/mainpage.html), we included them all (n = 110).
All these individual sequence datasets were put together (n = 970) and a global phylogenetic analysis was performed using RAxML (GTR + Gamma model) and 1000 bootstrap iterations for this analysis. The phylogenetic relatedness between the sequences was studied, and a 70% bootstrap value was taken as a significantly reliable value [28]. Thresholds for low genetic distance, which are commonly used as a proxy for divergence time, were not applied to the cluster definition in the ML trees since these clusters were further confirmed and analyzed using a time-stamped Bayesian phylogenetic analysis with BEAST, as described below. International non-B lineages (defined as phylogenetic associations of at least one sequence from our cohort clustered with sequences from different countries), and 'Andalusian clusters' (monophyletic associations of sequences in our cohort alone), were identified in the global ML tree.
A Bayesian Markov Chain Monte Carlo (MCMC) approach was applied to each of the individual HIV-1 non-B subtype/CRFs datasets described above, which included the most genetically similar sequences found with HIV-1 BLAST, as implemented in BEAST v1.7.5 [33]. The Shapiro-Rambaut- Drummond-2006 (SRD06) substitution model was used, together with a relaxed uncorrelated lognormal clock (UCLN) [34] and a demographic non parametric model, Bayesian Skyline Plot (BSP) [35]. This model combination was chosen because it best fits the analysis of the HIV-1 pol data run in the majority of studies [36]. The MCMC was run for 250 million states sampling every 50000. The evolutionary rate (μ, nucleotide substitutions per site per year, subst./site/year) for the different HIV-1 non-B subtypes/CRFs (S2 Table), and the most recent common ancestors (MRCA) of the different HIV-1 non-B clusters, were estimated. Only traces with an effective sample size (ESS) > 200 for all the parameters, after excluding an initial 10% burnin, were accepted as visualised in TRACER, v1.6 (http://tree.bio. ed.ac.uk/software/tracer/).
Maximum Clades Credibility (MCC) trees were constructed in each case to summarise the posterior tree distributions. In these MCC trees, the more epidemiologically relevant clusters and lineages, previously identified in the global ML tree, were studied; and a node support cutoff (posterior probability (pp) above 0.9) was applied for their confirmation. Trees were viewed and edited in FigTree, v. 1.4.0 (http://tree.bio.ed.ac.uk/software/figtree).

Analysis of the antiretroviral drug resistance mutations
Drug resistance mutations were identified in the pol sequences using the HIVseq program, which is available in the HIV Drug Resistance Database of Stanford University (https://hivdb. stanford.edu/hivseq/by-sequences/), and also using the WHO surveillance drug resistance mutation list (last updated in 2009 by Bennett and colleagues) [37].

Statistical analyses
A multivariate logistic regression analysis was performed to determine the predictive effect of the demographic, clinical and virological characteristics on the adscription to each subtype/ CRF. The statistical significance of these characteristics, compared to the total proportion of infected patients, was studied by a hypothesis contrast using a z-test. The statistical analysis was performed with SPSS 22.0.
The multivariate logistic regression analyses demonstrated a higher risk of carrying HIV-1 subtype A for females (OR = 6.17, p = 0.026) and non Africans (OR = 0.08, p = 0.008; S3 Table). The other HIV-1 non-B genetic forms showed no predictive effect of the demographic, clinical and virological characteristics (data not shown).
Twenty-three patients were infected with unusual HIV-1 non-B variants (i.e., those variants found in four patients or fewer). Of them, 10 (44.4%) were observed in Spanish patients ( Table 2). The recombination patterns for the different URFs obtained according to the Bootscan analysis are presented in

Analysis of the putative geographical origins of the main HIV-1 non-B genetic forms found in Eastern Andalusia
In order to characterize the phylogenetic relationship of the patients infected with the most frequently found HIV-1 non-B variants (those found in ! 5 patients), the global ML tree (Fig  4) revealed the existence of 13 international lineages in Eastern Andalusia (Table 3) and 11 Andalusian clusters ( Table 4) that involved patients in our cohort.
The Bayesian analyses (Figs 5 and 6) showed that most of these Andalusian clusters originated in the first decade of this century, and mainly included patients sampled in El Ejido. The low CD4 count of the patients included in most of these transmission networks suggests a late HIV diagnosis in a high proportion of patients (Table 4). In order to provide more information about the scale of the trees shown, we provide in the S4 Table the distribution of patristic (uncorrected) pairwise genetic distances between sequences included in each of the ML and Bayesian trees generated in this article.

Non-recombinant subtypes
Thirty (18.2%) patients were infected with HIV-1 subtype A1. The viral sequences were genetically similar according to HIV-BLAST to 21 GenBank sequences from Bulgaria, the Democratic Republic of Congo, Croatia and Greece with 13, 6, 1 and 1 cases, respectively. The ML analysis (Fig 4) detected a large international lineage that involved sequences from Eastern Europe (lineage L1.A1 in Table 3) and grouped 21 patients from our cohort: 16 women born abroad (Eastern Europe (n = 14), the Dominican Republic (n = 1) and Lithuania (n = 1)) and 5 Spanish men. This lineage also included 23 GenBank sequences, also originating from Eastern Europe: Bulgaria, n = 10, Russia, n = 5, Poland, n = 1 and the Ukraine, n = 1. Within this lineage, we found two clusters (A.1 and A.2), formed exclusively by Spanish men and female sex workers born in Russia, all being patients sampled in Eastern Andalusia. The A.1 local cluster involved 4 sequences from Spanish patients living in the capital of Granada, its origin was estimated to be 2008.5 (95%CI: 2006.6-2010.3), and the sequences presented the resistance mutation K103N in the RT gene. This cluster was also phylogenetically related to viruses that circulate in Eastern Europe. Unlike most of the HIV-1 non-B clusters, patients in the A.1 cluster showed a high CD4 count (mean = 590, range = 534-701). Moreover, the Bayesian phylogenetic tree revealed short internode branches, which may indicate short times between infections. Finally, five subtype A1 sequences from our cohort corresponded to patients from Africa: Mali, n = 2, Equatorial Guinea, n = 1 and Spain, n = 2, not clustered in transmission cluster. Seven (4.2%) sequences corresponded to HIV-1 subtype C, and showed high genetic similarity to 23 GenBank sequences sampled in South Africa (n = 12), Brazil (n = 6) and Bulgaria (n = 4). We thus found two main ways of subtype C entrance to our area: South Africa and Brazil: a Brazilian male patient from our cohort grouped with 6 GenBank sequences from Brazil; and a South African male patient grouped with GenBank sequences from South Africa (n = 2) and Somalia (n = 1) (Fig 4). Within this subtype, we also found a single cluster (C.1) formed by patients from Brazil (n = 1) and Romania (n = 2). Nine (5.5%) sequences corresponded to HIV-1 subtype F1 and showed a high genetic similarity to 21 GenBank sequences from Brazil (n = 14), Bulgaria (n = 4) and the Democratic Republic of Congo (n = 3). We found only one Andalusian F1 cluster: a sequence pair (cluster F.1), that originated in 2010.2 (95%CI: 2010-2011) and was formed by two male injection drug users sampled in Jaen and who were of Brazilian and Spanish origins. This sequence pair was included among GenBank sequences from Brazil in the ML tree. However, we found 2 international lineages: L1.F, which grouped two Romanian heterosexual patients from our cohort with GenBank sequences sampled in Eastern Europe, mainly Romania (n = 6) and Bulgaria (n = 2). The second F1 subtype lineage (L3.F) included Spanish men who have sex with men (MSM) sampled in North Spain, and also a MSM from our cohort. Twelve (7.3%) patients of our cohort were infected with HIV-1 subtype G, who came from different western and central African countries: Mali (n = 1), Nigeria (n = 6), Ghana (n = 3) and Guinea-Bissau (n = 1). They presented high genetic similarity to 5 GenBank sequences from the Republic of Congo (n = 4) and Bulgaria (n = 1). None of these sequences was epidemiologically related according to our data. We found only one Nigerian patient whose sequence grouped with another one of the same country of origin (lineage L1.G).

HIV-1 recombinant forms
Eight (4.8%) patients in our cohort were infected with the recombinant CRF14_BG form, which in the pol analyses typically forms a monophyletic cluster within the subtype G crown. These eight patients came from Spain (n = 4), Guinea (n = 2), and Guinea-Bissau (n = 2). We also found a single small Andalusian cluster (cluster 14BG.1), which originated in 2004.4 (95% CI:2003. , and was formed by two Spanish patients. Finally, two patients from Guinea and Guinea Bissau grouped with sequences from Equatorial Guinea (lineages L1.14BG and L2.14BG).
We found 71 (43%) patients, mainly from western African countries (77.5%), infected with CRF02_AG. Of these, 11 (14%) were grouped into five small Andalusian Cluster: 4 clusters of two patients and one with three patients. We detected 5 different lineages (L1.02AG-L5.02AG) of viruses sampled in other countries, with patients from our cohort who came mainly from Western Africa.
To study the phylogenetic profile of variant CRF06_cpx, we used all the sequences available in Los Alamos HIV given their small number, n = 110 (see Fig 4). We found 5 patients in our cohort (3%) to be infected with variant CRF06_cpx, who came from different western African countries: Nigeria (n = 3), Ghana (n = 1) and Senegal (n = 1). These sequences grouped with GenBank sequences from the neighboring Western African countries of Burkina Faso, Togo and Nigeria. However, we found no significant association among the patients infected with this genetic form, and the CRF06_cpx sequences sampled in our cohort were interspersed in the tree.

Discussion
In Eastern Andalusia, most HIV-1 non-B subtype genetic forms were found among immigrant heterosexual population, mainly African males or Eastern European females. These patients were living preferentially in El Ejido, an area that potentially acts as a gateway for diverse HIV- High prevalence and diversity of HIV-1 non-B genetic forms due to immigration in southern Spain 1 variants to enter the Eastern Andalusian region. These findings are explained by the fact the El Ejido's economy is mainly based on greenhouse farming, for which a large industry has emerged in recent years thanks to immigrant labor, made up of people mainly from Africa.
The prevalence of HIV-1 non-B variants in eastern Andalusia is similar to that reported in a study performed in the nearby Western areas of Andalusia (%23%) [49], but is still much higher than that found elsewhere in Spain [6,7]. An increased prevalence has been noted for HIV-1 non-B variants and their genetic diversity in Eastern Andalusia in recent years: 22% of autochthonous patients were infected with HIV-1 non-B forms between 2005 and 2012, as opposed to the 12.8% reported in former studies conducted between 1997 and 2001 [50]. We also detected 12 different CRFs and nine URFs, a variability that is probably related to the increased migration rate reported in southern Spain in the last decade [49,51].
The least frequent HIV-1 non-B variants were detected often among Spanish patients (43%, [10/23]), and most of the clusters formed by these variants included at least one Spanish patient (55%, [6/11]). These data suggest that although these HIV-1 non-B variants seem to be Table 3

Country of origin of the patients sampled in our cohort (n)
Node support (BT) High prevalence and diversity of HIV-1 non-B genetic forms due to immigration in southern Spain due to imported cases in most cases, they have also gradually penetrated the autochthonous population in recent years. The phylogenetic and epidemiological study of the HIV-1 non-B variants in our region showed that these variants account for high proportion of infections among migrant patients, and that these viruses were genetically close to those circulating in these subjects' countries of origin. This indicates that many patients were infected before they arrived in Spain. These sequences sampled in other countries, and available in public databases, act as a control to avoid overestimating the local transmission clusters that include patients who are most probably unrelated in epidemiological terms.

Articles with related sequences
As previously shown in a national study [7], CRF02_AG was the most frequent HIV non-B variant in our population (43%). Nonetheless, the small proportion of their phylogenetic association is surprising (14%, [11/71]). This clustering rate was much higher for other HIV-1 non-B subtypes, such as subtype A1 (27.3%, [9/33]), where we discovered an international lineage (L1.A1) that mostly included a particularly vulnerable group of Russian female sex workers and potentially their local customers.
According to our analysis, it would appear that most of the non-B cases detected in Eastern Andalusia were generally imported cases as most were identified in immigrant populations. Our analyses suggest that many of these cases form part of international HIV-1 lineages that originated in Eastern Europe, South America and sub-Saharan Africa. However, we also High prevalence and diversity of HIV-1 non-B genetic forms due to immigration in southern Spain  identified 11 intra-region clusters, which might suggest the local dissemination of some non-B variants, particularly those which involve autochthonous Spanish subjects (6/11) and recent emergence times according to the phylogenetic reconstruction. On the other hand, clusters formed by foreign subjects with old common ancestors most likely reflect imported infections. The methods used herein involve a number of sampling limitations that affect this and many other similar studies. Since we relied on a BLAST search to identify the genetically closest sequences (from both Spain and abroad) that could form part of the same transmission networks as our sequences, we depended on the sequences deposited in databases. Unfortunately, this availability is sometimes very low, particularly for non-B variants. Therefore, we cannot rule out that close and more informative sequences were not captured as they have not been sampled. This was the reason why we added all the sequences available in HIV Los Alamos collected in Spain. We also demonstrated the presence of 13 different lineages of viruses that circulated in our region, which grouped with other patients from different Spanish cohorts, mainly foreign patients.
Fortunately, very few sequences included in transmission clusters persented resistance to first-line antiretroviral drugs. This information agrees with the common conception that viruses with resistance mutations present a biological disadvantage against wild strains, which weakens their transmission efficacy. Likewise, the drug resistance mutations detected affect mainly reverse transcriptase inhibitor drugs. We detected transmitted resistance mutations in four of the five patients grouped in Cluster A.4, which would cause high level resistance to nevirapine and efavirenz. We decided to study only the resistance mutations present in transmission clusters, which would have a stronger epidemiological impact. Further detailed information will be provided in future works.
The constant epidemiological surveillance in our population, for which phylogenetic analysis tools are used, is a particularly important measure to study past outbreaks of genetic HIV-1 non-B subtype variants, and to prevent future ones. Likewise, as transmission cluster size seems to predict its expansion in time [52], we could expect some transmission chains of HIV-1 non-subtype variants to become larger in size in forthcoming years, and more Spanish individuals to be included. We herein detected the presence of one patient from our cohort related to a fast spreading cluster among Spanish MSM infected with subtype F in Galicia (NW Spain) [43], a transmission cluster which, as Delgado et al. suggest, would probably be closely linked to viruses that circulate in Eastern Europe [42]. These authors [53] have also described a subtype A cluster that is being transmitted among individuals in different areas of Spain. Finally, Patiño et al. [54] have warned about the novel appearance of variant CRF19_cpx among Spanish MSM individuals.
Adequate knowledge about the characteristics of local epidemics, the study of risk groups and the prevalence of different viral subtypes are all fundamental aspects to successfully design HIV-1 prevention campaigns. In the present study, we demonstrate that phylogenetic studies which combine demographic, clinical and geographical data from different HIV-1 non-B subtypes in Eastern Andalusia provide very useful information to epidemiologically monitor and control HIV-1 spread and its origin in imported cases. Its use will help to reinforce and implement efficient actions to prevent HIV-1 from spreading between autochthonous and migrant populations.
Supporting information S1 Table. HIV-1 reference sequence dataset used in the phylogenetic analysis.