CRF19 is a recombinant form of HIV-1 subtypes D, A1 and G, which was first sampled in Cuba in 1999, but was already present there in 1980s. CRF19 was reported almost uniquely in Cuba, where it accounts for ∼25% of new HIV-positive patients and causes rapid progression to AIDS (∼3 years).
We analyzed a large data set comprising ∼350 pol and env sequences sampled in Cuba over the last 15 years and ∼350 from Los Alamos database. This data set contained both CRF19 (∼315), and A1, D and G sequences. We performed and combined analyses for the three A1, G and D regions, using fast maximum likelihood approaches, including: (1) phylogeny reconstruction, (2) spatio-temporal analysis of the virus spread, and ancestral character reconstruction for (3) transmission mode and (4) drug resistance mutations (DRMs). We verified these results with a Bayesian approach. This allowed us to acquire new insights on the CRF19 origin and transmission patterns. We showed that CRF19 recombined between 1966 and 1977, most likely in Cuban community stationed in Congo region. We further investigated CRF19 spread on the Cuban province level, and discovered that the epidemic started in 1970s, most probably in Villa Clara, that it was at first carried by heterosexual transmissions, and then quickly spread in the 1980s within the “men having sex with men” (MSM) community, with multiple transmissions back to heterosexuals. The analysis of the transmission patterns of common DRMs found very few resistance transmission clusters.
Our results show a very early introduction of CRF19 in Cuba, which could explain its local epidemiological success. Ignited by a major founder event, the epidemic then followed a similar pattern as other subtypes and CRFs in Cuba. The reason for the short time to AIDS remains to be understood and requires specific surveillance, in Cuba and elsewhere.
CRF19 is a recombinant form of HIV-1, which causes rapid progression to AIDS (∼3 years versus 5–10 years for other subtypes and CRFs). CRF19 is reported almost uniquely in Cuba, where it is highly prevalent (∼25%) among newly detected HIV-1 patients. In this study, we found that CRF19 most likely recombined around the 1970s in the Cuban community that was stationed in Democratic Republic of the Congo and Angola at that time. It was introduced very early into the Cuban province of Villa Clara, from where it had several introductions to La Habana in the 1980s and then further spread to other Cuban provinces. The CRF19 epidemic most probably started with heterosexual transmissions, followed in the 1980s by multiple introductions into “men having sex with men” (MSM) community, followed by multiple transmissions back to heterosexuals (often females). The early introduction of CRF19 into Cuba most likely explains its success, not observed in other parts of the world. However, importantly, its rapid progression to AIDS makes it crucial to survey CRF19 sub-epidemics not only in Cuba, but also in other parts of the world having regular exchanges with Cuba.
Citation: Zhukova A, Voznica J, Dávila Felipe M, To T-H, Pérez L, Martínez Y, et al. (2021) Cuban history of CRF19 recombinant subtype of HIV-1. PLoS Pathog 17(8): e1009786. https://doi.org/10.1371/journal.ppat.1009786
Editor: David T. Evans, University of Wisconsin, UNITED STATES
Received: February 12, 2021; Accepted: July 6, 2021; Published: August 9, 2021
Copyright: © 2021 Zhukova et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Multiple sequence alignments and metadata used in this study are available on github.com/evolbioinfo/HIV1-CRF19_Cuba. The repository also contains the phylogenetic and time-trees reconstructed in this study and the BEAST configuration file. The sequences used in this study were uploaded to GenBank, their accession numbers are listed in S1 File.
Funding: OG and VK were supported by the Campus France collaborative programme ‘Partenariat Hubert Curien franco-cubain Carlos J. Finlay’. JV was supported by Ecole Normale Superieure Paris-Saclay and by ED Frontieres de l’Innovation en Recherche et Education, Programme Bettencourt. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
For HIV-1, more than 100 recombining subtypes have been detected . They have varying epidemiological features, such as pathogenesis and transmission patterns . CRF19 is a recombinant form of HIV-1 subtypes D, A1 and G. It was first detected in Cuba in 1999 when four epidemiologically unrelated individuals (diagnosed in 1997) were sampled . Casado et al.  proposed an African origin for CRF19 explained by its homology to an AG intersubtype recombinant from Cameroon and a subtype D virus from Gabon. However, while CRF19 has maintained a very low prevalence in the African continent through all these years, the virus has spread successfully in Cuba, placing it among the most prevalent circulating subtypes (24% of all sampled sequences by 2017). While CRF19 identification and first sequences were obtained from patients diagnosed in 1997 , a retrospective analysis performed by Perez et al.  exposed that this recombinant was already infecting Cuban patients who were diagnosed between 1986–1990, which evidences an early introduction in the outbreak of the epidemic. Potential central African ancestry of CRF19 could be explained by presence of numerous Cuban military and civilian personnel in several sub-Saharan African countries (the Democratic Republic of the Congo (DRC), Angola, Ethiopia) from 1960s to 1980s . However, CRF19 was never reported in sub-Saharan Africa.
In Cuba, CRF19 was found to cause particularly rapid progression to AIDS (<3 years after seroconversion, versus 5 to 10 years for other subtypes and CRFs ) and is associated with drug-resistance  in many newly infected patients. Several subsequent introductions related to Cuban samples were then reported in Spain (with one limited outbreak in southern Spain, with mild pathogenicity) [8–10] and Tunisia . Understanding the origin, transmission and drug resistance acquisition of such aggressive subtype is therefore key to impede its further spread in other regions of the world.
Delatorre and Bello  performed a maximum likelihood (ML) phylogeographic study of CRF19 and D sequences (as CRF19 is subtype D in the pol fragment analysed) from Los Alamos HIV Sequence database. Their data set contained 160 Cuban CRF19 sequences sampled between 1999 and 2011, 3 European CRF19 sequences and 2000 subtype D sequences from Africa (1996–2011). They discovered that all Cuban CRF19 sequences branched in a highly supported monophyletic sub-cluster that was nested within subtype D pol sequences of central African origin. This indicates a common recombined ancestor from which CRF19 spread further. The 3 CRF19 isolates in Europe branched within the Cuban clade, indicating a transmission from Cuba to Europe. They also performed a Bayesian analysis to estimate the time of the most recent common ancestor (MRCA) of the Cuban clade: The median of the estimated date was 1987 (confidence interval (CI) 1983–1991).
The goal of this study is to better understand the geographic origin and transmission patterns of CRF19, especially regarding transmission mode and drug resistance, using the new data from Cuba and international databases. New CRF19 and D-infected patients were detected and sequenced in Cuba and worldwide since the study of Delatorre and Bello . Moreover, additional information can be extracted by performing similar studies based on the parts of CRF19 that correspond to subtypes A1 and G. Importantly, with more fine-grained geographic data available for some of the Cuban sequences (e.g. provinces of diagnostics) the CRF19 spread within the country can now be elucidated. Similarly, as for many newly sampled Cuban sequences the information on the risk group is available, it can be used to reconstruct the transmission mode scenarios to learn how CRF19 spread between and within individual communities. Lastly, as transmitted drug resistance (TDR) is becoming predominant in several countries (e.g. UK ) and HIV transmission might be further impacted by drug resistance, we also investigated drug resistance mutation spread of CRF19.
We performed maximum likelihood phylogenetic analyses using fast and easy-to-use software, to investigate the spread of the virus, its mode of transmission and drug resistance patterns. Our new Cuban data set contains rich metadata (diagnostics dates, sampling dates, risk groups, province of diagnostics, etc.), from which we obtained spatio-temporal predictions for worldwide and local CRF19 spread as well as its mode of transmission among risk groups in Cuba. The results are similar to those obtained with BEAST , except for the transmission mode where the ML approach was more flexible and made it possible to account for additional information. The article is organized as follows: we first describe the data set and the insights on CRF19 origin and transmission patterns that we acquired in this study. We then detail the phylogenetic methods used for the temporal, phylogeographic, transmission mode and drug resistance analyses, which could be used for other HIV-1 CRFs or subtypes, and other viruses.
We combined the data from two sources: a Cuban data set (from now on referred as CU) and the 2018 HIV-1 compedium genome alignment from the Los Alamos HIV database (LA) , sampled between 1983 and 2018. LA contained 3884 sequences of all M-group HIV subtypes and recombinants that were curated. We chose this smaller but curated data set in order to avoid noisy and erroneous data (partial sequences, erroneous date, subtype or country annotations for non-curated sequences, etc.) that could perturb our analyses. CU contained 1674 partial pol and 164 partial env sequences (of various subtypes) recently sampled in Cuba, which we subtyped and aligned against LA as described in Materials and methods. We created three data sets out of the combined LA+CU alignment for CRF19 analyses (described in more detail in Tables 1 and 2):
- D+CRF19—the D-part of CRF19 (C-part of gag, PR, RT and nef, breakpoints from ) for 386 sequences of subtypes D and CRF19 (276 sequences), from CU and LA.
- A1+CRF19—the A1-part of CRF19 (N-part of gag, Integrase, env) for 218 sequences of subtypes A1 and CRF19 (31), from CU and LA.
- G+CRF19—the G-part of CRF19 (vif, vpr, vpu and C-part of env) for 84 sequences of subtypes G and CRF19 (5), from LA (as these parts of the HIV genome were not present in CU).
Number of sequences of different subtypes and from different sources. LA data contained 6 sequences from Cuba: 3 of subtype G and 3 CRF19.
For each data set, we reconstructed phylogenetic trees as described in Materials and methods and found that all the CRF19 sequences formed a unique cluster (see colourstrips in Fig 1). For A1+CRF19 and D+CRF19 data sets the CRF19 cluster was placed within the A1 and D trees, while for G+CRF19 data set it formed a sister-group to the G subtree (the G+CRF19 data set contained few sequences and did not include G sequences branching before the split with the CRF19 clade). There were several mis-annotated sequences in CU data set. For example CU1437–17 was placed within the D sequences in the D+CRF19 tree, while annotated in CU as CRF19: As it only contained the D-part of the sequence (and therefore could not be distinguished between CRF19 and D during the non-phylogeny-based subtyping) the annotation was most probably erroneous. In a similar manner, the CRF19 clusters of A1+CRF19 and D+CRF19 trees included a few A1 and D sequences respectively: Those sequences only contained the A1 or D fragments and therefore were likely mistyped as well.
reconstructed for the three data sets: A1+CRF19 (top left), G+CRF19 (top right), and D+CRF19 (bottom centre). Nodes and branches are coloured by location (ACR performed by PastML, MPPA+F81), as explained in the legend in bottom left: Cuba is coloured red, African countries are brown apart from the the DRC (violet), European countries are green, Asian countries are blue, American countries (Brazil and the USA) are orange, Australia is yellow. Colourstrips around the trees show sample subtypes: A1 (light-blue), D (yellow), G (light-orange), CRF19 (red), and A1-A3 recombinant (blue, note that this sequence was used in the A1+CRF19 data set as its part that aligns with the A1-part of CRF19 is fully A1). CRF19 sequences form a cluster for all data sets. Subtype misannotations lead to several D- and A1-annotated sequences within the corresponding CRF19 clusters, and one CRF19-annotated sequence (CU1437–17) outside of the CRF19 cluster in the D+CRF19 tree. Dates (with CIs, reconstructed by LSD2) and countries (with marginal probabilities, reconstructed by PastML) of the CRF19 cluster MRCAs and their non-CRF19 (i.e. A1, D or G) parent nodes are shown. The dates of non-recombinant Cuban clusters are also shown.
The dates obtained for the MRCA of the CRF19 cluster and for its parent (i.e. the most recent D/A1/G-subtype ancestor of the CRF19 cluster) by least-squares dating (LSD2 ), estimated on the three data sets are shown in Fig 1 and in Table 3. For all the data sets the CRF19 MRCA date is estimated in the 1970s. For the D-subtype dataset we confirmed this inference with a Bayesian analysis (see S1 Appendix. for details), which estimated 1972. The most recent A1/D/G-subtype ancestor dates are 1969/1959/1958 respectively.
Note that as the three data sets include different CRF19 sequences, the CRF19 cluster MRCA dates do not have to agree, but they all correspond to the time after CRF19 recombination. Similarly, the A1/D/G-subtype ancestor dates correspond to the time before recombination and do not have to be the same for the three data sets, but have to be more ancient than all of the CRF19 MRCA dates. Based on our estimations, the dates of origin inferred by using different parts of the CRF19 are all consistent and go in the same direction: we obtain a confident estimation of the date of recombination, occurring between 1966 (lower CI of the date when in A1+CRF19 data set it was still A1) and 1977 (upper CI of CRF19 MRCA in the D+CRF19 data set).
We also dated the non-recombinant A1/D/G-subtype Cuban sequences in our data sets. They were estimated to be introduced to Cuba later: 1982 for G, 1996 for A1 and in the 2010s for D (see Fig 1), supporting the hypothesis that the recombination occurred in Africa, as co-infection and thus high-level circulation of the 3 subtypes was needed for recombination.
The country predictions (and their marginal probabilities) for the MRCA of the CRF19 cluster and of its parent (i.e. the last A1/D/G-subtype ancestor) estimated on the three data sets are shown in Fig 1. The results suggest that CRF19 recombined in the Cuban community, as the CRF19 MRCAs in all data sets and the A1 ancestor have Cuba as the country reconstruction. The reconstructions also suggest that the D-strain ancestor has a DRC origin, while the G-strain reconstruction hesitates between Cuba and the DRC (with a slight preference for the DRC). Overall, these results fit well with the hypothesis of a recombination within the Cuban community in the DRC around 1970s [1966–1977].
We further performed a province-level analysis of the CRF19 subtree from the D+CRF19 data set. The root of CRF19 subtree (dated in 1974) was unresolved with two most probable provinces being Villa Clara (0.64 as marginal probability) and La Habana (0.24). This prediction is not surprising and is consistent with respect to the previous dating and country predictions: the CRF19 subtree most likely originated in the Cuban community in sub-Saharan Africa. From there the epidemic spread to Villa Clara very early (marginal probability of 0.92, date 1977, CI: 1973 – 1979), before HIV discovery in the 1980s and likely before its introduction in a number of developed countries by the very end of 1970s (e.g. USA, UK ). Then CRF19 gave rise to multiple introductions from Villa Clara to La Habana in the 1980s and 1990s and then further spread to other provinces (Fig 2).
ACR and visualisation are done with PastML  (MPPA+F81). Branches are coloured by provinces (colour code is explained below), gray thinner branches correspond to a province change. The root is unresolved between Villa Clara and La Habana. The oldest resolved node and its date are indicated and correspond to Villa Clara (marginal probability 0.92), the dates and marginal probabilities of the main introductions from Villa Clara to La Habana are also marked.
ACR for the transmission mode (heterosexual (HET) or a transmission between MSM) reconstructed the most likely state of the origin of the epidemic to HET (marginal probability of 0.89 vs 0.11 for MSM), but suggested that in the 1980s it quickly spread to and within the MSM community and had multiple transmissions back to HET (often female): Out of 312 transmissions in our tree, 53 were within the HET community, 74 from HET to MSM, 155 within the MSM community, and 30 from MSM to HET (including 14 to females), see Fig 3 for more details.
ACR is done with PastML  (MPPA+F81), the visualisation is performed with iTOL . Nodes and branches are coloured by transmission mode (green for HET, beige for MSM), gray thinner branches correspond to a state change, gray nodes have unresolved states (MSM or HET). The colourstrip around the tree shows the gender of the sampled individuals (red for female, blue for male), and we see several transmissions from the MSM community to female HET individuals.
We also performed a transmission mode analysis with a Bayesian diffusion in discrete space approach  and obtained a different result: The origin of the epidemic was inferred as MSM with a posterior probability of 0.99 (see S2 Fig.). However, as ACR analysis used more information (risk group at the moment of diagnostics and more sequences, see Materials and methods) to make the inference, its result is likely more credible. Moreover, ACR result is in a better agreement with the epidemiological data [21, 22].
We analysed the presence/absence of 66 surveillance DRMs, which were present in at least one of the CRF19 sequences (see Table 4 and S1 Table). Out of 313 CRF19 sequences, 172 (54.95%) had a treatment-experienced status, 107 (34.19%) were treatment-naive, and for 34 (10.86%) their treatment status was unknown. 169 sequences had at least one DRM (prevalence of 53.99%), out of which 21 were treatment-naive (prevalence of 19.63% among treatment-naive), they therefore correspond to TDR (see Table 4 for DRM-specific statistics). These treatment-naive prevalence values are very high and for several antiretroviral regiments (ARVs)—EFV (14.02%), NVP (43.77%), D4T (12.15%), AZT (11.21%), ABC (10.28%), and 3TC (10.28%)—exceeding 10%, the level at which WHO guidelines suggests the first-line ARVs should be reconsidered .
We then further analysed the DMRs with prevalence of at least 10%: RT:M184V, RT:K103N, RT:Y181C, RT:G190A, RT:M41L, RT:L210W, RT:T215Y, RT:D67N, RT:K70R. For most of them hardly any onward transmissions of drug resistance between treatment-naive individuals were detected (the largest TDR cluster contained 2 patients), which is probably related to undersampling (CU data set contained about 4–9% of the estimated number of people living with HIV in Cuba by 2018, according to UNIAIDS: unaids.org/en/regionscountries/countries/cuba), and relatively recent introduction of the corresponding ARVs. The latter is supported by the fact that for RT:M41L some naive-to-naive TDR was present (see Fig 4): RT:M41L is associated with resistance to AZT , the first ARV accepted in Cuba (used since 1987). To asses the robustness of this analysis, we also inferred presence/absence of RT:M41L with a Bayesian diffusion in discrete space approach  and obtained a similar result (see S1 Fig.).
ACR is done with PastML  (MPPA+F81), the visualisation is performed with iTOL . Nodes and branches are coloured by RT:M41L (violet for resistant (the mutation is present), light-green for sensitive). The colourstrip around the tree shows the treatment status of the sampled individuals (dark-green for treatment-naïve, lilac for treatment-experienced).
CRF19 is a predominant circulating recombinant form of HIV-1 in Cuba, with 24.1% of new HIV-positive patients being infected with it . We investigated its origin and transmission patterns using new sequence data from Cuba, using both fast ML-based software programs and the Bayesian approach implemented in BEAST.
Based on our date estimations, in 1966 CRF19 still did not exist as in the A1+CRF19 data set it was still A1 (lower CI), but by 1977 it already recombined as in the D+CRF19 data set this date is the upper CI of the CRF19 MRCA. This estimate is older than the one obtained by Delatorre and Bello  (CI: 1983–1991) but is not contradictory as Delatorre and Bello estimated the date of the CRF19 MRCA on a data set that contained less CRF19 sequences and hence should be regarded as the upper bound of the recombination date. Our result leaves both a recombination in Cuba and a recombination prior to introduction to Cuba possible (as Cubans were present in sub-Saharan Africa and especially DR Congo since the 1960s ). Recombination in Cuba was proposed by Delatorre and Bello for other prevalent HIV-1 recombinant forms (CRF20_BG, CRF23_BG, CRF24_BG), which probably emerged between 1996–1998 from a single CRF_BG ancestor dated in 1991. However, for CRF19 it seems unlikely that the recombination event took place in Cuba considering that (1) we traced MRCA for CRF19 back to 1977 or before, (2) the first Cuban HIV-1 patients were not diagnosed until 1986, (3) the non-recombinant A1/D/G-subtype Cuban sequences in our data sets were estimated to be introduced to Cuba after 1977 (1982 for G, 1996 for A1 and in the 2010s for D), while (4) recombination had to occur in a scenario of co-circulation of subtypes A, D, and G. Moreover, countries from sub-Saharan Africa were proposed to be sites of origin for other AG and AD recombinants within the same period of time [26, 27]. All together, we can conclude that CRF19 was among the very first CRFs/subtypes to be introduced in Cuba, by the mid of 1970s. Other subtypes have undergone the same journey, from Central Africa to the Caribbean, for example subtype B, which was present in Haiti since the end of the 1960s . However, while subtype B rapidly reached the USA and then Europe and the rest of the world, CRF19 stayed in Cuba, where it represents a major issue due to its prevalence and short time to AIDS.
To our knowledge, this is the first study which attempts to predict the origin of HIV-1 epidemic (and CRF19) by Cuban province employing sequence data. The phylogeography analysis detected that La Habana and Villa Clara played a major role in epidemic spread. We estimated that CRF19 was initially introduced to Villa Clara in the late 1970s (marginal probability of 0.92), had multiple introductions to La Habana in 1980s and 1990s and spread from there to other provinces. In agreement with our results, Dhanjal et al.  employed a series of graph algorithms to predict the evolution of the HIV/AIDS network in Cuba between the years 1986 and 2004. These authors suggested that at the end of 1986 epidemic was mostly located in La Habana, with 31 detections, and pointed out that Villa Clara and Sancti Spiritus were among the first provinces showing an increase in the detection rate. A sudden growth in detection in La Habana was expected given the high population density of this city and the difficulties to control the epidemic. In addition, an analysis performed by Miranda et al.  showed that, until 2010, Villa Clara remained among the provinces with a high incidence rate.
Our analysis on the transmission mode recreates some of the Cuban HIV-1 epidemic features such as a rapid spread among MSM community and multiple introductions to HET population (16% of all transmissions with MSM source, 50% of which were to females), indicating potential importance of bisexual individuals in the epidemic spread. According to epidemiological data, Cuban epidemic started as a mainly HET epidemic, after 1989 detection prevailed in homo/bi-sexuals, however, by 2002, a sudden rise in HET group occurred, ultimately showing a tendency to parity [21, 22]. In our reconstruction, CRF19 epidemic in Cuba followed a similar pattern and started in HET group, which agrees with the Pérez et al.  finding that among 28 individuals diagnosed with HIV between 1986 and 1990, 25 were HET and were infected with a high diversity of genetic forms.
Lastly, while confirming the high incidence of pre-treatment drug resistance (∼50%), we did not detect almost any onward transmission of drug resistance between treatment-naive individuals, apart from several small TDR clusters for RT:M41L, a DRM which is associated with resistance to AZT, the first ARV accepted in Cuba (in 1987). One of the reasons why we could not trace almost any naive-to-naive TDR might be because the sequences employed in this study were obtained as a result of Drug Resistance Surveillance in Cuba, and not the result of contact tracing studies. Several studies on the prevalence of TDR and a national survey from 2017 [25, 30] also showed that, despite a high drug resistance incidence (30% pre-treatment drug resistance), no clusters of TDR have been identified in Cuba to date for any HIV-1 subtype. However, epidemiological data combined with a small genetic distance among sequences from different patients has evidenced that, indeed, transmission of resistant HIV-1 genetic variants among naive individuals occurs in Cuba. This was almost invisible in our data, most likely due to their scarcity (<10% of infected individuals).
Considering all this, we suggest that CRF19 recombined in the Cuban community present in sub-Saharan Africa since the 1960s, most probably in the DRC, as a consequence of the co-circulation of subtypes A, D and G. It was introduced in Cuba very early, in the 1970s, and started to spread from a single transmission event, first in Villa Clara and then in the 1980s in La Habana. This early introduction to Cuba, which is earlier than the introduction of A, D and G subtypes, for example, could explain the local success of CRF19, which had no particular success in other parts of the world. This local success would thus result from a founder effect rather than some transmission advantage of CRF19, as supported by the relatively stable incidence of CRF19 through time: 18.3% in 2007 (sampled in 2003), 27.5% in 2012 (diagnosed in 2009–2010), 24.1% in 2019 (sampled in 2017). However, importantly, the fact that the time to AIDS with CRF19 is substantially lower than with other subtypes and CRFs (3 years versus 5–10 years ), makes it crucial to survey CRF19 sub-epidemics not only in Cuba but also in other parts of the world having regular exchanges with Cuba.
Materials and methods
Alignment and subtyping
LA dataset contained 3884 aligned and subtyped sequences of all M-group HIV-1 subtypes and recombinants. CU contained 1674 partial pol and 164 partial env sequences (of various subtypes) recently sampled in Cuba. We subtyped (pure subtypes and recombination positions) and aligned them against LA using jpHMM  (for detailed options see Analysis pipelines). When several sequences were present for the same patient in CU, we kept only the first one (in terms of sampling date). Most of the sequences in the CU data set were pre-annotated with a subtype; we replaced these annotations with jpHMM predictions in cases when they were contradictory (e.g. not matching jpHMM breakpoints for CRF19). All together, we obtained a large MSA of 5707 sequences, including both LA and CU, from which we extracted our D+CRF19, A1+CRF19 and G+CRF19 data sets (see Table 1 for details).
LA alignment contained country and sampling date annotations. The sequences from CU were also annotated with the province and date of diagnostics (which sometimes preceded sampling by several years), patient gender, risk groups (HET or MSM) and treatment status (treated or treatment-naive). For the DRM analysis we extracted the presence/absence of the SDRMs and their association with antiretroviral regiments with Sierra web service  (for detailed options see Analysis pipelines)).
For each data set we reconstructed a phylogenetic tree using RAxML-NG (v0.9.0, evolutionary model GTR+G6+FO+IO, for detailed options see Analysis pipelines)  and rooted it with an outgroup (5 LA sequences of a different subtype, B for D+CRF19, A6 for A1+CRF19, A for G+CRF19, which were then removed).
We then dated each tree with LSD2  (v126.96.36.199: github.com/tothuhien/lsd2/tree/v188.8.131.52, under strict molecular clock with outlier removal, for detailed options see Analysis pipelines) using tip sampling dates, and constraints based on the diagnostics dates. To assess the robustness of the temporal predictions, we additionally performed the dating of the CRF19 subtree by BEAST v1.10.4 [34, 35] with an uncorrelated lognormal relaxed molecular clock model  (see S1 Appendix.): the results (see Table 3) were fully compatible with those obtained by LSD2, however much longer to obtain (∼3 sec on 1-core machine for LSD2 vs 2, 36 days on a 96-core machine for the combined analysis (see S1 Appendix.) in BEAST).
As a phylogenetic tree is a proxy for a transmission tree with incomplete sampling, its branches correspond to patients and its internal nodes correspond to virus transmissions from a donor patient (parent branch and one of the child branches) to a recipient patient (the other child branch). As the recipient patient must have been infected (internal node) before being diagnosed, we put constraints for the dates of the internal nodes that had two tip children, at the dates of diagnostics of their recipients. As one cannot distinguish between the recipient and the donor branches in the tree (see Fig 5 for more details), we picked the more recent diagnostics date between the two and imposed the transmission node date to be before this diagnostics date. The temporal constraints on internal nodes are incorporated in LSD2 using the active-set method in the same way as it was used in LSD  to ensure the temporal precedence on ancestral nodes. This additional information, combined with the use of a strict molecular clock model, likely explain the narrower LSD2 confidence intervals, compared to BEAST 95% Highest Probability Density (HPD) ones.
An example internal node that has two tip children, A and B, and their sampling and diagnostics dates are shown on the left. As the internal node corresponds to a transmission, three scenarios are possible: (middle left) A is the donor, therefore the transmission must have happened before the recipient (B) got diagnosed; (middle right) B is the donor, therefore the transmission must have happened before the recipient (A) got diagnosed; (right) the donor was another individual that was not sampled and does not appear in the tree, therefore that transmission must have happened before both A and B got diagnosed. In practice, however, we cannot know which of the three scenarios is the correct one and therefore have to pick the more recent diagnostics date (B, Sep 2014) as the upper-bound constraint for the internal node.
Ancestral character reconstruction (ACR)
We analysed geographic origin and transmission patterns of the CRF19 epidemic worldwide and in Cuba using ACR with PastML tool (v1.9.20, for detailed options see Analysis pipelines) . PastML infers an ancestral scenario for a given character (e.g. location) on a rooted phylogenetic tree, by assigning the likely ancestral character states to every node in the tree, based on the character annotations provided for some of the tree nodes (typically tips). PastML implements a very fast (the analysis of transmission mode and DRMs took <1 minute) ACR in maximum likelihood framework, under an s-state (where s is the number of possible character states, e.g. countries) F81-like  model, and can handle very large phylogenies (dozens of thousands of tips), while being robust to mild sampling bias and model violations . With marginal posterior probabilities approximation (MPPA) method a unique state is predicted in tree nodes with low uncertainty, whereas several states are predicted if they have similar marginal probabilities.
For phylogeographic analysis we reconstructed ancestral scenarios for country annotation on the full trees reconstructed from the three data sets. Additionally, we reconstructed ancestral characters for Cuban provinces on the CRF19 subtree of the D+CRF19 tree (which included most of the CRF19 sequences in our data sets). As for the province-level analysis the annotations were specified at the date of diagnostics, which often preceded sampling, we created additional internal nodes in the tree, corresponding to diagnostics, named them, and associated the input province annotations for PastML with them instead of tree tips (see Fig 6). To add the diagnostics nodes, we processed each tip T whose patient’s diagnostics date diag(T) preceded the sampling date d(T)>diag(T). If the diagnostics happened after the last transmission (i.e. its parent node P, where d(P)<diag(T)), we split the tip branch into two parts with a one-child internal node D at the time of diagnostics (d(D) = diag(T)): The branch length of D became dist(D) = diag(T)−d(P), while the branch length of T became dist(T) = d(T)−diag(T). Otherwise (when the diagnostics happened before the last transmission diag(T)<d(P), which meant that the parent branch corresponded to the same (donor) patient as the tip), we processed the parent node (or recursively the grandparent node, etc.) instead, in a similar manner.
To reconstruct the ancestral states for characters annotated at diagnostics dates (such as provinces of Cuba), we added additional nodes to the time-scaled trees at times of diagnostics, named those nodes (on the middle panel, the diagnostics nodes, e.g. a, correspond to the tips with the same names but capitalised, e.g. A), and associated the input annotations for PastML with them. We then reconstructed ancestral and tip characters with PastML based on those annotations (see text for calculation details).
PastML allows for input annotations to be associated with any node in the tree (not only tips), and infers the states of the other nodes (including unannotated tips) during ACR. The computation of the Up and Down-likelihoods for a one-child node D being in a state i is a corner case of the formulas described in the “Materials and Methods” section of the original PastML publication : Down(D, i) = ∑j PC(i → j/dist(T))Down(T, j), Up(D, i) = ∑j PC(i → j/dist(D))Up(P, j). If an internal node’s state is provided in the input annotation file, then the Down- and Up-likelihoods of it being in any other state are reset to 0. The rest of the formulas from  are directly applicable.
To further understand the Cuban CRF19 epidemic, we also reconstructed ancestral scenarios for transmission mode (i.e. MSM or HET). The patients in the CU data set were annotated with their risk group (MSM or HET), and we used these data as a proxy for their potential HIV transmission mode at the moment of diagnostic and sampling: We associated the same annotations to both the corresponding single-child internal nodes (representing diagnostics) and to the tips (representing sampling). Consistently, the transmission mode is the same for the donor and the recipient at the moment of transmission (internal nodes of the tree), but can change with time for the same person. Note, moreover, that the MSM/HET status is uncertain, as some MSM might be bisexual or declared as HET .
Lastly, we analysed DRM presence/absence on the CRF19 subtree of the D+CRF19 tree as the D-part of CRF19 includes the parts of the virus genome (PR, RT) that may contain DRMs. We analysed the DRMs that had a prevalence >10%: RT:M184V, RT:K103N, RT:Y181C, RT:G190A, RT:M41L, RT:L210W, RT:T215Y, RT:D67N, RT:K70R. To reconstruct the ancestral character states for each DRM, we used a procedure visualised in Fig 7. We checked the ARVs that can provoke the DRM and the dates of their introduction to Cuba (e.g. RT:M41L is associated with resistance to AZT, which was the first ARV for HIV and was used in Cuba since 1987, to D4T and potentially to DDI, both of which were used in Cuba since 2001). We then cut the time-scaled tree (with named internal nodes) at the earliest of these dates (e.g. 1987), therefore obtaining the pre-treatment-introduction tree and a forest of post-treatment-introduction subtrees. For the trees in the forest we added additional one-child root nodes (as parents of the corresponding tree roots, at distances that corresponded to the differences between the root dates and the ARV introduction date), which we marked as sensitive in the PastML input annotation file. We performed the ACR with PastML on the forest (functionality that we added to PastML since version 1.9.19): the forest likelihood is calculated by multiplying the likelihoods of all its trees. For the new roots their Up-likelihood was set to 1 for the sensitive state and to 0 for the resistant state, and the Down-likelihood calculated as described above. Once the forest ACR was performed we combined it with the all-sensitive annotations for the pre-treatment-introduction tree nodes.
To reconstruct the ancestral character states, resistant (violet) or sensitive (light-green), for a DRM (e.g. RT:M41L), we cut the time-scaled tree at the date of introduction to Cuba of the first ARV (as, for example, AZT for the DRM RT:M41L, used in Cuba since 1987) that can provoke this DRM (left panel), obtaining the pre-treatment-introduction tree (upper part of the tree) and a forest of post-treatment-introduction subtrees (bottom part). For the trees in the forest we then marked their roots as sensitive (middle left panel). We performed the ACR with PastML on the forest (middle right panel) and combined the results with the the all-sensitive annotation for the pre-treatment-introduction tree nodes (right panel).
To assess the robustness of our ACR analyses, we performed the Bayesian diffusion in discrete space analysis  of the mode of transmission and presence/absence of RT:M41L DRM on the CRF19 subtree by BEAST v1.10.4 [34, 35] (see S1 Appendix.). The results were compatible for drug resistance, and showed the advantage of using rich metadata for the mode of transmission: To predict transmission mode PastML used additional information (risk group at the moment of diagnostics, set to the same value as at sampling), which we could not input into BEAST as the states at diagnostics could only be specified on a dated tree. PastML inferred HET at the origin of the epidemic (marginal probability of 0.89), which is in a better agreement with the epidemiological data [21, 22] than the BEAST result (MSM with posterior probability of 0.99).
Snakemake  pipelines and ad hoc Python3 scripts used for the analyses described above are available on github.com/evolbioinfo/HIV1-CRF19_Cuba. Along with the subtyping, tree reconstruction, dating and ACR tools mentioned above, we used goalign (v0.3.1, github.com/evolbioinfo/goalign) for basic sequence alignment manipulations (counting the number of sequences and the alignment length, removing positions containing only gaps, etc.), gotree (v0.3.0b, github.com/evolbioinfo/gotree) and ETE3  for basic tree manipulations (format conversion, pruning, etc.), and hdx-python-country Python3 package (v2.3.5, github.com/OCHA-DAP/hdx-python-country) for cleaning and enriching location metadata.
S1 Appendix. Bayesian analyses.
The detailed description of the Bayesian analyses performed in this study.
S1 Fig. Bayesian diffusion in discrete space analysis for the presence of RT:M41L DRM in the CRF19-infected individuals.
Analysis is done with BEAST [20, 34, 35], the visualisation of the maximum clade credibility (MCC) tree is performed with iTOL . Nodes and branches are coloured by RT:M41L (violet for resistant (the mutation is present), light-green for sensitive). The colourstrip around the tree shows the treatment status of the sampled individuals (dark-green for treatment-naïve, lilac for treatment-experienced).
S2 Fig. Bayesian diffusion in discrete space analysis for the transmission mode of CRF19.
Analysis is done with BEAST [20, 34, 35], the visualisation of the maximum clade credibility (MCC) tree is performed with iTOL . Nodes and branches are coloured by transmission mode (green for HET, beige for MSM). The colourstrip around the tree shows the gender of the sampled individuals (red for female, blue for male), and we see several transmissions from the MSM community to female HET individuals.
S1 File. Metadata.
Metadata table for the LA and CU sequences used in this study (tab-separated format). It contains the following columns:
- id—identifier of the sequence used in these analyses (corresponds to the tree tips);
- POL Accession Number—GenBank  accession number for the pol part of the sequence;
- ENV Accession Number—GenBank accession number for the env part of the sequence;
- source—source of the data: LA or CU;
- province_of_diagnostics—province of diagnostics for CU sequences, abroad for LA;
- country_code—ISO2 code for the country of sampling;
- country—country of sampling;
- intregion—generalized location of sampling, includes the following values: Australia, Cuba, Eastern Africa, Eastern Asia, Middle Africa, Northern America, Northern Europe, Russia, South America, Southern Africa, Southern Asia, Southern Europe, Western Africa, Western Asia, Western Europe;
- subregion—generalized location of sampling, includes the following values: Asia, Australia and New Zealand, Cuba, Europe, Northern America, Russia, South America, Sub-Saharan Africa;
- region—generalized location of sampling, includes the following values: Africa, Americas, Asia, Cuba, Europe, Oceania;
- sample_date—date of sampling;
- diagnostics_date—date of diagnostics;
- gender—gender of the infected individual: F or M;
- risk_group—declared risk group: Bisexual, HT (heterosexual), or MSM (men-who-have-sex-with-men);
- treated—treatment status: naive or treated;
- subtype_annotated—pre-annotated HIV-1 subtype;
- subtype_jpHMM—HIV-1 subtype detected by jpHMM;
- subtype_consensus—HIV-1 subtype annotations used in this study: pre-annotated ones when compatible with jpHMM (e.g. matching jpHMM breakpoints for CRF19), otherwise jpHMM prediction;
- subtype_tree—HIV-1 subtype detected by the phylogenetic trees in this study (same as consensus for the majority of the sequences, see Phylogenetic trees);
- subtype_CRF_19_D—whether the sequence is compatible with CRF19 in its D part according to jpHMM: D if yes, empty otherwise;
- subtype_CRF_19_G—whether the sequence is compatible with CRF19 in its G part according to jpHMM: G if yes, empty otherwise;
- subtype_CRF_19_A1—whether the sequence is compatible with CRF19 in its A1 part according to jpHMM: A1 if yes, empty otherwise.
The authors would like to thank Dr. Gonzalo Bello for his helpful advice on phylogeography of CRF19.
- 1. Foley BT, Korber BTM, Leitner TK, Apetrei C, Hahn B, Mizrachi I, et al. HIV Sequence Compendium 2018. Los Alamos, NM (United States): Los Alamos National Laboratory (LANL); 2018. Available from: http://www.osti.gov/servlets/purl/1458915/.
- 2. Lau KA, Wong JJL. Current trends of HIV recombination worldwide. Infectious Disease Reports. 2013;5(SUPPL.1):15–20.
- 3. Casado G, Thomson MM, Sierra M, Nájera R. Identification of a novel HIV-1 circulating ADG intersubtype recombinant form (CRF19_cpx) in Cuba. Journal of acquired immune deficiency syndromes (1999). 2005;40(5):532–7.
- 4. Pérez L, Thomson MM, Bleda MJ, Aragonés C, González Z, Pérez J, et al. HIV type 1 molecular epidemiology in Cuba: High genetic diversity, frequent mosaicism, and recent expansion of BG intersubtype recombinant forms. AIDS Research and Human Retroviruses. 2006;22(8):724–733. pmid:16910827
- 5. Cuevas MT, Ruibal I, Villahermosa ML, Díaz H, Delgado E, Parga EVd, et al. High HIV-1 genetic diversity in Cuba. AIDS. 2002;16(12):1643–1653. pmid:12172086
- 6. Kouri V, Khouri R, Alemán Y, Abrahantes Y, Vercauteren J, Pineda-Peña AC, et al. CRF19_cpx is an Evolutionary fit HIV-1 Variant Strongly Associated With Rapid Progression to AIDS in Cuba. EBioMedicine. 2015;2(3):244–54. pmid:26137563
- 7. Kouri V, Alemán Y, Pérez L, Pérez J, Fonseca C, Correa C, et al. High frequency of antiviral drug resistance and non-b subtypes in HIV-1 patients failing antiviral therapy in Cuba. Journal of the International AIDS Society. 2014;17:19754. pmid:25397499
- 8. González-Domenech CM, Viciana I, Delaye L, Mayorga ML, Palacios R, de la Torre J, et al. Emergence as an outbreak of the HIV-1 CRF19_cpx variant in treatment-naïve patients in southern Spain. PLOS ONE. 2018;13(1):e0190544. pmid:29309418
- 9. Pérez-Parra S, Álvarez M, Fernandez-Caballero JA, Pérez AB, Santos J, Bisbal O, et al. Continued propagation of the CRF19_cpx variant among HIV-positive MSM patients in Spain. Journal of Antimicrobial Chemotherapy. 2018;73(4):1031–1038. pmid:29325134
- 10. Patiño Galindo JA, Torres-Puente M, Gimeno C, Ortega E, Navarro D, Galindo MJ, et al. Expansion of the CRF19_cpx Variant in Spain. Journal of Clinical Virology. 2015;69:146–149. pmid:26209397
- 11. El Moussi A, Thomson MM, Delgado E, Cuevas MT, Nasr M, Abid S, et al. Genetic Diversity of HIV-1 in Tunisia. AIDS Research and Human Retroviruses. 2017;33(1):77–81. pmid:27473255
- 12. Delatorre E, Bello G. Phylodynamics of the HIV-1 epidemic in Cuba. PloS one. 2013;8(9):e72448.
- 13. Mourad R, Chevennet F, Dunn DT, Fearnhill E, Delpech V, Asboe D, et al. A phylotype-based analysis highlights the role of drug-naive HIV-positive individuals in the transmission of antiretroviral resistance in the UK. AIDS (London, England). 2015;29(15):1917–25. pmid:26355570
- 14. Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006;4(5):e88.
- 15. Foley BT, Korber BTM, Leitner TK, Apetrei C, Hahn B, Mizrachi I, et al. HIV Sequence Compendium 2018. Los Alamos, NM (United States): Los Alamos National Laboratory (LANL); 2018.
- 16. To TH, Jung M, Lycett S, Gascuel O. Fast Dating Using Least-Squares Criteria and Algorithms. Systematic biology. 2016;65(1):82–97.
- 17. Penrose RHL. Knowledge of the Geographical Spread and Prevalence of HIV/AIDS 1. In: The Penrose Inquiry. Final Report. Edinburgh: APS Group Scotland; 2015. p. 415–448. Available from: http://www.penroseinquiry.org.uk/finalreport/text/354876_chapter_9.html.
- 18. Ishikawa SA, Zhukova A, Iwasaki W, Gascuel O. A Fast Likelihood Method to Reconstruct and Visualize Ancestral Scenarios. Molecular Biology and Evolution. 2019;. pmid:31127303
- 19. Letunic I, Bork P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics. 2007;23(1):127–128.
- 20. Lemey P, Rambaut A, Drummond AJ, Suchard MA. Bayesian phylogeography finds its roots. PLoS Computational Biology. 2009;5(9):1000520.
- 21. Perez Avila J, Pena Torres R, Pena Tores R, Joane Fiol J, Lantero Abreu M, Arazoza Rodriguez H. HIV control in Cuba. Biomedicine and Pharmacotherapy. 1996;50(5):216–219.
- 22. Miranda Gómez O, Fariñas Reinoso AT, Coutín Marie G, Nápoles Pérez M, Lara Fernández H, Bueno Marrero LE. Panorama of HIV-infection in Cuba, 1986-2007; 2009. 2. Available from: http://scielo.sld.cu/scielo.php?script=sci_arttext&pid=S1561-30032009000200004&lng=es&nrm=iso&tlng=es.
- 23. World Health Organization. Updated recommendations on first-line and second-line antiretroviral regimens and post-exposure prophylaxis and recommendations on early infant diagnosis of HIV. World Health Organization; 2018. Available from: https://www.who.int/publications/i/item/WHO-CDS-HIV-18.51.
- 24. Boyer PL, Das K, Arnold E, Hughes SH. Analysis of the Zidovudine Resistance Mutations T215Y, M41L, and L210W in HIV-1 Reverse Transcriptase. Antimicrobial Agents and Chemotherapy. 2015;59(12):7184–7196.
- 25. Machado LY, Blanco M, López LS, Díaz HM, Dubed M, Valdés N, et al. National survey of pre-treatment HIV drug resistance in Cuban patients. PLOS ONE. 2019;14(9):e0221879. pmid:31479466
- 26. Carr JK, Salminen MO, Albert J, Sanders-Buell E, Gotte D, Birx DL, et al. Full genome sequences of human immunodeficiency virus type 1 subtypes G and A/G intersubtype recombinants. Virology. 1998;247(1):22–31. pmid:9683568
- 27. Thomson MM, Casado G, Posada D, Sierra M, Nájera R. Identification of a novel HIV-1 complex circulating recombinant form (CRF18_cpx) of Central African origin in Cuba. AIDS. 2005;19(11):1155–1163.
- 28. Gilbert MTP, Rambaut A, Wlasiuk G, Spira TJ, Pitchenik AE, Worobey M. The emergence of HIV/AIDS in the Americas and beyond. Proceedings of the National Academy of Sciences of the United States of America. 2007;104(47):18566–18570.
- 29. Dhanjal C, Clemencon S, De Arazoza H, Rossi F, Tran VC. The Evolution of the Cuban HIV/AIDS Network; 2011. Available from: http://arxiv.org/abs/1109.2499.
- 30. Pérez L, Kourí V, Alemán Y, Abrahantes Y, Correa C, Aragonés C, et al. Antiretroviral drug resistance in HIV-1 therapy-naive patients in Cuba. Infection, Genetics and Evolution. 2013;16:144–150. pmid:23416260
- 31. Schultz AK, Zhang M, Leitner T, Kuiken C, Korber B, Morgenstern B, et al. A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes. BMC Bioinformatics. 2006;7(1):265. pmid:16716226
- 32. Liu TF, Shafer RW. Web Resources for HIV Type 1 Genotypic-Resistance Test Interpretation. Clinical Infectious Diseases. 2006;42(11):1608–1618.
- 33. Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019. pmid:31070718
- 34. Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evolution. 2018;4(1). pmid:29942656
- 35. Ayres DL, Darling A, Zwickl DJ, Beerli P, Holder MT, Lewis PO, et al. BEAGLE: An application programming interface and high-performance computing library for statistical phylogenetics. Systematic Biology. 2012;61(1):170–173. pmid:21963610
- 36. Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution. 1981;17(6):368–376.
- 37. Jung M, Leye N, Vidal N, Fargette D, Diop H, Toure Kane C, et al. The Origin and Evolutionary History of HIV-1 Subtype C in Senegal. PLoS ONE. 2012;7(3):e33579. pmid:22470456
- 38. Köster J, Rahmann S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520–2522.
- 39. Huerta-Cepas J, Serra F, Bork P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Molecular Biology and Evolution. 2016;33(6):1635–1638.
- 40. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al. GenBank. Nucleic Acids Research. 2012;41(D1):D36—-D42. pmid:23193287