Reconstruction of the spatial and temporal dynamics of hepatitis B virus genotype D in the Americas

Hepatitis B virus (HBV) genotype D (HBV/D) is globally widespread, and ten subgenotypes (D1 to D10) showing distinct geographic distributions have been described to date. The evolutionary history of HBV/D and its subgenotypes, for which few complete genome sequences are available, in the Americas is not well understood. The main objective of the current study was to determine the full-length genomic sequences of HBV/D isolates from Brazil and frequency, origin and spread of HBV/D subgenotypes in the Americas. Complete HBV/D genomes isolated from 39 Brazilian patients infected with subgenotypes D1 (n = 1), D2 (n = 10), D3 (n = 27), and D4 (n = 1) were sequenced and analyzed together with reference sequences using the Bayesian coalescent and phylogeographic framework. A search for HBV/D sequences available in GenBank revealed 209 complete and 926 partial genomes from American countries (Argentina, Brazil, Canada, Chile, Colombia, Cuba, Haiti, Martinique, Mexico, USA and Venezuela), with the major circulating subgenotypes identified as D1 (26%), D2 (17%), D3 (36%), D4 (21%), and D7 (1%) within the continent. The detailed evolutionary history of HBV/D in the Americas was investigated by using different evolutionary time scales. Spatiotemporal reconstruction analyses using short-term substitution rates suggested times of the most recent common ancestor for the American HBV/D subgenotypes coincident with mass migratory movements to Americas during the 19th and 20th centuries. In particular, significant linkages between Argentina and Syria (D1), Brazil and Central/Eastern Europe (D2), USA and India (D2), and Brazil and Southern Europe (D3) were estimated, consistent with historical and epidemiological data.


Introduction
Despite the availability of an effective vaccine and potent antiviral treatment, hepatitis B virus (HBV) infection remains a major public health issue, affecting 257 million people worldwide [1,2]. HBV contains a partially double-stranded DNA genome~3,200 nucleotides (nt) in length. Based on genomic sequence divergence of > 8%, HBV isolates have been classified into eight genotypes (HBV/A to H) [3][4][5][6][7], further to which two putative genotypes, I and J, have been proposed [8,9]. The significant diversity within specific HBV genotypes has led to further classification into numerous subgenotypes [10,11]. HBV genotypes and subgenotypes have distinct geographic distributions and are associated with differences in disease progression, response to antiviral therapy, and clinical outcome [12][13][14][15]. In particular, HBV/D is distributed worldwide with predominance in Southeastern Europe, the Mediterranean Basin, the Middle East, and the Indian sub-continent [16]. It has the shortest genome (3,182 nt) and is characterized by a 33 nt deletion at the beginning of the pre-S1 region. Ten subgenotypes (D1 to D10) have been described to date [11,17], with further corrections and novel classifications of HBV/D subgenotypes reported elsewhere [11,16,18,19]. Relative to HBV/A, HBV/D is associated with poorer clinical outcomes for cirrhosis and hepatocellular carcinoma and lower response to interferon alpha [14].
The origin and evolutionary history of HBV are still controversial and a wide range of HBV substitution rates have been estimated depending on the calibration approach used for the calculation [20,21]. Studies using external calibrations based on human migrations find slower substitution rates [22,23], while rates estimated using tip-dating analyses of heterochronous HBV sequences tend to be faster [24,25]. Accordingly, previous reports on the times of origin and divergence of HBV/D and its subgenotypes have suggested different estimates [23,24,[26][27][28][29]. In the Americas, HBV/D is found across the continent [16,[30][31][32][33][34][35], although the detailed evolutionary history and phylogeography of this genotype have not yet been examined. In addition, few HBV/D complete genome sequences from Brazil are available in the databanks, limiting the contribution of Brazilian isolates to phylogenetic and phylogeographic studies. The main objectives of this study were to determine the full-length genomic sequences of HBV/D isolates from Brazil, examine the proportion and distribution of HBV/D subgenotypes in American countries, and reconstruct the spatial and temporal diversification of HBV/D in the Americas.

Ethics statement
The study protocol was approved by the Brazilian Ethics Committee for Medical Research (CONEP nº 9604/2004) and the Ethics Committee of Oswaldo Cruz Institute (nº 1.358.935), and all patients signed informed consent before participation. All experiments were performed in accordance with the relevant guidelines and regulations.

Serum samples
Thirty-nine retrospective HBsAg-positive serum samples (from 2003 to 2013) previously characterized as HBV/D-containing strains, collected from different geographical regions of Brazil (North East, n = 3; Central-West, n = 16; South East, n = 4; and South, n = 16), were selected for HBV complete genome amplification.

Viral DNA extraction and PCR amplification
HBV complete genome sequences was attempted with a protocol modified from Günther et al., 1995 [36] using 4 μL of the nucleic acid template, primers P1 (5'-CCGGAAAGCTTGA GCTCTTCTTTTTCACCTCTGCCTAATCA-3') and P2 (5'-CCGGAAAGCTTGAGCTCTTCA AAAAGTTGCATGGTGCTGG-3'), and the following PCR profile: denaturation at 94˚C for 4 min followed by 10 cycles at 94˚C for 40 s, 55˚C for 1 min, and 72˚C for 3 min; 10 cycles of 94˚C for 40 s, 60˚C for 1 min, and 72˚C for 5 min; 10 cycles of 94˚C for 40 s, 62˚C for 1 min, and 72˚C for 7 min; 10 cycles of 94˚C for 40 s, 62˚C for 1 min, and 72˚C for 9 min; and a final extension step at 72˚C for 10 min. The PCR assay was performed using Platinum Taq DNA polymerase and supplied reagents (Invitrogen, Carlsbad, CA) in accordance with product instructions.

Nucleotide sequencing
PCR products were purified using the Wizard SV Gel and PCR Clean-Up System (Promega, Madison, WI). HBV full-length genome sequences were determined via direct sequencing using a BigDye Terminator Kit v3.1 (Applied Biosystems, Foster City, CA), and sequencing reactions analyzed on an ABI3730xl automated sequencer (Applied Biosystems). Inter and intragenotypic recombination were investigated using SimPlot v3. 5

Phylogenetic analysis
Multiple sequence alignment was performed using the Muscle program with 66 HBV complete genome sequences (39 sequences determined in this study and 27 reference sequences for HBV genotypes A to J) and subsequently subjected to Maximum Likelihood (ML) phylogenetic analysis. The ML phylogenetic tree was inferred with the online version of the PhyML program [40] under the GTR + I + G nucleotide substitution model selected with SMS (Smart Model Selection in PhyML) [41]. A heuristic tree search was performed with the aid of the SPR branch-swapping algorithm and the reliability of phylogenies estimated with the approximate likelihood-ratio test [42] based on a Shimodaira-Hasegawa-like procedure (SH-aLRT).

HBV/D sequence dataset from the Americas
To identify the HBV/D subgenotypes circulating in the Americas and determine their proportions in individual countries, a search for HBV/D sequences from the continent available in GenBank by December 2018 was performed using the following queries (for each American country): Sequences without a specified subgenotype but more than 600 nt in length were submitted for HBV subgenotyping via phylogenetic analysis and added to the dataset.

Bayesian phylogeographic analyses
A total of 421 full-length genome sequences from HBV/D subgenotypes D1, D2, D3 and D4 available in GenBank in December 2018 with known country of origin and collection date were grouped into four datasets for each subgenotype and selected for phylogeographic analyses (GenBank accession numbers available in S1 Table). In addition, 29 complete genomes sequenced in this study were included in the analyses. The number of sequences specified by subgenotype and location is shown in Table 1. To avoid bias from overrepresented countries, the online tool CD-HIT Suite (http://weizhongli-lab.org/cdhit_suite/cgi-bin/index.cgi?cmd= cd-hit-est) was used to group sequences with high similarity, and only one representative sequence of each group, prioritizing the one with the oldest collection date, maintained in the dataset.
In order to investigate the phylogenetic signal of the datasets, we carried out a likelihood mapping analysis of 10,000 random sets of four sequences (quartets) using IQ-tree v1.6.11 software [43]. In addition, the temporal structure of the datasets was assessed by conducting a regression of root-to-tip genetic distances against year of sampling using TempEst v1.

Bayesian reconstruction of time-scaled phylogeny and phylogeographic analyses
To estimate the time and epicenter of diversification of the HBV subgenotypes circulating in the Americas, Bayesian MCMC analyses were conducted on 450 (421 from GenBank and 29 from this study) complete genome sequences grouped into four datasets for subgenotypes D1, D2, D3 and D4. Ten D3 genomes sequenced here were excluded from analyses after using the online tool CD-HIT Suite to avoid bias from overrepresentation. Subgenotype D7 was not Polynesia 18 a Sequences generated in this study (n = 29) and from Genbank (n = 421) (accession numbers available in S1 Table). substitution rates using uninformative priors ranged from 7.17 × 10 −5 to 3.27 × 10 −4 s/s/y ( Table 2). The coefficient of rate variation for all subgenotypes was significantly higher than zero, supporting the use of a relaxed molecular clock model. Data from spatiotemporal reconstruction analyses using such substitution rates suggest that subgenotype D1 possibly originated in Syria (posterior state probability (PSP) = 0.98). From Syria, D1 migrated to Cuba (PSP = 0.58) and Brazil (PSP = 0.68), but it was not possible to determine the time of introduction, since only one D1 sequence was available from these countries. In Argentina, the most probable location for the origin of D1 was also Syria (PSP = 0.86) and its tMRCA calculated as 1982 (95% HPD: 1920-2006) ( Table 2; Fig 3A). Calculation of the BF using SPREAD software revealed significant epidemiological relationships (BF > 3) between Brazil/Lebanon and Argentina/Syria (Fig 4A).
The phylogeography of subgenotype D2 suggests that the most probable epicenter was Central/Eastern Europe (Estonia, Russia, Poland and Serbia) (PSP = 0.56). The majority of the Brazilian and Argentine sequences formed a single cluster (posterior probability (PP) = 0.99) whose tMRCA was dated back to 1969 (95% HPD: 1941-1984), with Central/Eastern Europe as the most probable source location (PSP = 0.81). On the other hand, sequences from USA grouped with Indian sequences, and at least two different introductions of this subgenotype seem to have occurred in USA. The main time of introduction was estimated between 1966 and 1978 from India (PSP = 0.99) ( Table 2; Fig 3B). BF calculation revealed strong epidemiological relationships between Brazil/Central/Eastern Europe, Brazil/Argentina, and USA/India (Fig 4B).
Unexpectedly, spatiotemporal reconstruction of subgenotype D4 highlighted Martinique (PSP = 0.33) as the most probable place of origin. All D4 sequences branched in country-specific monophyletic groups that probably arose between the middle 1800s and middle 1900s ( Table 2; Fig 3D).
Additionally, we tested alternative evolutionary time scales inferred by using informative priors based on two previously published substitution rates for HBV: (i) 1.18 x 10 −5 s/s/y (95% HPD interval: 8.04 x 10 −6 -1.51 x 10 −5 s/s/y) calculated on the basis of heterochronous sampling calibration using ancient sequences [49], and (ii) 3.7 x 10 −6 s/s/y (95% HPD interval: 2.2 x 10 −6 -5.5 x 10 −6 ) estimated by means of external calibrations based on human migrations [22]. As shown in Table 3, time-scale reconstructions under both substitution rates informative priors suggested, in the vast majority of cases, tMRCA estimates for HBV/D subgenotypes in the Americas dating back to the pre-Columbian era (before 1492).

Discussion
To our knowledge, the present study is the first to reconstruct the spatial and temporal dynamics of HBV/D in the Americas using a Bayesian framework. To this end, we conducted a survey of HBV/D sequences and assessed the distribution of five among the 10 HBV/D subgenotypes (D1, D2, D3, D4 and D7) throughout the continent (Argentina, Brazil, Canada, Chile, Colombia, Cuba, Haiti, Martinique, Mexico, USA, and Venezuela). Interestingly, Cuba was the only country in which all five subgenotypes were identified, which may be attributable to interactions of the island with different countries over the years [34]. Brazil is the largest country in the Southern Hemisphere corresponding to almost half of the area of South America. In Brazil, HBV/D was identified countrywide [52][53][54], with the highest rates in the Southern region where an intense flow of European immigrants had occurred [55,56]. Here, we introduced 39 HBV/D full-length sequences corresponding to 19% (39/209) of the complete HBV genomes from the Americas deposited in GenBank by December 2018. These new genomes represent all HBV/D subgenotypes (D1, D2, D3 and D4) circulating in Brazil as well as the first sequenced Brazilian D1 genome.
Recent findings from two studies using ancient samples pointed out that HBV has been infecting humans for at least 7,000 years [49,57]. Owing to the lack of agreement concerning the HBV substitution rate, the times of origin and divergence of HBV genotypes and subgenotypes are largely uncertain [21]. Accordingly, the results obtained from the phylodynamics and phylogeography of HBV should always be carefully analyzed in combination with historical and epidemiological knowledge. In this study, we reconstructed the evolutionary history of HBV/D in the Americas by using substitution rates estimated from the sampling dates of the sequences or by using rates previously estimated for HBV as informative priors for calibration of time-scale. The time-scale reconstructions based on ancient heterochronous sequences [49] and external calibration [22] (long-term substitution rates) recovered much older tMRCAs than those based on substitution rates directly estimated from modern heterochronous sequences (short-term substitution rates). Almost all tMRCAs obtained with long-term substitution rates precede the European discovery of the Americas (~500 years ago), which is incompatible with epidemiological and  historical data, since only HBV/F and HBV/H are thought to be genotypes originating in the Americas [3,23,[58][59][60]. On the other hand, spatiotemporal reconstruction using short-term substitution rates provided an epidemiologically realistic scenario, suggesting tMRCAs for the American HBV/D subgenotypes coincident with mass migratory movements to Americas during the 19 th and 20 th centuries. Therefore, our results corroborate the concept that tip-dating analyses of modern heterochronous HBV sequences are probably more appropriate in dating recent dispersal events [21,22]. Conversely, the use of deep calibrations is likely to be effective to estimate events that are distant in time, such as the origin of viral genotypes/subgenotypes.
Our phylogeographic reconstruction showed that Syria is the most likely location for the origin of subgenotype D1. Syrian D1 sequences were not available when previous studies suggested that this subgenotype originated in Turkey [24,26]. In fact, Syria and Turkey are geographically close and have historical links that tie the two neighboring countries together. The  phylogeographic data suggest that the Cuban D1 originated from Syria. According to Paredes (2000) [61], between the second half of the 19 th century and the first half of the 20 th century several thousand of people, mainly from Syria, Lebanon, Palestine, Turkey, and Egypt had migrated to Cuba. The introduction of D1 in Brazil was also suggested to have occurred from Syria. This proposal is reinforced by the fact that Brazil has the largest Arab colony outside the Middle East. It is estimated that about 7% of the Brazilian population is of Arab origin, the majority from Syria and Lebanon, and the migratory movement from these countries to Brazil was initially documented from the second half of the 19 th century [62,63]. Phylogeographic analysis of D1 sequences from Argentina suggested a single introduction from Syria during the early 1980s (95% HPD: 1920-2006). Similar to Brazil, Argentina received a large number of Arab immigrants until the mid-20 th century, mainly originating from Syria and Lebanon [64]. Spatiotemporal reconstruction of subgenotype D2 suggested that this lineage originates in Central/Eastern Europe (Estonia, Poland, Russia, and Serbia), in agreement with previous reports [24], highlighting Russia as the most probable location of origin. The clade containing most of the Brazilian and Argentine sequences had a tMRCA around the second half of the 20 th century (95% HPD: 1941-1984), with Central/Eastern Europe as the most probable place of origin. Although Latin America received the majority of immigrants from Southern Europe, there was also a considerable flow of migrants from Central, East and Southeastern European countries [65]. Between 1910 and 1929, more than 1.5 million immigrants from Central and Eastern Europe (mainly Poland, Romania, Russia, Lithuania, and Latvia) entered Brazil for employment in agriculture [66]. On the other hand, India was estimated to be the most likely location for origin of subgenotype D2 in USA. Notably, HBV/D is very prevalent in India, where at least five circulating subgenotypes have been identified (D1 to D5) [67][68][69] and the second largest group of immigrants in USA is from India [70].
Dispersal pathways of subgenotype D3 in the Americas were complex including different geographic regions and multiple introductions. Phylogeographic analysis suggested Southern Europe (Italy and Spain) (PSP = 0.54) as the root location of D3, and Brazil as the origin place of most dispersal pathways occurring from the late 18th century in the continent. Previous studies based on epidemiological and historical data have suggested that subgenotype D3 was transported to Brazil by European immigrants [55,56]. Of note, mass European emigration to the Americas took place from the early 19th to the mid-20th century, especially to the United States, Argentina, Canada, Brazil, Cuba, and Uruguay. About 13 million Europeans went to Latin America, most of them were Italians, Spaniards and Portuguese [65]. However, the lack of complete genome sequences representing all variations of European D3 may have limited our analysis, leading to indications of Brazil as the source location of most dispersal pathways in the continent.
The spatiotemporal reconstruction of subgenotype D4 suggested that Martinique is the most likely location of origin of this subgenotype. This result is incompatible with epidemiological and historical data of HBV in the Americas. In the American countries, D4 has been identified mainly in African descendant populations [31,33,34,71,72], with the exception of D4 strains found in individuals living in the Southwestern region of the Canadian Arctic [73]. Brazil and the Caribbean received a large number of African slaves between the 16 th and 19 th centuries [74,75]. Some earlier studies suggest an African origin for subgenotype D4 [31, 71], but none to date have performed Bayesian phylogeographic analysis. Notably, D4 has been described in Rwanda [76], Somalia [77], Kenya [78] and Ghana [79]. However, only one complete African D4 genome sequence is available (GenBank accession number KF922432), which may have limited our analysis leading to estimates of Martinique as the place of origin of this subgenotype. Based on the collective evidence, our hypothesis is that the D4 sequences from Martinique represent sequences from ancestral African populations that have introduced this subgenotype in the Americas.
In conclusion, this study highlights the differential distribution patterns and evolutionary dynamics of HBV/D in the Americas, supporting the utility of recently advanced phylogenetic tools in reconstructing the evolutionary history of HBV, and provides novel full-length HBV/ D genomic sequences, increasing the contribution of Brazilian isolates to ongoing phylogenetic and phylogeographic studies.
Supporting information S1 Table