High Prevalence of Hepatitis C Virus Genotype 1b Infection in a Small Town of Argentina. Phylogenetic and Bayesian Coalescent Analysis

Previous studies in Argentina have documented a general prevalence of Hepatitis C Virus (HCV) infection close to 2%. In addition, a high prevalence of HCV has been recently reported in different Argentinean small rural communities. In this work, we performed a study aimed at analyzing the origins and diversification patterns of an HCV outbreak in Wheelwright, a small rural town located in Santa Fe province (Argentina). A total of 89 out of 1814 blood samples collected from people living in Wheelwright, were positive for HCV infection. The highest prevalence (4.9%) was observed in people older than 50 years, with the highest level for the group aged between 70–79 years (22%). The RFLP analyses showed that 91% of the positive samples belonged to the HCV-1b genotype. The E1/E2 and NS5B genes were sequenced, and their phylogenetic analysis showed that the HCV-1b sequences from Wheelwright were monophyletic. Bayesian coalescent-based methods were used to estimate substitution rates and time of the most recent common ancestor (tMRCA). The mean estimated substitution rates and the tMRCA for E1/E2 with and without HVR1 and NS5B were 7.41E-03 s/s/y and 61 years, 5.05E-03 s/s/y and 58 years and 3.24E-03 s/s/y and 53 years, respectively. In summary, the tMRCA values, the demographic model with constant population size, and the fact that the highest prevalence of infection was observed in elder people support the hypothesis that the HCV-1b introduction in Wheelwright initially occurred at least five decades ago and that the early epidemic was characterized by a fast rate of virus transmission. The epidemic seems to have been controlled later on down to the standard transmission rates observed elsewhere.


Introduction
More than 170 million people are infected with Hepatitis C Virus (HCV) worldwide. Chronically infected patients who develop chronic hepatitis may progress to liver cirrhosis and have an increased risk of developing hepatocellular carcinoma (HCC) [1,2].
The HCV genome is a positive sense, single-stranded RNA molecule of around 9.6-kb. It encodes a single polyprotein that is proteolytically processed by a combination of cellular and viral proteases into structural and nonstructural proteins [1,3,4].
Based mainly on phylogenetic analyses, six major lineages, namely genotypes 1 to 6, have been identified. These groups are further subdivided into several subtypes [5,6]. HCV genotypes 1a, 1b and 3a are distributed worldwide as a result of HCV transmission through blood transfusion, use of inadequately sterilized medical equipment and intravenous drug use [7]. However, a non-negligible proportion of HCV infections have an ''undefined'' route of transmission.
Previous studies in Argentina have documented a general prevalence of HCV infection close to 2% (Consenso Argentino de Hepatitis C, 2007). In addition, a high prevalence of HCV has been recently reported in different Argentinean small rural communities [8][9][10][11][12]. In this work, we performed a study aimed at analyzing the origins and diversification patterns of an HCV outbreak in Wheelwright, a small rural town founded in 1900, of approximately 5800 inhabitants, located in Santa Fe province (Argentina).

Epidemiological Analysis
The prevalence of HCV infections was studied during 2004 in a total of 1814 individuals, which represented approximately 31% of Wheelwright's population. Out of these 1814 volunteers (median age 40618.6 years), 716 (39%) were male and 1098 (61%) were women. One hundred and seven individuals (5.9%) were reactive for antibodies to HCV by EIA, and 72 out of these were PCR positive for HCV RNA. The 35 EIA positive-PCR negative samples were positive in a second EIA, and only 17 out of 35 were LIA positive. Consequently, eighteen out of 107 (17%) should be considered false positives for EIA screening method. Altogether, these results indicate an overall HCV prevalence of 4.9% (89/ 1814) ( Table 1).
The highest prevalence of HCV infections was observed in the group of people older than 50 years, with the highest levels found in individuals between 60-69 years (17%) and between 70-79 years old (22%) ( Table 2).

Phylogenetic Analyses
The RFLP analyses indicated that genotype 1b was, by far, the most prevalent among the samples from Wheelwright. Of the 64 samples genotype 1b determined by RFLP, 55 were PCR amplified in the two regions (E1/E2 and NS5B). These 55 sequences were used to describe the origin and diversification of the virus during the infection process in Wheelwright. Two sequences were only amplified in one of the two regions and were not included in the analysis, while the remaining seven could not be amplified.
The phylogenetic analyses were performed on both the separate, and concatenated, sequences from the E1/E2 and NS5B genes. In the separate analyses, the data from both E1/E2 and NS5B supported the monophyletic nature of the genotype 1b sequence from Wheelwright but with low bootstrap supports (data not shown). However, using concatenated sequences, all the methods used [maximum likelihood (ML), distance Neighbour-Joining, (NJ) and parsimony (P)] supported the monophyletic nature of the Wheelwright clade with good bootstrap values. The Wheelwright clade had a bootstrap value of 85 in the ML phylogenetic tree (Figure 1), and a value of 81 in the NJ one (supplemental Figure S1). The Parsimony analyses gave eight equally good trees (length = 18022). The bootstrap support for the Wheelwright clade in the Parsimony analyses was 63 (supplemental Figure S2).

Molecular Evolutionary Rate and Divergence Times
The E1/E2 (both with and without HVR1) and NS5B partial sequence genes from the Wheelwright samples were used to estimate both the substitution rate and the tMRCA. Approximate marginal likelihoods of the different demographic models were calculated with enforced strict and relaxed molecular clocks (Table 3, 4 and 5).
The Bayes Factor (BF) analysis favored the relaxed exponential molecular clock and the constant population size over the other models for both the E1/E2 and NS5B genes. However, these models were not significant with respect to the relaxed exponential molecular clock with Bayesian Skyline Plot (BSP), and the relaxed lognormal molecular clock with expansion growth model for E1/ E2 with HVR1. Likewise, the relaxed lognormal molecular clock with expansion growth model was not significant for E1/E2 without HVR1. Based on these results, the relaxed exponential molecular clock and the constant population size (i.e., the models that use the fewest parameters) were selected to avoid overparameterization.
In correspondence with the parametric model selected by the BF analysis, the BSP model (which estimates past population dynamics through time independently of any fixed parametric model of demographic history) indicated that the population size was maintained over time for the three data sets, under the different molecular clock models. However, the BSP model gives a more recent tMRCA and faster substitution rates than the other demographic models.

Discussion
Previous studies have documented a general prevalence of HCV of close to 2% in Argentina (Consenso Argentino de Hepatitis C, 2007). There are few previous reports of outbreaks of high prevalence of HCV in Argentina [8][9][10][11]. In the present work, a high prevalence of 4.9% (89/1814) was observed, mostly in the group of people older than 50 years, with the highest value for the group of people aged between 70   and 79 (22%). In a similar epidemiological study, Picchio et al. reported a prevalence of 5.7% for a small rural community of Argentina, geographically close to Wheelwright [8]. In our investigation, as in other studies [8][9][10][13][14][15][16], the number of HCV-infected people coincides with the second pattern previously described by Wasley et al. where most infections are found among elder people [17]. A similar study performed in a small Sicilian town has been recently reported [14]. The authors also observed that intravenous drug addiction and sexual promiscuity are almost absent in this population, and the most probable HCV transmission route is the iatrogenic one. This type of transmission may also be the cause of the outbreak in Wheelwright.
The combination of phylogenetic analysis, molecular clock and demographic models forms a powerful framework to construct and test hypotheses about viral epidemics. In the phylogenetic analysis of the E1/E2 and NS5B regions extensively used for epidemiological studies [18][19][20], the trees obtained by different methods (ML, NJ and P) showed similar topologies, grouping the 55 HCV-1b-infected patients from Wheelwright in a single cluster. Taking the results of this study into account, it is reasonable to assume that all the genotype 1b Wheelwright sequences share a common ancestor, and that a single source of infection is responsible for the HCV epidemic spread.
Recently, a Bayesian relaxed-clock method that allows the implementation of sophisticated calibration methods has been published [21]. Using these methods, the divergence times and substitution rate were calculated for genotype 1b in Wheelwright. The exponential relaxed clock model and the constant population  size performed better for the E1/E2 and NS5B genes with Wheelwright samples. Using this model, the estimated mean substitution rate from the NS5B region was 3.24E-03 s/s/y. This value is higher than the mean substitution rate values observed by others [22][23][24][25][26]. Part of this difference could be attributed to the different setups used in the different studies (e.g.: chimpanzees, a cohort of people infected with a common ancestor, intra-patients studies).
Higher substitution rates were determined using the E1/E2 region, both with, and without, the HVR1 (7.41E-03 s/s/y and 5.05E-03 s/s/y, respectively). These higher values may be attributed to the fast evolving characteristics of the genomic region that included three hypervariable regions.
The tMRCA based on the NS5B analyses was 53 years. Likewise, similar evolutionary time scales were obtained for the E1/E2 region without HVR1 (58 years), and the E1/E2 region with HVR1 (61 years). Although the tMRCA will not be the same as the date of introduction, especially in a population at constant size, our analysis allow us to speculate that the possible introduction and transmission events in Wheelwright started at least 50 years ago.
The molecular clock analyses give broadly similar results regardless of clock model and tree prior, with the exception of the BSP tree prior, which gives an intriguingly recent tMRCA, and faster substitution rates than the other analyses. The natural course of HCV infection comprises clinically silent periods in the most of the cases. Due to the inconspicuously nature of HCV infection, clinical manifestations of hepatic illness are often observed 20 to 30 years post-infection. However, in our case most of the hepatic illness were detected in patients older than 50 years and thus the results of the tMRCA from BSP model are at odds with epidemiological external data and could be confidently dismissed. It is possible that the underlying reason for the results could be attributed to the fact that the BSP model should be used only when the data are strongly informative about population history [27]. BSP places the least amount of constraint upon the data; in contrast, the parametric models possibly require less informative data given that they incorporate stronger priors on the analysis.
In summary, the HCV infection prevalence in Wheelwright is 4.9%. The phylogenetic analysis indicated a monophyletic origin for the HCV-1b epidemic. The tMRCA of the Wheelwright clade, the demographic model with constant population size, and the fact that the highest rate of infection was observed in elder people support the hypothesis that the HCV-1b introduction in Wheelwright initially occurred at least five decades ago, but were subsequently controlled, limiting further spread of the virus.

Studied Population
In 2004, the population of Wheelwright were encouraged to be tested for HCV infection. Blood samples were collected from 1814 volunteers, representing 31% of the population. These individuals were invited to undergo serological testing for HCV and complete a questionnaire aimed at identifying potential risk factors for HCV infection, such as surgery, injections, dental treatment, transfusions, out-of-hospital vaccines, accidental blood contact, job (present and past), intravenous drugs, tattooing, piercing, acupuncture, sexual abuse, jail and high-risk sexual behaviour such as multiple partners or men that have sex with men. Written informed consent to participate in this study was obtained from all patients. The ethics committee of the Universidad de Buenos Aires, Facultad de Farmacia y Bioquímica, approved this study protocol # Exp 701283.

Serology: HCV Antibodies and RNA Detection
All samples were tested for Anti-HCV antibodies (Anti-HCV-Ab) by ELISA (EIA II-Cobas Core, Roche). The presence of HCV RNA was investigated by a home-made nested RT-PCR with a detection limit of 100 UI/ml in all positive Anti-HCV-Ab samples. A second blood sample was obtained from Anti-HCV-Ab positive/ PCR negative patients, and both ELISA and PCR were repeated.
To further assess the presence of Anti-HCV-Ab in samples that were ELISA positive and PCR negative, an immunoblot confirmation assay was performed (INNO-LIA HCV AbIII Update, Innogenetics). Only EIA(+)/PCR(+) individuals and EIA(+)/ LIA(+) were considered HCV positive in the seroprevalence study.

HCV Genotyping
The HCV genotype was determined by RFLP analysis of the 59UTR region as described elsewhere [28]. Statistics Prevalence ratios and 95% confidence intervals were calculated to estimate the degree of association of risk with HCV transmission.

RNA Extraction, cDNA Synthesis and DNA Amplification for Phylogenetic Analysis
RNA was extracted from 100 ml serum using a commercial reagent (Trizol, Invitrogen). Reverse transcription (RT) reactions were carried out in 20 ml reactions. An aliquot of 9 ml of the eluted RNA and 20 ng random primers (Biodynamics) were incubated at 80uC for 5 min and 3 min at 220uC. Then, 2 ml of dNTP (10 mM), 100 units of M-MLV Reverse Transcriptase (Promega) and 20 units of Recombinant RNasin Ribonuclease Inhibitor (Promega) were added to the reactions and the mixtures were incubated at 37uC for 90 min and then kept at 80uC for 5 min to inactivate the enzyme.
The E1/E2 and NS5B genes were amplified by a hemi-nested PCR using Taq  The amplified DNAs were purified from agarose gels using a commercial kit (QIAquik Gel Extraction Kit protocol, QIAGEN) and the amplicons were sequenced in both senses using the internal PCR primers.

Phylogenetic Analysis
A partial region of the E1/E2 containing the C-terminal E1 and HVR1, HVR2 and HVR3 of the E2 (positions 1467 to 2024) and NS5B (positions 8202 to 8543) genes were sequenced from 55 samples from Wheelwright. These sequences were compared with the genotype 1b reference sequence (HCV-J, D90208) [29].
A strict requirement for dating an outbreak is to have a set of monophyletic strains (i.e., a group of sequences derived from a common ancestor). Current methods of phylogenetic analysis provide stringent tests of monophyly given that an adequate sample of sequences are used in the analyses. Essentially, the larger the portion of viral diversity represented in the dataset, the stronger the monophyly test for any group of sequences included in the dataset. Given that the outbreak in Wheelwright is largely dominated by 1b sequences, we downloaded 232 HCV-1b sequences from the Los Alamos sequence database for inclusion in the phylogenetic analyses. Tree rooting is also required to establish the monophyletic nature of a group of taxonomic units. Here, we used sequences from all the non-1b subtypes to root the tree by the ''outgroup'' criterion. The Wheelwright, reference 1b and outgroup sequences (n = 321) were aligned using the Clustal X program with default parameters [30].
Phylogenetic trees were constructed using distance (NJ), maximum likelihood (ML) and parsimony. The distance analyses were performed with the PAUP* v4.0b10 program [31]; the likelihood analyses with the PhyML v2.4.4 program [32] and the parsimony trees with TNT v1.1 program [33]. The ML and NJ trees were built using the GTR model, with a proportion of invariable sites and C-distributed rates across sites. Evolutionary models were inferred according to the Akaike Information Criterion (AIC) statistics [34] obtained with the Modeltest 3.7 program [35]. In the parsimony analysis we used multiple random addition sequences followed by tree bisection reconnection (RAS + TBR). Subsequently, several cycles of ratchet (R), tree-fusing (TF), tree-drifting (TD) and sectorial searches (SS) were performed until no improvements were observed in the tree lengths [36]. The robustness of the reconstructed phylogenies was evaluated by bootstrap analysis. The phylogenetic trees were analyzed using the Dendroscope v2.2 program [37].

Molecular Evolutionary Rate and Divergence Times
Bayesian coalescent-based methods were used to estimate the substitution rate and the tMRCA, using sequences from the E1/ E2 and NS5B genes. For the sequences of the E1/E2 gene, some modifications were necessary. The E2 gene contains 81 nucleotides corresponding to a hypervariable region (HVR1), which evolves faster than the surrounding sequences. Therefore, we performed independent analyses both with and without the inclusion of HVR1. The estimates of the rate of nucleotide substitutions per site per year (s/s/y) and the tMRCA (in years) were obtained by means of the Bayesian Markov Chain Monte Carlo (MCMC) techniques implemented in the BEAST v1.4.8 program [38]. Both strict and relaxed (uncorrelated lognormal and uncorrelated exponential) molecular clocks were enforced [38]. Five demographic models were applied as coalescent priors: constant population size, exponential growth, expansion growth, logistic growth and Bayesian skyline plot (BSP) [39].
The dates of isolation of the virus were considered as contemporaneous (isochronous). Since this is not sufficient to reliably estimate an evolutionary rate or the tMRCA, the tMRCA was inferred using a normal distribution with a mean value of 50 years and standard deviation (stdev) of 20 years as prior distribution, based on our prior belief about the history of the Wheelwright epidemic. To estimate the evolutionary rate we used a lognormal distribution as prior, with a lognormal mean of 26.603 s/s/y and a lognormal stdev of 0.711. The mean evolutionary rates were calculated from previously published data for the NS5B gene [22][23][24]40]. In addition, since the expected rate for the E2 gene is considered in the prior distributions used for the NS5B, the evolutionary rate for the E2 gene was estimated using the same distribution lognormal values as for the NS5B gene.
These analyses were performed using the General Time Reversible substitution model [41] with gamma-distributed rates across sites and a proportion of sites assumed to be invariable (GTR+G+I). The length and number of MCMC chains were chosen so that the effective sample sizes (ESS) were above 100, indicating that the parameter space was sufficiently explored. The convergence of the parameters to a stationary distribution was assessed with the TRACER program [42], and the statistical uncertainties were summarized from the 95% highest probability density (HPD) intervals. Model comparisons were performed by a Bayes Factor analysis [43]. Figure S1 Neighbour-Joining phylogeny of the concatenated analysis obtained with the PAUP* software. A separation between sequences from the outbreak (red branches, n = 55) and those from other sequences can be observed in the shaded area to the right of the figure. Genotypes 1b (references sequences, n = 232) are represented by orange branches and the genotypes No-1b are represented by black branches (n = 34). Group of Wheelwright is detailed to the left of the figure with bootstrap supports equal to 81% from the same analysis. Branch lengths are proportional to the number of nucleotide substitutions. Found at: doi:10.1371/journal.pone.0008751.s001 (1.74 MB TIF) Figure S2 Parsimony phylogeny of the concatenated analysis obtained with the TNT software. The strict consensus tree of parsimony analysis obtained from eight most parsimony trees. A separation between sequences from the outbreak (red branches, n = 55) and those from other sequences can be observed in the shaded area to the right of the figure in the strict consensus tree. Genotypes 1b (references sequences, n = 232) are represented by orange branches and the genotypes No-1b are represented by black branches (n = 34). Group of Wheelwright is detailed to the left of the figure with a bootstrap supports equal 63% from the same analysis.