The Phylogeographic and Spatiotemporal Spread of HCV in Pakistani Population

Hepatitis C Virus (HCV) is the most prevalent human pathogen in Pakistan and is the major cause of liver cirrhosis and hepatocellular carcinoma in infected patients. It has shifted from being hypo-endemic to being hyper-endemic. There was no information about the origin and evolution of the local variants. Here we use newly developed phyloinformatic methods of sequence analysis to conduct the first comprehensive investigation of the evolutionary and biogeographic history in unprecedented detail and breadth. Considering evolutionary rate and molecular-clock hypothesis in context, we reconstructed the spatiotemporal spread of HCV in the whole territory of its circulation using a combination of Bayesian MCMC methods utilizing all sequences available in GenBank. Comparative analysis were performed and were addressed. Whole genome and individual gene analysis have shown that sub-types 1a, 1b and 3a are recognized as epidemic strains and are distributed globally. Here we confirm that the origin of HCV 3a genotypes is in South Asia and HCV has evolved in the region to become stably adapted to the host environment.


Introduction
Hepatitis C virus (HCV) is a pandemic human virus. It's a leading cause of chronic liver diseases including hepatocellular carcinoma and liver cirrhosis. Worldwide about 200 million people are infected with the virus. Approximately, 3-4 million people are diagnosed with new infections annually [1]. In Pakistan, almost 10 million people i.e. 6% of its population are reported to be infected with the virus [2]. HCV is a blood-borne pathogen and the main cause of virus transmission is a blood transfusion. Other risk factors include medical and dental procedures with unsterilized equipment, hemodialysis, organ transplants from HCV-infected patients, needle injuries, piercing and tattooing in unhygienic circumstances and intravenous drug usage.
HCV is a member of family Flaviviridae representative of genus Hepacivirus. The virus is enveloped and carry a~9.6kb RNA single-stranded genome of positive polarity. It contains a single open reading frame (ORF) that encodes 11 proteins as shown in Fig 1A and Fig 1B. HCV RNA replication is extremely efficient with a production rate of 10 12 virions per day in a single individual [3] and a high error rate of approximately 1 in 100 000 bp copied [4,5] which causes high genetic heterogeneity in the viral genome leading to 7 genotypes and more than 70 subtypes characterized so far [6]. These genotypes refer to the genetically distinct groups of HCV isolates which arise during evolution. At genotype level the divergence in the nucleotide sequence exhibited by the virus is 30% whereas at subtype level it's 20% [7,8]. One of the most important features of HCV genotype is its phylogenetic severalty, each genotype have close similarities to the individual branch of the phylogenetic tree of HCV. The subtypes are also important in view of epidemiology as they are reported to be associated with a pattern of spread or certain groups [9]. Interestingly, the seven genotypes not only diverge in a phylogenetic tree, they also vary in geographical distribution and their genetic diversity. HCV phylogeny resembles a shape of star with six arms [10,11]. Similarly, the sub-types are significant in studying the effects, causes and patterns of the HCV [9].
Finally, HCV exists as closely related species known as quasispecies within an individual. Full length sequences show 90.8%-99% similarity between quasispecies [12]. The occurrence of various genotypes correlates to their location. Genotype 1 and Genotype 2 are prevalent in West and Central Africa. In South Asia most of the infections are due to Genotype 3. In Africa and the Middle East, Genotype-4 is more common whereas, Genotype 5 and 6 have been found locally in South Africa and South-East Asia respectively [13,14]. Asia and Africa have the highest prevalence rate in the world whereas Europe, Australia and North America are the lowest prevalence regions [15]. Sub-types 1a, 1b and 3a are recognized as epidemic strains and are distributed globally having low genetic diversity. Due to this scientist have concluded that the genetic diversity of endemic strains is greater than epidemic strains i.e. the low heterogeneity of epidemic strains make them highly stable. Various studies show that genotype 3 is most prevalent among the Pakistani population affecting about 78.96% of diagnosed patients. The most common subtype is 3a attributing 58.01% followed by genotype 3b [16]. Researchers have successfully revealed route of virus acquisition specific to a genotype or subtype [17][18][19]. A significant amount of data has been collected related to HCV genotypes. However, no comprehensive study regarding the origin and evolution of the local variants has been carried out till now. Keeping in mind the importance of geographical occurrence, clinical outcome and HCV strain relation, the current study has been designed. We have attempted to trace the origin, genetic diversity and molecular evolution of HCV infected population in Pakistan using coalescence theory and newly developed phyloinformatic methods. The coalescence theory's hypothesis says that all individuals have a common ancestor and the population that has evolved independently tends to genetically diverge over time. Analysis of HCV sequences using this theory and molecular clock hypothesis enables us to predict HCV population dynamics since the time of its discovery which can be used to estimate viral epidemic behavior. The outcome of the study will help to develop better therapeutic interventions to overcome the disease.

Data Collection
A total of 1,17,000 sequences (Partial/Complete genome) of HCV were retrieved from publically available database (GenBank) since inception to date into Geneious v 8.1.5 (http://www. geneious.com, [20]). These sequences belonged to different countries of the world and were renamed as: Accession number: Country: Date: Subtype Following inclusion criteria was fixed for sequences included in the study (a) sequences should be already published in peer reviewed journals; (b) there should be no ambiguity regarding the subtype consignment of each sequence; (c) all sequences should be non-recombinants; (d) country of the origin of sequence should be known and clearly established; (e) isolation date or publication date is clearly established in the original publication. Sequences with incomplete information were deleted.

Study Population
For this study we collected HCV complete genome sequences of genotype 3a from Pakistan and worldwide and also sequences of HCV genotype 1a, 1b, 3a, 3b from Pakistan and worldwide. Whole genome sequences were edited manually to obtain individual genes of each genotype. The study was conducted using NS5B gene. Other genes were not considered for either being highly conserved or too variable that they could not be aligned unambiguously. As a large number of sequences were available and including all of them would have considerably increased computational burden, two representative sequences were selected from each country for each genotype. This was done by constructing Neighbor joining trees using Jukes and Cantor Model (JC; [21]) of individual genes.

Evolutionary Model Selection
J-Model Test v2.1.6 [22,23] was applied and simplest evolutionary model was selected that adequately fit the data matrix according to the Akaike Information Criterion (AIC). For complete genome sequences the best fit model is General Time Reversible GTR + I + G (012345).

Phylogenetic Reconstructing Using Maximum Likelihood
Multiple alignments of NS5B region sequences were created and manually edited to form a data matrix. For complete genome two sets of nucleotide sequences were prepared for phylogenetic analysis; One set contained HCV genotype 3a consisting of 23 sequences and the other set contained all the genotypes and subtypes of HCV worldwide consisting of 102 sequences. Maximum likelihood phylogenetic trees were inferred using PhyML [23] program plugin available in Geneious v8.1.5 program (http://www.geneious.com, [20]). Boot strap analysis was performed using 1000 replicates.

Bayesian Coalescent Analysis (Dated Phylogeny)
Bayesian coalescent analysis was performed using Bayesian Evolutionary Analysis Sampling Trees program (BEASTv1.8.2) [24]. Simplest evolutionary model was selected using J-Model Test v2.1.6 [22,23]. Since we had no prior reason to assume a particular model of change in viral population size overtime, we used recently developed Bayesian Skyline plot to estimate epidemic history [25]. The sequences from Pakistani population in our database were sampled over a limited time frame and therefore, did not contain sufficient information about evolutionary rate. Therefore, we employed informative prior distributions for the NS5B evolutionary rates. We used 5.0×10-4 ± 1.7×10-5 substitutions/site/year for NS5B [26][27][28]. For complete genome analysis we used 1.289 × 10 −3 ± 1.47 × 10 −5 substitution per site per year evolutionary rate.
As recommended, un-correlated log-normal model for relaxed molecular clock was applied to all analysis. This model takes into account the changes in evolutionary rate among lineages [25]. MCMC chains were run for at least 50 million generations or till effective sampling size (ESS) value became >200 and sampled every 15000 steps [29]. For complete genome analysis MCMC run contained at least 140,798,000 states and sampled once every 10,000 steps. The posterior samples were analyzed using Tracer v1.5 program (available from http://beast.bio.ed. ac.uk/Tracer). Uncertainty in the estimates was indicated by 95% highest posterior density (95% HPD) intervals. Posterior distributions of the analysis were calculated after a burn-in of 10% using Tree annotator v1.5.2 (http://beast.bio.ed.ac.uk) and trees were visualized and formatted using FigTree (available from http://tree.bio.ed.ac.uk/software/figtree/).

Phylogenetic Analysis Using Parsimony & Other Methods
PAUP (v. 4.0) a Geneious plugin was used to infer maximum likelihood (ML) phylogenetic trees [30]. These trees are generated with the best-fitting nucleotide substitution model General Time Reversibility (GTR) using estimated proportion of invariable sites. Best trees of HCV complete genome of genotype 3a were constructed from 10 iterations of the likelihood ratchets by using Heuristic maximum likelihood tree search parameter. Bootstrap value set for the analysis was 1,000. Best tree generated from each iteration was compared and consensus tree was performed by PAUP program.

Phylogenetic Analysis of HCV Genotype 1a
The ML phylogenetic analysis of NS5B gene S1 Table showed that HCV genotypes 1a origin in Pakistan is polyphyletic and has occurred due to multiple outbreaks at different points in time. It is also observed that Pakistani sequences are more closely related to each other than to sequences from other countries. NS5B gene phylogeny is shown in Fig 2A. The phylogenetic tree clearly formed six clades. It is seen in clade II that Pakistani sequences recurrently clustered together. The Bayesian time scaled phylogenetic tree of NS5B gene agreed closely with the ML tree with minor exceptions. The estimated date for most recent common ancestor (MRCA) was found to be 1931 (95% HPD: 1382-173). The divergence date of individual clades can be seen on the tree in the Fig 2B. Phylogenetic Analysis of HCV Genotype 1b ML and Bayesian dated phylogenetic analysis were performed on NS5B gene of genotype 1b (S1 Table). The resulting ML phylogeny is shown in Fig 3A. The phylogenetic tree consists of four clades containing Pakistani sequences. These sequences have clustered with sequences from New Zealand, Egypt and Russia. The tree showed a polyphyletic origin of HCV genotype 1b in Pakistan. The dated phylogeny were similar to ML trees Fig 3B. The divergence date of MRCA for was found to be 1923 (95% HPD 1347-283). Clade I and IV showed strong statistical support whereas, clades II and III were not well supported by posterior probability.

Phylogenetic Analysis of HCV Genotype 3a
ML and dated phylogeny of NS5B gene of genotype 3a S1 Table showed   isolated were found to be closely related to many countries including Portugal, Thailand, Vietnam, India, Argentina, UK, Brazil, Russia, Afghanistan and Mexico however, many of the clades were not well supported by bootstrap value. The ML and dated phylogeny of NS5B gene was very diverse and depicted the inter country dispersal and high prevalence of HCV genotype 3a over the years. The whole tree dated back to 1800. The divergence date of only the large clusters are shown in Fig 4B. The best probability support was seen in the clades clustering Pakistani sequences together confirming the inter country dispersal of the virus.

Phylogenetic Analysis of HCV Complete Genome 3a
The ML phylogenetic analysis of complete genome of HCV genotype 3a (S1 Table) showed that its origin in Pakistan is monophyletic. It has been observed that Pakistani sequences are more closely related to each other than to sequences from other countries. The gene phylogeny is shown in Fig 5A. Pakistani sequences form a single clade in the phylogenetic tree and share a closer common ancestor. The tree contains an outgroup sequence from Canada of subtype 4a. The Bayesian time scaled phylogenetic tree of complete genome agreed with the ML tree. The tree confirmed the monophyletic origin of genotype 3a in the country. The estimated date for MRCA was found to be 1992 (95% HPD: 1897, 1273). The divergence date of individual clades can be seen on the tree in Fig 5B. PAUP analysis of complete genome also showed that HCV Tracing the Origin and Evolution of HCV Variants in Pakistani Population genotype 3a origin in Pakistan is monophyletic. The gene phylogeny is shown in Fig 5C. Pakistani sequences share a closer common ancestor and forms a single clade in the phylogenetic tree.
The ML phylogenetic analysis of complete genome of genotype 1, 2, 3, 4, 5 and 6 (S1 Table) showed that genotype 3a origin in Pakistan is polyphyletic. Pakistani sequences forms two clades in the phylogenetic tree. The gene phylogeny is shown in Fig 5D. The tree contains an outgroup sequence from Rodent Hepacivirus (KC411777). The tree explains the evolution of HCV genome. Multiple clades are present which explains its diversification. Pakistani sequences are closely related to the Italian and Indian sequences in the tree.

Phylogenetic Analysis of HCV Genotype 3b
The phylogenetic history and evolutionary pattern of HCV genotype 3b was studied using NS5B gene (S1 Table). The ML and dated phylogeny of genotype 3b was quiet similar to that of genotype 3a. The trees depicted a polyphyletic origin of the virus though, recurrent clustering of Pakistani sequences together confirmed the frequent inter country dispersal of the virus as seen in Fig 6A. The Pakistani isolates were also seen to group together with Singapore, Vietnam, Indonesia and Canada. The Bayesian phylogeny dated back to 1986 (95% HPD 1746-

Discussion
The current study was carried out to reconstruct the time scaled phylogenies to infer the origin and divergence of HCV genotypes in Pakistan. We applied the ML and Bayesian coalescent based analysis to provide a clear depiction of the problem. We performed analysis of genotype 1a, 1b, 3a and 3b. Majority of infections caused by HCV in the world involves only a few HCV subtypes namely 1a, 1b, 2b, 3a and 4a [31]. These subtypes have disseminated all over the world during the 20 th century mainly because of unsafe blood transfusions, injections and intravenous drug abuse [26,32]. Individual gene analysis using NS5B regions revealed that genotype 3a is the most diverse genotype in Pakistan and thus has been present in the local population for an extensive period of time [33].
For majority of the genotypes analyzed, two types of patterns of HCV transmission in Pakistan were observed. The first pattern reflected the ongoing and sustained transmission or the viral lineage within the country that might be due to a shared transmission route. The other pattern observed represented the individual Pakistani isolates of different HCV subtypes to be more closely related to non-Pakistani isolates. This pattern most likely represents the multiple and sporadic migration of HCV strains. The viral introduction in the country may arise via immigration, travel and tourism [34][35][36] or through network of injecting drug users [37]. Similar patterns were found in others studies of HCV and HIV [38].
The results revealed that HCV genotype 3a is the most diverse genotype in Pakistan. Similarly, when Pakistan HCV genotype 3a complete genome sequences were analyzed with HCV complete genome sequences of genotype 1, 2, 3, 4, 5 and 6 also reveals the diversity in the Pakistani HCV genotype 3a.
The phylogenetic analysis of genotype 1a depicts that the origin of the virus in the country is polyphyletic. The origin of the genotype was found to be in Ghana which confirms the findings of earlier studies [39]. The plausible reason for the spread of the genotype within the country is via unsafe medical practices. Another study concluded that genotype 1a infected patients got their infection during surgeries. Genotype 1a is most prevalent in Iran, England, Australia and Scotland [40,41].
Pakistani sequences were found to be closely related to Iran, Italy and Switzerland. A large number of Pakistani population reside in Italy and Switzerland. Whereas, people of specific religious sect cross the Iranian border in large number legally and illegally every year. Our results also indicated that HCV genotype dispersed from the developed countries to under developed countries including Pakistan. The worldwide spread of the virus was initiated and sustained during and after World War II. However, it may also be due to the wide spread of IDU in the 1930s [42] in the western countries and is relatively new to the Asian countries such as the earliest date of divergence for 1a in Pakistan was found to be 2001. These findings coincides with [39] conclusions. The HCV genotype 1b phylogeny of NS5B region also shows the polyphyletic origin of the virus in Pakistan. The Pakistani isolates are more closely related to Russia, Egypt and New Zealand. Genotype 1b is predominant in Russia [43,44]. Till date there are thousands of illegal Russian nationals living in Pakistan. They frequently cross the Pakistani border illegally without proper medical screenings. Genotype 1b is not associated with IDU rather its transmission route is more frequently associated to be iatrogenic one. 75% of the patients infected with genotype 1b got infected during surgeries. This may also indicate the insignificant spread of the genotype within the country as now, more safe medical practice is carried out throughout Pakistan. The earliest divergence date for Pakistan was found to be 1925 while the other being as recent as 2010. The origin of 1b was found in West Africa which coincides with the earlier studies [6,45].
Genotype 3 is the dominant genotype in Pakistan after genotype 1 [46,47]. It is reported in many studies that genotype 3a has originated in the Indian subcontinent about 300 years ago [48,49]. Our study also agrees with previous results. The ML and Bayesian phylogenies were well resolved and highly diverse. The high genetic variability as seen in the Pakistani isolates genotype 3a sequences, confirms that 3a has been present in country since long. Previous report [48] suggest that genotype 3 has most probable origin in the northern part of India. Since, India and Pakistan was Indian subcontinent for decades and still share a common boundary due to which they have a large common human ancestry. The individual phylogenetic trees of NS5B region shows that Pakistani sequences tends to cluster together forming a distinct Pakistan specific cluster of genotype 3a. The Bayesian molecular clock analysis provided an estimate of the data of most recent common ancestor (MRCA) of genotype 3a, which is 1992. It was estimated that subtype 3a has a common ancestor dated back to 1992 [95% HPD: 1897, 1273]. The estimated common ancestor of Pakistan HCV genome subtype 3a dated to 2006 [95% HPD: 1988, 1899].
HCV sequences from Pakistan were found to be closely related to Indian sequences in addition to Portugal, Thailand, Vietnam, India, Argentina, UK, Brazil, Russia, Afghanistan and Mexico, Uzbekistan and Switzerland. The earliest divergence date of HCV genotype 3a in Pakistan was found to be 1933. Indian subcontinent has a history of British rule between 1858 and 1947. It is speculated that HCV genotype 3a has disseminated to United Kingdom from the Indian subcontinent during and after the British rule. The time also overlaps with the era of unsafe blood transfusions and rapid industrializations. The spread may also be due to the mass migration of people during World War II. Rehman et al., [33] reported that HCV genotype population increased in Pakistan from 19th century to 1970 by thousand folds. Genotype 3a is very common among IDUs. Moreover, in Pakistan correlation of thalassemia and HCV with genotype 3a has been observed in patients who received blood transfusion several times in their life time. In a recent study by [9,50] HCV 3a was revealed as the most common genotype in Pakistan. It is also indicated that the epidemic spread of HCV genotype 3a occurred first in Pakistan than in other countries around 1920s and 1950s similar to our results. This spread is also considered as a reason for increased incidence of hepatocellular carcinoma.
Thus, it is proposed that HCV genotype 3a has spread worldwide during 1960s due to widespread use of heroin [16,51]. It is common practice among Pakistani IDUs to use and re-use needles and syringes with other IDUs. Genotype 3a is less commonly observed in people undergone surgeries or other medical or dental procedures.
HCV genotype 3b is the next most common genotype in Pakistani population. The E1 and NS5B gene analysis to trace the origin of the virus in Pakistan reveals close relationship of Pakistani and Indian sequences. Pakistani sequences were also found to be closely related to other countries including Singapore, Vietnam and Indonesia. The analysis shows dispersal of genotype 3b from Pakistan to China in early 19th century. Pakistan and China have always been on friendly terms and crossing of borders for tourism, study and business is very common. The close clustering of Pakistani sequences in the resulting phylogenies depicts the presence of genotype 3b in Pakistan since a long time. The proportion of genotype 3b in Pakistan has gradually decreased. This can be due to the 100% sustained virological response (SVR) rates recorded among patients of 3a and 3b genotypes [52]. The most common mode of genotype 3b transmission is the same as of genotype 3a i.e. via use and re use of contaminated syringes in IDUs. It has been reported that 60% of the patients infected with genotype 3b were drug addicts using intravenous drug usage.
The growing field of computational phylogenetics is very useful in molecular evolutionary studies. By inferring spatial and temporal processes from viral gene sequences, new insights could be obtained of the epidemiology and spatial diffusion of viruses.

Conclusions
The goal of this project was to trace and study the origin, the genetic diversity and molecular evolution of HCV population in Pakistan. Our study concludes the origin of HCV of genotype 1a, 1b, 1c, 3a and 3b is polyphyletic. The origin of genotype 3a is Indo-Pak and is highly divergent. The rapid evolution and enormous sequence diversity found in HCV sequences are a cause of a significant hurdle in the development of effective vaccines and novel therapeutic interventions. We tried to gain an insight of the evolutionary aspects of HCV genotypes available in Pakistan. The origin of genotype 3a is Sub-continent Asia which includes Indo-Pak and is highly divergent.
Phylogenetics is very beneficial in molecular evolutionary studies. By applying the spatial and temporal processes from viral genome sequences, new perceptions can be obtained of the epidemiology and spatial diffusion of viruses. The medical fraternity in Pakistan needs to carry on the research in order to provide improved understanding of the clinical trajectory of the disease. Research should also be done to sequence more HCV genomes in Pakistan to compare the nucleotide variation among other countries and target those areas in the genome that are essential for the viral replication like NS3-4A inhibitors and NS5B polymerase inhibitors. Novel therapies should be developed that can be targeted to high-risk populations.
Supporting Information S1