Similar Prevalence of Low-Abundance Drug-Resistant Variants in Treatment-Naive Patients with Genotype 1a and 1b Hepatitis C Virus Infections as Determined by Ultradeep Pyrosequencing

Background and Objectives Hepatitis C virus (HCV) variants that confer resistance to direct-acting-antiviral agents (DAA) have been detected by standard sequencing technology in genotype (G) 1 viruses from DAA-naive patients. It has recently been shown that virological response rates are higher and breakthrough rates are lower in G1b infected patients than in G1a infected patients treated with certain classes of HCV DAAs. It is not known whether this corresponds to a difference in the composition of G1a and G1b HCV quasispecies in regards to the proportion of naturally occurring DAA-resistant variants before treatment. Methods We used ultradeep pyrosequencing to determine the prevalence of low-abundance (<25% of the sequence reads) DAA-resistant variants in 191 NS3 and 116 NS5B isolates from 208 DAA-naive G1-infected patients. Results A total of 3.5 million high-quality reads of ≥200 nucleotides were generated. The median coverage depth was 4150x and 4470x per NS3 and NS5B amplicon, respectively. Both G1a and G1b populations showed Shannon entropy distributions, with no difference between G1a and G1b in NS3 or NS5B region at the nucleotide level. A higher number of substitutions that confer resistance to protease inhibitors were observed in G1a isolates (mainly at amino acid 80 of the NS3 region). The prevalence of amino acid substitutions that confer resistance to NS5B non-nucleoside inhibitors was similar in G1a and G1b isolates. The NS5B S282T variant, which confers resistance to the polymerase inhibitors mericitabine and sofosbuvir, was not detected in any sample. Conclusion The quasispecies genetic diversity and prevalence of DAA-resistant variants was similar in G1a and G1b isolates and in both NS3 and NS5B regions, suggesting that this is not a determinant for the higher level of DAA resistance observed across G1a HCV infected patients upon treatment.


Introduction
Advances in the knowledge of the structure and function of hepatitis C virus (HCV) proteins and the development of robust methods for studying HCV replication in vitro have resulted in the development of direct acting antiviral agents (DAAs) that target essential proteins, primarily the NS3/4A serine protease, the RNA dependent RNA polymerase (RdRp, NS5B) and NS5A. Three agents that inhibit the NS3/4A serine protease are now approved for clinical use (telaprevir, boceprevir and simeprevir). These protease inhibitors (PIs) are potent inhibitors of HCV replication in vivo; however, resistance develops quickly even when these drugs are administered in combination with peginterferon/ ribavirin. [1][2][3] Indeed, amino acid substitutions within the NS3 protease region have been identified in vitro and in vivo and correlated with reduced susceptibility to all PIs, including those in development ( Table 1).
HCV NS5B RdRp, lacking a proof reading function, misincorporates nucleotides at a rate of 1 per 10 000 bases copied. [19] The low fidelity of replication is compounded by a high replication rate that can result in the production of up to 10 12 virions per day. [20] As a result, HCV exists in any given patient as a diverse collection of closely related variants termed quasispecies. [21] Although the original definition of a quasispecies requires an effectively infinite population size, population geneticists have nonetheless found quasispecies theory to be useful for finite viral populations with high mutation rates, and have generally accepted the use of the term quasispecies when applied to HCV and several other viral infections [22].
Recent studies on protease inhibitors telaprevir and boceprevir have shown higher virological response rates, and lower breakthrough rates, in G1b-infected patients than in G1a-infected patients. [1,2,23,24] Similar results have been described for patients enrolled in Palm I NNI-based interferon-free regimen. [25,26] However, it is not known whether this corresponds to a different distribution of variants in G1a and G1b quasispecies in regards to the proportion that contain naturally occurring DAAresistance mutations before treatment; it is also unknown what the possible implications for the response to DAA treatment could be.
In this study, we sought to investigate the overall genetic diversity and the potential presence of low-abundant DAAresistant variants (,25% of the sequence reads) in the NS3 and NS5B genes of HCV G1a and G1b viruses in isolates from .200 DAA-naïve patients using ultradeep pyrosequencing (UDPS) methodology. We characterized, on a quantitative scale, the abundance (defined as the frequency of a particular amino acid substitution in an isolate's quasispecies as low as 0.5%) and prevalence (defined as the proportion of patients which carries a particular amino acid substitution) of more than 30 established DAA-resistant substitutions and polymorphisms in the NS3 and NS5B regions. We also compared the genetic variability of G1a and G1b isolates to determine whether differences in variability might be responsible for the lower virological responses to HCV PIs and Palm I NNI -DAA-containing therapy in patients infected with G1a compared with G1b viruses.

Clinical isolates
Samples from 208 G1 DAA-naive patients, most of whom were enrolled in Roche-sponsored global clinical trials (PV18369,  [27][28][29][30][31][32], with a median viral load of 3.6610 6 IU/mL (range 7.6610 4 -5.9610 7 IU/mL) were studied. All studies were global and conducted in full conformance with the principles of the Declaration of Helsinki and Good Clinical Practice. Protocols and all amendments were reviewed and approved by local ethics committees and regulatory authorities. Written informed consent was obtained from all patients before any study-related activities occurred. The authors were not involved in any of the original sample collections and samples were de-identified prior to being accessed by the authors. HCV RNA was extracted from 400 ml to 800 ml of serum from HCV-infected patients using the ZR Whole-blood Total RNA kit (Zymo Research, Irving, CA, USA) and following the manufacturer's instructions.

Population-based dideoxy-terminator (Sanger) sequencing
Population sequencing spanning the NS5B and NS3/4A coding regions was performed by Sanger sequencing using primers covering both DNA strands using an ABI 3730 xl DNA Analyzer. Chromatograms were analyzed using Sequencher (Gene Codes Corporation, Ann Arbor, MI, USA) and Vector NTI (Life Technologies, Carlsbad, CA, USA) software. Population sequence of each sample was then used for the 454 reads analysis (see section 454 sequence analysis).

HCV amplification for 454 sequencing
Reverse transcription (RT) was performed using the Transcriptor High Fidelity cDNA Synthesis Kit (Roche Diagnostics, Indianapolis, IN, USA) following the manufacturer's instructions, using random hexamers as primers for initiation and 1 to 9.15 mL of HCV RNA (depending on sample availability) per reaction. The RT cycling conditions were as follow: 5 minutes (min) at 25uC, 30 min at 50uC, 5 min at 85uC. For clinical isolates obtained from patients with an HCV viral load ,3.4610 5 IU/ mL, cDNAs were pooled (between 2 to 6 RT reactions, depending on sample availability) and concentrated using the Zymo Research DNA Clean & Concentrator-5 (Zymo Research, Irvine, CA, USA) following the manufacturer instruction.
Amplification of NS5B region (generating a 985 nucleotide fragment for G1a samples and a 916 nucleotide fragment for G1b samples) and NS3 region (generating a 750 nucleotide fragment for G1a samples and an 806 nucleotide fragment for G1b samples) was carried out in duplicate using the FastStart High Fidelity PCR System (Roche Diagnostics, Indianapolis, IN, USA) as follows: 4 mL cDNA (or 2 ml concentrated cDNA) was included in a 50 mL reaction mixture containing 5 mL Buffer #2 (1X final with 1.8 mM MgCl 2 ), 1 mL dNTPs (0.2 mM final each), 1 mL each forward and reverse primers (0.4 mM final each) and 0.5 mL enzyme (2.5 Units). PCR reactions were performed as follow: 3 min at 94uC, then 30 cycles of 30 seconds (sec) at 94uC, 30 sec at 52uC, and 1 min at 72uC, followed by a final 7 min extension step at 72uC. The duplicate products of this first round PCR (PCR1) were pooled and then amplified in a nested PCR (PCR2) using 454-fusion primers containing custom 7-mer bar codes (designed to be able to pool amplified DNA from different patient samples for the UDPS reactions) and the same cycling conditions as PCR1. The PCR2 reaction produced 3 smaller overlapping amplicons (308-368 nucleotides) encompassing NS5B amino acids 244 to 496 and 2 smaller overlapping amplicons (244-356 nucleotides), encompassing NS3 amino acids 31 to 175 and 30 to 190 for G1a and G1b isolates, respectively. The PCR2 products were double purified using AMPure beads (Beckman Coulter, Brea, CA, USA), quantified using Quant-iT PicoGreen dsDNA Reagent (Life Technologies, Carlsbad, CA, USA), pooled at equimolar concentrations and pyrosequenced using the 454/Roche GS FLX titanium platform producing reads of 331 bp on average. All amplification primers are given in Tables S1 and S2 in File S1.

sequence analysis
Standard flowgram format (SFF) files, the raw output from UDPS, were processed to generate paired files containing FASTA sequence reads and Phred-equivalent quality scores for each sequence library. To reduce sequence artifacts, reads shorter than 200 nucleotides and reads containing one or more bases with a quality score of ,10 (.10% probability of sequence error) or a mean quality score ,25 (.0.3% probability of sequence error per base per sequence read) were excluded. Sequenced reads were demultiplexed using the 59 primer and barcode sequences, resulting in the assignment of each read to a patient sample and primerpair, and each UDPS read was aligned to the population-based sequence of each sample using the MosaikAligner program (http://bioinformatics.bc.edu). For the calling of nucleotide mutations and amino acid substitutions, each amplicon was compared to the corresponding subtype reference sequence, H77 for G1a (GenBank accession number M67463) and Con 1 for G1b (GenBank accession number AJ238799). High-abundance variants were defined as mutations present in $25% of the sequence reads (as is conventionally defined for the Sanger sequencing detection threshold), and low-abundance variants were defined as mutations present in ,25% of the sequence reads.
All sequence data have been deposited at the NCBI Sequence Read Archive under study accession SRP040802.
The technical error rate was estimated for each UDPS run by amplifying and sequencing two HCV reference plasmids, G1a H77 and G1b Con1 using the same PCR protocol described for the clinical isolates. Each UDPS read was aligned to the corresponding plasmid sequence using the MosaikAligner program (http://bioinformatics.bc.edu) and the number of mismatches was counted. Homopolymeric regions were defined as regions with three or more identical consecutive nucleotides and their flanking nucleotides [33].

Distribution of NS3 and NS5B quasispecies variants
The number and frequency of distinct virus variants (sometimes referred to as ''haplotypes'') in each sample was estimated using the computational method ShoRAH (Short Reads Assemby into Haplotypes), which incorporates sequence error calculations to avoid overestimating the number of distinct genetic variants in a sample. [34] After removing virus variants present at frequency lower than 0.5%, the intra-sample quasispecies Shannon entropy, S QS(nt) , was calculated: [21].
where p i is the frequency of sequence variant i, n is the total number of variants and N is the average number of reads per amplicon. Each distinct variant identified by ShoRAH was translated to its corresponding amino acid sequences. Variants with same amino acid sequences were pooled and the Shannon entropy for the amino acid sequences (S QS(aa) ) was calculated in a manner analogous to S QS(nt) . Shannon entropy was chosen to characterize the mutant spectra of viral quasispecies in our samples, because it is a simple normalized quantitative measure of variability (randomness) in a sequence dataset which incorporates both the frequency and the number of possible sequence variants into a single metric. In the context of quasispecies analysis, a higher Shannon entropy (e.g. 0.25) typically implies that a virus population consists of a large number of distinct sequence variants (e.g. ,15 or more) occurring at low to moderate frequency, whereas a relatively low Shannon entropy (e.g. 0.02) often indicates a more ordered virus population which may contain just one to a few (e.g. ,3) sequence variants occurring at significant frequency [21].

Statistical inference
The Mann-Whitney-Wilcoxon test was used to assess the statistical significance of differences between the median Shannon entropy between subtypes and genes. The Fisher's exact test was used to determine if the levels of amino-acid conservation in the NS3 and NS5B regions differed between G1a and G1b isolates by using the fisher.test function for count data, as implemented in the R statistical package.

NS3 and NS5B drug-resistant amino acid substitutions included in the study
Study-defined NS3 and NS5B DAA-resistant amino acid substitutions are listed in Tables 1 and 2.

UDPS coverage and technical error rate according to gene and genotype
UDPS yielded a total of 3.5 million high quality sequence reads of 200 or more nucleotides from 136 G1a and 55 G1b NS3 samples and from 77 G1a and 39 G1b NS5B samples. Overall, there was a median of 8,399 reads (IQR: 6,943-11,151) for the 191 NS3 samples and a median of 14,043 reads (IQR: 11347-16142) for the 116 NS5B samples. A total of 34 HCV plasmid controls were run (10 G1a and 8 G1b NS3 controls, and 10 G1a and 6 G1b NS5B controls) and a total of 339,769 sequenced reads with .200 bases were generated over 11 plates for the 34 controls: 105,210 for G1a NS3 and 80,462 for G1a NS5B; and 75,182 for G1b NS3 and 78,915 for G1b NS5B.
The median mismatch error rate (or technical error rate), determined using G1a H77 and G1b Con1 plasmid DNA, was 7.0610 24 overall: 4.0610 24 in non-homopolymeric regions and 1.4610 23 in homopolymeric regions. The median mismatch error rate was the same for the NS3 and NS5B genes and for G1a and G1b plasmids. Assuming that errors occurred in a Poisson distribution and that samples contained one thousand viral templates, and adding the published error rate of the Roche Transcriptor High Fidelity reverse transcriptase (1.98610 25 ), the likelihood that technical artefact would cause a mutation to be detected at a level of 0.5% or higher would be 7.7610 25 in nonhomopolymeric regions and 1.5610 22 in homopolymeric regions. The likelihood that technical artefact could cause a mutation to be detected at a level of 1.0% or higher would be ,1.0610 210 in non-homopolymeric regions and 2.5610 26 in homopolymeric regions. Using the overall median error rate of 7.2610 24 and a Bonferroni correction for testing 13 PI-resistant variants, 3 NIresistant variants, and 18-NNI-resistant variants, the likelihood of detecting any PI-, NI-, or NNI-resistance mutation at a level of 0.5% or higher in a sample would be 1.2610 22 , 2.7610 23 , and 1.6610 22 , respectively. To prevent false positives, a conservative cut-off ($0.5%) for variant detection was adopted as the minimum threshold in calling high-and low-abundance mutations.

Distributions of quasispecies diversity among G1a and G1b isolates
To evaluate inter-patient genetic variation across the NS3 and NS5B quasispecies obtained by UDPS, the histograms of Shannon entropy of quasispecies (see Materials and Methods for definition) were plotted for all samples combined (G1a and G1b), or separately by subtype, in nucleotides ( Figure 1A) and in amino acids ( Figure 1B).
The median Shannon entropy for the NS3 nucleotide sequences was similar for the 136 G1a  Figures 1A & B), consistent with the notion that sequence diversity in HCV patient samples can vary among samples.
The NI-resistant variants S282T and L320I/F were not present in any read from G1a or G1b samples.

Discussion
Resistance profiles of the first four approved DAAs agents (the protease inhibitors telaprevir, boceprevir and simeprevir and NS5B polymerase inhibitor Sofosbuvir) have been well characterized and clinical resistance resulting in treatment failure has been reported for the first three. [1][2][3] It has recently been shown that virological response rates are higher and breakthrough rates are lower in G1b infected patients than in G1a infected patients treated with certain classes of DAAs such as PIs and Palm I NNIs [1,2,[23][24][25][26].
As a result of a lack of a proof reading function in the HCV polymerase resulting in a high rate of spontaneous mutation, variants emerge constantly and, with a DAA selection pressure combined with a replicative advantage, may quickly become dominant in the population. It is not known if a higher level of genetic variability exists in G1a than G1b, or if elevated variability may be associated with treatment response to DAA's. A thorough characterization of the genetic diversity and prevalence of DAAresistant variants in treatment-naive HCV infected patients will therefore shed light on their potential clinical relevance and impact.
In this study, we tested the hypothesis of whether intrinsic genetic variability across G1 subtypes is directly associated with the differential response rates between G1a-and G1b-infected patients treated with DAAs. For this, we performed an extensive UDPS analysis that included 191 NS3 and 116 NS5B isolates from 208 HCV-infected DAA-naïve patients infected by a G1a or G1b virus with viral load of the patients spanning almost three orders of magnitude (see Materials and Methods). To our knowledge, this is the largest UDPS study on HCV diversity in treatment-naïve patients up to now.
We observed a higher prevalence of high-abundance variants containing a PI-resistant variant/polymorphism in G1a than G1b isolates. However, this higher prevalence is mainly driven by a polymorphism at position 80 that, in the case of simeprevir, preexistence of Q80K polymorphism has been shown to affect the sustained virologic response (SVR) rate. [3] Amino acid substitution at Q80 have not been identified in telaprevir or boceprevir clinical trials [1,2] and have no significant effects on the activity of telaprevir or boceprevir in in vitro experiments. [35] Lowabundance PI-resistant variants were also identified in both G1a and G1b isolates, however no significant difference between G1a and G1b was observed. The resistance mutations V36M and R155K in G1b isolates and V170A in G1a isolates have higher genetic barriers and require two nucleotide changes from wildtype (Table 7). Consistent with this phenomenon, we found no pre-existing V36M and R155K mutations among G1b isolates and no V170A mutations among G1a isolates in the set of high-and low-abundance mutations from this study. The impact of low-level pre-existing PI-resistant variants on treatment outcome is yet to be fully determined. Recently published studies using UDPS methods and following a small number of PI-treated patients have shown that, in some (but not all) patients, pre-existing low-level PI-resistant variants could have affected their response to treatment [36] or re-treatment [37]; the presence of PI-resistant variants at baseline did not necessarily prevent a patient from responding to treatment and PI-resistant variants at baseline were not selectively enriched upon treatment. [29] Our findings support the notion that the relationships between pre-existing PI-resistant variants and treatment outcome are likely to be complex and may depend on virus genotype, genetic barriers of a particular resistant variant as well as the prevalence of the variant, among other factors.
In contrast to the NS3 region, no significant difference between G1a and G1b isolates was observed in the prevalence of highabundance drug-resistant variants in the allosteric binding sites of the NS5B region. The results obtained here are also in agreement with previous studies that showed an overall 10 to 20% prevalence of variants at known drug resistance sites in the NS5B sequences from treatment-naive individuals. [38,39] Low-abundance drugresistant variants in the allosteric binding sites of the NS5B region were detected in ,20 to 30% of the isolates, with no difference between G1a and G1b isolates.
The impact of NNI resistance variants abundance on drug susceptibility was demonstrated in an in vitro phenotypic study, showing that NNI-resistance variants present .25% decrease significantly the NNI potency. On the other hand, NNI-resistance variants present,25% did not necessarily have an impact on NNI potency [38].
This study also looked at the potential pre-existence of Sofosbuvir or Mericitabine resistance mutations. The NS5B nucleos(t)ide inhibitor-resistant variant S282T, was not found. To date, no S282T variant, has been found in treatment-naive GT 1 isolates above the assay detection limit of sensitive sequencing technology such as UDPS, as shown in this study (n = 116) and in a recent study. [40] A novel combination of NS5B substitutions (L159F/L320F) conferring a low-level resistance to mericitabine has been described recently. [9] Variants with L320F substitution were not observed, corroborating the absence of resistance to these compounds at low level in the treatment-naïve individuals included in this study. It is noteworthy that the only NS5B NIresistant substitution detected in this study, V321I, was lowabundance and involves a low genetic barrier of a single transition.
On the other hand, the NI-resistant substitution S282T and the NI-resistant combination L159F/L320F, which were not observed, have higher genetic barriers and require a transversion and two nucleotide changes, respectively. For the NS5B allosteric inhibitors, the observed resistance variants were single transitions, single transversions or a combination of single transition and single transversion ( Table 7), with the majority being single transitions. The quasispecies diversity of each isolate in each region and the inter-patient variation was also determined by the Shannon entropy at the nucleotide and amino acid levels. The only difference was seen between G1a and G1b isolates when comparing the Shannon entropy calculated for the NS5B amino acid sequences (0.19 vs 0.13; p,0.001; Mann-Whitney Test). When comparing the Shannon entropy between two patient groups, a statistical difference at the amino acid level was not always found to be correlated with a statistical difference at the nucleotide level (or vice versa), possibly due to codon degeneracy and other factors. For example, in a retrospective investigation of ten patients chronically infected with HCV G3a and treated with peginterferon/ribavirin, Moreau et al reported that, at the baseline time point, the treatment failure group was found to have a higher Shannon entropy at the amino acid level (but not at the nucleotide level) than the sustained virological responders group [41].
In summary, no significant difference in median S QS(nt) levels was observed in this study that could differentiate G1a and G1b quasispecies in either the NS3 or NS5B region. Similar results have been reported in a recent smaller UDPS study, based on 8 G1a and 6 G1b samples. [36] It will be of future interest to investigate whether correlations can be established between Shannon entropy within the NS5B region at the amino acid level and treatment failure rates across clinical trials of G1a and G1b patients treated with NS5B NIs or NNIs.
Despite the high genetic variability of HCV, only a third of the amino acid positions in NS3 (,23-33%) and NS5B (,29%) were found to contain high-abundance substitutions among all isolates in this study. This apparent discrepancy can be attributed, at least partly, to the fact that synonymous substitution rates in HCV are typically .10-fold higher than the non-synonymous substitution rates in the NS3 and NS5 regions [42], leading to many more nucleotide changes which do not alter amino acid identities. Additional constraints, such as the requirement to preserve the 3D structures of proteins and essential base-pairing in RNA, may have further reduced the number of ''neutral'' sites, where sequence change can be tolerated with no significant effects on virus fitness.
High-throughput sequencing technology such as UDPS and Illumina deep sequencing has provided powerful new tools to characterize the genomes of pathogens such as HCV, and has been increasingly utilized to detect and quantify rare variants, Table 7. Contribution of each mutational category (transition and transversion) to the development of drug-resistant substitutions (variants) in the NS3 and NS5B proteins reported in this study.

C316N (4)
The genetic barrier score (GBS) was calculated according to the model described by Van  which are below the detection limits of Sanger-based techniques such as clonal sequencing [36,[43][44][45]. However, the sensitive detection of rare variants requires the differentiation between actual variants and technical errors originating from library preparation and sequencing processes. By sequencing the NS3 and NS5B regions of two HCV reference plasmids (G1a H77 and G1b Con1), the error rates (either by HCV region or by HCV subtype) were found to be consistent with those described in the literature. [46][47][48] Subsequently, by combining the specific error rates of our UDPS methodology and a Poisson distribution, we were able to establish a conservative cut-off ($0.5%) which minimized false positive results in our HCV variants analysis.
In conclusion, the study reported here provide a rich source of data on the abundance and prevalence of HCV variants in treatment-naive G1 patients, and provide insights into the possible factors contributing to the observed differences between G1a-and G1b-infected patients in virological response and breakthrough rates. Using the largest collection of UDPS data of HCV isolates from treatment-naïve patients to date, we found that, at the genetic level, the variability was similar in G1a and G1b isolates and in both NS3 and NS5B regions. We observed no clear difference in entropy between G1a and 1b HCV, at least in the HCV regions studied here; the number of naturally occurring high-abundance and low-abundance drug-resistant variants in NS5B was similar in G1a and G1b. However, a non-significant but higher prevalence of PI-resistant low-abundance variants and a significantly higher prevalence of high-abundance PI-resistant variants were observed in G1a than G1b NS3 samples. Importantly, this large-scale study strongly supports the elimination of higher genetic variability in G1a as a major contributing factor to the observed differences in virological response and breakthrough rates between G1a-and G1b-infected patients treated with PIs. Instead, a natural prediction emerging from our results is that factors unrelated to intrinsic genetic variability, such as random mutagenesis and a low genetic barrier to resistance, are more likely to play major roles in determining a patient's response to DAA-based therapy. These factors may be considered preferentially in the development and clinical testing of future DAA-based therapy, which has the potential of becoming interferon-free and all-oral regimens that will cure most patients regardless of HCV genotypes, subtypes and prior treatment status [49].

Supporting Information
File S1 Lists of primers and barcodes used to generate the 454 amplicons. (DOCX)