Quantifying the Diversification of Hepatitis C Virus (HCV) during Primary Infection: Estimates of the In Vivo Mutation Rate

Hepatitis C virus (HCV) is present in the host with multiple variants generated by its error prone RNA-dependent RNA polymerase. Little is known about the initial viral diversification and the viral life cycle processes that influence diversity. We studied the diversification of HCV during acute infection in 17 plasma donors, with frequent sampling early in infection. To analyze these data, we developed a new stochastic model of the HCV life cycle. We found that the accumulation of mutations is surprisingly slow: at 30 days, the viral population on average is still 46% identical to its transmitted viral genome. Fitting the model to the sequence data, we estimate the median in vivo viral mutation rate is 2.5×10−5 mutations per nucleotide per genome replication (range 1.6–6.2×10−5), about 5-fold lower than previous estimates. To confirm these results we analyzed the frequency of stop codons (N = 10) among all possible non-sense mutation targets (M = 898,335), and found a mutation rate of 2.8–3.2×10−5, consistent with the estimate from the dynamical model. The slow accumulation of mutations is consistent with slow turnover of infected cells and replication complexes within infected cells. This slow turnover is also inferred from the viral load kinetics. Our estimated mutation rate, which is similar to that of other RNA viruses (e.g., HIV and influenza), is also compatible with the accumulation of substitutions seen in HCV at the population level. Our model identifies the relevant processes (long-lived cells and slow turnover of replication complexes) and parameters involved in determining the rate of HCV diversification.

We also varied the fraction of mutations that are considered to be lethal, and that lead to nonproductive RNA synthesis, , ( Figure S1C). For our baseline results we used =0. 4. However, if we take this fraction to be 0, 0.2, or 0.8, this does not affect the viral load profile during primary infection (left panel). This may be expected as the RNAs with lethal mutations are a very small fraction of the total number of RNAs produced, since they correspond to 40% of the ~10 -5 mutated RNAs. On the other hand, the effect of different  on the expected accumulation of mutations is large (right panel). If the fraction of lethal mutations is less than the original 0.4, then the model predicts more mutations than observed in the data, and to fit the latter we would need even smaller mutation rates than those estimated in the baseline scenario (Table 1). If the fraction of lethal mutations is larger than our original assumption of =0.4, say =0.8 ( Figure S1C), then the model predicts a slower accumulation of mutations and to fit the data we would need a larger value for the mutation rate. Still, even in this extreme case, the median estimated mutation is only 5.5×10 -5 per nucleotide per replication cycle, about twice our estimate for =0.4. However, the value of =0.4 used here is already high, corresponding to the maximum reported in the literature to date [1].

Estimating the Mutation Rate from Stop Codon Frequencies
Classical genetics shows that for a population in mutation-selection balance the frequency, f, of single point mutations with selection disadvantage s is given by f=/s, where  is the mutation rate. For lethal mutations, s=1, and the frequency of these mutations is simply f=, since they must have been generated at the last replication cycle. Cuevas et al. [2] proposed using this approach to estimate the mutation rate of HCV, using non-sense mutations, i.e. stop codons (UAA, UAG, UGA) as a proxy for lethal mutations.
There are 18 codons that upon mutation may generate a stop codon. These are: UUA, UUG, UCA, UCG, UAU, UAC, UGU, UGC, UGG, CAA, CAG, CGA, AAA, AAG, AGA, GAA, GAG, GGA. Following the notation in [2], in each codon we underline the nucleotide that is the mutation target, for a total of 19 non-sense mutation targets -NSMT (note that UGG has two targets). A priori, each of these targets can mutate to any of the other 3 nucleotides, but in most cases only the mutation to one of those 3 will generate a stop codon (for example, UGU -> UGA). In a few cases the underlined nucleotide can mutate to 2 of the (N), and the parental NSMT of each stop codon, such that we have N 1 stop codons generated by mutation of non-boldface, underlined nucleotides and N 2 stop codons originated by mutation of boldface, underlined nucleotides.
Cuevas et al. [2] then used the following reasoning. Let us assume that there are M 1 codons (underlined), for which a single nucleotide substitution leads to a stop codon; M 2 codons (underlined, boldface), for which two possible mutations lead to stop codons, with M= M 1 + M 2 ; and that the mutation rate per nucleotide per replication cycle is . Then in one replication cycle, we can write By counting each stop codon appropriately and the total number of NSMT, one can use this formula to directly estimate the mutation rate, as we do in the text.
It is important to note that each of the two expressions alone would allow us to estimate  (i.e., This expression for  has a simple interpretation. We divide the actual number of stop codons observed (N 1 +N 2 ) by the total number of possible mutations leading to stop codons (M 1 +2M 2 ). The factor 3 corrects for the fact that for each stop codon mutation observed, it is equally likely that the original nucleotide mutated to a non-stop codon nucleotide, which we did not count [3]. The two expressions that we derived for  give very similar estimates as shown in the main text, and we can show that in the limit of infinite data they are the same. Further, as we show below, the second expression is statistically a more efficient estimator, because it has a smaller sampling variance.
Since both  1 and  2 are independent estimates of , we can use the weighted average . The use of this optimal weight for the weighted average of the two estimates  1 and  2 leads to our second expression above for estimating This result can be trivially generalized to an arbitrary number of independent binomial processes with rates proportional to a single parameter,  to be estimated: when the rates are small, the minimal variance estimator is given by the total number of observed events divided by the expected number per unit  We note that both expressions for  are still an approximation, because they assume that all types of substitutions are equally likely, which is not correct. For example, transitions are preferred over transversions, and indeed we found that this bias is 18 to 1 (=18) in our data set, when corrected for available sites [4]. One way to implement this correction in the last formula for  is to count the possible