Next-Generation Sequencing of HIV-1 RNA Genomes: Determination of Error Rates and Minimizing Artificial Recombination

Next-generation sequencing (NGS) is a valuable tool for the detection and quantification of HIV-1 variants in vivo. However, these technologies require detailed characterization and control of artificially induced errors to be applicable for accurate haplotype reconstruction. To investigate the occurrence of substitutions, insertions, and deletions at the individual steps of RT-PCR and NGS, 454 pyrosequencing was performed on amplified and non-amplified HIV-1 genomes. Artificial recombination was explored by mixing five different HIV-1 clonal strains (5-virus-mix) and applying different RT-PCR conditions followed by 454 pyrosequencing. Error rates ranged from 0.04–0.66% and were similar in amplified and non-amplified samples. Discrepancies were observed between forward and reverse reads, indicating that most errors were introduced during the pyrosequencing step. Using the 5-virus-mix, non-optimized, standard RT-PCR conditions introduced artificial recombinants in a fraction of at least 30% of the reads that subsequently led to an underestimation of true haplotype frequencies. We minimized the fraction of recombinants down to 0.9–2.6% by optimized, artifact-reducing RT-PCR conditions. This approach enabled correct haplotype reconstruction and frequency estimations consistent with reference data obtained by single genome amplification. RT-PCR conditions are crucial for correct frequency estimation and analysis of haplotypes in heterogeneous virus populations. We developed an RT-PCR procedure to generate NGS data useful for reliable haplotype reconstruction and quantification.


Introduction
Human immunodeficiency virus type 1 (HIV-1) is a highly diverse virus, not only on a global scale, but also within individual HIV-1 infected subjects [1]. The genetic variants constituting the viral population are called haplotypes, and these haplotypes form a viral quasispecies [2]. It has been shown that low-abundant haplotypes are already present in patients shortly after infection [3][4][5][6]. Numerous studies have shown that minority drug-resistant HIV-1 variants can be clinically relevant and lead to therapy failure, especially in the context of pre-existing minority variants harbouring resistance mutations to non-nucleoside reverse transcriptase inhibitors (NNRTI) [7][8][9][10]. Viral diversity has major implications on pathogenesis, drug resistance, and vaccine development. Since next-generation sequencing (NGS) platforms are widely available, virus populations can be studied much faster compared to the classical methodology of single genome sequencing. However, these technologies require rigorous estimation of error rates and identification of error sources, especially when viral haplotypes are quantified (reviewed in [11]). For instance, several studies have investigated the accuracy of the pyrosequencing technology, and it is well known that homopolymeric regions are the main source of insertion-deletion (indel) errors [12,13]. Moreover, the PCR polymerase can also contribute to this effect [14]. PCR artifacts are well known and addressed by optimizing PCR conditions and using high fidelity DNA polymerases [15]. Recently, primer identifiers have been described to circumvent some of the remaining PCR artifacts [16].
So far, not much attention has been drawn to the cDNA synthesis that is required as first step when RNA, rather than DNA, is the source for genetic analyses. RTs are error-prone enzymes [17], and misincorporations during cDNA synthesis are difficult to avoid and almost impossible to distinguish from real variations, especially in heterogeneous viruses such as HIV-1.
In vitro recombination has almost exclusively been studied on DNA templates and numerous improved PCR conditions have been described [18][19][20][21][22][23][24][25][26][27][28]. Amplifying a heterogeneous DNA sample can lead to artificial chimeras and therefore to an overestimation of genetic variation [18,24,25]. PCR-mediated chimeras are mainly created by prematurely terminated template extensions during PCR and subsequent false priming of these short sequences to a non-homologous sequence in the following cycles [21,23]. A previous study has shown that PCR-induced recombinants can account for up to 30% of the final PCR product [19]. Several factors can influence PCR-induced in vitro recombination, including template amount and polymerase processivity [20][21][22], but in vitro recombination induced by reverse transcription is poorly studied. So far, only Fang and co-workers studied HIV-1 cDNA synthesis-induced in vitro recombination and showed that a 2.5fold higher in vitro recombination rate can be observed in RT-PCR compared to DNA PCR when a long 4.5 kb fragment is amplified, probably due to prematurely terminated cDNA synthesis or RNA molecules degraded prior to the RT reaction [29].
Minimizing in vitro recombinants is particularly important when studying the intra-patient diversity of viruses like HIV-1. Besides a high mutation rate, this virus has the natural ability to recombine, which is one of several options of HIV-1 to circumvent selection pressures and to adapt to a new host [30,31].
Here, we estimated the error rates and characterized possible error sources for the 454 pyrosequencing technology at all stages of the procedure. We established an optimized, artifact-reducing RT-PCR protocol to reverse transcribe, amplify, and pyrosequence HIV-1 RNA genomes enabling accurate haplotype analysis based on entire sequence reads.

Substitution and Insertion/Deletion Rates and their Sources
To estimate the error rates of the different steps in the procedure of 454 pyrosequencing, the protease gene of the virus strain HIV-1 JR-CSF was amplified and 454 pyrosequenced following three different experimental procedures. In the first procedure, the plasmid pYK-JRCSF, containing the full-length sequence of HIV-1 JR-CSF , was digested using restriction enzymes flanking the protease gene. Adaptors were ligated to the protease gene to obtain a fragment for direct 454 pyrosequencing. We refer to this sample as ''NGS'' (figure 1A). It is used to evaluate the substitution and indel (insertions and deletions) rates of the emulsion PCR and the pyrosequencing procedure. In the second set-up, the exact same plasmid preparation was used to amplify the protease gene using fusion primers that consist of a HIV-1 specific region, a multiplex identifier and either the A or B sequence required for 454 pyrosequencing. This sample is named ''PCR-NGS'', as only one, the inner, PCR was done to obtain the amplicon ( figure 1A). This experiment was performed to estimate the substitution and indel rates of PCR, emulsion PCR, and pyrosequencing. In the third set-up, again the same plasmid preparation was used to produce the virus stock HIV-1 JR-CSF from which viral RNA was isolated and reverse transcribed followed by outer and inner PCRs. This sample is named ''RT-2PCR-NGS'' (figure 1A). This set-up was used to estimate the substitution and indel rates of the complete procedure that is commonly applied to pyrosequence HIV-1 from patients' plasma samples (RT, outer PCR, inner PCR, emulsion PCR, and pyrosequencing).
All three experimental procedures were set up in duplicates and pooled before pyrosequencing. Reads were aligned to the HIV-1 JR-CSF reference sequence, forward and reverse reads were analyzed separately (see Materials and Methods). Every difference between a read and the reference was counted as an error. Table 1 depicts the average substitution and indel rates per nucleotide for each sample. The substitution rates per nucleotide varied between 0.08-0.16%, not showing clear patterns in regard to either the different experimental procedures nor to forward and reverse reads. In contrast, indel rates varied considerably. In comparison, deletion rates were 2.7-5.5 -fold lower in reverse reads than in forward reads obtained from PCR-NGS and RT-2PCR-NGS samples and approximately twofold higher in reverse reads of NGS samples (table 1). Insertion rates varied less in forward and reverse reads of PCR-NGS and RT-2PCR-NGS samples, but they were .3-fold higher in reverse reads than in forward reads of NGS samples. The analysis of substitution and indel rates per position in forward and reverse reads revealed that these errors occurred mainly in the context of homopolymers (figure 1B). The longest homopolymer (six guanines) is located at position 18-23 (figure 1B). It mainly caused artificial deletions in forward reads and insertions in reverse reads, explaining the differences in average error rates (table 1).

Characterization of Molecular HIV-1 Clones
293T cells were transfected separately with five different HIV-1 full-length plasmids to obtain molecular HIV-1 clones. As a control, each of them was 454 pyrosequenced separately to estimate the substitution and indel rates and to exclude the presence of any recombinants in these virus stocks. Between approximately 5,000 and 34,000 reads were analyzed per sample. Within the 271 base pairs (bp) long analyzed region of the viral protease gene, the mean substitution rates were on average 0.09% per nucleotide and evenly distributed among the amplicons, except the high substitution rates within the six-G homopolymer region of the amplicons (figure 2). The insertion and deletion rates were on average 0.1% and 0.2% per nucleotide, respectively. Recombinants were not observed in any of these molecular clones.
Sequences were similar to the published sequences of each molecular HIV-1 clone except for HIV-1 89.6 . Here, our HIV-1 89.6 strain consists of GAG instead of AGT at positions 2360-2 and of G instead of A at position 2371 (based on HIV-1 89.6, GenBank accession number U39362). The largest genetic distance was between HIV-1 NL4-3 and HIV-1 89.6 consisting of 13 mismatches, the lowest genetic distance was 6 mismatches (seen in 4/10 possible pairs of the 5 HIV-1 strains) enabling the investigation of in vitro recombination (figure 2, orange bars).

In vitro Recombination Frequency is Influenced by RT-PCR Conditions
PCR conditions can influence the formation of artificial chimeras [21,22,27]. To test the effect of in vitro recombination during RT-PCR, a 5-virus-mix was generated consisting of HIV-1 HXB2 , HIV-1 NL4-3 , HIV-1 JR-CSF , HIV-1 89.6 , and HIV-1 YU2 . The molecular HIV-1 clones were mixed in approximate same amounts. In each of these and the following experiments, the exact same volume of the 5-virus-mix, the equivalent of 100,000 HIV-1 RNA copies, was used. After viral RNA was extracted, it was reverse transcribed using three different RT enzymes: Transcriptor High Fidelity RT, M-MuLV RT, RNase H 2 , and SuperScript III RT. Each cDNA synthesis was performed in duplicate. cDNA was amplified by outer and inner PCRs and pyrosequenced (table 2). A total of 1,649-100,263 reads were obtained per sample and artificial recombination was estimated using Recco. We found 30.6-37.1% of all reads to be artificial recombinants when the cDNA was amplified using standard PCR conditions and no adjustment of input copy numbers for nested PCR was performed (table 3, PR1-2).
To reduce the high in vitro recombination frequency, PCR cycling conditions were optimized by increasing the elongation time to minimize the occurrence of prematurely terminated extension events [32], increasing dNTP and oligonucleotide concentrations, and omitting the final extension step [27] (table 2). Furthermore, after the first PCR, amplicons were quantified and 10 5 DNA copies were transferred to the second round of amplification. With these modifications, the artificial recombination rate was reduced to 0.9-2.6% (table 3, PR3-8), as measured in six independent samples. The choice of the reverse transcriptase did not influence the artificial recombination rate.

Characteristics of False Haplotypes Induced by in vitro Recombination
In the haplotypes reconstructed by ShoRAH, we observed the original strains together with other sequences. The latter can be subdivided in two classes: 1. recombinants of the strains (in vitro recombinants) and 2. viral variants harbouring artificial substitutions and/or indels (here called erroneous haplotypes). Applying standard RT-PCR conditions (samples PR1 and PR2), 53.6 and 43.9% of all reconstructed haplotypes, respectively, were classified as in vitro recombinants (table 3). These rates were substantially higher than the estimates of in vitro recombinants by Recco, which can be explained by the different algorithms and subsequent procedures used as explained in materials and methods. Figure 3 shows the 23 in vitro recombinants found at frequencies $1% of the viral population in samples PR1 and PR2, 17 of which could be found in both samples. Two recombination events per chimera can be clearly assigned to four in vitro recombinants (m, q, r and t, figure 3); for the remaining 19 in vitro recombinants, one recombination event can be clearly identified, although the occurrence of more than one recombination event cannot be excluded. For samples PR3-8, where optimized RT-PCR condi- A) The molecular full-length HIV-1 clone pYK-JRCSF was used to generate three different samples for the determination of error rates and error sources during the different steps of sample preparation. The blue (left) pathway indicates the procedure NGS, i.e., no amplification step was performed before emulsion PCR and pyrosequencing. The green (right) pathway shows the procedure PCR-NGS, i.e., the target was amplified once prior to 454 emulsion PCR/ pyrosequencing. The orange (middle) pathway depicts the commonly used procedure RT-2PCR-NGS to reverse transcribe, amplify and sequence HIV-1 RNA genomes reflecting the errors that will occur using patients' plasma samples to analyze HIV-1 haplotypes. Detailed description of each step is given in the materials and methods section. B) Error rates per positions are shown for forward reads (left) and reverse reads (right). For each duplicate, one example is shown (always sample a as presented table 1). doi:10.1371/journal.pone.0074249.g001 tions were used, including limited input copy number, no in vitro recombinants were found at frequencies $1%. None of the 472 clones analyzed using single genome amplification was an in vitro recombinant. These results are consistent with the estimated recombination rates by Recco (table 3).

Haplotype Reconstruction by ShoRAH Reveals Different Frequencies of True Haplotypes in Different Methodological Settings
ShoRAH was applied for haplotype reconstruction from 454 pyrosequencing data obtained from the 5-virus-mix containing HIV-1 HXB2 , HIV-1 NL4-3 , HIV-1 JR-CSF , HIV-1 89.6 , and HIV-1 YU2 . The true haplotypes were successfully reconstructed in the following frequencies in these six samples obtained using optimized amplification conditions (PR3-8, table 3 (table 3).
Haplotype frequency analysis based on 454 pyrosequencing data obtained using standard amplification conditions showed that less than 30% of all reads belong to the true haplotypes (PR1 and PR2, table 3). This leads to a substantial underestimation of the frequency of each true haplotype as compared to the proportions obtained by single genome amplification or 454 pyrosequencing from samples amplified applying optimized conditions (table 3).

Discussion
Characterizing the diversity and evolutionary dynamics of virus populations within infected hosts is of great importance. For instance, it provides insights into virus escape mechanisms and development of drug resistance. Haplotype determination can be of therapeutic relevance, because pre-existing minority drugresistant variants present in a patient can increase the risk of therapy failure as shown for HIV-1 in a recent meta-analysis [8]. NGS technologies enable the fast acquisition of thousands to millions of sequences from one sample, making it a powerful tool to study diverse virus populations. However, the analysis can be hampered by several experimental errors, occurring both during library preparation and sequencing (reviewed in [11]). For data analysis, two major in vitro artifacts have to be considered: 1) substitution and indel errors and 2) in vitro recombinants. Both can lead to wrong estimates of diversity within a virus population. This artificial diversity is difficult to distinguish from the real diversity especially for HIV-1, as the mutation rate is high and recombination also occurs frequently in vivo [33].
We designed a control experiment to estimate the substitution and indel error rates at each different amplification step of 454 pyrosequencing and of the sequencing technology itself. We used the same HIV-1 full-length plasmid, obtained from a single bacterial clone, and processed it with different techniques to estimate the error rate at each step of the pyrosequencing technology. The indel rates did not differ substantially between the different steps and occurred predominantly in homopolymeric regions clearly showing that these errors were generated during the pyrosequencing step [12,34].
The substitution rate was .10-fold higher as previously described for non-pre-amplified genomic DNA fragments [12]. This might reflect the substitutions introduced by bacteria during the numerous duplications of the transfected plasmid. We expected that the substitution rate is higher in the samples generated by RT-PCR followed by outer and inner PCRs than in samples amplified only once prior to NGS. Interestingly, the substitution rates were only marginally higher in the PCR-NGS approach. Again, it might be possible that the overall diversity of plasmids obtained from bacteria after numerous duplications leads to a substantial amount of plasmids, which will not result in the generation of intact virus particles in transfected 293T cells. Thus, the transfection and the harvest of cell-free supernatant would display a bottleneck resulting in a less heterogeneous virus population compared to the plasmid ''population''. In summary, our RT-2PCR-NGS experimental procedure showed an average substitution rate of ,0.1%, which is consistent with previous studies [13,[35][36][37][38][39].
In vitro recombination represents a more severe problem especially for haplotype analysis of amplified viral RNA genomes. In our set-up, mixing five diverse virus strains, using standard  amplification procedures and applying a very strict analysis by Recco, the in vitro recombination frequency reached up to 37%. Analysis by ShoRAH and manual inspection revealed an in vitro recombination frequency of up to 53.6%. These numbers may still underestimate the real in vitro recombination frequency, since up to 27.6% of false haplotypes were not clearly classifiable in these samples. Artificial chimeras inflate viral diversity estimates and, on the other hand, lead to wrong frequency estimates of the true haplotypes. Optimizing the amplification conditions and limiting the input DNA copy numbers in the second, outer PCR reduced the in vitro recombination rate to 0.9-2.6%. It has been previously shown that the input DNA copy number is a critical factor in the generation of artificial recombinants [19]. Despite these optimized conditions, our in vitro recombination rates were still higher than previously reported rates of 0.11-0.89% also using optimized PCR conditions [25,28,33,38]. One major dissimilarity between our experimental approach and those methods was the template used. We performed a RT-PCR starting with HIV-1 RNA whereas the RT step was omitted in the other approaches, i.e., they started with viral DNA. It is known that prematurely terminated amplicons during PCR are mainly responsible for in vitro recombination in PCR reactions [20,22,[25][26][27]36,40,41]. However, in RT-PCR procedures, prematurely terminated cDNA fragments during cDNA synthesis can additionally serve as primers in subsequent amplifications [29]. Fang et al. showed that the in vitro recombination rate was approximately 3-fold higher when RT-PCR was compared to PCR alone (6.49% vs 2.65%, respectively) [29]. An in vitro recombination rate of 1.56%, i.e., more comparable to ours, has been recently published applying RT-PCR to generate amplicons for subsequent NGS procedures [42]. In a recent publication, Jabara et al. showed that by using degenerated primers for cDNA synthesis, PCR artifacts can be excluded from further analysis [16]. This approach is useful to exclude polymerase-induced misincorporations and PCR-induced recombination; however, it cannot identify RT-induced errors and recombinants.     It must be noted that our approach to amplify an almost equal mixture of five divergent virus strains is an extreme example of a heterogeneous population that enhances the likelihood of detectable recombination events. If the population is homogeneous, one will not find artificial chimeras to such an extent, but this does not mean that in vitro recombination does not occur. Rather, it means that one cannot identify any recombination event, because such an event can only be seen when it occurs between two different DNA molecules that can be distinguished by their genetic dissimilarity. In fact, a PCR-generated recombinant can only be the third most frequent haplotype if it is a chimera of the two most abundant ones [22]. On the other hand, the HIV-1 population of an infected individual is expected to show the classical quasispecies profile, with one dominant master sequence and a large number of lowabundant haplotypes. Thus, in vitro recombination between different haplotypes will be less evident. Nevertheless, it is important to know the possible in vitro recombination frequency within a sample, for example, when investigating superinfections.
The RT step in our experiments did not particularly affect the recombination rates. One reason could be that rather a short amplicon (290 bp) was amplified. Amplicon length can influence recombination rates. Two studies reported a low PCR-induced recombination rate (below 1%) within a short amplicon (120-265 bp) [26,28]. Amplification of a longer amplicon (1.5 kb) led to higher recombination rates [43]. Fang et al. showed that in vitro recombination rates were higher in RT-PCR than PCR alone when amplifying a long molecule of over 4 kb [29]. The reasons might be lower efficiencies of reverse transcriptases to generate long amplicons, or that RNA is less stable than DNA, i.e., degraded RNA can lead to incomplete cDNAs. Amplicon sizes are increasing as a result of enhanced NGS read lengths; consequently, RT-PCR artifacts might become more abundant.
In summary, we have developed an optimized RT-PCR protocol suitable to amplify and sequence HIV-1 RNA genomes via 454 pyrosequencing exhibiting low error rates. We show that abundant in vitro recombinants influence haplotype reconstruction and lead to artificially high diversity as well as to a bias in quantification of true haplotypes present in the viral population. Thus, it is crucial to estimate and minimize the in vitro recombination frequency as well as to consider PCR-and RTinduced artificial errors in any subsequent analysis in terms of characteristics and frequencies of variants, especially those of RNA sources.

Pyrosequencing
DNA was measured using the Quant-iT TM PicoGreenH dsDNA Assay (Invitrogen) according to the manufacturer's protocol. Equimolar DNA amounts were pooled for emulsion PCR. Pyrosequencing was performed with the GS FLX System using the GS FLX Titanium MV emPCR Kit (Lib-A) or the GS Junior System using the GS Junior Titanium emPCR Kit (Lib-A) (Roche-454 Life Sciences).

Single Genome Amplification
The frequencies of the five virus strains in the 5-virus-mix sample used for the standard PCR protocol were investigated using single genome amplification. cDNAs generated with the three RT enzymes Transcriptor RT, Transcriptor High Fidelity cDNA Synthesis Kit, M-MuLV RT, RNase H 2 , was diluted in such a way that each amplicon was derived from one cDNA copy (a maximum of 1 of 5 PCR reactions were positive). qPCR was performed in a real-time cycler ABI7500 (Applied Biosystems) as follows: 94uC-59, 506(94uC-1599, 55uC-3099, 72uC-4599) followed by a melt curve analysis in a total volume of 20 ml containing 5 nM (ROX, Invitrogen) 1.

Data Analysis
The analyses of error rates for samples NGS, PCR-NGS and RT-2PCR-NGS were done separately for forward reads and reverse reads and restricted to the region nt 2281-2569 (based on HIV-1 HXB2 ) that overlaps in all three experimental set ups. Substitutions, insertions and deletions were analyzed by aligning reads to the HIV-1 JR-CSF reference sequence after removal of reads with gaps of .10 nt. The alignments were computed with needle (implementing the Needleman-Wunsch algorithm), from the software suite EMBOSS [44] and the differences were counted as errors. Analysis of PR1-8 was done on the region nt 2279-2549 (based on HIV-1 HXB2 ). Haplotype reconstruction was performed using the software ShoRAH (Short Read Assembly into Haplotypes) [45], a tool developed to correct sequencing errors in order to reconstruct the true local variants present in the virus population. ShoRAH clusters the reads (it groups them according to their similarity) and removes the intra-cluster variation to eliminate sequencing errors. The recombination analysis was performed using the software Recco [46]. Each read was included in a multiple sequence alignment with the five reference sequences (the viral strains in the 5-virus-mix) computed with muscle [47] and then passed to Recco. This software computes the number of ''savings'', i.e., number of mismatches saved when explaining the read by a recombination event between two viral strains and mutations, rather than by a single strain and mutations only. For each sample, a histogram was produced counting the number of reads with a given number of savings. The proportion of recombinant reads was estimated on each sample by comparing the results obtained on the real reads to those results of a set of simulated reads obtained from the viral strains by substitutions only. In this analysis, reads were defined as recombinant when Recco assigns a savings value higher than two, which was the maximum value observed on the simulated reads. It is important to say that this analysis tends to underestimate the number of recombinant reads. In fact, even on datasets where all reads are recombinant, Recco reports saving values of two or less for a substantial fraction of the reads.