A Comprehensive Deep Sequencing Strategy for Full-Length Genomes of Influenza A

Driven by the impact of influenza A viruses on human and animal health, much research is conducted on this pathogen. To support this research, we designed an all influenza A-embracing reverse transcription-PCR (RT-PCR) for the generation of DNA from influenza A virus negative strand RNA genome segments for full-length genome deep sequencing on a Genome Sequencer FLX instrument. For high reliability, the RT-PCRs are designed such that every genome segment is divided into two amplicons and for the most variable segments redundancy is included. Moreover, to minimize the risk of contamination of diagnostic real-time PCRs by sequencing amplicons, RT-PCR does not generate amplicons that are amenable to RT-qPCR detection. With the presented protocol we were able to generate virtually all amplicons (99.3% success rate) from isolates representing all so far known 16 hemagglutinin and 9 neuraminidase subtypes and from an additional 2009 pandemic influenza A H1N1 virus. Three isolates were sequenced to analyze the suitability of the DNA for sequencing. Moreover, we provide a short R script that disambiguates the sequences of the primers used. We show that using unambiguous primer sequences for read trimming prior to assembly with the genome sequencer assembler software results in higher quality of the final genome sequences. Using the disambiguated primer sequences, high quality full-length sequences for the three isolates used for sequencing trials could be established from the raw data in de novo assemblies.


Introduction
Influenza A virus is one of the major threats in modern health care. Besides influenza viruses that cause seasonal epidemics, pandemic viruses such as the 2009 H1N1 swine origin influenza A virus have a significant impact on human and animal health. Therefore, intensive research is carried out tackling different aspects of influenza A virus properties, very often trying to correlate phenotypes with genotypes.
The segmented negative strand RNA genome of influenza A virus comprises 8 segments ranging in size approximately from 850 nt to 2350 nt. Novel viruses arise due to antigenic drift and antigenic shift. Antigenic drift is caused by continuous mutation of the genomes due to the high error rate of the viral RNAdependent RNA-polymerase. Reassortments of genome segments during infection of a single host with different influenza A viruses result in antigenic shift (reviewed in [1]). Consequently, a high variability of the genome composition leading to diverse virus mixtures may be expected. Therefore, the complete characterization of influenza A viruses is only possible by precise determination of full-length genome sequences by deep sequencing. Also for diagnostics, there is the need for comprehensive tools in this field. This demand has been highlighted once more by the sudden and unexpected emergence and fast spread of the pandemic H1N1 swine origin influenza A virus in 2009.
The existing DNA sequencing capacity that is now available with the different next generation sequencing technologies [2] more than suffices to accurately sequence any DNA sample at any desired sequence depth. On the contrary, adequate protocols for correspondingly satisfying preparation of DNA in sufficient amount and quality for sequencing of complete virus genomes are not at hand.
The most apparent problem in full-length virus genome sequencing is to target the sequencing process away from host nucleic acids to viral nucleic acids. Preparing a simple shotgun sequencing DNA library, the most comprehensive approach, will most likely result in a huge amount of host specific instead of fulllength viral sequences. The same holds true when working with libraries prepared from cDNA synthesized from RNA with random priming [3,4]. So far, different techniques to focus the sequencing have been described [5,6,7,8]. First of all, the most obvious is to enrich virions from cell culture or from host tissue or fluids by centrifugation before extracting the genomic DNA/RNA [5]. This may be a very time-consuming and laborious procedure with sometimes uncertain outcome. Secondly, it is possible to enrich for the portion of nucleic acids that is the aim of the sequencing effort if these can be caught by probing the complete DNA/RNA. This procedure was engaged, for example, in the first official RNA sequencing protocol for the Genome Sequencer FLX (GS FLX) [7]. Here, the influenza A genome segments were enriched by hybridization to biotinylated probes directed against the conserved ends of the genome segments. Subsequently, the probes were bound to streptavidin beads, the samples washed and finally the captured RNA was eluted. A similar approach was used by Monger and colleagues [8] who depleted host nucleic acids by probing total RNA with labeled host nucleic acid. Another approach to target the sequencing to viral nucleic acid was published by Potgieter and colleagues [6]. In this case, a protocol was designed to enrich for double stranded RNA (dsRNA) exclusively representing the bluetongue virus or other dsRNAvirus genomes. Nevertheless, as with the first option, targeting the sequencing effort to a subpopulation of the RNA extracted from a sample also is laborious and the risk of failure is high, as the outcome might not be sufficient in quality or amount to allow for sequencing.
Consequently, it is straightforward to target the sequencing by simply targeting the preparation of DNA for sequencing by specific PCR primers. In the case of highly pathogenic avian influenza A H5N1, it was possible to target and simultaneously speed up the sequencing by a sensitive and reliable reverse transcription-PCR (RT-PCR) [9]. Enrichment of virus nucleic acids by the different available techniques mentioned above was not necessary. Here, we describe a method that fulfills the demand for easy and reliable, but in addition at the same time broad and comprehensive sequencing of influenza A viruses that goes far beyond the previously reported specialized H5N1 sequencing approach.

Viruses
The virus strains used in this study (see table 1) were obtained from the virus collection of the National Reference Laboratory for Avian Influenza at the Friedrich-Loeffler-Institut, Greifswald, Insel Riems, Germany.

RNA extraction
Total RNA was isolated from allantoic fluid with Trizol LS (Invitrogen, Darmstadt, Germany) in combination with RNeasy Mini spin columns (Qiagen, Hilden, Germany). First, 250 ml of allantoic fluid were mixed with 750 ml Trizol LS and processed as recommended by the manufacturer until phase separation. Instead of RNA precipitation, the RNA containing aqueous phase was then mixed with ethanol (final concentration $40% (vol/vol)) and loaded onto an RNeasy Mini spin column. Subsequently, the RNeasy Mini protocol was followed until elution of RNA. Alternatively, total RNA was isolated from allantoic fluid (140 ml) with a QIAamp viral RNA minikit (Qiagen, Hilden, Germany), according to the manufacturer's instructions.
PCR, library preparation, sequencing, and sequence assembly DNA was generated either in a one-step or in a two-step RT-PCR. For the one-step RT-PCR protocol, cDNA was generated from total RNA diluted 10-fold in RNA-safe buffer (50 ng/ml carrier RNA, 0.05% Tween 20, 0.05% sodium azide in RNasefree water) [10] by RT-PCR with a SuperScript III one-step RT-PCR system with Platinum Taq high-fidelity polymerase (Invitrogen). For every complete influenza A genome, 16 PCRs were set up with the primer combinations listed in table 2. Five microliters of the dilute template and 1 ml of each primer (10 mM; final concentration in the PCR mixture, 0.4 mM) were mixed and denatured for 2 min at 95uC. Immediately after denaturation, the samples were frozen in liquid nitrogen, and subsequently, 18 ml of the PCR master mixture (12.5 ml 26 reaction mixture, 1 ml Superscript III reverse transcriptase-Platinum Taq high-fidelity polymerase enzyme mixture, 4.5 ml RNase and DNase-free distilled water) was added. The RT-PCR was performed in a model 2720 thermal cycler (Applied Biosystems, Darmstadt, Germany). The cycling parameters were 30 min at 50uC; 2 min at 94uC; 40 cycles of 15 s at 94uC, 30 s at 50uC, and 3 min at 68uC; and 1 cycle of a hold for 5 min at 68uC. Alternatively, twostep RT-PCR was performed with SuperScript VILO cDNA Synthesis Kit (Invitrogen) and AccuPrime Taq DNA Polymerase High Fidelity (Invitrogen). For the two-step protocol, 2 separate reverse transcriptions were performed, one for the 59 parts and one for the 39 parts of the segments. For this purpose, 5 ml total RNA were mixed with 4 ml 56 VILO reaction mix, 2 ml 106 Superscript enzyme mix, and 7 ml RNase and DNase-free distilled water. To this mix containing random hexamer primers, for the 59 portion of the segments 2 ml of general primer pan-IVA-1F-M13F (10 mM) or for the 39 portion of the segments 2 ml of RNase and DNase-free distilled water were added. Reverse transcription was then performed in a model 2720 thermal cycler (temperature profile: 10 min at 25uC, 5 min at 42uC, 30 min at 55uC, 5 min at 60uC, 5 min at 85uC, and finally 15 s at 10uC). Immediately after reverse transcription, PCR was set up. To this end, 2.5 ml of the cDNA were used as template and combined with 2.5 ml 106 AccuPrime PCR Buffer 1, 0.1 ml AccuPrime Taq High Fidelity, 17.9 ml RNase and DNase-free distilled water, and 2 ml of a primer mix containing primers (each 5 mM) for amplification of the respective portion of the genome segment according to table 2. The reaction mixes were incubated initially 1 min at 94uC; 40 cycles of 15 s at 94uC, 30 s at 50uC, and 3 min at 68uC; and 1 cycle of a hold for 5 min at 68uC. The PCR products both from the one-step and the two-step protocol were purified with Ampure Beads (Beckman Coulter Genomics, Bernried, Germany) according to the manufacturer's instructions. The purified DNA was quantified with a model 1000 spectrophotometer (NanoDrop Technologies, Inc., Wilmington, DE) in the double-stranded DNA mode and was then pooled in equimolar amounts. This DNA pool (800 ng to 3 mg) was transformed to DNA libraries by using a GS DNA library preparation kit (Roche, Mannheim, Germany) using the protocol of Wiley and colleagues [11] with the modifications of Leifer and colleagues [12]. For bead bound clonal amplification of the DNA libraries, the DNA libraries were subjected to duplicate emulsion PCRs (emPCR) with the GS emPCR kit I (Roche), according to the manufacturer's instructions, with 2 copies per bead. After bead recovery and enrichment, the beads were sequenced using a GS LR70 sequencing kit (Roche) and the appropriate instrument run protocol. The resulting sequencing reads were sorted according to the genome segments to which they related and were subsequently assembled into one contig (i.e., a set of overlapping sequencing reads) per segment with the GS FLX sequence assembly software newbler (version 2.3; Roche). During the assembly, the primer sequences were trimmed off the raw data.
In addition, to check whether or not the sequences were complete, we performed a mapping along appropriate reference sequences using the GS FLX reference mapper software (version 2.3; Roche).

PCR Design
Primers were designed on the basis of alignments of all influenza A full-length segment sequences that included the conserved ends and regardless of the host of the virus isolate available at the time of primer design. Positioning of the primers for reverse transcription and amplification was done following three key criteria: (i) None of the 8 genome segments should be reversely transcribed and amplified as a single amplicon but for better performance be divided in a 59-and a 39-part. (ii) In order to reduce the risk of PCR failure caused by non-matching primers due to influenza A virus evolution, we made use of the conserved regions at both the 59 and the 39 segment ends as primer binding sites. (iii) To be also compatible with diagnostic settings, the two amplicons to be generated for each segment should, if possible, not lead to the generation of DNA traceable in the real-time PCR assays in use in diagnostic laboratories for influenza A diagnostics. This prevents interference with diagnostics by PCR products that might be released at any step during the sequencing library preparation process. The RT-PCR design resulting from these considerations is depicted schematically in figure 1. As can be seen in figure 1 (and table 2), in some instances, we use redundant inner segment specific primers. This redundancy is necessary due to the high variability in the sequences of the respective genome segments.

Generation of sequencing amplicons
The functionality of our PCR protocol was tested with 17 different influenza A isolates that comprise all so far known 16 hemagglutinin (HA) and 9 neuraminidase (NA) types. In addition, a 2009 pandemic H1N1 influenza A virus was included in the panel. Table 3 summarizes the results of the PCRs for all segments from all isolates that were probed. Altogether, 270 out of 272 PCRs (99.3%) that were performed were successful, i.e. with only very few exceptions all amplicons from all isolates could be generated. In no case the PCRs for both the 59-and the 39-fragments for one segment failed. Therefore, even in the case of a PCR failure we were able to gain sequence information for the respective genome segment. Pan-IVA-S1-1254R Pan-IVA-S1-1395R GCA Sequencing the PCR products Most importantly, we were interested in the applicability of the amplicons for sequencing with regard to purity, i.e. absence of non-specifically amplified host sequences, and with respect to completeness of the influenza A sequences. Therefore, we sequenced the DNA fragment pools confected from the PCR products of three isolates.

Primer disambiguation for sequence analysis
In general, in assemblies of data obtained from sequencing PCR products it is necessary to trim off the primer sequences before assembly in order not to distort the final sequences. This is even more important when amplicons have been generated with primers including ambiguities, i.e. incompletely specified bases [13], as is the case with the primers used here. Failure to trim off the primer sequences will result in final sequences which include these ambiguities and hence deviate from the true sequence.
The powerful 454 assembler newbler used in this study cannot handle nucleotide sequence ambiguity codes in trim information [14]. If there is no clearly dominating base at a certain position, the assembler is not able to determine the consensus base at that position. This either leads to low quality basecalls, single base insertions or deletions, or the software splits the sequences at these ambiguous positions leading to a fragmentation of the final genome sequences. Therefore, it is necessary to disambiguate the primer Figure 1. Schematic of the primer localizations. PCR products used for sequencing are designed such that every segment is amplified in two amplicons. Moreover, no amplicon containing the complete region necessary for detection in a diagnostic real-time RT-PCR is generated. This reduces the risk of compromising nucleic acid diagnostics by sequencing amplicons. For some segments, redundancy is introduced for more reliability. doi:10.1371/journal.pone.0019075.g001 sequences to allow for correct trimming. To this end, a short R [15] program (see supplemental File S1) was written. This program makes use of the seqinr package [16]. Table 4 summarizes the data obtained in de novo assemblies with or without primer disambiguation. From the numbers shown there it is evident that disambiguation of the primers enhances final data quality. Due to the impact of primer disambiguation, for all further analyses, disambiguated primer sequences were used.

Analysis of data quality, quantity, and usability
For analyses of the quality and quantity of the obtained raw sequences, we first performed a mapping of the reads along the known reference sequences. This allowed us to assess the purity, completeness, and fidelity. Moreover, the raw data were assembled without any advance sequence information in order to test the usability for de novo assembly.
With regard to completeness, we were particularly curious whether the complete sequences for segments 5 and 7 could be generated. This was interesting because we used very short PCR products (,300 bp) for the 39 and 59 part of these segments, respectively. Such short DNA fragments are depleted during preparation of the sequencing libraries. Nevertheless, mapping of the sequence reads against the appropriate reference sequences showed that all segments had been completely sequenced. In all cases, only the sequences of the binding sites of the general forward and reverse primers, respectively, were missing from the final sequences due to primer trimming.
Besides the assessment of DNA purity and sequence completeness, mapping also provided information about the fidelity of the newly generated sequences. The results of the sequence comparisons are compiled in table 5. Except for a few deviations in the regions of the general primers, i.e. outside the coding regions, we rarely detected sequence deviations in the sequences of all 3 isolates. With regard to the deviations in the primer regions it is necessary to note that these occur due to primer trimming. Primer trimming leads to a diminished sequence depth in the trimmed regions causing a diminished sequence quality. With regard to those deviations which were found in the coding regions, we detected 3 differences between newly established and the reference sequences for all segments of isolate A/shearwater/West Australia/2576/79 (H15N9). Namely, we found one deviation in the HA gene (a677g), one base substitution in the NA (a866t), and another difference in the NP gene (a1459c) which is a primer derived substitution. In the sequences for isolate A/turkey/Ontario/6118/68 (H8N4), we detected 1 difference within the open reading frames. This n499g was located in segment 1 and obviously is not a deviation from the reference sequence. The figures for the pandemic H1N1 virus were much the same as for the other 2 isolates. Here, we also found a single deviation from the reference sequences, namely an a837g substitution in segment 2. In summary, we conclude that the sequence fidelity achievable with the established protocol is excellent. With an overall error rate of only 0.032% (including those deviations that are located in the general primer regions) the concordance of the newly determined with the beforehand known reference sequences from all 3 isolates that were sequenced is very high. It is important to note that the deviations may be acquired by the virus during repeated cell or egg culture passages.
De novo assembly. In order to test if the quality of the raw data and the overlaps of the PCR amplicons are sufficient for a de novo assembly, we assembled the raw sequence data without advance sequence information. To this end, we performed an assembly of the raw sequencing reads using the 454 assembler software newbler. As published before for sequencing of highly pathogenic avian influenza A H5N1 viruses [9], complete assembly into a single contig (a set of overlapping sequencing reads) per segment was only possible after allocating the sequencing reads to the segments. The assemblies of the data resulted in one contig per segment for all 3 isolates. Only the sequence for the segment 7 of the H8N4 was truncated. Here, the sequence for nucleotides 1-76 was missing in the de novo assembled sequence, caused by the depletion of the small fragment representing this region. Hence, the coding sequence for the H8N4 genome was 99.59% and those for the two other isolates were 100% complete. The main characteristics of all de novo assemblies are summarized in table 6. As for mapping, the fidelity of the de novo assembled sequences was very high. Here, the total error rate was 0.030%, again including those mutations that were detected in the general primer binding regions which resulted in a relatively low quality due to the trimming during the assembly.
The results of the comparisons of our de novo assembled unrevised sequences with the available reference sequences (table 7) predominantly showed the same differences which were detected by mapping (see table 5). Remarkably, four of the deviations detected by mapping located in the general primer binding regions were not detected in the assemblies. This provides an additional hint that these deviations are artifacts. Moreover, with regard to the coding regions, we did not detect the a866t substitution in the H15N9 NA sequence. On the other hand, three single nucleotide deletions were found in the de novo assembled sequences. In particular, we found one deletion in segment 7 of the pandemic H1N1, one deletion in segment 6 of the H8N4 virus, and one deletion in segment 5 of the assembly for the H15N9 virus's sequence. From the detailed assembly characteristics stored by the Table 4. Summary of the impact of primer disambiguation on sequence quality obtained by de novo assembly.

Ambiguous primers
Unambiguous primers assembler software, all these deletions could be identified as undercalls, i.e. as artifacts in the assembly. Taken together, despite the differences that are described here, the overall consistency of the unrevised sequences with the reference sequences is 99.97%.

Discussion
In several reports the possibility to benefit from shotgun high throughput sequencing for applications in virus research has been shown (see for instance [5,6,7,8]). However, none of the presented protocols is useful for targeted full-length genome deep sequencing. This is due to limitations in the sample preparation caused by failure to focus or laborious protocols used to target the sequencing process to the aim of the work. Our previously reported protocol for sequencing highly pathogenic influenza A H5N1 [9] is very sensitive, accurate, and useful for the specialized goal. However, its spectrum is too narrow to be useful in influenza A virus research in general or in situations like the 2009 pandemic of the swine origin influenza A H1N1. Therefore, we set out to improve the method making use of the benefits of our first protocol but adding usability and broadening the spectrum of the method to influenza A viruses regardless of the subtype.
The presented results of the sequencing trials done with 3 isolates clearly show that we reached our goal to broaden the spectrum without losing the specificity that is necessary to target the sequencing. With a maximum of only 3.7% of the reads representing host nucleic acids, we obviously successfully direct sequencing to the influenza A genome. Nevertheless, the scale of our approach is very good. This is highlighted by a success rate of about 99.3% for all PCRs as tested with 17 different isolates that represent all 16 HA and 9 NA types, respectively, and one 2009 pandemic H1N1 virus. Therefore, we conclude that focusing the sequencing must not necessarily diminish comprehensiveness. Unlike the protocol for subtype independent sequencing of influenza A viruses initially provided by 454 life sciences for use with the GS FLX [7], we neither need delicate techniques for enrichment of the target nucleic acid nor for preparation of cDNA carrying the adapters necessary for GS FLX sequencing.
By introduction of the protocol of Wiley and colleagues [11] and omission of the gel purification initially included in the sequencing protocol for highly pathogenic influenza A H5N1 [9], we gain the possibility to fully automate sample preparation. Only quantification of the PCR products and preparation of the DNA pools has to be done manually. This permits to raise the throughput which is possible since the introduction of the MID adapters for the GS FLX. These MID adapters provide a recognizable sequence tag at the beginning of each read. These tags allow to group the sequence reads belonging to different DNA libraries. Taken together, this makes sequencing less laborious and more economic.
Another benefit of the new protocol presented here is the design of the PCRs in order not to compromise diagnostic qPCR assays in use in influenza diagnostic laboratories. With the exception of the two small fragments generated for sequencing the M and NP segments, none of the sequencing amplicons is detectable in the respective qPCR assays (data not shown). Therefore, laboratories can be protected from sequencing amplicons that may interfere with real-time PCR diagnostics. Skipping the two small fragments  Sequencing RNA directly from specimens without enriching for virus nucleic acids results in high proportions of host sequence reads. In two different studies in which new viruses were identified only 0.014% and 0.091% of the reads, respectively, represented virus sequences [3,4]. This underlines the necessity to direct the sequencing process. The aforementioned different approaches for focusing the library preparation and hence the sequencing process to the virus nucleic acids are less restricted than our approach [5,6,7,8] but, nevertheless, strong constraints exist: The protocol of Potgieter and co-workers [6] only works with dsRNA, excluding its use for influenza sequencing. The authors do not provide information about the amount of contaminating host reads. With the random cDNA library preparation protocol currently provided by Roche/454 Life Sciences paired with the enrichment protocol of Simons and Hutchison [7], we reached a host sequence contamination level of 87.5%. Moreover, this protocol relies on pre-existing knowledge as does our method. Monger and colleagues [8] report that after de-enrichment of host nucleic acids by hybridization still 84% of the sequencing reads represented host sequences. Due to all these inconveniences, we chose RT-PCR for the targeting. On the one hand, RT-PCR in general is a very robust and sensitive technique which enables us to sequence whole genomes from samples with a broad range of virus content. On the other hand, the focused sequencing results in high amounts of target sequence data. In turn, this renders sequencing with any depth that may be desired economically possible. The huge amount of data enables quasispecies analyses by sequencing every genome position hundreds or even thousands of times. The possibility to benefit from deep sequencing data for quasispecies analysis has already been shown before with African horse sickness virus and classical swine fever virus [6,12]. With our here presented protocol, this kind of analyses is made possible for all subtypes of influenza A viruses.
Altogether, we were able to markedly improve our previously reported protocol [9]. With the high quality of the final sequence data that are produced with our technique, we conclude that the presented method is suitable for routinely sequencing complete influenza A virus genomes with a high throughput. In addition, due to the great sequence depth that can be achieved even quasispecies analysis at the full genome level is made possible.

Supporting Information
File S1 This file contains the R-code that disambiguates nucleotide sequences and writes a new file containing these disambiguated sequences. The script relies on the seqinr package [16]. (PDF)