The Feasibility of Using High Resolution Genome Sequencing of Influenza A Viruses to Detect Mixed Infections and Quasispecies

Background The rapidly expanding availability of de novo sequencing technologies can greatly facilitate efforts to monitor the relatively high mutation rates of influenza A viruses and the detection of quasispecies. Both the mutation rates and the lineages of influenza A viruses are likely to play an important role in the natural history of these viruses and the emergence of phenotypically and antigenically distinct strains. Methodology and Principal Findings We evaluated quasispecies and mixed infections by de novo sequencing the whole genomes of 10 virus isolates, including eight avian influenza viruses grown in embryonated chicken eggs (six waterfowl isolates - five H3N2 and one H4N6; an H7N3 turkey isolate; and a bald eagle isolate with H1N1/H2N1 mixed infection), and two tissue cultured H3N2 swine influenza viruses. Two waterfowl cloacal swabs were included in the analysis. Full-length sequences of all segments were obtained with 20 to 787-X coverage for the ten viruses and one cloacal swab. The second cloacal swab yielded 15 influenza reads of ∼230 bases, sufficient for bioinformatic inference of mixed infections or quasispecies. Genomic subpopulations or quasispecies of viruses were identified in four egg grown avian influenza isolates and one cell cultured swine virus. A bald eagle isolate and the second cloacal swab showed evidence of mixed infections with two (H1 and H2) and three (H1, H3, and H4) HA subtypes, respectively. Multiple sequence differences were identified between cloacal swab and the virus recovered using embryonated chicken eggs. Conclusions We describe a new approach to comprehensively identify mixed infections and quasispecies in low passage influenza A isolates and cloacal swabs and add to the understanding of the ecology of influenza A virus populations.


Introduction
Influenza A virus is an enveloped RNA virus belonging to the family Orthomyxoviridae with a genome spanning 13.5-kilobases and consisting of eight single stranded RNA segments. The individual RNA segments range in length from 890-2341 nucleotides and encode 11 proteins [1]. There are 144 possible combinations of two surface glycoproteins, hemagglutinnin and neuraminidase, that determine the antigenic properties and subtype classification of the virus.
Influenza A viruses are zoonotic and as a group of viruses, they possess a wide host range including humans, at least 105 bird species, pigs, horses, dogs, cats, ferrets, mink and marine mammals. In the United States alone, more than 200,000 hospitalizations and 36,000 deaths annually are due to complications from seasonal influenza in humans. Globally, it is estimated that influenza causes 300,000 to 500,000 human deaths annually [2]. Multiple cases of human infections with H1N1, H5N1, H7N7, and H9N2 avian influenza viruses (AIV) have been reported since 1997, raising concerns over potential zoonosis of AIV [3].
From April 2009 a pandemic caused by a novel H1N1 virus has been ongoing. As of August 2009, there have been more than 182,000 laboratory confirmed cases of pandemic influenza H1N1, 1799 deaths, in 177 countries and territories have been reported to WHO (http://www.who.int/csr/don/2009_08_21/en/index. html). Therefore, enhanced surveillance of avian and swine influenza A viruses is necessary to provide an understanding of the ecology and evolution.
Influenza viruses have a high error rate during the transcription of their genomes because of the low fidelity of RNA polymerase [4]. The high error rate produce quasispecies, a phenomenon where many different viral genotypes co-circulate in the host, with each virus subtype potentially associated with varying levels of fitness for that host [5]. As defined by Domingo et al, ''viral quasispecies are closely related (but nonidentical) mutants and recombinant viral genomes subjected to continuous genetic variation, competition, and selection'' [6]. This high error rate in replication operates as a double-edged swordimproving the ability of the virus to rapidly adapt to a new host via genetic changes that aid in replication and transmission efficiency while leading to the production of defective subtypes that have reduced fitness for the current host. Some or most of these quasispecies or mixed subtypes may be missed during viral culture because a ''host'' (chicken embryo or cell culture) adaptation pressure [7].
The frequency of infection with multiple subtypes of the virus in wild birds or swine populations that may contribute significantly to the emergence of new viruses with altered host specificities is not known. Complete genome sequencing of influenza A viruses by the current method (RT-PCR followed by classical dye terminator chemistries) is time and resource demanding. For example, in the recent large-scale influenza sequencing project, 95 overlapping one-step RT-PCR were performed per sample to obtain the complete viral genome sequence [1]. Newly developed sequencing-by-synthesis technology has simplified the world of genomics by circumventing the need for individual segment amplification, cloning, and shotgun library preparation [8]. Pyrosequencing approach is useful for the identification of previously undetected and/or uncultured viruses [9] and has been applied in the detection of antiviral resistance markers [10,11,12,13,14,15,16,17,18,19], detection of human virulence signatures in H5N1 [20], diagnosis [21] and sequencing of the full genome of high pathogenic H5N1 [22]. We used this de novo approach to sequence the entire genome of 10 virus isolates (eight avian influenza viruses and two swine influenza viruses) and two primary cloacal swabs. We tested the hypothesis that quasispecies and mixed infection among avian and swine influenza viruses can be identified by the new next-generation pyrosequencing.

Pyrosequencing using Genome Sequencer FLX platform
Twelve samples including 10 virus isolates (eight avian influenza viruses and two swine influenza viruses propagated in embryonated chicken eggs and in MDCK cells with trypsin, respectively) and two cloacal swabs, were processed for pyrosequencing. Complete genomes (.99% Open Reading Frame) were obtained for all eight segments of each virus isolate and a cloacal swab (Table 1). For these 11 samples, the mean influenza sequence reads per small PicoTiterPlate (PTP) region was 7075, with an average read length of 232 bases. A complete ORF region (100% genome length) was obtained for five avian H3 isolates (Table 1). Both swine virus isolates yielded a clean full-length genome. For the other five virus isolates and one cloacal swab [cloacal swab of A/green-winged teal/Minnesota/ Sg-00131/2007(H3N2)], 2-41 nucleotides were missing in some segments at the 39 end but rarely at the 59 end. Overall, ,13400 bases (.99% of the total genome size) were covered for eleven samples with 20-787 X coverage depth. A representative coverage depth map is shown in Figure 1 Comparison of sequences of cloacal swab and virus recovered using embryonated chicken eggs system reveals extensive variability Complete genome sequences were obtained from the cloacal swab of A/green-winged teal/Minnesota/Sg-00131/2007(H3N2) and virus recovered using egg system. A comparison of sequences from each segment revealed 80-91% nucleotide identities (PB2, 90%; PB1, 87%; PA, 90%; HA, 80%; NP, 83%; NA, 82%; M, 91%; and NS, 86%) suggesting extensive variability or existence of quasispecies in the cloacal swab. In the second cloacal swab-virus isolate pair, complete sequences were obtained from the virus isolate, whereas only 15 influenza sequence reads were obtained from the cloacal swab. These 15 reads of ,230 bp included sequences of PB2 (two reads), PB1 (four reads), PA (three reads), HA (three reads; one read each for H1, H3, and H4), NP (one read), and NS (two reads). Four sequences (one each for PB1, PA, H4, and NP) had 100% identity with the virus recovered using embryonated chicken eggs, A/mallard/Minnesota/Sg-00133/2007(H4N6).

Comparison of Sanger sequences with pyrosequencing to resolve polymorphisms
Whole genome sequences using standard dye terminator chemistry were also available for four H3N2 viruses. Comparison of these sequences against the pyrosequencing data revealed eight single nucleotide mismatches ( Evidence of quasispecies in cloacal swab and virus recovered using eggs and cell culture systems A series of genomic subpopulations or quasispecies as identified by single nucleotide polymorphisms (SNP) at specific nucleotide positions was identified in five virus isolates (four egg grown avian influenza viruses and one cell cultured swine influenza virus) and a cloacal swab. All the above samples are H3N2 subtype and all quasispecies populations observed in this study originated from mutations in NP, PB1, PA, M, and NS genes ( Table 3). One example of the evidence of quasispecies in codon 715 of the M gene of A/mallard/South Dakota/Sg-00125/2007(H3N2) is shown in Figure 2A. Figure 2B shows codon 441 of the M gene of the same isolate with computational complexity of a false deletion that needed to be corrected as T/G quasispecies by manual curation. There was good agreement between polymorphic loci identified in chromatograms generated using Sanger sequencing and pyrosequencing methods. When single nucleotide polymorphisms at a particular base were present, mixed peaks were observed in Sanger's chromatogram, whereas variant populations were identified in the assembled pyrosequencing reads ( Figure 2B).

Identification of mixed infections
A bald eagle isolate, A/bald eagle/Virginia/Sg-00154/ 2008(H1N1/H2N1) that was originally typed by sequencing segments of HA and NA as H2N1 showed evidence of mixed subtypes by whole genome analysis. Analysis of the HA sequences from the 454 data revealed that this isolate carried both H1 ( Figure 3A) and H2   ( Figure 3B) subtypes, suggesting that this was a mixed infection. In addition, NA sequences of this isolate revealed two different lineages of N1 and .96% identities with each other ( Figure 3C). As described above, from 15 influenza reads (,230 bases each) that were realized for cloacal swab of A/mallard/Minnesota/Sg-00133/2007(H4N6), there was evidence of mixed infection with H1, H3, and H4 subtypes.

Discussion
Complete genome sequencing of influenza A viruses is essential to determine the genetic basis of pathogenicity, antiviral resistance, and understanding the evolution of viruses in a variety of hosts and environments. Previous studies on sequence-based detection of      antiviral resistance and diagnostics routinely used amplification of short portion of NA or HA genes followed by pyrosequencing. Hoper et al. [22] developed the pyrosequencing protocol for complete genome sequencing of H5N1 avian influenza using locus specific PCR products. In other words, all H5N1 segments were amplified with specific primers prior to sequencing. We reasoned that segment specific amplifications would lose information regarding mixed infections or quasispecies, if present in the sample. We used a preanalytical enrichment of influenza A virus genomes from several sample types including primary samples (cloacal swabs), chicken embryo grown avian and cell cultured swine influenza viruses. Enrichment was followed by de novo sequencing to enable an unbiased realization of all possible sequences in the sample. The protocol for cDNA library generation we describe is independent of locus specific amplification primers and can be used for sequencing any unknown type of influenza A viruses. This approach is consistent with the metagenomics approaches that have helped elucidate microbial (and viral) population structures from complex matrices such as marine water, soil, feces, respiratory secretions, serum and plasma [9,23]. Application of GS De novo Assembler or GS Reference Mapper software for our 454 sequence analysis failed to identify full-length contigs. GS assembler yielded several short contigs and GS Reference Mapper produced a few false insertion/deletions ( Figure 2B). We, therefore, developed an algorithm that combines three software packages (GS De nova Assembler, GS Reference Mapper and Sequencher) to efficiently assemble the genomes and detect quasispecies. The length of genome covered was ,13400 bases (.99% of total genome size) with high confidence at 20 to 787-X coverage. This depth coverage was sufficient for the identification of quasispecies and mixed infections.
Presence of mixed infection and quasispecies in influenza viruses has also been demonstrated by others using RT-PCR of a short segment of HA from cloacal samples [7], serial limiting dilutions of virus isolates followed by RT-PCR [24], an RT-PCR/electrospray ionization mass spectrometry (RT-PCR/ESI-MS) [25], or by serological analysis [7]. In the present study, we used pyrosequencing for the identification of mixed populations of viruses as either a viral quasispecies or co-infections with multiple strains. Finding H1, H3, and H4 in one cloacal sample [cloacal swab of A/mallard/Minnesota/Sg-00133/2007(H4N6)] indicates there was the possibility of a mixed infection in the bird from which the cloacal swab was collected compared to a clean single H4 subtype that was recovered in egg grown virus. This is in agreement with the study of Wang et al. [7,26] who reported up to five HA subtypes in a cloacal swab sample whereas only one HA-NA combination was recovered in isolates using embryonated eggs. If multiple strains of AIV are present in the cloacal swab, one subtype commonly outgrows the others in the aberrant host system (such as embryonated chicken eggs) while the other strains remain undetected [7,24]. In our study, the H4 subtype may have out-competed the other two subtypes in culture. Alternately, the H4 population might be the only live virus in the sample.
A possible rationale for the relatively few influenza reads (15 reads) observed in one of the cloacal samples could be due to insufficient RNA in the original sample or RNA losses during processing for pyrosequencing. In the other cloacal swab of A/green-winged teal/Minnesota/Sg-00131/2007(H3N2), complete sequences were obtained and these sequences had 80-91% nucleic acid identities with the virus recovered using embryonated egg system. This result indicates that there was a mixed population of viruses in this cloacal swab but the H3N2 subtype possibly became the predominant subtype by out-competing other virus subpopulations in the embryonated egg system. More studies with larger numbers of matched-pair samples need to be performed to completely resolve this phenomenon.
Complete genome sequences of A/bald eagle/Virginia/Sg-00154/ 2008(H1N1/H2N1) showed two virus lineages (H1N1 and H2N1). Using RT-PCR based HA and NA typing, this virus was identified as H2N1. In general, unambiguous indexing of mixed subtype infections would require sequential limiting dilution, PCR, cloning, and sequencing of several clones. To our knowledge, this is the first report of full genome sequencing of all eight segments from a mixed infection representing two lineages of the virus.
In our analysis of 12 samples, quasispecies were identified from five samples (four egg grown waterfowl isolates and one cell cultured swine influenza virus). All these viruses were H3N2 and identified quasispecies originated from mutations in NP, PB1, PA, M, and NS genes but not in HA, NA or PB2 genes. The four waterfowl isolates used in our study were recovered at the same study site and on the same day. This result concurs with the study of Dugan et al., [24] in which quasispecies were identified among H4N6 isolates that were recovered at the same study site, from the same species (mallard), and on the same day.
Inasmuch as the mutation rate for type A influenza viruses is estimated at one nucleotide change per 10,000 nucleotide during replication and most infections are caused by as many as 10 to 1000 virions which likely possess varying numbers of nucleotide differences in their genomes, one can expect that each influenza A virion is possibly a quasispecies. However, we identified relatively few quasispecies -probably because the currently available sequence analysis software do not allow robust quasispecies analysis and extensive manual curation is necessary. We believe that with the help of improved bioinformatic tools we would detect more quasispecies populations in our sample sets.
The method described in the current study does not require virus propagation, sequence information and circumvents the need for cloning and library construction prior to sequencing. Thus the currently described method is simple and less time consuming compared to Sanger sequencing. Despite these obvious advantages the cost of equipment is high and requires extensive bioinformatic expertise for assembling and analysis of the contigs.
In conclusion, using an unambiguous genome sequencing approach, we present evidence of quasispecies and mixed infections among influenza A viruses that could help shape our understanding of the ecology and evolution of these viruses. Future studies should be undertaken to -1) strengthen the interpretation of culture and sequence data generated by current influenza A virus surveillance networks; 2) establish novel influenza sequencebased evolutionary analyses; and 3) provide an improved understanding of influenza subtype stability and transmission in a wide array of mammals and birds. . The sample was incubated at room temperature for 30 min. with gentle shaking in the orbital shaker and then placed on a magnetic stand for 3 min. The supernatant was removed by aspiration with a pipette and the coated beads were washed four times with 1X B&W buffer. The captured RNA was eluted from the beads by incubating at 65u C for 5 min with 40-mL of 10 mM EDTA, pH 8.2, in 99% formamide. The tube was placed on the magnetic stand for 3 min. and the supernatant, containing enriched RNA, was aspirated with a pipette.

Fragmentation of enriched RNA and cDNA synthesis
The enriched viral RNA was fragmented into a size range compatible with sequencing on the Genome Sequencer FLX. Five micro liters of 5X RNA Fragmentation Buffer (200 mM Tris-acetate, pH 8.1, 500 mM Potassium acetate, 150 mM Magnesium acetate) was added to 20-mL of enriched viral RNA. The samples were mixed thoroughly by pipeting, incubated for 2 min at 82uC, and then immediately transferred to ice to stop the fragmentation reaction. The reaction volume was increased to 50-mL by adding RNase free water, purified with RNAClean (Agencourt) as per the manufacturer's instructions and eluted with 20-mL of RNase free water.
The fragmented RNA sample was reverse transcribed in 20-mL final volume using random hexamer (59-phosphate-NNNNNNN-39) and Superscript First-Strand Synthesis System for RT-PCR (Invitrogen) as per the manufacturer's instructions. Each reaction consisted of 7-mL of fragmented RNA and 2-mL of 500-mM primers. After reverse transcription, the RNA was removed by hydrolysis by adding 20-ml of Denaturation Solution (0.5 M NaOH, 0.25 M EDTA) and incubating at 65uC for 20 minutes. The mixture was neutralized by adding 20-ml of 0.5 M HCI in 0.5 M Tris-HCl, pH 8.0. The resultant sscDNA was recovered with RNAClean (Agencourt) as per the manufacturer's instructions and eluted from the beads with 20-mL of RNase free water.

Adapter Ligation
For clonal amplification and sequencing on the Genome Sequencer FLX, the sscDNA required the addition of adaptors to each terminus. The adaptors have been designed to enforce directional ligation to the sscDNA, such that one will be uniquely ligated to the 59-end (sscDNA Adaptor A) and the other to the 39-end (sscDNA Adaptor B) of the sscDNA. Each adaptor is comprised of two complimentary oligonucleotides that are annealed together as described. The 39-end adaptor consists of ''sscDNA Oligo B'' (59-biotin-GCCTTGCCAGCCCGCTCAGNNNNNN-phosphate-39) and ''sscDNA Oligo B-prime'' (59-phosphate-CTGAGC-GGGCTGGCAAGG-dideoxyC-39) which, after annealing, results in ''sscDNA Adaptor B'' with a 39-random overhang of six nucleotides. Similarly, the 59-end adaptor consists of ''sscDNA Oligo A-prime'' (59-NNN NNN CTG ATG GCG CGA GGG AGG dideoxyC-30) and ''sscDNA Oligo A'' (59-GCCTCCCTCG-CGCCATCAG-39) which form ''sscDNA Adaptor A'' with a six nucleotide 59-end overhang. The adapter ligation reaction was carried out using T4 DNA ligase (New England Biolabs) in a total volume of 30-mL, containing 3 mL of 10X ligase buffer, 1-mL of (1.67 mM final conc.) adapter A, 1-mL of (6.67 mM final conc.) adapter B, 5-mL (2000 cohesive end units) of T4 DNA ligase, 15-mL of sscDNA and 5-mL of water. The reaction mixture was incubated at RT for 2 hrs; the ligated sscDNA was recovered with Dynabeads MyOne Streptavidin C1 (20-mL beads per sample) and eluted by incubating at 65uC for 5 min with 40-mL of 10 mM EDTA, pH 8.2, in 99% formamide. The final sscDNA was purified with two rounds of RNAClean (Agencourt) and eluted in 20-mL of nuclease free water.
The final adapted sscDNA was amplified using Advantage 2 PCR Kit (Clontech) in a total volume of 50-mL containing 5-mL of 10X Advantage 2 buffer, 2-mL of 50X dNTP mix (10 mM each), 10-mL (10 mM) Primer A (59-GCC TCC CTCGCG CCA-39), 10-mL (10 mM) Primer B (59-GCC TTG CCA GCC CGC-39), 1-mL of 50X Advantage polymerase mix, 10-mL of sscDNA, and 12-mL of nuclease free water. The PCR conditions used were: 96u C for 4 min; 30 cycles of 94u C for 30 s and 64u C for 30 s; 68u C for 3 min; hold at 14u C. The PCR product was purified with two rounds of AMPure (Agencourt) as per the manufacturer's instructions. The double stranded DNA library was eluted with 20-mL of water and quantified with the Quant-iT Picogreen dsDNA Assay Kit (Invitrogen). Emulsion PCR amplification was carried out using either primer A or primer B or both for bidirectional sequencing. The sequencing reactions were carried out in small regions of the PicoTiterPlate (1-4 regions/sample) on the Genome Sequencer FLX (GS FLX) platform.

Sequencing data analysis
Data analyses were performed on the Linux servers or Windows work station at the Minnesota Supercomputing Institute. All the sequencing reads were blasted against influenza genome in NCBI blast version.2.2.16. The 'non influenza' sequences were filtered out and only influenza reads were assembled in GS De nova Assembler Version 2.0.00.20 and mapped in GS Reference Mapper Version 2.0.00.20. The influenza contigs obtained using the above software were reassembled in Sequencher Version 4.8 (Genecodes).

Quasispecies identification
All the influenza reads were run in GS De novo Assembler with three sets of parameters: minimum overlap (MOL) of 40 nucleotides and 90% identity, MOL of 100 and 100% identity, and MOL of 200 and 100% identity. The larger contigs (.500 bases) obtained by the above method were BLAST analyzed using NCBI resources and the most closely related sequences, referred to as reference sequences, for each segment were downloaded. All the influenza reads were mapped with reference sequences in GS Reference Mapper. The contigs obtained from GS Assembler and the consensus sequences obtained from GS Mapper were reassembled in Sequencher 4.8. The new contigs were then examined for ambiguous bases (e.g. R, Y, K etc.) and particular base positions were manually examined for the presence of more than one kind of base (quasispecies) in GS Reference Mapper.