Characterization of Quasispecies of Pandemic 2009 Influenza A Virus (A/H1N1/2009) by De Novo Sequencing Using a Next-Generation DNA Sequencer

Pandemic 2009 influenza A virus (A/H1N1/2009) has emerged globally. In this study, we performed a comprehensive detection of potential pathogens by de novo sequencing using a next-generation DNA sequencer on total RNAs extracted from an autopsy lung of a patient who died of viral pneumonia with A/H1N1/2009. Among a total of 9.4×106 40-mer short reads, more than 98% appeared to be human, while 0.85% were identified as A/H1N1/2009 (A/Nagano/RC1-L/2009(H1N1)). Suspected bacterial reads such as Streptococcus pneumoniae and other oral bacteria flora were very low at 0.005%, and a significant bacterial infection was not histologically observed. De novo assembly and read mapping analysis of A/Nagano/RC1-L/2009(H1N1) showed more than ×200 coverage on average, and revealed nucleotide heterogeneity on hemagglutinin as quasispecies, specifically at two amino acids (Gly172Glu and Gly239Asn of HA) located on the Sa and Ca2 antigenic sites, respectively. Gly239 and Asn239 on antigenic site Ca2 appeared to be minor amino acids compared with the highly distributed Asp239 in H1N1 HAs. This study demonstrated that de novo sequencing can comprehensively detect pathogens, and such in-depth investigation facilitates the identification of influenza A viral heterogeneity. To better characterize the A/H1N1/2009 virus, unbiased comprehensive techniques will be indispensable for the primary investigations of emerging infectious diseases.


Introduction
In April 2009, an H1N1 triple-reassortant swine influenza virus (A/H1N1/2009) was detected in humans with febrile respiratory illness in North America [1], and the virus has rapidly spread worldwide by human-to-human transmission. According to the disease outbreak news from the World Health Organization, at least 14,711 people died from A/H1N1/2009 between April 2009 and January 2010 (http://www.who.int/csr/don/en/). Fatal cases from A/H1N1/2009 viral infection were summarized in a report by Gill et al. [2].
The genome of influenza A virus (family Orthomyxoviridae) consists of 8 single-stranded negative sense RNA molecules spanning approximately 13.5 kb. The segments range in length from 890 to 2341 nucleotides (nt) and encode a total of 11 proteins [3]. Genetic diversity in influenza virus results from a high mutation rate associated with replication using a low-fidelity RNA polymerase and the reshuffling (reassortment) of segments among coinfecting strains. Multiple-reassortant influenza viruses from avian, human, and swine origins emerged as major pandemic influenza viruses (i.e., 1918 H1N1, 1957 H2N2, and 1968 H3N2) causing significant mortality in humans in the 20 th century [4]. Such an ''antigenic shift'' by multiple reassortant drives the emergence of pandemic influenza viruses, with their severity and clinical outcome always unpredictable [5].
Influenza A virus can evade antibodies specific to its attachment protein, hemagglutinin (HA), by the accumulation of amino acid substitutions in HA epitopes [6]. This ''antigenic drift'' in HA epitopes [7] affects recognition by antibodies that neutralize viral infectivity by blocking the interaction of HA with sialic acid residues on host-cell membranes. The H1 subtype HA has four antigenic sites recognized by monoclonal antibodies with high neutralizing activity, designated Sa, Sb, Ca, and Cb [8]. In addition, 8 continuous B cell/antibody epitopes for human H1N1 HA proteins have been experimentally defined by the Immune Epitope Database and Analysis Resource (IEDB: http://www. immuneepitope.org/) [9]. Immune epitope analysis of HA epitopes in A/H1N1/2009 is also summarized in the Influenza Research Database (http://www.fludb.org/brc/homeExtraPage. do?decorator=influenza&extraPage=separate) [10].
To better predict a future pandemic of influenza A virus, the characterization of possible antigenic drift will be indispensable. Igarashi et al. and Shen et al. reported that a structural comparison of HAs could predict probable future antigenic changes during the evolution of A/H1N1/2009 in the human population [11,12].
In addition to this prediction, extensive investigations on viral quasispecies will be required to reveal the actual appearance of those antigenic changes. Nakamura et al. demonstrated the direct detection of potential pathogens, including influenza virus, using de novo pyrosequencing [13], but the detection appeared to have insufficient redundant sequencing reads to reveal the genetic variation of the viruses. Ramakrishnan et al. demonstrated the discrimination of quasispecies in mixed HA subtype infections of influenza A virus using the same pyrosequencing approach [14]. However, it was shown that the influenza viral RNA sample should be enriched through sequence-specific oligonucleotide capturing prior to pyrosequencing, indicating that such enrichment might represent a biased result.
Here, we performed de novo sequencing using total RNAs extracted from an autopsy lung of a patient infected with A/ H1N1/2009, and detected potential pathogens such as bacteria in addition to A/H1N1/2009. Extensive DNA sequencing using the Illumina Genome Analyzer II (GA II) revealed that quasispecies for the HA sequence were generated in single patient. Such heterogeneity demonstrated the antigenic drift of HA, implying the existence of a mechanism to escape pre-existing human immunity to the virus.

Summary of sequencing reads and detection of potential pathogens
To determine the influenza A virus sequence and other potential pathogens, we performed de novo sequencing of doublestranded cDNA from total RNA extracted from the autopsy lung of a patient infected with the A/H1N1/2009 virus (A/Nagano/ RC1/2009(H1N1)) in August 2009 in Japan. The patient was found to be positive for A/H1N1/2009 by real-time reverse transcriptase-polymerase chain reaction (RT-PCR); histopathological findings were also reported [15]. GA II produced 9.4610 6 40-mer reads from the cDNA library (Fig. 1B). To exclude the human-derived read sequences, we performed an analysis pipeline as follows (Fig. 1A). Initially, all 9,475,890 reads were aligned to a reference sequence of human genomic DNA, followed by quality trimming to remove low-quality reads and excluding reads with similarities to ambiguous human sequences, resulting in 9,309,538 reads (98.24%) with a possible human source (Fig. 1B). The remaining 166,352 reads (1.75%) were further analyzed using a BLAST search against non-redundant databases, revealing 80,827 (0.85%), 469 (0.005%), and 85,056 (0.90%) reads as influenza A virus, bacteria, and no hits, respectively (Fig. 1B).
Regarding the bacterial hits, species classification was determined based on the results of a BLASTN search against the nt database (Fig. 1C). The most abundant bacterium was Propionibacterium acnes, but our other sequencing trials for clinical specimens suggest that this species is always detected (data not shown). Therefore, the presence of P. acnes could be the result of contamination at some point from the autopsy to the preparation of the cDNA library. In addition to P. acnes, Escherichia coli and Acinetobacter baumannii were frequently detected as possible contaminants. Suspected bacterial pathogens were identified as Streptococcus pneumoniae and Porphyromonas gingivalis. Specific PCR using 16S-rDNA and the lytA gene was performed for further verification of the presence of S. pneumoniae (data not shown).
Although S. pneumoniae was not sufficiently abundant to conclude a coinfection with A/H1N1/2009, the severity of the A/H1N1/ 2009 infection could be correlated with S. pneumoniae, as reported by Palacios et al [16]. The other detected bacteria, such as Streptococcus sp., generally constitute the normal human oral flora.
de novo assembly of the A/H1N1/2009 virus Whole 40-mer short reads, including human-derived reads, were assembled using Euler-SR or the Velvet de novo assembler. The resultant contigs generated using Euler-SR had longer extended sequences than those generated using Velvet (data not shown); thus, all further analyses were performed using the contigs generated using Euler-SR (Texts S1 and S2). All contigs showed high similarity to the sequences of A/H1N1/2009 (Table 1). Among the 8 segments, almost the whole lengths of segments 2 (2321 nt), 3 (2231 nt), 4 (1765 nt), 5 (1562 nt), 7 (1026 nt), and 8 (892 nt) were correctly assembled as single contigs of 2204, 2198, 1761, 1514, 1019, and 834 nt, respectively (Table 1 and Fig. 2). Segments 1 and 6 were divided into several contigs, but were correctly aligned to the known sequences (Table 1 and Fig. 2).

Read mapping analysis of the A/H1N1/2009 virus
To obtain whole sequences and identify single nucleotide polymorphisms (SNPs) for the 8 segments, the 40-mer short reads were aligned to the sequence of A/Tronto/T0106/2009(H1N1), which was found to be the most similar to the A/H1N1/2009 virus using a BLASTN search. Figure 2 shows dot plot images of the coverage at every nucleotide of the segments. Read coverage was observed at ,6200 on average for all segments, indicating a sufficient redundancy to identify the viral sequences and SNPs. The obtained viral sequences, designated as A/Nagano/RC1-L/ 2009(H1N1), were consistent with those from A/Nagano/RC1/ 2009(H1N1) passaged in the Madin-Darby canine kidney (MDCK) cell line, except for 3 possible heterogeneous nucleotides in HA.
The coverage plot curves were not flat throughout the segments. Intriguingly, both ends of segment 1 (encoding PB2), the 39-end of segment 3 (encoding PA), and approximately 700 nt of segment 8 (encoding NS) showed significant abundant coverage greater than 61000.

Genetic population analysis of the A/H1N1/2009 virus
To identify heterogeneous populations, alignment results were screened using MapView software (Fig. 3B). Three potential heterogeneous genetic populations were found in segment 4 (encoding HA) at the 515, 715, and 716 nt positions (Fig. 3A), but not in other segments. The read alignments shown in Fig. 3B indicate that either the GGT or AAT sequence appeared at the 715-717 nt position, but not the GAT or AGT sequence. In addition, the read coverage implied that the major (GG; HA-Major) or minor (AA; HA-Minor) nucleotides were detected at the frequencies of 75% and 25%, respectively. To validate these variations, HA-Major-or HA-Minor-specific quantitative RT-PCR (qRT-PCR) was performed for the preparation of specific PCR products between the 434 and 738 nt in the HA coding sequence (Fig. 3C). qRT-PCR demonstrated that the expression of HA was ,40,000-fold greater than that of human b-actin, and the expression ratio of HA-Major/HA-Minor was 4.05, suggesting that it corresponds to the read mapping shown in Fig. 3A. Furthermore, HA-Major and HA-Minor sequences were verified by Sanger DNA sequencing of the specific PCR products (Fig. 3D). Taken together, these results suggest the following amino acid substitutions of HA: one nucleotide alteration causes Gly 172 Glu and the other two alterations cause Gly 239 Asn (Fig. 3D).

Epitope analysis of heterogeneous HA
To elucidate whether the Gly 172 Glu and Gly 239 Asn amino acid substitutions in the HA sequence could be associated with antigenic drift, they were compared to known potential epitopes [8,9]. Representative HA amino acid sequences of the H1N1 influenza A virus were aligned with the heterogeneous HA-Major and HA-Minor sequences. The Gly 172 Glu substitution (corresponding to Gly158 in the mature HA lacking a signal peptide) was located on the Sa antigenic site (Fig. 4A).
The HA Gly 172 Glu substitution is likely to be rare thus far because a BLASTP search against the non-redundant nr database revealed only two identical hits, A/Bayern/62/2009(H1N1) in Germany and A/Catalonia/S1207/2009(H1N1) in Spain (data not shown). One intriguing hit was to A/Pennsylvania/14/ 2009(H1N1) isolated in the US, whose HA sequence has an Xaa amino acid at position 172 due to the presence of the heterogeneous nucleotide R (A or G) (Fig. 5A). This deposited sequence completely coincides with our observation, suggesting that two variants of HAs are likely to coexist in the human lung, further implying that such heterogeneous populations might frequently be generated during infection.
Furthermore, HA Gly 239 Asn was located on the Ca2 antigenic site that contributes to binding with the host's sialic acid receptor [17]. Asp239 (corresponding to Asp225 in the mature HA lacking a signal peptide) was frequently distributed in H1N1 HAs (Fig. 4B), but Gly239 and Asn239 were found to be minor amino acids among HAs; a BLASTN search found 18 and 5 hit entries on the nt database, respectively. As was observed for Gly172, Xaa239 was found in the nt database (Fig. 5B), suggesting that HA heterogeneity of both minor amino acids may affect its binding affinity to the sialic acid receptor.

Discussion
In this study, we demonstrated the detection of potential pathogens using a next-generation DNA sequencer. We speculated that, in addition to influenza A virus, additional potential pathogens such as S. pneumoniae could contribute to the severity and fatality of infection with the A/H1N1/2009 virus [2,16]. In this case, the amount of bacteria detected was small ( Fig. 1B and  1C), and they were considered to be the result of contamination during the course of the experiment, from autopsy to short read sequencing. The clinical outcome of the patient and histopathological examination suggest that this was a case of influenza viral pneumonia rather than bacterial infection [15], although S. pneumoniae coinfection has been reported to play a crucial role in the severity of influenza virus infection in some cases [16,18].
In the present autopsy lung sample, very significant viral copies (,40,000-fold greater than b-actin) were determined using qRT-PCR, but this was not always observed in autopsy samples from other patients (less than b-actin) (data not shown). Such abundant read sequencing enabled us to obtain almost full-coverage contig sequences for the viral segments using de novo assembly, suggesting the importance of this result in terms of being able to evaluate uncharacterized emerging infectious agents using an unbiased sequencing technique at the outbreak of a pandemic. Indeed, this study demonstrated that whole contigs can be identified as A/ H1N1/2009, but not seasonal H1N1 or other subtypes ( Table 1).
The read coverage profile generated by mapping was very indicative for segment 1 encoding PB2 (Fig. 2). Both ends were highly redundant with up to 63000 coverage. The coverage is reflected by the amounts of vRNA, cRNA, and mRNA of influenza A virus, implying that the coverage bias may detect a more stable region as it is dependent on the expression level or manner of replication.
Contrary to the viral sequences obtained for A/Nagano/RC1/ 2009(H1N1) isolated from passaging in MDCK cells, de novo sequencing revealed the presence of A/Nagano/RC1-L/ 2009(H1N1) HA quasispecies in the autopsy sample (Fig. 3). Despite the fact that immunity to A/H1N1/2009 viruses is supposed to be limited among the general human population [19], we detected the amino acid substitution Gly 172 Glu in the HA Sa antigenic site in A/Nagano/RC1-L/2009(H1N1), and this appears to be a very rare event among A/H1N1/2009 viruses to date.
We also observed another substitution of Gly 239 Asn in the HA Ca2 antigenic site of A/Nagano/RC1-L/2009(H1N1). This antigenic site plays a crucial role in conferring specificity to galactose of the human a2-6 sialylated glycan receptor [20]. Interestingly, Asp239 (corresponding to Asp225 in the mature HA that lacks a signal peptide) is highly prevalent in known H1N1 HAs, indicating that both Gly239 and Asn239 appear to be very minor amino acids among all HA sequences.
Thus far, amino acid substitutions in the HAs of A/H1N1/2009 have been identified compared with seasonal H1N1 HAs. Homology-based structural investigations [17,21] suggest that A/ H1N1/2009 HA has the necessary residues to provide optimal contacts for high affinity binding to a2-6 sialylated glycans present in the human upper airway [22], while it apparently shows minimal contact with a2-3 sialylated glycans present in the human lower respiratory tract [23]. Indeed, the recombinant A/H1N1/ 2009 HA has been characterized to exhibit lower binding to the alveolar tissue of the lower respiratory tract [17]. However, we previously detected abundant viral nucleoprotein of A/Nagano/ . To obtain the consensus sequences for the respective 8 segments, 40-mer short reads were aligned to the complete segment sequences of A/Tronto/T0106/ 2009(H1N1) (gb|CY045951.1 -.8). Short read sequencing was performed using total RNA including human RNA, and also vRNA, cRNA, and mRNA from influenza A virus; thus, coverage bias was detected throughout the segments, but the average coverage (AC) is likely to be similar at approximately 6200 or more. The horizontal red arrows show the location of the contigs obtained by de novo assembly, as shown in Table 1. doi:10.1371/journal.pone.0010256.g002 RC1-L/2009(H1N1) in pneumocytes expressing a2-3 sialylated glycans in autopsy lung tissue sections [15], suggesting that the above substitutions could affect the binding affinity to both types of sialylated glycans. Very suggestive reports predicted the possible future antigenic drift of A/H1N1/2009 viruses from viral sequence and structural comparative analyses [11,12]. Prior to the initiation of the current study (September 2009), Igarashi et al. predicted possible substitutions and these included the two amino acid substitutions presented here (Gly 172 Glu and Asp 239 Gly) [11]. Furthermore, Shen et al. suggested that host-driven antigenic drift based on evolutionary trends appeared to favor Asp239 (corresponding to Asp225 in the mature HA) in swine HAs and the 1918 pandemic, while Asp204 (corresponding to Asp190 in the mature HA) was favored in seasonal H1N1 HAs [12]. These predictions are very attractive and our experiments demonstrated one of them a posteriori. Furthermore, recent study has shown that receptorbinding avidity can influence antigenic drift [24]. HA antigenic sites Sa is the membrane proximal region, therefore, the identified variations on both Sa and Ca2 might contribute to the alteration of antigenicity and receptor-binding avidity by synergistic effect. The newly identified Asn239 substitution could be a probable candidate for further investigation of the manner of viral binding to sialic acid on the host receptors.
In conclusion, this study demonstrated that de novo sequencing can comprehensively detect pathogens, and such in-depth

Ethics Statement
The study protocol was approved by the institutional medical ethical committee, National Institute of Infectious Diseases, Japan (Approval No.236), and the study was conducted according to the Declaration of Helsinki Principles. In the autopsy case, a written consent for autopsy was obtained from relatives.
Total RNA and cDNA preparation from autopsy human lung Information for the patient was previously reported [15]. Briefly, in August 2009, a 33-year-old male patient with chronic heart failure due to dilated cardiomyopathy, mild diabetes mellitus, atopic dermatitis, asthma, and obesity (BMI: 38) died from respiratory failure and multiple organ dysfunction syndrome.
A diagnosis of pandemic influenza A virus (A/H1N1/2009) infection was determined using RT-PCR testing in a clinical laboratory. Total RNA was prepared from a 5-mm cube of the autopsy lung tissue using ISOGEN (NipponGene, Japan), followed by Ambion TurboDNase treatment (Ambion, Austin, TX USA). Double-stranded cDNA was prepared from 5 mg of total RNA using the random priming method with SuperScript Choice System for cDNA synthesis (Invitrogen, Carlsbad, CA, USA). cDNA was purified using a QIAquick PCR Purification kit (QIAGEN, Hilden, Germany).
Short-read DNA sequencing using the Illumina Genome Analyzer II An approximately 300-bp length cDNA library was prepared using a genomic DNA sample prep kit (Illumina, San Diego, CA, USA), and DNA clusters were generated on a slide using a Cluster Generation kit (ver. 2) on an Illumina cluster station (Illumina), according to the manufacturer's instructions. To obtain ,1.0610 7 clusters for one lane, the general procedure as described in the standard recipe (Illumina) was performed as follows: template hybridization, isothermal amplification, linearization, blocking, denaturation, and hybridization of the sequencing primer (Illumina). All sequencing runs for 40 mers were performed with GA II using the Illumina Sequencing kit (ver. 3). Fluorescent images were analyzed using the Illumina base-calling pipeline 1.4.0 to obtain FASTQ formatted sequence data.

Homology search analysis
The obtained DNA sequences were investigated using a BLAST search as shown in Fig. 1A. The results of the BLASTN search were analyzed and visualized using MEGAN v3.7.4 [25] with the following parameters: minimum support, 5; minimum score, 35.0.
de novo assembly of short reads Prior to de novo assembly, all obtained 40-mer reads were trimmed based on the phred quality value obtained using the Euler-SR qualitytrimmer command [26]. Such trimmed read sequences were assembled using Velvet v0.7.55 [27] or Euler-SR v1.0 [26] with the default parameters (Velvet: hash length, 25; Euler-SR: vertex size, 25).

Read mapping
To obtain consensus sequences for the respective 8 segments of influenza A virus, 40-mer short reads were aligned to A/Tronto/ T0106/2009(H1N1) sequences (gb|CY045951.1 -.8) as reference sequences with Maq software (ver. 0.7.1) [28] using the easyrun Perl-command. The consensus sequences were extracted as a ''cns.fq'' file for each segment, and deposited in the DNA Data Bank of Japan (DDBJ; accession numbers: AB538386 to AB538393 for the 8 segments of A/Nagano/RC1-L/ 2009(H1N1), and AB538394 for segment 4 encoding the HA-Minor sequence). Read coverage at every nucleotide was obtained using Maq software (ver. 0.7.1) with the pileup command. Read alignment for the validation of SNPs was performed using the MapView graphical alignment viewer [29].

Virus isolation
The A/H1N1/2009 virus was isolated from MDCK cells passaged once with trypsin.

Supporting Information
Text S1 Fastq file of the 40-mer short reads with similarity to influenza A virus extracted from whole obtained short reads.