Universal Oligonucleotide Microarray for Sub-Typing of Influenza A Virus

A universal microchip was developed for genotyping Influenza A viruses. It contains two sets of oligonucleotide probes allowing viruses to be classified by the subtypes of hemagglutinin (H1–H13, H15, H16) and neuraminidase (N1–N9). Additional sets of probes are used to detect H1N1 swine influenza viruses. Selection of probes was done in two steps. Initially, amino acid sequences specific to each subtype were identified, and then the most specific and representative oligonucleotide probes were selected. Overall, between 19 and 24 probes were used to identify each subtype of hemagglutinin (HA) and neuraminidase (NA). Genotyping included preparation of fluorescently labeled PCR amplicons of influenza virus cDNA and their hybridization to microarrays of specific oligonucleotide probes. Out of 40 samples tested, 36 unambiguously identified HA and NA subtypes of Influenza A virus.


Introduction
Influenza A virus circulating among humans and domestic and wild animals presents a significant pandemic threat due to the rapid evolution of strains with new antigenic properties [1]. Examples include the recent emergence of H1N1 swine influenza virus that proved pathogenic for humans, as well as H5N1 bird influenza virus that has the capacity to become pathogenic for domestic birds and humans.
Since protection against influenza is determined primarily by HA and NA antigens, it is important to classify new isolates by subtypes of these proteins. Presently, 16 sub-types of HA and 9 sub-types of NA are known [2,3]. The traditional scheme of subtyping influenza viruses involves virus isolation, analysis using standard immunochemical techniques or real-time PCR amplification, and partial or complete nucleotide sequencing of HA and NA genes [4,5,6,7].
Recently, microchip hybridization has become a popular diagnostic tool [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. Two types of microchips are used for this purpose: sequencing and hybridization microchips. In the first case, the use of microchip and micro-fluidics technology is used for determination of nucleotide sequences of HA and NA needed for the classification of viruses [14,19,20]. More commonly, hybridization microchips use sets of immobilized oligonucleotide probes that form complexes with virusspecific cDNA tagged with, for instance, a fluorescent label [9,13,17,21,22]. In these papers specialized microchips were used to determine a limited number of influenza virus types [9,13], as well as microchips for sub-typing the majority of HA and NA variants [20]. Some protocols use PCR amplification with primers specific to select sub-types of HA and NA [22]. These protocols increase specificity by complicating the overall procedure. Efficiency of any microchip hybridization protocol is determined by the procedure used to identify specific probes to distinguish between genotypes. The task of probe selection is relatively straightforward when sequences differ significantly, as is the case for the genotyping of higher taxa. For instance, it is very easy to discriminate herpes and pox-viruses [23].
The task becomes more complicated if genomic structures are similar. Probe selection becomes even more complicated when the number of known sequences is very large and exceeds several hundreds and even thousands. In this case, the strategy of probe selection becomes critical.
There are several published approaches and algorithms [24,25,26,27,28,29,30,31] for the analysis of DNA sequences and selection of specific probes. Directly searching for probes based on traditional computational methods is labor-intensive and often requires much time. Probe selection could be simplified for the initial analysis of protein sequences that are more conservative. This approach was discussed in several communications [26,31] but was not fully implemented because of the differences in amino acid and nucleotide sequences of the microorganisms. This approach could have a better chance of success for analyzing closely related DNA sequences such as cDNA of Influenza A and B viruses.
Considering the short length of HA and NA genes (1700 and 1400 nucleotides, respectively), finding a set of probes specific to each sub-type of HA and NA is not simple. First, there is a substantial similarity between sub-types of these proteins. Second, the cDNA structure of different sub-types of HA depends on the sub-type of NA present in the virus, and vice versa. This complicates the selection of probes specific to the entire sub-type regardless of the matching NA sub-type (e.g., H1N1, H1N2, … H1N9). In this study, we have explored the selection of probes based on amino acid sequences to create a universal microchip that could determine all sub-types of HA and NA of Influenza A virus.

Results and Discussion
The selection of microarray probes was performed in two steps using custom software developed in our lab. During the first phase, amino acid sequences typical of the sub-type of interest, but absent from other sub-types, were identified. During the second phase, oligonucleotide probes were derived from these conserved peptides and their properties optimized. The rationale for this approach is that microorganism properties are determined mostly by amino acid sequences, rather than nucleotide sequences. In addition, amino acid sequences are more conserved, which simplifies the search for similar sequences.
The process that will be illustrated using the example of the NA gene included the following steps ( Fig. 1): N Selection of nucleic sequences from GenBank and their translation into polypeptides N Selection of peptides common for the NA sub-type of interest N Removal of peptides present in other sub-types of NA N Reverse translation of these peptides into nucleic acid probes N Selection of the most representative probes N Removal of oligonucleotide probes that can bind other NA sub-types after one or two mutations N Selection of probes forming duplexes with target DNA with optimal melting temperatures Selection of conserved peptide sequences Genomic sequence databases contain a number of incomplete sequences that complicate selection of universal probes. Therefore only protein sequences longer than 200 amino acids were analyzed. Conserved stretches of seven amino acids were identified among peptides from each sub-type of NA, regardless of a matching HA sub-type. This peptide length was chosen because using a longer stretch would significantly reduce the number of oligonucleotide probes typical for each sub-type, and shorter peptides would produce less specific probes. Because it was not possible to find peptides common to all proteins of the same subtype (due to sequence variability and because some genes were not completely sequenced), the most common peptides that were present in more than 90%-95% of the proteins were selected.
Next, peptides unique for the serotype of NA were selected by removing peptides also found in other sub-types. Therefore, nine sets of peptides that were 7 amino acids in length identifying each sub-type of NA were selected.
The example below presents the amino acid sequence of the NA of A/California/UR06-0552/2007(H1N1) isolate. Peptides shown in bold italics are typical for 90% of N1 subtype neuraminidase, while absent from all other sub-types (N2-N9). There was a total of 89 such 7-mer peptides. MNPNQKIITIGSISIAIGIISLMLQIGNIISIWASHSIQTGS

Selection of oligonucleotide probes
The conserved 7-mer peptides were reverse translated into the respective sets of 21-mer oligonucleotides. If peptides partially overlapped with a shift of one or two amino acids, then the corresponding oligonucleotide probes would produce largely redundant information about the sub-type of NA. In these cases only one such probe was selected using the following criteria. Let us consider peptide segment LAGNSSLC, represented by two 7mer peptides: LAGNSSL and AGNSSLC. For these peptides, sets of 51 and 43 oligonucleotide probes were generated ( Table 1). Selection of one of these two sets was made based on the following criteria: 1. Out of several sets of oligonucleotide probes sorted by the number of times each one was found in the set of analyzed sequences, the set in which the first probe was most represented was selected. 2. If the representation of the first probe was equal, then selection was based on the first two probes. 3. Again, if the representation was equal, the set with the minimal number of probes was selected.
If all parameters in the above steps were equal, then the set was selected randomly; frequently, the first set was chosen. In the above example (Table 1), peptide LAGNSSL, coded by 51 oligonucleotide sequences, met these criteria. Out of these oligonucleotide sequences, probe number 1 (TTAGCGGGCAATT-CATCTCTT) was present in 1043 sequences of NA genes, while probe number 2 (TTGGCCGGCAATTCATCTCTT) was present in 708 out of the total 2696 analyzed NA sequences of this sub-type. For other probes, these numbers were significantly lower. If the continuous conserved peptide was long, as for REPFISCSH-LECRTFFLTQGALLNDKHSNGTVKDRSP (see amino acid sequence of A/California/UR06-0552/2007[H1N1] isolate), several overlapping peptides shifted 3-7 amino acids relative to each other were selected to maximize the use of independent sequence information.
Principles used for the selection of probes for other sub-types of neuraminidase (N2-N9) were similar to the above example. Evidently, the number of oligonucleotide probes for the entire set of peptides can be very high, reaching several thousands. Many represent only one viral cDNA sequence (Table 1). Therefore, the next step was selection of the sub-set of an optimal size (20-24 probes), maximizing the chances of successful determination of NA sub-types. The simplest solution would be to choose only highly represented probes. However, the downside of this approach is that the procedure would be biased for probes specific to variants of NA from viruses containing the most prevalent sub-type of HA. Therefore we have used a special procedure to ensure more uniform representation of sequences.

Reducing representation bias
The existing nucleotide databases predominantly contain sequences of NA gene of sub-types N1 and N2, while there is significantly less information about other sub-types. Similarly, the HA gene is mostly represented by sub-types H1, H3, and H5. In addition, within each sub-type of NA sequences of different matching sub-types of HA are also not represented equally (Tables  S1 and S2). For instance, the N1 genotype is predominantly matched with sub-types H1 (1140 out of 2696 sequences or 42% of the database) and H5 (1295 or 48%), while some other HA subtypes were under-represented (H4, H10, H12) or absent altogether (H8, H13, H14, H15). The correlation between the structure of the NA gene and the type of HA is determined by coevolution of Influenza A viruses: changes in one gene are accompanied by changes in other genes [32,33].
To minimize the representation bias, the following approach was used (Fig. 2). First, the probe (Probe_1-1) most represented in all sequences (Seq_0) was selected from the entire set of probes (Probe_Total). Next, sequences containing Probe_1-1 were removed from Seq_0, producing a smaller set Seq_1. Next, Probe_1-2 that was represented most in Seq_1 was identified and removed from the set. This iterative operation was repeated to generate Probe_1-3, Probe_1-4, Probe_1-5, and Probe_1-6.
During the next phase, the first six probes (Probe_1-1-Probe_ 1-6) were removed from the entire pool of probes (Probe_Total), and the remaining probes were subjected to the same selection procedure using the entire starting DNA database Seq_0 to generate an additional set of probes Probe_2-1-Probe_2-6. The entire procedure was repeated three more times to create three additional sets of six probes. The structures of the resulting 30 probes are presented in Table S3.   If the number of sequences was small (as in the case of N4 neuraminidase represented by 31 cDNA sequences), the first generation probes Probe_101-Probe_1-3 completely determined the entire DNA database. In this case, to find more probes (20)(21)(22)(23)(24), iterations were repeated more than five times. Table 2 shows that with an increased number of iterations, the number of highly represented probes drops from 1257 to 1107 (Probe_1-1 and Probe_5-1), while the number of poorly represented probes increases from 33 to 111 (Probe_1-6 and Probe_5-6).
In addition, there are other ways to select probes. For instance, to include the least represented sequences (such as H8N1 and H10N1), the above method could be modified to exclude from Seq_0 those sequences that match more probes than a pre-set limit (N_thresh). By varying this parameter, one can change the share of probes specific to N1 neuraminidase represented by a small number of DNA sequences. At low N_thresh, the higher number of DNA sequences from the entire set will match at least N_thresh probes. In contrast, at high N_thresh, probe selection will be done essentially as in the first method. In the present example, for all remaining NA sub-types, sets of 30 probes were selected.

Minimizing cross-subtype hybridization and probe optimization
To reduce possible false positive hybridization results, oligonucleotide probes were tested for matches to other sub-types after introduction of one or two nucleotide changes. Table S3 shows the primary set of probes for the determination of N1 neuraminidase. Some of these probes demonstrated potential matches with other sub-types (N2, N3, … N9) with one or two mutations. Therefore, they were excluded from the set.
It should be noted that even some perfect probes, as well as probes with one or two mutations, that recognize N1 neuraminidase genes can be found in sequences of N2, albeit very rarely ( Table 2). Detailed analysis shows that the only sequence of N2 neuraminidase that binds these N1-specific probes is virus isolate Influenza A/mallard duck/Minnesota/1979, classified as H3N2. This sequence does not have any probes (perfect or mismatched by 1 or 2 nucleotides) specific to N2 neuraminidase. On the other hand, the sequence contains four perfect probes specific to N1. A similar situation exists for isolates of Influenza A/Equine/New Market/79 (published as belonging to N3 sub-type but matching N8 based on oligonucleotide probes pattern), A/ruddy turnstone/ Delaware Bay/262/2006(H7N5) (N5 based on traditional typing, but N3 based on oligonucleotide probes), and A/Anas plathyrhynchos/Spain/1252/2007(H6N4) (N4 vs. N5).
It appears likely that these four examples represent erroneous typing by conventional methods. Without access to the stocks of these viruses, it was impossible to directly test this hypothesis. However, independent phylogenetic analysis confirms that this might be the case. Fig. 3 shows the phylogenetic tree of a number of DNA sequences of N4 and N5 neuraminidase in which the A/Anas plathyrhynchos/Spain/1252/ 2007 isolate is located within the N5 clade.
Finally, to generate probes that would hybridize with their targets with uniform efficiency their lengths were modified to adjust the melting temperature of probe-DNA complexes. The range of melting temperatures was not very restrictive (10uC), and probe length adjustment included shortening or lengthening of a probe by 1-2 nucleotides within constant parts of codons.

In silico testing of oligonucleotide probes
Length-modified probes were tested for matching sequences of other sub-types. Results showed that such modifications did not lead to the loss of specificity. By using the above procedure, probes specific to all nine sub-types of neuraminidase were selected (Table  S3). The number of probes specific to each sub-type varied between 19 and 24; a total of 191 probes were selected. A similar procedure was used for selection of probes to HA genes, with the exception of H14 for which there were only two sequences in GenBank (Table S4).
It is noteworthy that both amino acid sequences and primary DNA structures of the HA gene of H1N1 swine influenza virus substantially differed from other variants of H1 hemagglutinin. Therefore, none of the probes selected based on GenBank sequences deposited prior to March of 2008 matched the DNA of H1N1 swine influenza. This allowed us to design a set of probes for selective determination of H1 hemagglutinin of swine H1N1 influenza virus. Structures of these probes are shown in Table S4 (H1_swine). There are no significant differences in neuraminidase of this virus, and probes identified for N1 sub-type (Table S3) are suitable for genotyping the NA gene of swine H1N1 virus.
It is important to be able to use the probes for genotyping subtypes of HA and NA of viruses deposited in GenBank after March 2008 (set 1), i.e., DNA sequences that were not used to generate probes. Table S5 shows results of in silico testing for N1 neuraminidase in sequences deposited from March 2009-March 2010 (set 2). To minimize bias, only sequences of 1000 nucleotides or longer were used. The results show that the number of probes specific to N1 depends on the type of accompanying hemagglutinin. DNA of well-represented subtypes (H1N1 and H5N1) on average can be determined by 4-5 perfect probes, for both sets 1 and 2. For other sub-types of HA that are less represented, the average number of perfect probes vary from 2.3 to 5.6, and in most cases does not depend on the analyzed set of sequences. Similar conclusions can be made for other sub-types of NA and HA. This finding suggests that the probes developed in this work could be used for the genotyping of NA and HA sub-types of newly isolated strains of Influenza A. On the other hand, this result does not mean that the probe sets listed in Tables S3 and S4 are rigidly determined and cannot be expanded or modified. It may be useful to perform similar analyses in the future as the number of known sequences expands, closing gaps in the representation of some sub-types that exist today.

Microarray hybridization
The probes developed in this work were used to create a microchip containing two sub-arrays. One 1669 spot sub-array contained NA-specific probes, while another 16624 spot sub-array contained HA-specific probes (Fig. 4a and 4b). Probes contained 39-amino hexyl linker to enable immobilization on glass slides modified with phenyl di-isothyocyanate [34]. Microchip edges were marked with (dT) 8 oligonucleotide with 39-amino hexyl linker and 59-bound fluorescent dye TAMRA.
Microarrays were hybridized with amplicons prepared from cDNA of Influenza A virus. Because probes were selected from information regarding full-length sequences of NA and HA, it was important to generate PCR amplicons of appropriate length. To do this, multiplex PCR was used to amplify full-length HA and NA genes that were then used to generate fluorescently labeled DNA for hybridization with microarrays.
Multiplex PCR amplification of HA and NA genes of Influenza A virus was based on primers NA_F1, NA_R2, HA_F1, and HA_R2 described in [5] that were trimmed at the 59-end. To increase specificity and the yield of fluorescently labeled DNA for hybridization, additional internal primers were selected to split each amplicon into two segments (see Materials and Methods).
Because of the significant sequence diversity, selection of internal PCR primers was performed in two steps. First, a conserved 7-mer peptide was chosen and reverse translated into a degenerate nucleotide sequence corresponding to a set of 21-base long oligonucleotides coding for this peptide. Amplification was conducted asymmetrically to generate predominantly singlestranded DNA for subsequent microchip hybridization.
The microchip created in this work was tested using samples representing different sub-types of hemagglutinin and neuraminidase (Table 3). A total of 40 samples of hemagglutinin genes (H1, H2, H3, H4, H5, H6, H7, H9, H10, H11, and H13) and 40 samples of neuraminidase (N1, N2, N3, N6, N7, N8) were tested. The hybridization of fluorescently labeled amplicons was performed at 55-60uC for 2 hours. Microchip slides were then washed, scanned, and analyzed using ScanArray software. To assess results of microarray genotyping, two parameters were used: 1. Average fluorescence of a spot (Ym) equal to the total fluorescence of all spots specific to a sub-type divided by the number of such spots. This parameter was chosen because the analyzed DNA could not only form perfect duplexes with probes, but also duplexes with one or two mismatches, providing additional information about the sub-type of test samples. For instance, for the A/Orenburg/ IIV2974/2009(H1N1) isolate, 6 out of 22 probes can form a perfect duplex. For the A/Kyoto/07K303/2008(H1N1) isolate, the number of perfect duplexes is lower (2), but in this case there are also 7 duplexes with 1 mismatch. At the same time, the chances of forming stable complexes between these probes and other sub-types is substantially lower; the rules of probes selection allowed only three or more mismatches.
2. The fraction of positive spots, i.e., the number of spots specific to a sub-type with fluorescence exceeding the average fluorescence of all spots used for genotyping. Even though this parameter is linked to the first one, it identifies false positive results caused by high levels of fluorescence. Therefore, both parameters must be used for the evaluation of results. If both identify the same subtype, this increases the reliability of this conclusion.
The error of calculation was determined based on variations in fluorescence intensity of marker spots located in the left and in the right columns of the microchip (Fig. 4). This error is integral and includes all errors related to fabrication, handling, and processing of the microchips. Fig. 5 shows examples of hybridization results for amplicons of HA and NA of Influenza A viruses. It is clear that sub-types of the isolates can be identified unambiguously. For 36 out of 40 virus samples tested in this experiment, subtypes of HA and NA were clearly identified (Table 4). For one of the samples (A/mallard/ Crimea/2027/2006), the subtype of NA was unknown but was clearly identified as N7. Another sample, A/duck/Primorie/ 3691/02, for which there was no NA data in GenBank, yet it was claimed to be H12N2, the microchip identified the NA as N5.
For two samples of HA from A/Chicken/Novosibirsk region/ 326/05 and A/duck/Primorie/3691/02 isolates and the NA from isolate A/herring gull/Mongolia/454/08, results were ambiguous, confirming the sub-type only by using one of the two parameters described above. We also could not determine the sub-type of NA for A/Wisconsin/67/056Pr/8 X-161 and A/pintail/Kamchatka/411/2007 isolates, as well as the HA sub-type for isolates A/Mallard/Mongolia/2307/2006 and A/mallard/Crimea/ 2027/2006.
Interestingly, variations in temperature during hybridization do not affect hybridization patterns and relative distributions of intensities across the microchip. A temperature increase above 60uC reduces the number of unspecific spots but leads to the overall decrease of fluorescence, complicating the analysis and negatively impacting the overall performance of the test. Similarly, hybridization at temperatures below 50uC increases binding with non-specific oligonucleotide probes and leads to smaller differences between positive and negative spots. At the same time, the 10uC difference between Tm of different probes did not impact the performance of respective spots.
In conclusion, this communication describes a new method for the development of oligonucleotide hybridization probes that enables the creation of universal microchips for sub-typing HA and NA genes of Influenza A viruses. Evaluation of these probes, both in silico by matching them with database sequences and using them to experimentally identify new virus isolates, demonstrates high reliability of the proposed probes set for sub-typing HA and NA genes of Influenza A virus.
The selection of probes for sub-typing NA and HA genes was performed using custom software consisting of a set of Microsoft Basic macros that implements algorithms described above. Phylogenetic analysis was performed by version 4.0 of MEGA software using a neighbor-joining algorithm of cauterization.
Oligonucleotide probes and PCR primers were synthesized using ASM-800 DNA synthesizer (Biosset Ltd, Novosibirsk, Russia) using standard protocols. Microarrays were printed on glass phenyl isothiocyanate slides using Odyssey Calligrapher (BioRad) spotter with a 360 mm pin. Each slide contained four identical sub-arrays suitable for simultaneous hybridization of four samples. Each subarray contained 9 and 16 lines of spots specific for NA and HA, respectively. The layout of the probes is shown in Fig. 4.
cDNA of samples 22-40 was produced from genomic RNA isolated from 140 ml of tissue culture fluid (virus titer 105-107 PFU/ml) using QIAquick Viral RNA Kit (QIAGEN, Valencia, CA). Reverse transcription was performed as follows. The mixture containing 1 ml of GACTAATACGACTCACTA-TAGGGAGCAAAAGCAGG primer (20 mM), 9 ml RNA extracted from 80 ml of virus, and 2 ml of the dNTP (10 mM) mixture was heated for 10 min at 65uC and chilled on ice. Next, 4 ml of 56 cDNA buffer, 1 ml water, and 1 ml of ThermoScript (Invitrogen, Carlsbad, CA) reverse transcriptase were added. The mixture was centrifuged and incubated for 5 min at 55uC and 55 min at 60uC. The amplicon lengths were ,800 and ,600 b.p. for neuraminidase and ,1,100 and ,700 b.p. for hemagglutinin. To generate predominantly single-stranded DNA for subsequent hybridization, asymmetric DNA was performed [12]. The composition of PCR mixture included (per 50 ml) Qiagen PCR buffer (1.5 mM MgCl2, KCl, (NH 4 ) 2 SO 4 , tris-HCl, pH 8.7), 1.5 mM MgCl2, 200 mM dATP, dCTP dGTP, 70 mM dTTP, 20 mM dUTP-Cy3, 80 nM or forward primer, 1 mM of reverse primer, 1.25 U of Qiagen Hzot-Start Taq-polymerase, and 1 ml of full length amplicon template. PCR was performed in a BioRad iCycler under the following conditions: 15 min at 95uC followed by 34 cycles consisting of 30 sec at 94uC, 30 sec at 52-55uC, and 80 sec at 72uC. The amplification was monitored by electrophoresis in 1% agarose gel with EtBr staining. Fluorescently labeled amplicons were clarified from the excess of dNTPs using Qiagen gel filtration columns and were further dried in a vacuum centrifuge at 60uC.
The dried fluorescently labeled amplicons were reconstituted in 10 ml of water, supplemented with 10 ml of 26hybridization buffer, heated for 2 min at 97uC, and chilled on ice before hybridization. 16hybridization buffer contained 66SSC, 56Denhardt solution, and 0.1% Tween 20. Hybridization was performed in an ArrayIt chamber (Sunnyvale, CA) at 55-60uC for 2 hours. Next, slides were subsequently washed in 66SSC, 26SSC with 0.2 fi SDS, 26SSC, and 16 SSC. Slides were scanned using Scan Array Express 2.0 (Perkin Elmer) at 543 nm (Cy3) and 633 nm (Cy5). The images were analyzed with ScanArray Express software (Perkin Elmer).

Supporting Information
Table S1 The number of isolates of Influenza A viruses with different sub-types of neuraminidase analyzed in this work.
(XLS)    Table S5 Effectiveness of N1 neuraminidase sub-typing by using oligonucleotide probes. Probes specific for N1 neuraminidase (see Table 5) and cDNA of neuraminidase longer than 1,000 nucleotides were used. (XLS) *One of isolates tests as N5. **Sub-type of NA was unknown, but was determined to be N7.
(+) -sub-type call was made by using two parameters.
(+/2) -sub-type call was made by one parameter (average fluorescence of a spot).