The Evolutionary Pattern of Glycosylation Sites in Influenza Virus (H5N1) Hemagglutinin and Neuraminidase

Two glycoproteins, hemagglutinin (HA) and neuraminidase (NA), on the surface of influenza viruses play crucial roles in transfaunation, membrane fusion and the release of progeny virions. To explore the distribution of N-glycosylation sites (glycosites) in these two glycoproteins, we collected and aligned the amino acid sequences of all the HA and NA subtypes. Two glycosites were located at HA0 cleavage sites and fusion peptides and were strikingly conserved in all HA subtypes, while the remaining glycosites were unique to their subtypes. Two to four conserved glycosites were found in the stalk domain of NA, but these are affected by the deletion of specific stalk domain sequences. Another highly conserved glycosite appeared at the top center of tetrameric global domain, while the others glycosites were distributed around the global domain. Here we present a detailed investigation of the distribution and the evolutionary pattern of the glycosites in the envelope glycoproteins of IVs, and further focus on the H5N1 virus and conclude that the glycosites in H5N1 have become more complicated in HA and less influential in NA in the last five years.


Introduction
Influenza A viruses (IVs), which belong to the orthomyxoviridae family, consist of eight negative RNA strands. Hemagglutinin (HA) and neuraminidase (NA) are two glycoproteins that are encoded by the IV genome, expressed from segments 4 and 6, respectively. The selection due to various host immune systems and anti-flu drugs accelerate the mutation rates of viral proteins, especially for these two membrane proteins [1,2]. There are 17 HA subtypes and 10 NA subtypes, designated H1-H17 and N1-N10, respectively. Over 118 combinations of IVs can be isolated from wild birds, which are also the natural reservoir of these viruses (except the H17N10 virus, which, until recently, was isolated only from bat) [3][4][5]. The species jumping ability of IVs can result in the infections of poultry and mammals, such as chicken, swine, equine or whale species, with different virulence levels [6][7][8][9]. The H1N1, H2N2 and H3N2 viruses have been responsible for tens of millions deaths during the deadly history of human influenza epidemics. Furthermore, the H5N1, H7N7, H7N2, H7N3 and H9N2 viruses have been isolated from sporadic human infections and deaths [10][11][12][13]. It is worth noting that the H5N1 virus is the most severe for human and avian species, with sudden onset and high mortality. The mortality rate in hundreds of patients who were hospitalized for H5N1 infections was roughly 59.05%, much higher than the mortality rates of the Spanish Flu or the 2009 influenza pandemic (H1N1) [11,14,15].
As a requirement for infection, the homotrimeric HAs play a key role in binding to the host sialic acid (SA) receptors and membrane fusion. The nascent HA of all subtypes consists of conserved structures, including the signal peptide, the cytoplasm domain, the transmembrane domain and the extracellular domain [16]. The mature HA monomer can be cleaved by proteases into the global HA1 and stalk HA2 subunits [17]. When IVs are located in the host digestive tract or respiratory tract cell, the receptor binding domains (RBDs) at the tip of HA1 bind to the SAa2-3Gal or SAa2-6Gal receptors, which are essential for endocytosis [18,19]. HA unfold and expose the interior HA2 subunits in the acid environment, then the fusion peptides in HA2 insert themselves into the host membrane (viral membrane fusion) [20,21].
Homotetrameric NA is a type II membrane protein, whereas HA is a type I membrane protein. The nascent NA consists of four parts: the cytoplasm tail (in amino-terminus), the transmembrane domain, the stalk domain and the global domain [22]. Different subtypes of NA are composed of 450,480 amino acids, displaying low sequence similarity. Although there is variable homology among the various NA subtype sequences, especially in the N1 and N2 subtypes with the deletion of 4,30 amino acids in the stalk domain [23], NA subtypes display stable topologies: a six-bladed b-propeller fold makes an enzymatic activity domain that functions in the release of progeny virions [24,25].
HA and NA have a distribution ratio of 4:1 on the influenza viral envelop and maintain the basic functions of host recognition, infection and viral diffusion [26]. Various studies have reported that some of the factors that influence HA include the number of the basic residues in the HA0 cleavage site, the mutation of key residues in the RBD, the changing of antigenic sites or Nglycosylation sites (glycosites) and the variation of the topology of N-glycan structures [15,27]. Meanwhile, the factors that influence NA include deletion of the stalk domain, the mutations drugresistance, as well as the changing of antigen sites or glycosites and the variation of the topology of N-glycan structures [28]. As a kind of glycan-binding protein (GBP), HA works with NA, which functions as an exoglycosidase, cooperatively. Only by achieving a dynamic equilibrium between attaching to the host and releasing progeny virions can the IVs gain a long-term mechanism for infection and diffusion.
The existence of N-glycosylation is necessary for viral membrane glycoproteins. The biosynthesis and modification of nascent secretory or membrane proteins occurs in the endoplasmic reticulum and Golgi, N-linked glycans encode crucial information for the folding, maturation, transport or degradation of proteins [29]. To escape both the host's humoral and cellular immune systems, the potential glycosites in viral envelope proteins can provide the identical glycans as those of the host's cells to mask the antigenic sites [28,30]. Additionally, glycosylation also impacts the sensitivity of HA to temperature, the protection of cleavage sites and the stalk domain, and even the receptor-binding preferences [31][32][33]. As the ideal model for the influence of N-glycosylation in pathogen-host interaction, the present studies show that the envelope glycoproteins of IVs appear to only have N-glycosylation, with no O-glycosylation and GPI-anchors [34]; hence, the glycosites discussed in this paper only pertain to N-glycosylation site. Figure 1. The N-J trees of two glycoproteins in IVs with the corresponding distribution chart of glycosites. The phylogenetic trees of HAs and NAs were constructed using three to ten representative amino acid sequences in each subtype (File S1). The distribution charts of glycosites, colored according to the statistics of conservation in each HA or NA subtype (File S2), are shown in various strips. The red, green and blue color represent the levels of conservation of ''.95%'', ''5%,95%'' and '',5%'', respectively. The conserved cysteines are shown in yellow strips. (A) The N-J tree of HA subtypes with the corresponding distribution chart of glycosites. (B) The N-J tree of NA subtypes with the corresponding distribution chart of glycosites. doi:10.1371/journal.pone.0049224.g001 The Glycosylation Sites in Influenza Virus PLOS ONE | www.plosone.org Current studies have analyzed the evolutionary dynamics of Nglycosylation sites of select subtypes or HA/NA as a whole [35][36][37]. Although most influenza evolution can be accounted for by genetic drift, there is also evidence of adaptive evolution of mutations which are under positive selection [38,39]. We have extended previous studies of basic similarity alignment scoring of similarity to analyze the position-specific glycosites that are under selective pressure in IVs [40]. Here we present a detailed investigation of the distribution and the evolutionary pattern of the glycosites in the envelope glycoproteins of IVs, especially in the H5N1 virus.

Materials and Methods
Protein sequence data from all subtypes of HA and NA The amino acid sequences of HA and NA were obtained from the NCBI (National Center for Biotechnology Information) Influenza Resource (http://www.ncbi.nlm.nih.gov/genomes/ FLU, accessed 15th March 2012) [5]. To fully understand the distributed regularity of the glycosites in each subtype, we downloaded the 29 sets of HA and NA with the customized definition ''.{serotype} {strain} {segname}'' using the following combinations: H1Nx,H17Nx and HyN1,HyN10 (where x and y represent ''any'' by default), in addition to the HA and NA from the H5N1 virus.
All the sequence alignments of the various HAs and NAs were performed using Clustal 2.0. Repetitive, incomplete and mixed sequences were removed. For the purpose of convenience, one representative sequence with the generally longest length from each subtype was chosen for further description (Table S1 and S2).
To assess the relationship between the distributed regularity of the glycosites in the envelope glycoproteins and the evolutionary position of various HAs and NAs, firstly, the preliminary trees for each HA and NA subtype were constructed using all available sequences; secondly ten representative sequences from discrepant clades, with consideration of host source and areas, were picked up (excluding H14, H17 and N10 due to limited records). Finally, the phylogenetic trees of HA and NA were constructed by these hundreds of sequences (File S1). The phylogenetic trees of the HA and NA were constructed by MEGA5.05 with N-J methods and a p-distance model with a bootstrap value of 1000 [41,42].

Prediction and statistical analysis of potential glycosites
As is known, N-X-T/S (X cannot be a proline) is the glycosite motif. Although some researchers have concluded that not all the potential glycosites would be glycosylated in mature glycoproteins, Kelley WM et al. argued that the 14-sugar glycans would be transferred to all the glycosites in nascent proteins and be truncated and modified in the subsequent process [42].
Due to the fixed pattern of glycosite, many programs and software provide the prediction of glycosites [43], such as the ''NetNGlyc 1.0 Server'' (www.cbs.dtu.dk/services/NetNGlyc). By submitting the alignment files to a prediction server, we obtained a series scores for the potential glycosites. In consideration of experimental error, those occasional glycosites, which may result from the genomic sequencing or translated by different genetic codes, were excluded. For example, the 429NLS in the HA of A/ duck/Hunan/3315/2006(H5N1) only appears in H5 set once, and it is also embedded inside of HA trimer [44]. Statistics of the position, pattern and the levels of conservation in various subtypes were used for further discussion (Table S1, File S2).

The comparison of 3D structures of HA and NA
The available structures of HA and NA would help us to explore the distinctive function of various glycosites. Therefore, we collected the structures of various HA or NA subtypes from the PDB (Protein Data Bank, http://www.rcsb.org/pdb/home/ home.do). To obtain the whole multimer, these monomer structures were processed with a VMD 1.9 transformation matrix using a Tcl script [45]. The code for each representative structure of the various subtypes of HA and NA and the corresponding source of IVs are shown in Table S2.
Most coordinate files of envelope glycoproteins are obtained from X-ray crystallography or NMR. Furthermore, complete larger glycans are too exible to yield sufficient electron density [46]. Isolation and purification of membrane glycoproteins by particular enzyme treatments lead to the lack of partial domains

Co-evolution of glycosites in the H5N1 virus
The H5N1 virus can be divided into 10 clades, according to the evolutionary position of the HAs defined by the WHO (World Health Organization) in 2008. These clades are numbered 0 to 9 [48,49]. Clades 0,2 are responsible for all the human infections, resulting in 359 deaths since 2003 [11]. Many clades have not been reported for the last four years, while the dominant H5N1 viruses are concentrated in clade 0,2. Clade 2.1, clade 2.2 and clade 2.3 have further evolved into third-or fourth-order clades, and these newly formed viruses have become geographic strains.
For the purpose of investigating the co-evolution relationship between the glycosite patterns of HA and NA, we reconstructed a phylogeny tree of HAs and determined ten clades. When discrete monophyletic groups with a common node meet a bootstrap value of $60 at the clade-defining node and average percentage pairwise nucleotide distances between and within clades of .1.5% and ,1.5%, respectively, they could split into the second-, third-, or even fourth-order clades [50]. The statistics of the glycosites of each HA, as well as the corresponding NA, were recorded according to their clades (Files S3, S4 and Figure S1).

Newly emerging Glycosites of HA appeared in distinctive subclades
As shown in Figure 1A, the current 17 HA subtypes were concentrated in two evolutionary groups. One large group, represented by H1 and H5, contained H2, H6, H8, H9, H11, H12, H13 and H16; Another large group was represented by H3, and contained H4, H7, H10, H14 and H15. Although each subtype of HA had a lower similarity to each other and the distribution of the glycosites differed in sequential numbering, we found that the sequence alignment of all HA subtypes indicated that partial glycosites appeared in similar domains. Furthermore, the structure alignment of available HAs also showed that various HA subtypes had a highly conserved structure, and the distribution of glycosites was also regular ( Figure S2).
By analyzing the position of glycosites and their conservative rates, we conclude that two types of glycosites appear in HAs: one with a high level of conservation in all HA subtypes and another with various conservative rates in different HA subtypes. Two highly conserved glycosites are located near the HA0 cleavage site (e.g., the 27NNST in H1 or the 30NGT in H8) and the fusion peptide of the HA2 (e.g., the 498NGT in H1 or 500NGS in H5) respectively in all subtypes, and these two glycosites play necessary role in viral life cycle for protecting the HA0 cleavage sites and fusion peptide [34,51]. In addition, another highly conserved glycosite appears at C-terminal part of the HA1 sequence, which is near the connection of the global and stalk domains, except in H7 and H15 sequences [52]. Three to ten characteristic glycosites were distinctive in each subtype. Their conservative rates were influenced by different internal evolution branches, ranked from 0.5% (e.g., 292NGS with 2.85% in H3) to 100% (e.g., 38NGT with 99.94% in H3), distributed mainly in the global domain. It is worth mentioning that some highly conserved glycosites were near to cysteines (Figure 1). Kozlov et al. hypothesized that ERp53, which is involved in the formation of disulfide bonds during the folding of nascent proteins, would form a complex with calreticulin/calnexin, which depends on precursor N-glycans [53].
One most obvious characteristics of glycosites was that the increasing samples or cross-species reports would lead to more glycosites. There are many factors contributing to the existence of numerous glycosites in H1, H3 and H5 subtypes. Because the RNA virus is rapidly mutating, long-term cross-species infections have resulted in the accumulation of adaptive mutations and glycosites. Due to the selection pressure of different host immune systems, a disadvantageous mutation would be eliminated while a virus with the gains and losses of glycosites would be conserved.
For example, the HAs from avian H1N1 or H3N2 virus had less glycosites than those of the seasonal human H1N1 or H3N2 virus (e.g., 286NAS in A/Memphis/28/1983 (H1N1). Figure 2A and File 5S).
Not all the glycosites shared the same numbering in one subtype. For this reason, we chose a representative sequence for each subtype (Table S1). Most glycosites were influenced by the deletion or insertion of the mutants in the upstream sequence, which is also known as ''antigenic drift'' [54]. In the HAs of the H1N1 or H5N1 viruses, some nearby glycosites were determined by ''antigenic shift'' [55]. In the H1 subtypes, the 144NVT glycosite from the seasonal H1N1 flu, which were isolated from 1940s to 1980s (e.g., A/Memphis/1/1984 (H1N1)), was replaced by the 142NHT glycosite after the 1990s (e.g., A/California/04/ 2007 (H1N1)). Additionally, no such glycosylation appeared in those H1N1 viruses isolated from swine H1N1 flu (e.g., A/swine/ Guangdong/1604/2010 (H1N1)), Spanish flu (e.g., A/South Carolina/1/1918 (H1N1)) and the 2009 influenza pandemic (H1N1) (e.g., A/Mexico city/CIA1/2009 (H1N1)). It has been partially shown that vaccination of mice with the 1918 influenza strain protected against subsequent lethal infection by the 2009 virus; however, the 1918 strain did not protect against the seasonal H1N1 flu [56]. It is interesting to note that the migration of the 179NKS to 177NLS tracks with these effects. Almost all of the 179NKS glycosites identified were centralized in swine-origin influenza viruses (S-OIVs, e.g., A/Swine/Guandong/1604/2010 (H1N1)) and parts of the 2009 influeza pandemic H1N1 virus (e.g., A/Mexico City/014/2009 (H1N1)), while the 177NLS glycosite was mainly found in the swine or human H1N2 viruses (e.g., A/ New York/481/2003 (H1N2)) and most seasonal flu viruses (e.g., A/California/04/2007 (H1N1)). These date further confirm that the HA of the 2009 influenza pandemic (H1N1) originated from S-OIV, not previous seasonal virus [57]. Similar positional conversion of the glycosites in H5N1 caused by ''antigenic drift'' was the ''179NYT/181NNT''. The conserved 181NNT glycosites have been replaced by the 179NYT in parts of clade 7 viruses since 2005. It reminds us that although these glycosites were adjacent, their N-glycans would shield different antigenic sites, which would provide some suggestions for the development of influenza vaccine.

The glycosylation of NA is affected by the deletion of the stalk domain
It can be observed from Figure 1B that N10 is highly divergent from other subtypes [4]. The remainder of the current 9 NA subtypes are concentrated into two evolutionary groups: one group was represented by N2 and contains N3, N6, N7, and N9; and another group contains N1, N4, N5, and N8.
The glycosites of NA can be divided into two types according to the distributed region: two to four highly conserved glycosites are located in the stalk domain in each subtype; two conserved glycosites and most middle-low conserved glycosites are mainly located in the global domain, which are near the tip of NA, the connection of the global and stalk domains, or the antigenic sites. The glycosite of 146N is conservative in all NA subtypes (e.g., 146NDT in A/Boston/20/2008(H3N2), 144NGT in A/duck/ Taiwan/4201/99(H7N7) or 146NGT in /Viet Nam/1203/ 2004(H5N1)). It has been shown that the N-glycan at this glycosite affects NA enzymatic activity, causing a 20-fold decrease in activity [58]. Similar to the description of HAs, a large number of H1N1, H3N2, H5N1, H7N2 and H9N2 viruses have accumulated numerous glycosites in N1 and N2, especially in the global domain, mainly participating in immune evasion. Moreover, one conserved glycosite, 12NTT (Conservative rate: 93%, e.g., A/ turkey/Italy/3807/2004 (H7N3)), located in the transmembrane domain in N3 and N10, requires further investigation [59].
The significance of the conserved glycosites in the stalk domain was providing the N-glycans to avoid the cleavage by host enzyme (e.g., Trypsin) [60,61]. The variance of the glycosites was closely related to the deletion of the stalk domain. Although the threedimensional structure of stalk domain has not been determined  yet, it is speculated that the presence of an a-helix motif in the uncrystallized structure has also been provided by cryoelectron microscopy [62]. Wagner et al. believed that a longer stem domain would enhance the replication capacity of the virus, while the deletion of the stem domain would decrease the enzymatic activity of NA [63]. Various subtypes had stalk domain deletions of 3 to 24 residues, except for N4, N8, N9 and N10. The numbers of deletions were also distinctive across different combinations of IVs and even within one subtype, such as the N2 subtype. Generally, no deletions were found in the NA of the H3N2 virus; deletion of 3 residues and one corresponding glycosite with the pattern ''E-R-61N-3-64T-V-H'' (meaning 3 residues missing between 61N and 64T, e.g., A/chicken/Zhejiang/611/ 2011 (H9N2)), appeared in most of the NA subtypes of the H9N2 virus. A similar deletion of 20 residues and two glycosites (I-E-60R-20-80N-I-I) appeared in most NA subtypes of the H6N2 virus (e.g., A/duck/Fujian/3193/2005 (H6N2)). The deletion of 16 residues and two glycosites as ''C-E-55P-16-72T-T-E'' were a distinctive part of the H7N2 virus (e.g., A/unknown/New York/19501-5/2006 (H7N2)). Parts of the H5N2 virus were characterized by the deletion of 20 residues and two glycosites (''R-N-62I-20-83G-Y-R'', e.g., A/chicken/Ibaraki/3/2005 (H5N2)). Moreover, in some avian H2N2 viruses, the deletion of 22 residues resulted in the loss of two glycosites; however, a new glycosite appeared in the newly created sequences (''P-A-47N-22-70N-T-V'', e.g., A/chicken/New York/Sg-00300/1997 (H2N2)). In all, the diversity of deletions in the stalk domain indicates that the glycosylation pattern of HA and NA has a complex relationship.

The co-evolutionary glycosite pattern in the H5N1 virus
Globally, the researchers have been paying attention to the highly pathogenic avian influenza (HPAI) since the first human death caused directly by avian H5N1 virus in 1997. It is generally considered that the HPAI viruses were characterized by polybasic residues in the HA0 cleavage site in HA and the deletion of the stalk domain in NA [64]. However, these characteristics had been reported even before 1990s. The earliest H5N1 virus, ''A/ chicken/Scotland/1959 (H5N1)'', had four continuous basic residues in the HA0 cleavage site (compared to 5,6 continuous basic residues in most common H5N1 HAs); the stalked deletion of NA also existed in A/turkey/Ontario/84/1983 (H5N1). The records of H5N1 virus have increased rapidly since 2003. Since then, a number of new clades and subclades have emerged and resulted in various new glycosites ( Figure 3A and 3B). The percentage of unconserved glycosites in HA also increased and diversified rapidly after 2003, meanwhile the percentage of unconserved glycosites in NA reduced gradually, regardless of those highly-conserved glycosites ( Figure 3C and 3D).
Both twelve glycosites were found in the HA and NA of the H5N1 virus, shown in the Tables 1 and 2 and Files S3 and S4.
Most H5N1 viruses were grouped into clade 0 before it appeared in Hong Kong again in 2003. Various patterns of the glycosites in HA and NA had co-existed in these original viruses. These original NAs contained the known glycosites, including four highly conserved glycosites in the stalk domain and seven in the global domain (except the occasional 341NGT which only appeared in clade 1 and Thailand records during 2004,2010, e.g., A/chicken/Thailand/CU-354/2008 (H5N1)). In contrast, six highly conserved glycosites together with 170NST exist widely in clade 0. Since 2003, the WHO has recorded a three-wave epidemic of H5N1, which resulted in hundreds of deaths and huge economic losses [11]. Until recently, the glycosite patterns were highly conserved in all the avian clades except for clade 7 (e.g., clade 3, 4, 5, 6, 8 and 9), as shown in Figure 4 and Figure S2.
Most currently recorded H5N1 viruses were concentrated in the fourth-order clades. During this decade, there have been increasing human infections and new glycosite patterns of HA and NA. It has been reported that the H5N1 virus of clade 2.2 was involved in the outbreak that occurred among the migratory bird population near Qinghai Lake in 2005. Since then clade 2.2 spread westward. This resulted in a number of deaths of wild birds in Europe [65]. In 2006, the H5N1 virus appeared in Africa for the first time, followed by hundreds of mortally infected humans in Egypt, Nigeria and Djibouti [66].  Figure 5D, File S3).
The 181NVT glycosite, which is located at the apical b-folding of HA, is conserved in all human IVs (e.g., A/Anhui/1/2007 (H5N1)). Previous statistics have indicated that this glycosite has a lower level of conservation in avian IVs (e.g., A/chicken/ Vietnam/NCVD-093/2008 (H5N1)). The results of the molecular dynamics simulation indicated that the a2-3-sialoglycans adopted a straight-like and outward topology structure while the a2-6sialoglycans were fishhook-like and inward; therefore, we inferred that the that the deficiency of the glycosite would benefit the binding of SAa2-3Gal sialoglycans [68]. Actually, all the viruses that had deficiencies of the 181N glycosite were isolated from the avian host, which were concentrated in clades 7.1, 7.2 and 2.2.1.1 ( Figure 6A, File S3).
Interestingly, the deletion of the stalk domain in H5N1 NA is variable. As the first human death reported in 1997, most H5N1 viruses that belonged to clade 0 and isolated in Hong Kong were characterized by ''N-Q-S-I-54I-18-73N-F-Y'', which remained a glycosite: 50NQS. Although previous studies conjectured that the H6N1 virus (e.g., A/Teal/Hong Kong/W312/97 (H6N1)) was the donor of the NA gene in 1997 HPAI virus [69,70]; however, similar motif could be found even in 1983. Since 2000, the most common pattern of ''A-E-48P-20-69I-S-N'' with four glycosites missing has dominated in the H5N1 NAs ( Figure 6B, File S4).

Discussion
There is one kind of glycoprotein that participates in the recognition and membrane fusion in most virus envelopes, such as the spike (S) protein in the SARS virus, the gp160 in HIV and HA in IVs [71,72]. These glycoproteins can bind to one specific glycan structure which is known as the lectin or GBP. Other viral glycoproteins, such as HN in the Newcastle Disease Virus or NA in IVs, function as the exoglycosidase in the release of virus particles [73].
The IVs have an innate capacity for high mutation rates because they are RNA viruses. Humans are under continuous attacks by newly emerging IVs which constantly undergo ''antigenic drift'' and ''antigenic shift''. Accumulation of substantial sequences and 3D coordinates of IV proteins have provided the ideal tools for the investigation of how mutations affect transfaunation, vaccine design and drug-resistance [74,75].
N-glycosylation not only influences the folding and secretion of glycoproteins such as HA and NA but also provides the same glycans (similar to the host's own glycans) to escape the host's immune system. As the key modification of biological significance in viral glycoproteins, we found the glycosites in 17 known HA subtypes and 10 known NA subtypes that have complex characteristics. In general, two highly conserved glycosites near the HA0 cleavage site and fusion peptide site may maintain the basic function of HA; these sites were seldom absent in the HA subtypes. More glycosites were identified in the global domain or in the connection of global and stalk domains, along with longterm and large-scale epidemics in several of the subtypes. The distributional regularity of the glycosites in the NA subtypes is also complex; two to four glycosites located in the stalk domain are highly conserved in various subtypes, and are affected by the deletion of the stalk domain. Another highly conserved glycosite was found at the tip of tetramer NA in all subtypes. Other glycosites were found to be mainly concentrated in the global domain, which surrounds the antigenic sites.
The HA or NA subtypes exhibited low similarities of amino acid sequences in all subtypes while maintaining identical structures, which revealed that the functions of various HA or NA subtypes remain conserved. Notably, some of the glycosites near the cysteines were found to also be conserved. The cysteines take the main role in stabilizing the tertiary structure; the conservation verifies that cysteines and N-glycans played an important role in the protein folding and quality control.
We have further investigated the H5N1 virus to elaborate the collaborative relationship of glycosites in HA and NA. Five highly conserved glycosites in HA had existed before the H5N1 virus first crossed species barriers and infected humans, as well as two additional glycosites, 181NVT and 170NST. Since 2003, the H5N1 virus has exhibited a rapid evolutionary dynamic. Under the selection pressure of different hosts and antigenic drift, the glycosite pattern of current H5N1 viruses in different geographical locations has been distinctive: the HA in clade 2.2.1.1, isolated from Egypt, lacked the 181N glycosite but had added the 88NVS glycosite. In addition, the HA in clade 2.3.2.1, isolated from China or Vietnam, lacked the 170N glycosite but had added the 152N glycosite. In contrast, as the glycosites in HA became more diverse in the H5N1 virus, the glycosylation in NA was impaired by a decreasing number of glycosites. The NAs of current H5N1 viruses lack a stalk domain and the four corresponding glycosites; except the four high conserved glycosites in the global domain, others glycosites have rarely been identified.
The envelope glycoproteins, which play a crucial role in virus recognition, invasion and spread. The analysis of the glycosites in HAs and NAs has provided basic information for vaccine design, host selection and changing virulence. However, infection is a complex process; the alternation of glycosites and glycan shapes may affect the functions of glycoproteins. In addition, there are other mutations also worthy of further consideration, such as the E627K in the PB2 protein that enhances the avian viral replication capacity in mammalian cells [76,77]. Drug resistance is also related to viral proteins, such as the M2 protein [78]. Thus, to prevent the next influenza pandemics, more research needs to be done. Figure S1 The N-J trees of H5N1 HA. The phylogenetic tree was inferred from protein sequences by the Neighbor-Joining method and rooted using A/chicken/Scotland/1959. Estimates of the statistical significance of the phylogenies were calculated by performing 1,000 bootstrap replicates. The clades classified by the WHO are shown as colored bars. (TIF) Figure S2 The superposition of crystal structures from various HA and NA subtypes. The red, green, blue and yellow color represent the H1, H3, H5 and H7 in HAs or N1, N2, N8 and N9 in NA respectively. File S5 The analytic alignment file containing the 5973 protein sequences of H3 HA. As is shown in the output file, the red, green and blue color represent the levels of conservation of ''.95%'', ''5%,95%'' and '',5%'', respectively. (ZIP)

Supporting Information
File S6 The N-J trees of H3 HA. The phylogenetic tree was inferred from H3 HA sequences by the N-J method and rooted using A/equine/Miami/1/1963(H3N8). Estimates of the statistical significance of the phylogenies were calculated by performing 1,000 bootstrap replicates. 633 representative H3 HA sequences were chosed in consideration of the age, hosts, areas, as well as the known HA sequences from H3N1 and H3N3,H3N7 viruses. Those unconserved glycosites are labeled (conservative rate ,95%).