Evolutionary dynamics of the H7N9 avian influenza virus based on large-scale sequence analysis

Since 2013, epidemics caused by novel H7N9 avian influenza A viruses (AIVs) have become a considerable public health issue. This study investigated the evolution of these viruses at the population level. Compared to H7 and N9 before 2013, there were 18 and 24 substitutions in the majority of novel H7N9 AIVs, respectively. Nine of these in HA and six in NA were rare before 2013, and four of these in HA and two in NA displayed host tropism. S136(128)N and A143(135)V are located on the receptor binding sites of the HA1 subunit and might be important factors in determining the host species of novel H7N9 AIV. On an overall scale, the evolution of H7 and N9, both in terms of time distribution and host species, is under negative selection. However, both in HA and NA, several sites were under positive selection. In both the overall epidemics and the human-derived H7N9 AIVs, eight positive selection sites were identified in HA1, with some located within the known antigen epitopes or the receptor binding site(RBS) domain. This may induce variations in H7N9 AIV with positive selection. It is necessary to strengthen the surveillance of novel H7N9 AIVs, both in human and bird population to determine whether a new virus has emerged through selection pressure and to prevent future epidemics from occurring.


Introduction
Influenza A viruses (IAVs) belong to the family Orthomyxoviridae. Based on the antigenic properties of the two surface glycoproteins hemagglutinin (HA) and neuraminidase (NA), IAVs are clustered into 18 HA (H1-H18) and 11 NA (N1-N11) subtypes [1]. Apart from H17, H18, N10, and N11, which were restrictively identified from bat samples in the form of H17N10 and H18N11, all of the other subtypes of the virus can circulate in avian species [2]. Occasionally, avian influenza A viruses (AIVs) can be transmitted from avian species to mammals, which may lead to the development of human pandemic strains by direct or indirect transmission [3,4]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 In the spring of 2013, the first human infection with a novel AIV of H7N9 subtype (referred to as novel H7N9 AIV) emerged in Eastern China, causing severe human infections and deaths [5]. Subsequent outbreaks of H7N9 influenza in humans have occurred in the winter and spring of every year since then, and China has now witnessed six H7N9 epidemics [6][7][8]. According to statistics from the World Health Organization, there have been a total of 1,567 laboratory-confirmed cases of human H7N9 AIV infection, including 615 deaths, as of December 6, 2018 (http://www.fao.org/ag/againfo/programmes/en/empres/H7N9/situation_ update.html). H7N9 has thus become a major public health issue. Previous studies have described the novel H7N9 AIV as a triple-reassortant virus of H7, N9, and H9N2 influenza viruses. Phylogenetic characteristics, variations involving key amino acids, virulence, and pathogenicity to human or animals regarding one or more strains isolated from human or bird have been well described [5][6][7][8][9][10][11], and numerous sequences have been submitted to public databases. These efforts have greatly facilitated research studies on the evolutionary dynamics of H7N9 AIVs based on large-scale sequence analysis.
Several viral proteins of IAVs are known to be responsible for immune response, host adaptation, or interspecies transmission, of which the membrane protein HA is the major determinant [12,13], while the balance between HA receptor-binding affinity and NA receptordestroying activity is critical for the efficient growth of IAVs. NA also contributes to influenza virus species specificity and determines the virulence and drug resistance of AIVs [14]. In this study, we further clarified the molecular basis of host tropism and virulence characteristics of novel H7N9 AIVs and provide evidence on the surveillance and early warning signs of an H7N9 epidemic. We used relevant influenza virus bioinformatics databases to investigate the evolutionary dynamics of HA and NA, screen the key amino acid sites related to virulence and infectious characteristics, and predict the possible evolution and mutation trends of novel H7N9 AIVs in the future.

Sequence distribution
As of August 29, 2018, there were 1,952 HA sequences of novel H7N9 AIVs in the two databases, of which 1,125 were derived from human and 827 from avian species. There were 1,918 NA sequences of novel H7N9 AIVs, of which 1,096 were derived from human and 822 from avian species. The main epidemic area of novel H7N9 is China.
For H7 and N9 before the H7N9 epidemics, 2,159 HA sequences were obtained from the two databases, of which 84 were derived from human and other mammals. A total of 758 NA sequences were obtained and except for one whale-derived isolate, all sequences were derived from birds. Because only a small number of sequences were derived from human and other mammals, H7 and N9 before the H7N9 epidemics were not further differentiated based on host.
Compared to both HAs from before 2013 and from the avian-derived novel H7N9 AIVs, human-derived novel H7N9 significantly differed from them at site 136 (corresponding to site 118 of mature HA1 of H7 or site 128 of mature HA1 of H3), 143 (corresponding to site 125 of mature HA1 of H7 or site 135 of mature HA1 of H3), 396, and 499 (p < 0.05). The majority of human-derived AIVs at these sites are N, V, A, and R, respectively, but those derived from avian species are S, A, E, and S, respectively. However, the substitution Q235(226)L did not show any host specificity, and the compositions of L occupied 92.27% of human-derived and 90.33% of avian-derived isolates ( Table 1).

Spatial structure analyses of HA1 subunits
Homology modeling for the tertiary structure demonstrated that sites 136 and 143 are both located on the surface of the HA1 subunit. Site 136 is located within the head of the HA1 subunit and adjacent to the 120-loop of RBS domain, whereas site 143 is situated within the left arm (or left side) of the RBS. A substitution from serine (S) of avian-derived H7 [which is represented by A/chicken/ Wuxi/0405005G/2013(H7N9) with 100% identity to the avian-derived majority] to asparagine (N) of human-derived one [which is represented by A/Quzhou/1/2015(H7N9) with 99% identity to human-derived majority] at site 136 results in a more prominent bump in the 120-loop of RBS domain. Moreover, the larger side chain of N also provides a wider contact area for the binding of RBS to receptors located in host epithelial cells. This change in spatial structure can be achieved by modifying only a single site, i.e., S136N, of the transitional HA1 (Fig 1a-1c and S1 File).
For site 143, a substitution from alanine (A) to valine (V) leads to a greater change in the tertiary structure of the RBS domain. However, this conformational change is apparently not caused by a single amino acid residue, A143V, but seems to involve a series of amino acid Evolutionary dynamics of H7N9 AIVs residue substitutions. These result in a smaller and shallower pocket in the RBS domain (Fig 1d-1i). This means that it is more difficult for the viruses to bind to receptors on the surface of the host epithelial cells. On the contrary, it will be easier for the viruses to dissociate from the receptors. For viruses that have infected host cells and successfully replicated in large quantities, this conformation will be beneficial for the release of viruses from surfaces of host cells, and thus for the rapid proliferation of viruses. This may partly explain why novel H7N9 AIV has higher virulence to human and bird hosts than previous H7 subtypes AIVs. Evolutionary dynamics of H7N9 AIVs

Positive selection sites in H7 and N9
Matrices, including 105, 66, 105, and 112 H7 sequences, and 81, 61, 107, and 120 N9 sequences were selected and used in the analysis of selection pressure according to time distribution and host species (S2 File).
Analysis of single amino acid sites one-by-one revealed that two sites of HA1 subunits belonging to the H7 subtype AIVs before 2013 were affected by positive selection pressure, namely, 284(275) and 334(325). None of these sites formed a de facto substitution. Similarly, among the 12 positive selection sites of NA belonging to the N9 subtype AIVs before 2013, only variations at site 50 and 287(282) may be caused by positive selection pressure. It is particularly worth noting that since the emergence of the H7N9 epidemic in China in 2013, many positive selection sites within H7 have appeared in both the overall epidemics and the humanderived H7N9 AIVs. Most of these even located within the HA1 subunit. For the overall epidemics and human-derived AIVs, there were both eight positive selection sites, some of them were located within known antigen epitopes or the RBS domain (Table 3).
Other positive selection sites within both H7 and N9 and in terms of time distribution and host species are listed in Table 3.

Discussion
In this study, based on large-scale bioinformatics analyses for H7 and N9 nucleotide sequences belonging to novel H7N9 AIVs, H7, and N9 subtypes IAVs before 2013, the origin and evolution of novel H7N9 were discussed, and the future evolution of novel H7N9 was also reasonably predicted.
Human infections and deaths caused by novel H7N9 AIVs are much higher than any previous H7 subtype influenza [15,16]. Compared with early H7, the HA of the novel H7N9 AIVs appeared to have as many as 18 substitutions. Many of these substitutions are located in known antigenic epitopes or the RBS domain of HA1 and were rare in previous H7 subtype AIVs. This means that such substitutions are likely to play a pivotal role in the emergence of novel H7N9 AIV and the outbreak of epidemics; one or more of these substitutions work together might be the molecular basis of virulence enhancement. As for NA, its role is mainly catalytic and it is also important for drug resistance. However, compared with N9, most of the substitutions within NA of novel H7N9 AIVs are rarely located on known drug resistance sites or enzyme activity centers. There are also many sites that are not located at known key sites such as 245(236) and 279(270) in the HA1 subunit and all substitutions involving HA2. There are two explanations for such substitutions. One is that although some amino acid sites are related to virulence and pathogenicity of the virus, these have not been confirmed to date. For example, because HA2 is not a membrane surface subunit of HA, its role in virulence and pathogenicity has been neglected for a long time. In fact, as non-surface proteins, PB1, PB2, and M are all related to the pathogenicity and host tropism of AIVs [16][17][18], and new antigenic epitopes of novel H7N9 AIVs also have been frequently reported [19]. Second, some viral characteristics do not function through a single substitution, but probably through the synergy of multiple substitutions [16,20]. Therefore, the significance of these substitutions in virulence and pathogenicity deserves further investigation.
As typical AIVs, spillover infections of H7 or N9 from birds to humans should have a corresponding molecular basis. This study found that there are four significant substitutions between avian-derived and human-derived H7 HAs, namely, 136(128), 143(135), 396, and 499. Two significant substitutions in N9 NAs, namely 247(242) and 327(322), also display host specificity. Whether these substitutions are more conducive to cross-species transmission of the virus require further investigation, although homology modelling has revealed that 136 (128) and 143(135) are indeed located within the RBS domain of HA1 [21,22] and could alter the spatial structure of this domain alone or in cooperation with other sites. These substitutions might be important to determine which host species can be infected by these H7N9 AIVs and play a decisive role in H7N9 spillover infections from bird to human. So far, there is no report on the substitution of these two sites and thus these have yet to be confirmed. However, a Q235(226)L substitution, which corresponds to position 226 in H3N2 subtype HA and its substitution is considered to be the key factor causing cross-species transmission of AIVs [23][24][25], emerged in both majorities of human-and avian-derived novel H7N9 AIVs, and the differences were significant (p < 0.05) compared to H7 before 2013, but within novel H7N9 epidemics, these did not show any host specificity as reported in some studies [5,18]. This suggests that Q235(226)L may be utilized in determining the virulence and pathogenicity of novel H7N9 AIVs, but it does not necessarily determine the host type of H7N9.
Analysis of selection pressure may suggest whether a mutation is supported or hindered by natural selection, the direction of virus evolution, and possible amino acid sites involved [26,27]. On an overall scale, the evolution of H7 and N9, both in terms of time distribution and host species, is under negative selection pressure, which is consistent with previous research results on other subtypes IAVs [27][28][29][30]. Examination of single amino acid sites in AIVs before 2013 or in novel H7N9 AIVs or in avian-derived or human-derived H7N9 AIVs has revealed multiple sites on H7 and N9 that are under positive selection pressure. Only very few positive selection sites formed a de facto substitution. This indicates that in the evolutionary process of the novel H7N9 AIVs, purifying selection still played the leading role.   [31,32]. This suggests that H7N9 may rapidly vary under the positive selection pressure. It is necessary to strengthen the surveillance of novel H7N9 AIVs, especially those isolated from human, to determine whether a new virus has emerged through selection pressure and to prevent future epidemics from occurring. Selection pressure may be manifested by the way of host immune system response to the virus. Although the H7N9 vaccine has not been used in humans, with the increase in natural infections in the population, the immune background of human population gradually accumulates and the immune barrier is gradually improving [26,33]. This inference is in line with reality; in China, especially in East China, there are relatively high proportions of antibodies against H7 subtype AIVs in human, particularly, in the poultry workers [34,35]. In addition, as H7N9 vaccines have been widely used to prevent and control avian influenza in poultry farms [8,36], the immune pressure among poultry will increase over time, and the viruses will further evolve. Thus, it is also of great significance to strengthen the continuous surveillance of H7N9 AIV poultry to better understand the mutational pattern of the virus, develop prevention and control measures, and screening vaccine strains.

Conclusions
Compared with early H7 and N9, the HA and NA of the novel H7N9 AIVs appeared to have as many as 22 and 26 substitutions, respectively. Four sites within H7, i.e., S136(128)N, A143 (135)V, E396A, and S499R, and two sites within N9, i.e., S247(242)P and N327(322)S, showed significant host specificity, which may play a role in determining host tropism. To elucidate the mechanism of the novel H7N9 epidemic occurrence and transmission, further experiments are needed to confirm these findings. This study also found that in the early evolution of H7 and N9, positive selection pressure played a limited role, and purifying selection largely contributed to its pathogenicity. However, since the outbreak of H7N9, the virus has been subjected to positive selection pressure at many sites, especially in human population. This suggests that H7N9 may rapidly vary under the positive selection pressure. It is necessary to strengthen the surveillance of novel H7N9 AIVs both in human and bird populations to determine whether a new virus has emerged through selection and to prevent future epidemics.

Sequences preparation
HA and NA nucleotide sequences were downloaded from two databases, the Influenza Virus Resource of NCBI (http://www.ncbi.nlm.nih.gov/genomes/FLU/aboutdatabase.html) and the Global Initiative on Sharing Avian Influenza Data (GISAID, http://platform.gisaid.org/epi3/ frontend), on August 29, 2018. When downloading from GISAID, the option of only GISAID uploaded isolates was chosen. For novel H7N9 AIVs, matrices of both HA and NA were downloaded as one of the following: from all host animals, from human only, and from avian only. The collection date (not the release date or submission date) was set as 2013/03/01-. For the H7 and N9 before the H7N9 epidemics, sequence matrices were downloaded as the formula of H7Nx and HxN9, and the collection date (not the release date or submission date) was set as -2012/12/31. After removing repetitive sequences manually according to the isolates' name and removing poor quality sequences that were filled with consecutive letters "n", all of the matrices were aligned by MAFFT v7.0 (http://mafft.cbrc.jp) and trimmed by MEGA 5.1 (https:// www.megasoftware.net/). Only the codon of each sequence remained in the matrices.

Analyses on variation and polymorphism of amino acid site
Using the MegAlign module in the Lasergene v7.1 software (https://www.dnastar.com), nucleotide sequences in each matrix were translated into amino acid sequences, and the amino acid majority at each site was then calculated. The amino acid majorities of HA and NA belonging to the novel H7N9 AIVs and H7Nx or HxN9 that were established before 2013 were again compared using MegaAlign. When there were different majorities at the same sites, their polymorphisms of amino acids were counted by MS-word by using the function of search-replacement. The composition ratios of amino acid polymorphisms were tested by chi-square analysis with SPSS 13.0. The difference was significant when p < 0.05. The known functional sites such as antigen epitopes or receptor binding site (RBS) were based on H3N2 IAVs and other reports regarding H7, N9, and H7N9 IAVs [37][38][39][40][41].

Spatial structure analyses on HA1 subunits
As mentioned above, HA is the major determinant for host immune response, host adaptation, or interspecies transmission. The globular head of HA, namely HA1, has the most important domain for binding to receptors on host epithelial cells triggering viral infection, as well as for antigen-antibody reactions [12,13]. The variations in the amino acids of the protein primary structure at key sites might lead to changes in protein tertiary structure, which allows ease or difficulty for viruses to bind to specific ligands such as antibodies or receptors located on the surfaces of host-specific epithelial cells. Amino acid sequences of HA1 most similar to humanor avian-derived majorities were obtained by BLASTp (https://blast.ncbi.nlm.nih.gov/Blast. cgi), while the transitional form of HA1 was obtained by manually modifying the specific sites on the avian-derived majority ones. Homology modelling was done by Automated Mode via the ExPASy web server (https://swissmodel.expasy.org) [42,43].

Selection pressure analysis of protein-encoded genes
For a protein-encoded nucleotide sequence, if the nonsynonymous substitution caused by a codon mutation is significantly greater than the synonymous substitution caused by this mutation, then this amino acid site is considered subject to a positive selection pressure; if the situation is reversed, then it is considered to be subjected to negative selection pressure. Positive selection pressure may be beneficial to the survival of new mutations, while negative selection pressure may lead to the elimination of new mutations. Selection pressure can be expressed by formula ɷ = dN/dS, which is the ratio of nonsynonymous changes per nonsynonymous site (dN) to synonymous changes per synonymous site (dS). This ratio is used to infer the type of evolutionary pressure acting on a protein: when ɷ>1, positive Darwinian selection is inferred, when ɷ = 1.0, strict neutrality is inferred, and when ɷ<1, purifying selection is inferred [44].
The overall selection pressures within each matrix of HA or NA were calculated by using the Datamonkey online server (http://www.datamonkey.org/). HKY85 was selected as the phylogeny test model and single likelihood ancestor counting (SLAC) served as the algorithm. When prior trees were needed, neighbor-joining statistical method, Kimura-2 parameters model, and the bootstrap test with 1,000 replicates were adopted to construct the trees [28,45]. Positive selection pressure analyses for each amino acid site one-by-one were calculated by MEGA5.10. Prior trees were also constructed by the methods mentioned above. p≦0.20 was determined as a statistically significant difference [46][47][48]. Due to the greater accumulation of deleterious mutations that have not yet purged at the population level, estimates of selection pressure can be biased when analyzing intensively sampled populations. To avoid bias, repeatedly submitted sequences were strictly eliminated. Based on the phylogenetic tree and homogeneous identities, sequences shared more than 99% identity were randomly left one for subsequent analysis, and the rest were removed.