Evolutionary Trends of A(H1N1) Influenza Virus Hemagglutinin Since 1918

The Pandemic (H1N1) 2009 is spreading to numerous countries and causing many human deaths. Although the symptoms in humans are mild at present, fears are that further mutations in the virus could lead to a potentially more dangerous outbreak in subsequent months. As the primary immunity-eliciting antigen, hemagglutinin (HA) is the major agent for host-driven antigenic drift in A(H3N2) virus. However, whether and how the evolution of HA is influenced by existing immunity is poorly understood for A(H1N1). Here, by analyzing hundreds of A(H1N1) HA sequences since 1918, we show the first evidence that host selections are indeed present in A(H1N1) HAs. Among a subgroup of human A(H1N1) HAs between 1918∼2008, we found strong diversifying (positive) selection at HA1 156 and 190. We also analyzed the evolutionary trends at HA1 190 and 225 that are critical determinants for receptor-binding specificity of A(H1N1) HA. Different A(H1N1) viruses appeared to favor one of these two sites in host-driven antigenic drift: epidemic A(H1N1) HAs favor HA1 190 while the 1918 pandemic and swine HAs favor HA1 225. Thus, our results highlight the urgency to understand the interplay between antigenic drift and receptor binding in HA evolution, and provide molecular signatures for monitoring future antigenically drifted 2009 pandemic and seasonal A(H1N1) influenza viruses.

The same 1918 pandemic A(H1N1) influenza virus was also spread to swine during 1918,1919, and became the so-called ''classical'' swine influenza [8,25,26,27], first isolated in North American in 1930 [26] and in Europe in 1976 [28,29]. In 1979, a novel lineage of avian-like A(H1N1) influenza virus, believed to have derived from closely related Eurasia avian influenza viruses, emerged in swine in Europe [30] and replaced the classical swine A(H1N1) virus in this region [31,32,33]. These two classes of swine A(H1N1) viruses displayed different evolutionary trajectories [34]. In1998, a new triple-reassortant A(H3N2) virus, derived from North American avian, classical swine A(H1N1) and human A(H3N2) viruses, caused outbreaks in North American swine [35,36]. Mixing of the triple-reassortant H3N2 with established swine lineages gave rise to H1N1 and H1N2 reassortant swine viruses [37,38]. Since 2007, human infection caused by A(H1N1) swine virus has become a health concern in the United States [7].
The 2009 A(H1N1) influenza virus has its origin as a reassortant from a Eurasian avian-like swine A(H1N1) virus and a triplereassortant virus circulating in North American swine [1,2,3,4,5,6,7,8]. As such, the 2009 A(H1N1) virus contains NA and M from Eurasian avian-like swine A(H1N1) virus, and the remaining genes from the triple-reassortant virus -PB2 and PA (avian virus), PB1 (human A(H3N2)), and HA, NP and NS (classical swine A(H1N1)) [1,2,3,4,5,6,7,8]. In a sense, we are continuingly living in a pandemic that started in 1918 [27]. Thus, it is not surprising for the similarly mild first waves of the 1918 and 2009 pandemics. Notably, the second wave of the ''Spanish'' influenza in the fall of 1918 became much more lethal, peaked within one month of the initial introductions in many communities [11]. This makes influenza virologists and healthcare officials fear that further mutations in the 2009 A(H1N1) virus could also lead to a potentially more dangerous second wave in subsequent months. Thus, in-depth studies on the 1918 pandemic strains as well as their post-pandemic decedents should provide critical new insights into the evolution of A(H1N1) in general, and the pandemic potential of the 2009 A(H1N1) in particular.
HA is one of the two major glycoproteins on the surface of influenza virus. It is the primary antigen that elicits host immune response, and is also responsible for binding to sialic-acid receptors and for mediating viral entry into host cells [39]. The hallmarks of highly pathogenic influenza viruses among human population include easy human-to-human transmission as a result of high affinity of HA for human-like a(2,6) receptors, and significant difference in sequence and antigenicity of HA with existing seasonal and vaccine strains [1,2,39]. It has been demonstrated on 1918 A(H1N1) HA that HA 1 D190 and D225 are key determinants for effective binding to human-like a(2,6) receptors and consequently high infectivity of the virus among human population [11,40,41,42]. A single mutation D225G reduced the binding affinity for a(2,6) receptors [41,42] and the infectivity of the virus [40], while a double variant D190E/D225G rendered the HA non-binding to a(2,6) receptors [41,42] and the virus noninfectious [40].
In A(H3N2) virus, HA is the major agent for host-driven antigenic drift [43,44]. However, it is unclear whether or not and, if yes, how human immunity imposes selection on A(H1N1) HA. In order to address this critical issue, we undertook a systematic computational analysis of the evolution of H1 HA in the region of HA 1 , which is the primary target for host immunity selection [43].
Recent years have witnessed an explosive expansion of available computational methods for phylogenetic analysis of selective pressure, including a variety of methods that look for different types of positive selection such as diversifying selection, toggling selection and directional selection [45,46,47,48,49,50,51,52,53,54] implemented in software packages such as HyPhy [55], MrBayes [56,57] and PAML [45]. Here we used PAML 4.0 [45] for calculation of heterogeneous selection pressure at each codon and HyPhy [55] for directional selection in 335 non-egg-adapted and 32 egg-adapted human A(H1N1) HA sequences. These sequences were from A(H1N1) viruses isolated all around the globe between 1918,2009. In addition, we also analyzed 42 classical swine A(H1N1) HA sequences for their close relationship to the 2009 A(H1N1) HA.
In PAML 4.0 [45], a number of models are available: the branch models allow the v ratio to vary among branches in the phylogenetic tree and can be used to detect positive selection on particular branches [46,58]; the site models allow the v ratio to vary among sites and can be used to detect positive selection at particular sites [59,60]; the branch-site models allow the v ratio to vary both among sites and among branches [61] and can be used to detect positive selection that affects only a few sites in a few branches.
In this analysis, a large dataset composed of over 300 sequences was used to ensure sufficient representative sequences for the total time span of 91 years, which made it impractical for the use of branch-site models in our calculations. However, by separating the sequences into distinct subgroups based on their phylogenetic relationship and applying the site models in PAML 4.0 [45], we successfully detected the branch and the specific sites therein that were under host-driven positive selection. Our study revealed differential evolutionary trends of A(H1N1) HA since 1918, which provided molecular signatures for monitoring future antigenically drifted 2009 pandemic and seasonal A(H1N1) influenza viruses.

Results and Discussion
Phylogenetic Analysis of Human A(H1N1) HA Sequences Since 1918 It is known that egg-adapted influenza viruses tend to have nonnatural host-associated modifications at certain sites of HA sequences [62,63,64]. To eliminate the effects of such modifications in our analysis, we selected only 333 HA sequences of A(H1N1) viruses between 1918,2009 (as of July 10, 2009) with a well-documented record that they had never been passaged in chicken eggs at any stage. Furthermore, intragenic recombination may give rise to false positives in subsequent detection of positively selected codons [65], thus the Recombination Detection Program (RDP3) [66] was used to make sure that all HA sequences used in this study were free of recombination, agreeing with previous observations that intragenic recombination is rare for HA [67]. The nucleotide sequences of 333 A(H1N1) HAs in the region of HA 1 including the signal peptide, were analyzed by the ClustalW method [68]. The phylogeny tree suggested that these HA sequences belong to two major groups: the majority of HA sequences from 1918 to 2008 formed group I, and those of the 2009 A(H1N1) together with a strain isolated in 2007 formed group II (Fig. S1). The separation of the 2009 A(H1N1) HAs from HAs of established human A(H1N1) viruses between 1918,2008, including the 1918 pandemic and the seasonal A(H1N1) viruses, was consistent with the proposed swine origin of HAs in these viruses [1,2,3,4,5,6,7,8]. The low sequence identity (,73%) between the 2009 A(H1N1) HA with seasonal and vaccine A(H1N1) HAs might explain why people were in general immunologically naïve to the former [8,69]. In fact, there did not exist cross-reactivity between the 2009 and seasonal A(H1N1) viruses [8], nor did the vaccination with recent (2005,2009) annual vaccines provide immune protection against the 2009 A(H1N1) virus [69].

Evidence for Host-Driven Antigenic Drift in Human A(H1N1) HAs
In order to understand whether host-driven antigenic drift is imposed on the evolution of HA 1 of A(H1N1) virus, we used likelihood ratio tests (LRT) in the software package PAML 4.0 [45] to identify the presence or absence of positive selection. In this context, positive selection referred to a significant excess of aminoacid altering (non-synonymous) substitutions over silent (synonymous) substitutions in nucleotide sequences. Large LRT values (or small p-values) between alternative models and null models, such as M2a vs. M1a, M8 vs. M7, or M8 vs. M8a, led to the rejection of the null models.
Since HA sequences of group I was further divided into five subgroups (Fig. S1), the PAML calculation was carried out on each of these five subgroups and on group II (Table 1). For group I-i that included three 1918 pandemic A(H1N1) HAs, in order to increase the sample size, we also included two partial sequences, A/London/1/1918 and A/London/1/1919 [11]. Except for the subgroup I-v, all other subgroups of group I had very low LRT values and large p-values (Table 1, 2), indicating predominantly neutral or purifying selection. These results were consistent with the overall low prevalence of A(H1N1) virus during the period of 1979,2006 [70], and agreed well with a previous study that focused on 1995,2005 A(H1N1) isolates where no positive selection was detected [71]. In sharp contrast, group I-v 2006,2008 had v.10 and LRT.60, which provided strong evidence for positive selection (Table 1, (Table 1, 2). Given the largely nonexistence of human immunity against the 2009 A(H1N1), the lack of positive selection among group II was expected. However, with more mild infections rapidly propagating among human population in the Table 1. The values of log-likelihood (l), d N /d S , and parameter estimates in CODEML analysis of human A(H1N1) Has.

Identification of Positively Selected Codons in Human A(H1N1) HAs
In order to understand how H1 HA sequences were positively selected by human existing immunity, the CODEML [72] program in PAML 4.0 was used on subgroup I-v in which about 0.6% codons were found to be under positive selection (Table 1, 2). Both M2a and M8 models identified HA 1 156 and 190 with greater than 95% posterior probabilities to be under positive selection (Table 3). In previous studies, the antigenic structure of H1 HA (A/PuertoRico/8/1934) had been determined to include five distinct antigenic sites on the globular domain: Sa, Sb, Ca1, Ca2 and Cb [73,74] (Fig. 1a). Both of these positively selected codons were located on the site Sb (Fig. 1b). The focus of positive selection on the Sb antigenic site was consistent with a crossreactivity analysis of various epidemic H1N1 strains using monoclonal antibodies that it was under much higher pressure for mutations [74].

Positive Selection of Egg-Adapted Human A(H1N1) HAs during 1933,1979
To compensate for the lack of non-egg-adapted human A(H1N1) HAs for the period of 1933,1978, we separately collected a total of 32 different egg-adapted A(H1N1) HA sequences between 1933,1979 that were free of sequence ambiguity (Fig. S2). These sequences as a group were analyzed by PAML 4.0, as well as two subgroups that covered the periods of 1947,1957 (12 sequences) and 1948,1979 (17 sequences) ( Table 4,  We further employed the CODEML in PAML 4.0 to analyze the positively selected codons in each group. The results were shown in Table 5 where highlighted in bold were the codons not known to be possible egg-adapted mutations (HA 1 138, 144, 163,  189, 190, 225, and 226) [62,63,64]. For the entire group 1933,1979, HA 1 77, 225 and 227 were found to be under positive selection with greater than 95% posterior probability in model M8 (Fig. 1b). They were located in the antigenic sites Cb (HA 1 77) and Ca2 (HA 1 225 and 227), respectively (Fig. 1b). In addition, for the subgroup 1948,1979, HA 1 225 was found to be under positive selection with greater than 99% posterior probability in both models M2a and M8. However, given the fact that these HAs were from egg-adapted A(H1N1) viruses in which HA 1 225 was one of the most frequently changed site [62,63,64], and the predominant residue at this site (Table 6), G225, was commonly found in swine and avian A(H1N1) HAs, it was possible that the changes at HA 1 225 was due to positive selection imposed by adaptation in eggs. At posterior probability of 90%, HA 1 138 and 189 were positively selected as well, however, both sites were involved in egg-adapted substitutions [62,63,64]. In sharp contrast, however, HA 1 143, 166 and 264 in the subgroup  1 We used the degree of freedom of 2 for these LRT tests that is expected to be too conservative. 2 The p-values were calculated from x 2 distribution using degree of freedom of 1 that was then divided by a factor of 2 for the mixture distribution, as suggested by the author of PAML 4.0. doi:10.1371/journal.pone.0007789.t002 Table 3. Codons under positive selection in HA 1 of human A(H1N1) influenza viruses.   1947,1957 were found to be under positive selection (Table 5), none of which was among the previously identified egg-adapted mutations. Antigenically, these codons were located in the antigenic sites Ca2, Sa and Cb, respectively (Fig. 1b). For their relatively distant location from the receptor-binding site, HA 1 143, 166 and 264 are probably mutations driven by existing human immunity for antibody escape. Thus, there appeared to have different evolutionary patterns for the subgroup 1947,1957 circulating in the 1950s and the subgroup 1948,1979 circulating mostly in the 1970s. The former subgroup was subjected to positive selection pressure at HA 1 143, 166 and 264 (Table 5), and had a much larger variability at HA 1 190, with 25% being non-D190 (Table 6). In marked contrast, the latter subgroup was probably not under host-driven positive selection in humans and had highly conserved HA 1 190 (94.1% being D190).

Directional Evolution of Human A(H1N1) HAs
In order to test whether directional evolution of protein sequences existed in the evolution of human A(H1N1) HAs, we employed a maximum likelihood method developed by Kosakovsky Pond and colleagues [49]. In each subgroup, we used the oldest HA sequence as the root. In agreement with CODEML analysis reported in previous sections, among all non-egg adapted human A(H1N1) HAs, directional evolution was only identified in the subgroup I-v, at sites HA 1 143,156,158,190,193 and 197 (Table 8,9). HA 1 143 belonged to the antigenic site Ca2 of A(H1N1) HA, whilst all other sites were located in the antigenic site Sb (Fig. 1b). Among these sites, HA 1 156, 190 and 193 were identified by CODEML in PAML 4.0 to be under positive selection with 99.7%, 100%, and 80.9% posterior probability in model M8, respectively (Table 3). In previous structural studies, residue HA 1 190 in 1934 human A(H1N1) HA and HA 1 190 and 193 in 1930 swine A(H1N1) HA were found to directly interact with bound human-like a(2,6)-receptors [78]. Thus, it remains to be investigated the impacts of directional evolution at HA 1 190 and 193 on receptor binding and antigenic drift.  We also performed directional evolution study on egg-adapted human A(H1N1) HA sequences, and found that in both the entire group 1933,1979 and the subgroup 1948,1979, multiple favored mutations of D225RG and G225RD were detected ( Table 8, 9). Given its involvement in egg-adaptation, the directional evolution at HA 1 225 may be the consequence of egg-adaptation. In contrast, no residues in the subgroup 1947,1957 were identified to be under directional selection.  (Table 6). For the 575 epidemic HAs, HA 1 190 was highly variable (17.0% sequences did not have D190), while HA 1 225 was more conserved (only 1.8% sequences did not have D225) (Table 6). Among all the deviations (a total of 107 cases) from the ideal D190/D225 combination for human A(H1N1) viruses, two predominant ones were N190/D225 (69.2%) and V190/D225 (19.6%). At present, we don't know the exact effects of these mutations, or in combination with other concurring mutations at or around the receptor-binding site, on binding to human receptors. Further experiments are needed to clarify these issues. However, in previous studies, a single mutation D190N of A(H1N1) HA was shown to result in a lower binding affinity for human-like a(2,6) receptors, and a higher binding affinity for avian-like a(2,3) receptors [64].

Evolution of Human and Swine
The five HA sequences retrieved from victims of 1918 ''Spanish'' A(H1N1) influenza virus shared 98.9% to 99.8% sequence identity [11]. Among them, there were two nonsynonymous substitutions of D225G, one in A/New York/1/ 1918 and the other one in A/London/1/1919 ( Table 6). The HAs harboring the mutation D225G had reduced binding affinity for human receptors [11,40,41].
In the 73 HA sequences from the 2009 pandemic A(H1N1), D190 was strictly conserved, while D225 was 94.5% conserved ( Table 6). At HA 1 225, the deviations were 1.1% for G225 and 3.3% for E225. Thus, the complete conservation at HA 1 190 and the nearly complete conservation at HA 1 225 were consistent to the importance of these residues in allowing for binding to humanlike a(2,6) receptors [40,41], supporting the substantially higher human-to-human transmissibility of the 2009 A(H1N1) virus than seasonal A(H1N1) viruses [5,8].
Therefore, there were two distinct evolutionary trends in hostdriven antigenic drift of human A(H1N1) HAs at residues in the receptor-binding site: the 1918 pandemic HAs underwent antigenic drift at HA 1 225, while the epidemic HAs undertook antigenic drift at HA 1 190. In the absence of selection, the 2009 A(H1N1) viruses were highly conserved at both HA 1 190 and 225, which was distinct from those two host-selected evolutionary trends (Table 6). With gradually established immunity among human population, we wondered how the 2009 A(H1N1) virus would undergo antigenic drift in the months to come. Thus, we also looked at the conservation at HA 1 190 and 225 in 42 swine A(H1N1) HA sequences (Table 6). Surprisingly, among these sequences, D190 was conserved at 97.6%, while D225 and G225 were observed at 66.6% and 28.6%, respectively. The similarly high variability of HA 1 225 in swine A(H1N1) HAs with that of 1918 pandemic HAs was consistent with the relative antigenic   stasis of swine A(H1N1) until 1998 [8] and agreed well with the suggestion that the introduction of the 2009 pandemic A(H1N1) virus into humans be a single event or multiple events of similar viruses [1,2,3,4,5,6,7,8].
The deviations from the ideal D190/D225 combination in A(H1N1) HAs might result in reduced binding to human receptors [11,41,42,64,79]. However, two possibilities, which are not mutually exclusive, may explain the fact that mutations are frequently observed at these two sites: one is that other concurring mutations at or around the receptor-binding site may sufficiently maintain the receptor binding affinity so that the overall binding affinity is largely unaffected; the second is that the gain in evading antibody neutralization far overweighs the reduction in receptor binding. Due to the overlapping locations of the ever-changing antigenic sites and the more-conserved receptor-binding site of HA, there is a constant dilemma of whether or not a residue at the receptor-binding site should change. Although the involvement of residues in antigenic drift that are critical for receptor binding was also observed in HAs of other types and subtypes including influenza B virus HA [80], H3 [43,44]and H5 HA [44,81], the interplay between these two opposing forces in HA evolution is still very poorly understood. Although previous studies on A(H3N2) HAs suggested covariation of antigenicity and receptor-binding specificity as a possible mechanism for the antigenic differences observed in viruses propagated in different cells [82], questions such as how residues involved in receptor binding are actively utilized for antigenic drift in influenza evolution in the same hosts need to be urgently addressed in order for us to comprehend the powerful strategies that the virus employs for recurring influenza infections.

Implications for the 2009 Pandemic
By analyzing hundreds of A(H1N1) HA sequences between 1918,2009, our study revealed positive selection in the subgroup I-v of A(H1N1) HAs. The positively selected codons were located at HA 1 156 and 190 in the Sb antigenic site [83]. It was surprising that HA 1 190, which is critical for receptor-binding specificity of A(H1N1) HAs, was also under positive selection. Through further analysis of HA 1 190, together with HA 1 225, the other critical determinant for receptor-binding specificity of A(H1N1), we found that the epidemic HAs and the 1918 pandemic and swine HAs favored one of these two sites for antigenic drift. Whether the 2009 pandemic A(H1N1) HA will adopt any of these two trends, or use a novel mechanism that does not involve HA 1 190 and 225, will unfold in the coming months. If the latter is to be used, the 2009 A(H1N1) viruses may maintain their intrinsic high transmissibility, which, together with mutations in other genes such as NS1 and PB1-F2 with signatures of elevated pathogenicity [1,2], may suffice a new disastrous pandemic in the near future.

Phylogenetic Analysis of A(H1N1) HAs
We obtained all available HA sequences (over 1,000) of nonegg-adapted A(H1N1) viruses for the period of 1918,2009 (as of July 10, 2009) from GISAID/Epifludb. We then removed the sequences with one or more ambiguous nucleotide sequences within the HA 1 region and deleted identical sequences. This gave us a dataset of 652 HA sequences that included three 1918 pandemic HAs, 575 epidemic HAs from 1979,2008 that collectively formed group I, and 73 pandemic HAs from 2009 and one HA from 2007 that belonged to group II. To facilitate the speed of computing, we further removed closely related sequences and obtained a dataset of 333 HA sequences. The program RDP3 (http://darwin.uvigo.es/rdp/rdp.html) [66] was used to make sure that no recombination was present in any of these HA sequences. The ClustalW method [68] with the MEGALIGN program of DNASTAR package (www.dnastar.com) was used for phylogenetic analysis of H1 HA sequences in the region of HA 1 (Fig. S1).
Due to the historic use of eggs for amplification of influenza viruses before sequencing, there presented a vacuum in sequence for non-egg-adapted A(H1N1) viruses between 1919 and 1979. In order to gain insights into the evolution of A(H1N1) viruses for this period, we separately collected a total of 32 different egg-adapted A(H1N1) HA sequences between 1933,1979 that were free of sequence ambiguity (Fig. S2). These sequences were similarly analyzed while keeping in mind of the possible egg-adapted mutations at HA 1 138, 144, 163, 189, 190, 225, and 226 [62,63,64].
In order to compare the evolution of swine A(H1N1) HA sequences, we also retrieved 42 unique swine H1 HA sequences for the period of 1990,2009 that were free of ambiguous nucleotide sequences (Fig. S3). The reason that we focused on 1990,2009 was that previous studies suggested that swine A(H1N1) viruses be antigenically stable for the period of 1930 to 1990s [84].

Analysis of Positive Selection by PAML 4.0
The site-specific models implemented in the CODEML program in PAML 4.0 [45] was used to calculate heterogeneous selection pressure at amino-acid positions [45,54,72,85]. The models used in this study were M0, M1a, M2a, M7 and M8. M1a (nearly neutral), M7 (beta) and M8a (beta and v = 1) were null models that did not support v.1. In contrast, the alternative models M2a (positive selection) and M8 (beta and v), compared to M1a and M7 respectively, each had an additional class that allowed v.1. Likelihood ratio tests (LRT) comparing M2a versus M1a, M8 versus M7, and M8 versus M8a provided test for the existence of positive selection. In the test, twice the log likelihood difference, 2Dl = 2(l 1 2l 0 ), was calculated where l 1 and l 0 were the log likelihoods for the alternative model and null model, respectively. A larger value of LRT over those of x 2 distribution led to rejection of the null models [72]. In order to calculate the codon-substitution models for heterogeneous selection pressure at each codon, the Bayes Empirical Bayes (BEB) analysis implemented in CODEML [72] was used, which has been shown to yield robust results even for small datasets. For all calculations, multiple runs, each with different initial parameter values, were performed to ensure optimization and convergence.

Directional Evolution of Protein Sequences Using HyPhy
Each group of A(H1N1) HA sequences aligned by the ClustalW method (Fig. S1, S2, S3) was input to the PhyML program [86] to generate an unrooted phylogenetic tree, which was then rooted using the Treeview software [87] by selecting the oldest sequence in each group as the root/ancestor. This rooted phylogenetic tree was used for directional evolution of protein sequences [49] implemented in the HyPhy [55] software package.