Influenza A H1N1 Pandemic Strain Evolution – Divergence and the Potential for Antigenic Drift Variants

The emergence of a novel A(H1N1) strain in 2009 was the first influenza pandemic of the genomic age, and unprecedented surveillance of the virus provides the opportunity to better understand the evolution of influenza. We examined changes in the nucleotide coding regions and the amino acid sequences of the hemagglutinin (HA), neuraminidase (NA), and nucleoprotein (NP) segments of the A(H1N1)pdm09 strain using publicly available data. We calculated the nucleotide and amino acid hamming distance from the vaccine strain A/California/07/2009 for each sequence. We also estimated Pepitope–a measure of antigenic diversity based on changes in the epitope regions–for each isolate. Finally, we compared our results to A(H3N2) strains collected over the same period. Our analysis found that the mean hamming distance for the HA protein of the A(H1N1)pdm09 strain increased from 3.6 (standard deviation [SD]: 1.3) in 2009 to 11.7 (SD: 1.0) in 2013, while the mean hamming distance in the coding region increased from 7.4 (SD: 2.2) in 2009 to 28.3 (SD: 2.1) in 2013. These trends are broadly similar to the rate of mutation in H3N2 over the same time period. However, in contrast to H3N2 strains, the rate of mutation accumulation has slowed in recent years. Our results are notable because, over the course of the study, mutation rates in H3N2 similar to that seen with A(H1N1)pdm09 led to the emergence of two antigenic drift variants. However, while there has been an H1N1 epidemic in North America this season, evidence to date indicates the vaccine is still effective, suggesting the epidemic is not due to the emergence of an antigenic drift variant. Our results suggest that more research is needed to understand how viral mutations are related to vaccine effectiveness so that future vaccine choices and development can be more predictive.


Introduction
In April 2009, a novel human influenza A(H1N1) virus was identified. This virus rapidly spread around the globe causing significant morbidity and mortality in 2009/2010. This virus was of swine origin [1,2] and contained a novel combination of gene segments not previously reported in a human influenza virus isolate [3]. Except for the elderly, the vast majority of individuals around the world did not have protective immunity against the virus and were thus susceptible to infection [4]. This relatively low immunological pressure has presumably contributed to the fact that there has been only limited antigenic change in the virus.
The primary target of the immune response to influenza is generally the hemagglutinin (HA), a glycoprotein found on the surface of the virus. Mutations in the HA protein enable the virus to escape the neutralizing antibody response induced by vaccination or infection. Changes in the major antigenic epitopes are believed to be primarily responsible for immune escape [5], though changes outside these regions may also influence HA antigenic structure and antibody binding strength. More generally, evidence from equine and human challenge studies [6] suggest that reinfection probability increases as the number of amino acid differences between the primary infection/vaccine strain and the challenge strain increase. Studies at the household level found reinfection with human A(H3N2) occurred when the number of amino acid mutations was between 9 and 22 [7]. In vitro studies of the A(H1N1)pdm09 strain have shown that only one or two amino acid changes can reduce the ability of human sera to bind viruses of this strain [8].
Between April 2009 and April 2010, the Centers for Disease Control and Prevention (CDC) estimate that there were ,61 million clinical cases of influenza in the US [9], and a further ,80 million people were vaccinated against the virus [10]. Prior infection or vaccination precludes infection with a similar strain of influenza because the HA proteins displayed on the surface of the virus are targeted by existing antibodies. Immunity exerts pressure on the virus to evolve rapidly, a process of antigenic change well described in prior influenza epidemics. However, despite the potential for antigenic changes in the virus that may presage the emergence of an antigenic drift variant, no quantification of the magnitude of changes in the HA gene of the A(H1N1)pdm09 strain has been done on a global level. While geographically limited assessments have shown changes in the sequence of the HA gene [11][12][13][14], a global perspective is necessary because new strains can spread around the globe in months or even weeks. In this report we explore the evolution of the A(H1N1)pdm09 strain since April 2009 at both the RNA and protein levels, altogether constituting .9,000 sequences of A(H1N1)pdm09.

Methods
Both the nucleotide and amino acid sequences of the coding regions and the sequences of the hemagglutinin (HA), neuraminidase (NA), and nucleoprotein (NP) segment coding regions were obtained from the National Centre for Biotechnology Information (NCBI) influenza virus resource [15]. Full-length sequences were selected for all A/H1N1 samples collected from humans from 1/ 1/2009 through 12/31/2013. Multiple sequence alignment was calculated using MAFFT [16,17] with the FFT-NS-2 progressive alignment algorithm. The multiple sequence alignment was viewed with ClustalX [18].
Sequences were then compared base-pair by base-pair (nucleotides) and amino-acid by amino-acid (proteins) with the vaccine strain (A/California/07/2009). While other options for measuring pairwise distances are possible, we used the simplest metric, called the Hamming distance. This metric assigns a zero or one depending on whether two nucleotides or amino acids are identical and has been widely used to cluster different influenza strains [19]. We then defined the distance between two sequences as the sum of the pairwise distances between their composite nucleotides or amino acids. Divergence from the vaccine strain was then calculated as the percentage of the sequence that was identical to the vaccine strain.
Percentage divergence was used to identify the pandemic strains using a relatedness criterion. After examining the way that the different isolates clustered (see Figures S1, S2, S3, S4, S5 and S6), strains with a similarity greater than a specific percentage were considered pandemic strains and all subsequent analysis was on these remaining strains. For HA and NA sequences, strains with a relatedness greater than 90% to the vaccine strain were considered A(H1N1)pdm09 strains, while for NP sequences we used 94% as the cutoff. The pandemic strains were then sorted by collection date. Strains with only the year of collection were excluded. Strains that had year and month but not day were sorted at the end of each month.
We then plotted the hamming distance of both the nucleotide coding regions and the amino acids and calculated the rate of mutation accumulation as the linear trend of the fit of the data. Strains were also separated into two seasons per year, from April to September and from October to March, and differences in the mean hamming distance between seasons were tested for statistical significance using a two-tailed student's T-test. Linear trend analysis and significance tests were done in R [20].

Epitope Analysis
Antibodies bind influenza virus primarily at the epitope regions of the hemagglutinin protein [21]. Although other residues can affect the geometry at the surface, and so can be under selective pressure, they are not available for presentation to antibodies. Thus, these epitopes are likely to be the predominant sites of selection and increased change in those sites is suggestive of immune escape. In addition, there have been suggestions of a linear correlation between vaccine efficacy and the antigenic distance of a strain at the epitopes from the vaccine strain [22].
Despite the importance of the epitope regions there is no consensus on the epitope regions for A(H1N1)pdm09. We thus examined three different possible models suggested in the literature. The first was done by Deem et al. [5], which mapped five epitope regions (A-E) from H3 onto a pandemic strain (A/ California/04/2009). The second one we used was proposed by Huang et al. [23] and uses entropy and a likelihood ratio to define a set of 41 natural epitopes that are a subset of the five epitope regions defined by Deem et al. [5]. The third is the set of five antigenic regions (Ca1, Ca2, Cb, Sa, Sb) defined from laboratory studies on influenza virus A/PR/8/34 [24]. For each set of epitope regions, we calculated the hamming distance for each region as well as P epitope , a measure of antigenic distance [25] defined as, which can be used to estimate the likely efficacy of a vaccine [22,25]. We also analyzed the rate of non-synonymous to synonymous (dN/dS) changes in the coding region of the HA gene across all the isolates using the vaccine strain as the basis for comparison, focusing on the differences in the rates between epitope (using the first definition) and non-epitope residues.

H3N2
We also conducted a similar analysis comparing changes in the HA gene between A(H3N2) strains and the H3N2 vaccine strains. Full-length H3N2 sequences were also downloaded from the NCBI influenza virus resource [15] for the period 1/1/2009 through 12/31/2013. Multiple sequence alignment was again calculated using MAFFT [16,17] with the FFT-NS-2 progressive alignment algorithm, and the multiple sequence alignment was viewed with ClustalX [18]. Finally, base-pair by base-pair (nucleotides) and amino-acid by amino-acid (proteins) comparison was done with the vaccine strain for each season as noted by the WHO (http://www.who.int/influenza/vaccines/virus/ recommendations/en/). Thus, strains collected prior to April 2010, were compared to the A/Brisbane/10/2007 strain. Strains collected between April 2010 and October 2012 were compared to A/Perth/16/2009, and strains collected after October 2012 were compared to vaccine strain A/Victoria/361/2011.

Results
We calculated the hamming distance for both the coding region and the protein of the HA, NA, and NP gene segments for all available fully sequenced strains of A(H1N1)pdm09 in the NCBI influenza virus resource from April 1999 to December 2013. The total number of HA sequences was 9,076 (includes one that was dated March 30, 2009 but not the vaccine strain), though sampling was not equal across the years, with the vast majority (75%) sequenced between April 2009 and March 2010 (Table 1). There were fewer fully sequenced NA and NP isolates, only 7,232 and 4,406, respectively. Despite these limitations, clear trends were observed in the rate that the HA, NA, and NP genes and proteins accumulated mutations.
Between April 2009 and December 2013, the coding region of the hemagglutinin segment of the influenza H1N1 pandemic Table 1. Mean hamming distance (standard deviation) of the hemagglutinin, neuraminidase and nucleoprotein coding regions and proteins, by season.  (Table 1). Each year's increase was statistically significant (p,0.01) both compared to the prior season as well as the initial season. However, while the following seasons were statistically different from the initial season (p,0.01), the mean hamming distance has not significantly changed since March 2012. This is also reflected in the mutation accumulation rate, which was 6.54 (SE: 0.04) per annum, or a rate of 3.8610 23 nucleotide substitutions per site per year, for the HA gene between April 2009 and March 2012.
The hemagglutinin protein of the influenza H1N1 pandemic strain has also been accumulating mutations at a faster rate than the NA and NP proteins ( Figure 2). We estimated that the HA protein has been accumulating mutations at a rate of approxi-  (Table 1). Each year's increase was also statistically significant (p,0.01) both compared to the prior season as well as the initial season. However, while again the 2012-2013 seasons were statistically different from the 2009-2010 season (p, 0.01), the mean hamming distance of the HA protein from the vaccine has not significantly changed since March 2012.
While the last two seasons have not seen significant changes in the mean hamming distance, this belies differences in the pattern of mutations between seasons. For instance, the mutation D97N fluctuates in frequency, though never reaching 50%, through several seasons before becoming dominant in 10/2012-3/2013 season. On the other hand mutations S69T, S143G, A197T, N260D, and V520A all became the consensus mutation in the 10/2011-3/2012 influenza season, appearing in ,70-80% of sequences, but by the next year they all became much less common and the dominant amino acid found is the wild type ( Table 2). A number of other mutations -P83S, S203T, and I321V -were found in most sequences by the winter of 2009 and have not waned in frequency. While, some mutations, S185T, E374K, and S451N, continually increase in frequency each season, other mutations (K163Q, K283E, A256T, and E499K) all became the dominant sequence in 10/2013-3/2013 or later for the first time after persisting at a low frequency for a number of seasons. All the mutations described here were originally seen in at least one sequence in 2009-2010, though this is not surprising as nearly 70% of the amino acids have at least one mutation in one isolate in the 10/2009-3/2010 season.

Epitopes
Because of the uncertainty regarding the location of the epitope regions of the hemagglutinin protein, we examined mutations using three different definitions for these regions: (1) a set of epitopes defined by matching the epitopes to H3N2 [5]; (2) a subset of the first set that are natural epitopes [23]; and (3) a set of laboratory confirmed sites for prior seasonal H1N1 strains [24]. In the first set, which encompasses the largest number of residues, Using only the subset of those residues which have been defined as natural epitopes we observed fewer mutations in these residues, with the majority appearing in epitope D ( Figure S8). However, with a lower denominator, average P epitope calculated for these residues is 0.08 and 0.14 for the 10/2012-3/2013 and the 4/2013-9/2013 seasons, respectively, and a maximum value of 0.3. Lastly, for the laboratory confirmed epitopes we observed that there were approximately 3 mutations in these residues on average in recent seasons, primarily in the Ca1, Sa, and Sb regions ( Figure S9). This resulted in a P epitope average value of 0.11 in both the 10/2012-3/2013 and the 4/2013-9/2013 seasons, and a maximum value of 0.42. Analysis of the non-synonymous to synonymous mutations in the epitope regions compared to the rest of the gene found that dN/dS outside the epitope regions was fairly high in the first couple seasons but has been approximately one in the last several seasons. Conversely, within the epitope regions dN/dS has generally been above unity ( Figure S10).

H3N2
Annual influenza epidemics in the United States in the 2010-2011, 2011-2012, and 2012-2013 seasons were predominated by H3N2 influenza strains (http://www.cdc.gov/flu/weekly/ pastreports.htm). We calculated the hamming distance for both the coding region and the protein of the HA gene segments for all available fully sequenced isolates of A(H3N2) in the NCBI influenza virus resource from January 1999 to December 2013. The total number of HA sequences was 3,220, and sampling was approximately equal across years. We then compared the number of mutations that differed between collected strains in each season with the recommended vaccine strain for that season, and compared this to the evolution of A(H1N1)pdm09 over the period of the study. This data indicates that H3N2 mutation rates similar to that seen with A(H1N1)pdm09 led to the emergence of two antigenic drift variants but no A(H1N1)pdm09 drift variants emerged during the same timeframe ( Figure 4).

Discussion
Since the emergence of the A(H1N1)pdm09 virus in 2009, only a limited number of genetic or antigenic changes in the virus HA gene/protein have been documented. Based on the detected sequences, which remained antigenically homogeneous and closely related to the vaccine virus, the recommended virus strain for inclusion in the seasonal influenza vaccine remained an A/ California/07/2009-like virus for the 2013-2014 northern hemisphere winter influenza season [26] as well as for the upcoming 2014 southern hemisphere influenza season [27] and the 2014-2015 northern hemisphere winter influenza season. However, over the past several seasons there have been a number of reports of virus isolates containing amino acid changes in the HA protein that have the potential to alter the antigenic properties of the virus [8,28]. In this report we observe that the HA protein has accumulated mutations both in total and within the epitope regions that make the potential for vaccine escape highly probable. This has important implications for evolutionary, epidemiological, and clinical aspects of the virus.
From an evolutionary perspective, the HA gene has been accumulating mutations more rapidly than the NA and NP genes, however, the rate of nucleotide substitution and amino acid substitution is lower than prior estimates [29]. While the faster mutational drift of the HA gene is similar to past experience with other H1N1 strains as well as with H3N2 strains, the slower rate of   Figure S11). This suggests that the two viruses may be subject to different selective pressures on their corresponding HA proteins. Historically, for a new epidemic to occur, the HA protein of the virus has to mutate enough to become antigenically distinct to a significant percentage of individuals [30,31]. Prior studies suggest that the probability that this will occur increases when the number of amino acid substitutions in the HA protein exceeds 10 [6,7] or the number of amino acid substitutions in the epitope regions exceeds 4 [29]. This is also the pattern that we see with the A(H3N2) data. Clinically the 2013-2014 season has so far been marked by an influenza epidemic predominated by A(H1N1)pdm09 [32]. This is consistent with the appearance of prior epidemics given that the number of mutations in the HA protein (particularly in the epitope regions) was similar in magnitude to prior H1N1/H3N2 epidemics that were the result of antigenic drift. However, while the estimated P epitope scores suggested that the vaccine may be only moderately effective [22,33], the evidence to date indicates that the vaccine for the 2013-2014 season has been as effective as prior seasons [34,35]. The plateau in the hamming distance and the efficacy of the vaccine suggests that an antigenic drift variant has not emerged this season, despite an increase in the number of cases consistent with an epidemic. These results could be explained by several different reasons. The first is that potentially the vaccine does not provide long-lasting immunity as a natural infection would and individuals vaccinated in prior years are susceptible if they did not get a vaccine this year. A second possibility is that because the A(H1N1)pdm09 strain was a novel strain almost everyone was susceptible, but many individuals may not have been infected during the initial wave of infection leaving a large pool of susceptible individuals that has been augmented with births of naïve children. Third, the mid-season results could just be due to sampling bias and more sequences/studies may suggest an alternative narrative. The first two cases suggest that improved vaccination coverage would have contributed to fewer cases of influenza this season.
We, as yet, cannot predict how influenza mutations will accumulate or how these specific mutations will contribute to influenza epidemics. For example, over the course of the study, numerous genetic 'outliers' were sampled without a new epidemic occurring. In fact there were three samples in which more than 40 amino acids differed from the vaccine strain identified prior to the 2012-2013 seasons. Why did these strains not start a new epidemic? Excluding sequencing errors, one possibility is that they could have been less transmissible relative to the dominant strain and thus could not seed a new epidemic. Alternatively, as an epidemic increases and there are more infected individuals, the probability of genetic outliers appearing increases. However, as they are outliers, the probability that they are transmitted is less precisely because they are outliers (regardless of fitness -though mutations generally reduce fitness, further reducing the likelihood an outlier is selected). However, as the epidemic wanes the likelihood of a genetic outlier appearing is less, but if one is generated, the probability that it will spread is increased. This suggests that variability (i.e. genetic diversity) in sampling is likely to increase as the number of susceptible individuals wanes and the seed of a new epidemic is likely to occur from these 'outliers'. Better predictions of how outliers are related to future epidemics could lead to an increase in the efficiency of selecting future vaccine strains. Figure 3. Divergence at the A(H1N1)pdm09 epitopes. Changes in the major antigenic epitopes are believed to be primarily responsible for immune escape. We examined the total number of mutations in these regions combined. However, there is disagreement as to the amino acid locations encoding the epitope regions, thus we used three potential descriptions of the epitope regions of the influenza A(H1N1)pdm09 HA protein. The first (A) was based on the A(H3N2) strain's epitopes, the second (B) was a set of natural epitopes that is a subset of the first set of epitopes, while the third (C) is a set of laboratory confirmed epitopes for prior H1N1 strains. All three show divergence (i.e. an increase in the number of hamming distance) in the epitope regions, particularly the first and third definitions. Despite the important epidemiological and clinical implications of this work, it is not without limitations. First of all, estimates of antigenic drift and vaccine effectiveness are based in large part on changes in the epitope regions of the HA protein, however, there is no consensus on the exact location of these epitopes. In addition, the most recent H3N2 epidemic was due in large part to changes in the structure of the epitopes that occurred outside the clearly defined epitope region. Second, the samples we used were not randomly selected, but were drawn from available sequences. These sequences are largely from individuals that were hospitalized in western countries and so likely represent only a fraction of the potential diversity. Regions outside of the west may play a large role in the evolution of influenza. For instance, while the most significant outliers from 2013 were from Kenya, African isolates account for only a small fraction of the total number in the database. Better geographic surveillance would increase the potential for identifying antigenic drift in the virus and improve the capacity to make vaccine strain choices. Despite these limitations, the extremely large number of samples heralds a new era in genomic surveillance and promises to increase our knowledge and understanding as to how influenza evolves. It also suggests a need for tools to be developed that allow quick and easy interpretation of newly sequenced isolates within the context of other sequences so that decisions on surveillance and interventions can be optimally provided. Crucially it also suggests more research is needed to understand how viral mutations are related to vaccine effectiveness so that future vaccine choices can be more predictive.

Conclusion
The vast number of A(H1N1)pdm09 sequences provides a means of understanding the evolution of influenza and potentially predicting new epidemics. Data of this sort can be used to develop theories and predictions as to how future viruses may evolve and provide data for vaccine optimization. Ideally, computational and in vitro methods could be used to generate vaccine strains that would be predictive rather than reactive, but a better understanding of influenza intra-host diversity and transmission is required to start developing such techniques. While the future evolutionary paths of the A(H1N1)pdm09 strain are not fully known and subject to as yet undetermined ecological and environmental effects due to interactions with other strains and pathogens, the number of mutations in the HA protein suggest that there is a high probability of an antigenic drift variant in the A(H1N1)pdm09 strain occurring in the near future, and surveillance should be geared to look for such changes.  one was based on the A(H3N2) strain's epitopes. A-E refers to the different epitopes, while F is the P epitope calculation measuring the proportion of amino acid differences in the dominant epitope, for each strain.

Supporting Information
(GIF) Figure S8 Divergence at the A(H1N1)pdm09 epitopes, definition 2. We used three potential descriptions of the epitope regions of the influenza A(H1N1)pdm09 HA protein. The present one is a set of natural epitopes that is a subset of the first set of epitopes. A-E refers to the different epitopes, while F is the P epitope calculation measuring the proportion of amino acid differences in the dominant epitope, for each strain.
(GIF) Figure S9 Divergence at the A(H1N1)pdm09 epitopes, definition 3. We used three potential descriptions of the epitope regions of the influenza A(H1N1)pdm09 HA protein. The present one is a set of laboratory confirmed epitopes for prior H1N1 strains. A-E refers to the different epitopes, Ca1, Ca2, Cb, Sa, Sb, while F is the P epitope calculation measuring the proportion of amino acid differences in the dominant epitope, for each strain.