Large-Scale Sequence Analysis of Hemagglutinin of Influenza A Virus Identifies Conserved Regions Suitable for Targeting an Anti-Viral Response

Background Influenza A viral surface protein, hemagglutinin, is the major target of neutralizing antibody response and hence a main constituent of all vaccine formulations. But due to its marked evolutionary variability, vaccines have to be reformulated so as to include the hemagglutinin protein from the emerging new viral strain. With the constant fear of a pandemic, there is critical need for the development of anti-viral strategies that can provide wider protection against any Influenza A pathogen. An anti-viral approach that is directed against the conserved regions of the hemaggutinin protein has a potential to protect against any current and new Influenza A virus and provide a solution to this ever-present threat to public health. Methodology/Principal Findings Influenza A human hemagglutinin protein sequences available in the NCBI database, corresponding to H1, H2, H3 and H5 subtypes, were used to identify highly invariable regions of the protein. Nine such regions were identified and analyzed for structural properties like surface exposure, hydrophilicity and residue type to evaluate their suitability for targeting an anti-peptide antibody/anti-viral response. Conclusion/Significance This study has identified nine conserved regions in the hemagglutinin protein, five of which have the structural characteristics suitable for an anti-viral/anti-peptide response. This is a critical step in the design of efficient anti-peptide antibodies as novel anti-viral agents against any Influenza A pathogen. In addition, these anti-peptide antibodies will provide broadly cross-reactive immunological reagents and aid the rapid development of vaccines against new and emerging Influenza A strains.


Introduction
The recent outbreak of swine-origin influenza A (H1N1) that began in April 2009 in Mexico has caused an immediate international concern. In June 2009, the virus had already spread to 70 countries and a global pandemic was declared by WHO [1]. Since then the virus has continued to spread to 168 countries and has infected approx. 209,438 people worldwide [1]. Over the past decade, influenza epidemics have been mild; nevertheless, influenza A virus has been predicted as a major and unpredictable threat to public health due to historic precedents [2,3]. Prior to the outbreak of H1N1, H5N1 influenza virus infection in humans in South Asia had caused a significant number of cases of severe disease and deaths in humans and had led to a global concern about the potential of this virus to evolve to pandemic proportions [4]. These current and recurring events of Influenza A fatalities around the world highlight this ever-present threat to global public health.
The inability to provide lasting protection to humans against influenza A virus is due, in part, to the rapid evolution of the viral surface glycoprotein, hemagglutinin (HA), which leads to a change in its antigenic structure. Hemagglutinin plays a major role in determining host specificity since it is responsible for viral binding to host cell receptors and penetration of host membranes [5,6,7]. Influenza A hemagglutinin exists as 16 related subtypes in birds [8,9]. Three subtypes, H1, H2 and H3, are found in viruses known to have caused human pandemics and several subtypes are known to infect other mammals, e.g. pigs and horses. During repeated rounds of infection, selection, and re-infection, influenza viruses undergo host-specific adaptations. The regions involved in hostvirus interactions including the receptor-binding site are likely to resist changes, but the antigenic sites are subject to drift due to immune surveillance. In addition, some regions may evolve for other reasons e.g. to facilitate post-translational modification or to facilitate protein folding and maintenance of secondary/tertiary structures [5]. It is reasonable to hypothesize that regions of the hemagglutinin protein that are phylogenetically information rich, would be good candidates for involvement in virus-host interactions and for additional viral functions. This would be especially true for regions shared by more than one subtype. In this work, we attempt to identify such information-rich regions on the HA1 subunit of the HA protein, where the majority of the amino-acid variation is located. This subunit is also highly exposed and, hence, a target of neutralizing antibody responses [10]. Currently, it is not possible to modulate the B-cell response to specific protein regions, and hence, the current vaccines, which are composed mainly of HA protein or inactivated virus, have to be reformulated as the virus mutates and changes. Due to constant evolution of influenza A viruses, there is an urgent need for the development of new vaccine strategies and anti-viral therapies based on conserved regions, which can provide wider protection against any new Influenza A virus.
This study focuses on analysis of Influenza A human HA1 protein sequences available in the NCBI database, corresponding to H1, H2, H3 and H5 subtypes. These sequences were used to identify nine regions that were conserved across subtypes. These conserved sites were further analyzed in terms of secondary structure, hydrophilicity and solvent-accessible surface to determine their suitability for targeting anti-peptide antibodies/antiviral therapies.
This work will be critical in the development of new anti-viral therapies such as peptide antibodies or a peptide vaccine that will be effective against all current and possibly future Influenza A strains. Further, due to evolutionary variability of the HA protein, development of vaccines against emerging strains is delayed by the non-availability of reagents/antibodies that recognize the new viral strain. This type of information will also be useful in designing broadly cross-reactive, immunological reagents/peptide antibodies to quantify the new vaccine product until viral-specific reagents become available.

Analysis of HA Sequences
All full length HA sequences were downloaded from NCBI Flu database (as of July 2009). Separate files were generated for all H1, H2, H3 and H5 sequences. Computing a simultaneous multiple alignment across all subtypes for the HA segment is challenging for several reasons including: 1) high sequence-variability between subtypes, 2) large number of sequences that need to be aligned and 3) the need to correct for biased representation of the subtypes in the sequence databases. As a surrogate, we aligned the sequences in stages where we used a sensitive profile-profile alignment stage to align sequences across subtypes. In order to achieve this, we first aligned the HA sequences within each subtype using MUSCLE 3.6 [11] after filtering out frame-shifted and partial sequences (based on an initial MUSCLE alignment and manual curation). We also excluded subtypes that were poorly represented in the NCBI database to eliminate any further bias in the results. The profiles for each multiple-alignment were then aligned to the other profiles using the profile-profile aligner COMPASS [12]. Because the profile for H2 had the most significant alignments to the other profiles, as measured by COMPASS E-values, it was used to create a seeded multiple alignment for the profiles similar to a PSI-BLAST [13] alignment. The profile multiple-alignment was then expanded, based on the multiple-alignment for each profile, into a multiple-alignment for all the sequences across subtypes. In the final alignment, the sequence of 1HGJ (H3 HA structure) was included to follow the H3 numbering.
Computing a conservation score for each column of the final multiple-alignment was complicated by the uneven representation of the subtypes in the alignment. Since weighted entropy scores require ad-hoc weighting schemes for the various sequences in the alignment, we instead relied on a simpler correlation-based score: for each of the subtypes H1, H3 and H5, the dot product of the profile column with the corresponding profile column of H2 was used to estimate the correlation of the profiles. The correlation with H2 was chosen because H2 was the seed profile for the multialignment. The correlation values were rounded to two decimal places and hence some correlation values were not exactly 1.0 but very close to 1.0. The final conservation score was then taken to be the minimum of these correlation scores, as a conservative estimate of conservation. The above alignment results and their correlation scores are shown in Table S1.
We concentrated on the HA1 subunit and manually identified regions containing 6 residues or more, where more than 50% residues had a final conservation score of 0.9-1.0, and selected them as conserved regions. We chose a threshold of 6 residues as a minimum length required for a peptide antibody response.

Analysis of HA Structure
All of the structure analysis was done using Accelrys software discovery studio 2.0. The H3 structure (1HGJ) was downloaded from the PDB site. A ribbon model of the HA monomer was generated to depict the nine identified conserved regions. Each region was enlarged to show the detailed three-dimensional (3-D) structure of the above sites. The H1 structure (PDB: 1RVZ) was used for comparison with the H3 structure.
The secondary structure was calculated using both DSSP secondary structure prediction program [14] and from the PDB file. Since the results of the two methods were similar, only the results from DSSP program were tabulated. To calculate the solvent accessible surface, each of the nine regions were selected separately and calculations were performed by the method of Connolly, using a probe radius of 1.40 Au [15,16]. Each of the nine regions was also analyzed for the hydrophobicity value based on Kyte and Doolittle [17]. The sum total of hydrophobicity value of individual amino-acid residues was tabulated. A negative value reflects low hydrophobicity or a hydrophilic nature for the region.

Sequence Analysis
A comprehensive HA protein sequence analysis was performed using sequences of H1, H2, H3 and H5 subtypes of Influenza A that were available in the NCBI database, to identify all of the invariable regions. The analysis was performed in two steps; first, individual profiles of each subtype were generated and each profile was then aligned to the H2 profile to obtain the most significant alignments. The correlation plots of H2 with other subtypes reflect that subtypes H2 and H5; and H2 and H1, were clearly more similar than H2-H3 pair ( Figure 1). The H2-H5 and H2-H1 pairs had correlation scores ranging from 0.8-1.0 for 66% and 56% of amino acid residues in HA1, respectively. However, a correlation score of 0.8-1.0 was observed for only 33% of the residues in the H2-H3 pair. Similar results have been reported previously by Nobusawa et.al. [18], who showed that H1 subtype exhibited 58.7% and 55.7% identity of the HA1 subunit to the HA1's in the H2 and H5 subtypes respectively, and least identity to the H3 subtype (35.2%).
The sequence alignment data were distributed into two components: residues with a conservation score of 0.9-1.0 were considered as conserved residues and regions containing 6 residues or longer, where more than 50% residues had conservation score of 0.9-1.0 were grouped as conserved regions. Using these criteria, we expected to miss out some important single amino-acid positions and conformational epitopes but identify regions suitable for anti-viral/peptide therapies. This data clearly identified the nine most conserved regions within all subtypes as shown in Table 1 and 2. The conservation score of each variable residue within the identified conserved region was depicted within brackets, if the variation resulted from similar residue that would lead to no change in charge/hydrophobicity. The final conservation at each position in the selected conserved regions was higher than it appears in Table 1 and 2, as the variation in the selected regions, between subtypes often resulted in similar substitutions.
We concluded that, despite the large differences in the subtypes, there were regions of low variability that could be prime targets for anti-peptide/anti-viral responses.

Structure Analysis
Targeting of anti-viral agents/antibodies to specific regions of the protein is complex and incompletely understood. Currently, there is no effective method of predicting the epitope structure of the pathogen and directing the antibody response to pre-defined regions using protein as an antigen. However, there is evidence that short peptides with pre-determined sequence specificity can be used to raise anti-peptide antibodies, which recognize peptide-specific region of the protein [19,20] and monoclonal antibodies raised against synthetic peptides can cross-react with the intact protein molecule [21]. A vast majority of the literature suggests that not all regions of protein cross react with synthetic peptide antibodies and a number of structural parameters like surface exposure [22,23], hydrophilicity and residue type [17], influence this propensity. Therefore, the above parameters were evaluated at each of the nine identified sites to predict their accessibility to anti-peptide antibodies.
The structure of HA of influenza A virus (subtype H3) strain A/Aichi/2/68 was determined by Wilson et al., 1981 [24] and, subsequently, studies of variant viruses enabled mapping of antigenic sites on the protein (reviewed in ref. 10). More recently, structures of the influenza A virus HA for three additional subtypes were solved: H1 [25,26], H5 [27] and H9 [28]. We chose the H3 structure (PDB: 1HGJ) for our analysis to adhere to the numbering used in sequence analysis.
The position of nine identified regions were mapped on the 3-D structure of the HA monomer and shown in Figure 2A as a ribbon model. Each identified region was also enlarged to depict the details of the 3-D structure ( Figure 2B). All of the identified sites 1-9 were colored in red ( Figure 2B) and their secondary structure details, hydrophobicity value and solvent-accessible surface area were tabulated (Table 3). Site 1 was present at the N-terminus of HA1. This site contains the conserved cysteine residue that forms a  disulphide linkage with HA2 [10]. The site exists as a loop on the HA monomer with a large solvent-accessible surface area. Site 2 and site 7 had few residues that formed a part of known antibody binding sites E and C respectively [10]. Site 2 was mostly a-helical while site 7 had mostly a b-structure. Both these sites had small solvent-accessible surface areas and may be inaccessible/buried in the monomer. Site 3 was composed of both, a a-helical region and a loop with a large solvent-accessible surface area, similar to site 1. This region was highly hydrophilic and has been shown to be immunodominant in eliciting antibodies against a synthetic peptide [29,30]. However, this region does not form a part of known antibody binding site indicating that in solution, this region may acquire some conformation that is accessible only to peptide antibodies.
Sites 4 and 6 were mainly present as a b-sheet and appeared to be hidden in the monomer. Site 8 and site 9 were both found at the C-terminus of HA1 and were mainly loops with a large solvent-accessible surface area and high hydrophilicity. Site 9 has also been shown to elicit neutralizing antibody response against multiple HA subtypes [31].
This work did not identify the receptor binding site separately, but some of the conserved binding site residues were located within sites 3, 4 and 5. This lack of specific identification could be because the receptor binding site is comprised of residues that are not adjacent in the protein 3-D structure and several of the residues are not conserved [18].
A broadly cross-reactive antibody response is more probable, if the nine identified sites have adopted similar conformational structures in H1, H2 and H5 subtypes, as those in H3 subtype. In order to test this, we evaluated the structure of identified sites using the H1 structure (PDB: 1RVZ). We chose H1 structure alone due to high degree of sequence similarity between H1, H2 and H5 subtypes. In addition to sequence similarity, the nine sites had a high degree of structure conservation, with similar solventaccessible surface and secondary structure (Table 3). Some small differences in the secondary structure were seen at sites 3, 6, and 7. The hydrophobic/hydrophilic nature of sites also remained unchanged except for site 9 which had increased hydrophobicity in the H1 structure. To identify sites that had a high potential of generating an anti-peptide response, the sites were ranked based on their solvent accessible surface and hydrophobicity. Hydrophilic sites were selected and ranked first according to their solvent accessible surface, followed by the hydrophobic sites (Table 3).
Five (1, 3,5,8,9) out of the nine identified invariable regions had a high level of structure conservation between subtypes, suitable secondary structure and a significantly large solventaccessible surface to target an anti-peptide/anti-viral response.

Discussion
The hemagglutinin protein of influenza A virus is the major surface antigen against which neutralizing antibodies are produced and, hence, a major constituent of all vaccine formulations. However, this protein undergoes rapid evolutionary variation that leads to a change in its antigenic structure, and vaccines have to be reformulated so as to include the hemagglutinin protein from the emerging new viral strain. Thus, there is critical need for the development of an anti-viral measure to protect against any emerging Influenza A pathogen. One possible solution is an antipeptide antibody that is directed against the conserved region of the hemaggutinin protein that is represented in all the current and possibly future viral strains. Such an anti-viral response will also dramatically speed up development of vaccines against new viral strains until viral-specific reagents are developed.
T-cells, which mediate cellular immune responses, can target more conserved peptides from internal proteins and have the potential to provide wider protection. The role of both CD4+ Tcell and CD8+ T-cell in influenza infection have been extensively studied [32,33]. CD4+ T-cells and CD8+ T-cells recognize antigens presented by MHC class II and class I molecules, respectively. Most of the conserved internal protein sequences belong to class 1 epitopes and CD8+ response, in particular, is being considered in novel vaccine therapies [33,34]. However, unlike antibody response, T-cell response is not sterilizing and cannot prevent infection of the host cells. Therefore, antibody based vaccine approaches, although currently strain specific, are the primary means of resistance and recovery from influenza infection.
We conducted a comprehensive HA sequence analysis of major subtypes of Influenza A virus to identify regions that were conserved between subtypes. A combination of protein sequence alignment, and the correlation score with the subtype that gave most significant alignment with other subtypes was used to identify invariant regions. This method not only gave the most significant alignment result, but also eliminated differences arising from the variable number of available sequences of different subtypes and such an approach can be useful for other viruses with high rates of mutation.
Computational searches for conserved residues have been extensively done in case of human immunodeficiency virus (HIV-1), and these results have indicated that the most highly variable regions correlate with B-cell epitopes that are responsible for viral immune escape [35]. Because it is not known how to induce or modulate the immune responses to target conserved regions, such studies have not been used for the development of vaccines/anti-viral agents. An anti-viral therapy for Influenza A virus that is based on regions that are highly    conserved between all Influenza A subtypes can only have a significant strategic advantage provided that the above regions can be selectively targeted. There are many uncertainties when using whole protein as an antigen, such as localizing the antibody binding sites, directing the antibody response to certain sites and defining how amino-acid changes alter antibody binding in the context of a complex structure. The generation of anti-peptide antibodies with predefined sequence specificity is a promising alternative approach provided the antibodies that recognize and neutralize the native protein can be generated. It appears by large number of experimental reports using human immunodeficiency virus HIV-1 that many such peptides antibodies can recognize and, in some cases, neutralize some HIV-1 viruses [36,37]. The success of binding of anti-peptide antibodies to native protein has been attributed to many structural properties and includes secondary structure, hydrophilicity and surface exposure of protein binding sites; however, the contribution of any individual parameter or a combination that can assure an antigenic response is currently unknown [22,23,38,39]. Although, this structural correlation is similar to the antibody response when the whole protein is the immunogen, the immune response is not against the complex protein structure and, hence, easier to predict and control. In this context, peptides spanning the five identified conserved regions with structural parameters required for an antibody response can be tested with or without small variations for a broadly cross-reactive anti-viral response. The conserved regions adopt similar conformational structures in different subtypes, further validating the results obtained by sequence analysis and suggests that they have a potential of protecting against heterologous viral strains. This in silico approach will also need validation by experimental methods and in animal models.
This study establishes the identity and structural features of all conserved regions of hemagglutinin protein of Influenza A viruses. These regions/sequences can be selected as the first step in the development of new peptide vaccine strategies, and for the generation of anti-peptide antibodies as novel anti-virals and crossreactive immunological reagents, and may provide a solution for neutralization of any existing or emerging Influenza A pathogen.

Supporting Information
Table S1 Correlation score of H2 with other subtypes at each amino acid residue in HA1 domain. The data obtained from muscle and compass alignments was tabulated. The H3 sequence (PDB: 1HGJ) was added to follow the H3 numbering. The consensus residue at each amino acid position, the frequency of consensus residue, correlation of H2 with other subtypes and the final conservation score at each amino acid position is shown. Found at: doi:10.1371/journal.pone.0009268.s001 (0.10 MB XLS)