Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Future Sequon Finder - A novel approach for predicting future N-linked glycosylation sequon locations on viral surface proteins

  • Shane P. Bryan,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Medicine, Division of Nephrology, University of Rochester Medical Center, Rochester, New York, United States of America

  • Martin S. Zand

    Roles Conceptualization, Methodology, Project administration, Resources, Writing – review & editing

    Martin_Zand@urmc.rochester.edu

    Affiliations Department of Medicine, Division of Nephrology, University of Rochester Medical Center, Rochester, New York, United States of America, Clinical and Translational Science Institute, University of Rochester Medical Center, Rochester, New York, United States of America

Abstract

Influenza viruses are known to evade host immune responses by shielding vulnerable surface protein epitopes via N-linked glycosylation. A program titled Future Sequon Finder was developed to predict the locations in which glycan binding sites are most likely to emerge in future influenza hemagglutinin proteins. The predictive modeling approach considers how closely sites in currently circulating strains resemble glycosylation sequons at the nucleic acid level, the surface accessibility of those sites, and the mutation frequency of amino acids at those sites that would need to change to form a glycosylation sequon. The efficacy of this model is tested using historic human H1N1 and H3N2 influenza strains along with swine H1N1 strains. Through this analysis, it is revealed that glycosylation addition events in influenza hemagglutinin proteins are typically the result of single nucleotide mutation events. It is also demonstrated that site-specific mutation frequency and surface accessibility are powerful predictors of which sites will become glycosylated in human influenza viruses when considered with the genetic composition of the sites in question. Having been designed to incorporate these factors, the program successfully predicted almost every historic sequon addition event (28/30 in human IFVs, 14/15 in swine IFVs). For human strains, it also ranked the correct near-sequons highly among falsely predicted sequons based on site-specific mutation frequency. After demonstrating the model’s power with historical data, the program was used to predict future HA glycosylation sequon locations based on currently circulating human influenza viruses.

Introduction

Glycosylation of viral proteins that bind to cell surface ligands is a key evolutionary adaptation, as it alters protein antigenicity and shields target sites from immune responses [16]. Knowledge of actual or potential N-linked glycosylation sites is important for identifying epitopes on viral surface proteins that serve as ideal target sites for antibodies [7], vaccine engineering [810], or immune epitope focusing [1113]. The highly conserved process of N-linked glycosylation involves the attachment of a polysaccharide to an asparagine (N) residue facilitated by oligosaccharyltransferase, which recognizes a specific amino acid (AA) sequence known as an N-linked glycosylation sequon [14]. N-linked glycosylation sequons are defined by the AA sequence ‘N-!P-S/T’ where the first AA is N, the second AA is anything but proline (P), and the third AA is either serine (S) or threonine (T) [15]. Viruses have exploited protein glycosylation mechanisms in host cells to modify their proteins, aiding in viral protein folding, transportation, host receptor binding, and antibody evasion [16]. N-linked glycan trees are anchored at their N residue and can pivot to shield a large protein surface area, sterically blocking antibodies from binding nearby epitopes [17,18]. This effect, known as ‘glycan shielding’, must be considered when developing monoclonal antibody therapies, vaccination strategies, and antigenic escape predictions.

Influenza hemagglutinin (HA) proteins enable the virus to bind and enter host cells via interaction with sialic acid receptors and are a key target for antibody-mediated immune responses in the host [19]. As influenza virus (IFV) strains circulate, human populations develop immune responses capable of neutralizing closely related IFVs via HA interference, resulting in population immunity that slows the spread of infection [20]. To circumvent this population immunity, IFVs frequently modify their HA proteins [2124]. These modifications include increases in receptor binding avidity, distal mutations altering HA conformation and epitope accessibility, mutations in epitopes directly altering their structure, and mutations resulting in sequon addition events (SAEs) that allow new glycans to bind and shield vulnerable epitopes [7]. IFV’s use of the latter strategy is seldom considered when predicting escape mutations despite the fact that it has occurred dozens of times in the recorded history of human IFVs [25]. Given IFV’s propensity for shielding epitopes with glycans, predictive software that identifies future glycosylation sites could improve models for antigenic escape. Here we describe an effective computational method for predicting future glycosylation sites on viral surface proteins. The method identifies which sites on viral proteins are most likely to mutate into N-linked glycosylation sequons based on the region’s nucleotide sequence, its surface accessibility, and the historical mutation frequency of residues within the region that would need to change to generate a glycosylation sequon. The method’s efficacy is first tested on historic influenza HA proteins and later applied to currently circulating strains.

Materials and methods

Program overview

The method for predicting sequon emergence, Future Sequon Finder (FSF) [26], was implemented in Java using JDK 21. The general workflow of the program is outlined in Fig 1. It accepts a nucleotide sequence input with the option to include the associated surface accessibility for each AA in the protein acquired from the external surface accessibility prediction software GetArea [27] as well as site-specific AA mutation frequencies acquired from the sequence alignment and analysis tool MEGA using the Le and Gascuel model [28,29]. The nucleotide input is converted into an AA sequence, cross-referencing the GetArea results with the generated AA sequence, and a list of existing N-linked glycosylation sequons is identified, along with lists of sites that are most likely to become glycosylation sequons as a result of genetic drift. Probable future glycosylation sites, or “near-sequons," are sorted into three categories based on the nucleic and amino acid edit distance from a glycosylation sequon and changes in the physical properties of the AA sequence that would emerge from sequon transformation. The categories are as follows:

thumbnail
Fig 1. General workflow for generating outputs with FSF.

The primary analytic pipeline shown in blue is the most basic analytic path in FSF, requiring only an input nucleic acid sequence to generate lists of existing and near sequons. This diagram illustrates a case in which the mutation frequency (green) and surface accessibility (orange) paths are used with the primary path to generate more robust and informative results. When using all pathways, the FSF output will contain a set of binned lists of: all near-sequons and true sequons, all near-sequons and true sequons with exposed surfaces, all near-sequons sorted by mutation frequency (not shown here), and all near-sequons with solvent-exposed surfaces sorted by mutation frequency.

https://doi.org/10.1371/journal.pone.0328174.g001

  • Near-sequon - A trio of amino acids in which one AA must be substituted to generate a glycosylation sequon (AA edit distance = 1)
  • Very near-sequon - A trio of amino acids in which one nucleotide must be substituted to generate a glycosylation sequon (nucleotide edit distance = 1)
  • Ultra near-sequon - A very near-sequon wherein the mismatched AA has the same charge, polarity, and hydrophobicity as the correct AA

If surface accessibility data is included with the sequence input, all near-sequons will be sorted by the surface exposure ratio of the first AA in the sequon triad. All near-sequons (including ‘very’ and ‘ultra’ near-sequons) with exposure ratios exceeding 15% are tagged as near-sequons with exposed surfaces, meaning they are adequately exposed for glycan attachment. While glycans are covalently bound to proteins before folding, we assume that a sequon in an unexposed region would either remain unglycosylated, thus providing no selective advantage, or become glycosylated and disrupt proper folding of the protein. If the user has chosen to incorporate site-specific mutation frequency data, near-sequons in each category with exposed surfaces will be sorted again based on the relative mutation frequency of the site where their mismatched AA is located. We assumed as a premise that if a near-sequon was only one AA away from becoming a true sequon, and that AA position was subject to frequent observable mutations in the history of related IFVs, that site was more likely to mutate again and generate a functional sequon. After each category is sorted, the rankings of all near-sequons are displayed to the user with those whose mismatched AA is most likely to mutate ranked at the top of the list.

Sidechain exposure ratio analysis

To implement the surface accessibility path in FSF, the nucleotide sequences of interest were first translated into AAs. 3D protein models were then generated from the AA sequences and stored as PDB files. ColabFold v1.5.5 (AlphaFold2 using MMseqs2) was used in this analysis to generate all 3D protein structures [30,31]. After the PDB files were generated, the sidechain exposure ratios were determined using GetArea, chosen due to its accessibility and ease of use [27]. A 1.4Å probe radius was used for all predictions, the approximated radius of a water molecule.

FSF was designed to receive GetArea outputs, assigning a sidechain exposure ratio to each AA found in both the GetArea result and the translated nucleotide sequence input by the user. All AA/position identifier pairings from the GetArea result are compared to those same pairings generated from the converted nucleotide sequence input. If any mismatches between the GetArea sequence and the user input sequence are detected, an error is thrown indicating that the sidechain exposure ratios could not be aligned. Once sidechain exposure ratios have been aligned to the converted AA sequence, FSF iterates through all categories of near-sequons previously determined from the nucleic acid sequence. The program categorizes near-sequons based on whether the sidechain exposure ratio of the first AA in the triad (N, or the residue destined to become N) exceeds 15%. A cutoff of 15% was chosen given the observation that existing sequons seldom support glycans when the asparagine sidechain exposure ratios are less than 20% in HA proteins as found for multiple influenza strain HAs by Altman et al. [25]. The additional 5% leniency was included to account for minor errors in 3D protein structure generation and conformational changes that may occur during the transition from near-sequon to sequon.

Site-specific mutation rate analysis

To implement the site-specific mutation frequency path in FSF, a list of nucleotide sequences relating to and predating the sequence of interest must first be obtained. These sequences provide evolutionary context for the sequence of interest, showing which regions are most likely to change over time. We excluded sequences with the following characteristics: uncertain nucleotides (any character other than A, T, G, or C), any AA insertions (HA sequences encoding more than 566 AAs), any repeat sequences sharing the same name, and any sequences from strains succeeding our strain of interest. Sequences were curated to ensure they began with the correct start codon, eliminating reading frame ambiguity for translation. Sequences that did not begin with a start codon were eliminated. The remaining nucleotide sequences were converted into AAs, exported in a FASTA file, and then aligned using the multiple sequence alignment tool, MUSCLE [32]. The aligned sequences were saved as a .meg file and then imported into the Molecular Evolutionary Genetics Analysis (MEGA) tool [28]. Site-specific mutation rates were calculated in MEGA using the Le and Gascuel model with 8 gamma distribution categories [29].

The MEGA output was then imported into FSF, which assigned relevant site-specific mutation values to each predetermined near-sequon based on the AA position within that near-sequon that would need to mutate to generate a functional N-linked glycosylation sequon. For example, if near-sequon "NTK" was identified with AAs in positions 1, 2, and 3 respectively, FSF would recognize that the "K" AA in position 3 must change to generate a sequon. FSF will then associate the site-specific mutation frequency for AA site 3 with that near-sequon. Near-sequons are then ranked within their classification categories based on their associated mutation frequencies from most mutagenic to least.

Results

Predicting historic IFV sequon addition events

To test FSF’s efficacy, we used a set of historic influenza strains where SAEs were observed in closely related subsequent strains, as described by Altman et al. [25]. For this analysis, 20 strains of H1N1 IFV and 18 strains of H3N2 IFV relevant to SAEs were selected. Most sequences were selected from S3 Fig of Altman et al., who previously determined the phylogenetic relationships among prominent historic IFV strains associated with SAEs [25]. Exceptions to this include A/NewYorkCity/2/1918, A/Tonga/14/1984, A/Memphis/1/1984, A/Colorado/14/2015, A/Virginia/22/2016, and A/Netherlands/10100/2024 for H1N1, and A/Wyoming/11/2014, A/Germany/13247/2022, A/Bangkok/P3993/2023, A/Netherlands/10098/2024 for H3N2 which were added to fill temporal and phylogenetic gaps in the dataset. The added strains were selected from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) [33] or GISAID’s EpiFlu database [34]. To ensure these strains were representative of common glycosylation sequon and near-sequon patterns, they were compared to other circulating strains from the same time period, avoiding rare variants. The strains, sequences, and their sources can be found in S1 Appendix.

Initially, FSF was designed to only consider the genetic composition of HA sequences. Following the primary program path in Fig 1, FSF was able to analyze the selected 20 H1N1 and 18 H3N2 strains with their nucleotide sequences. The nucleic acid sequences for the HA proteins of all 38 selected strains were obtained from the BV-BRC [33] and GISAID’s EpiFlu database [34]. Sites that would become glycosylation sequons at some point in the IFV timeline were tracked over time. Their status as existing sequons, near-sequons, very near-sequons, and ultra near-sequons was recorded (Table 1). We focused on sequon alterations in the H1 head domain, where most N-linked glycosylation changes occur. Sequons and near-sequons outside of this region were omitted to decrease the total number of near-sequons in the FSF outputs.

thumbnail
Table 1. Temporally organized FSF results in the context of historically glycosylated HA regions. Historic H1N1 and H3N2 IFV strains have been temporally organized to illustrate trends in glycosylation patterns over time, as well as trends in near-sequon occurrence at these critical sites. The letters in each AA site column are the single-letter identifiers for the AA found at the given position in the given strain. These letters also represent the first AA of the sequon (or near-sequon) triad (the "N" position of N-!P-S/T). Characterized antigenic sites are indicated in the top row of the table. Near-sequons identified by FSF are indicated by varying shades of orange, with darker shades representing sites that more closely resemble glycosylation sequons. The emboldened letters in the darkest shade of orange represent true N-linked glycosylation sequons. (Top) Temporally organized H1N1 IFVs. (Bottom) Temporally organized H3N2 IFVs.

https://doi.org/10.1371/journal.pone.0328174.t001

Considering the potential genetic disparity between IFV strains from consecutive years, a phylogenetic analysis of the same IFV HA sequences was conducted. This strategy avoids pairing distantly related sequences and captures the impact of genetic drift, the primary means by which IFVs adapt to escape herd immunity [7,35], on sequon emergence. Phylogenetic trees containing all selected strains for H1N1 and H3N2 were constructed and are displayed in Fig 3. These trees were used to determine the branch on which each SAE occurred. Assuming parsimony, there were 19 SAEs in the H1N1 lineage and 13 SAEs in the H3N2 lineage that ultimately rose to prominence, persisting for 2 years and successfully spreading to multiple countries. It is worth noting that other SAEs have occurred in recorded human H1N1 and H3N2 HA sequences; however, these glycosylation sequons appeared in single countries or disappeared within 1–2 years. This analysis only considers SAEs that rose to prominence as previously defined. Preexisting strains closely related to the SAE strain were selected and analyzed with FSF. Table 2 displays the results of this phylogenetic-based analysis. To better visualize the disparity between HA sequences within IFV groups over time, AA edit distance heatmaps were generated using randomly selected HA strains from the H1N1 and H3N2 master lists (Fig 2).

thumbnail
Fig 2. Mapping edit distance between IFV HAs over time.

To better understand the roles of antigenic shift and antigenic drift in H1N1 and H3N2 HA evolution, edit distances between HA AA sequences were calculated and visualized in RStudio. All HA sequences were randomly selected from the master lists comprised of 32,000 H1N1 sequences and 40,000 H3N2 sequences, with a maximum of 5 sequences drawn per year to prevent overrepresentation of modern strains. (A) H1N1 HA protein AA edit distances from 1918–2024. (B) H3N2 HA protein AA edit distances from 1968–2024.

https://doi.org/10.1371/journal.pone.0328174.g002

thumbnail
Fig 3. Phylogenetic trees containing selected H1N1 and H3N2 IFV HAs.

HA nucleotide sequences for the 20 H1N1 and 17 H3N2 IFVs chosen for historical analysis were aligned and used to generate maximum likelihood phylogenetic trees in MEGA [28]. Bootstrap values (500 replicates) are displayed at each branch node as proportions. (A) Maximum likelihood tree of selected H1N1 IFVs. (B) Maximum likelihood tree of selected H3N2 IFVs. A/equine/Uruguay/1/1963 was used as an outgroup to root the trees.

https://doi.org/10.1371/journal.pone.0328174.g003

thumbnail
Table 2. FSF site designations of critical sites from closely related IFV strains circulating before SAEs. All AA sites listed in this table became N-linked glycosylation sequons in the strains specified in the "SAE Strain" column. Phylogenetically related IFV strains circulating before the SAE have been selected and the FSF designation of the site that would become glycosylated is presented in the rightmost column. Related strains were selected if they predated the strain in which the glycosylation sequon arose and were the closest preexisting relative on the generated phylogenetic tree shown in Fig 3.

https://doi.org/10.1371/journal.pone.0328174.t002

Through the phylogenetic analysis of historic IFV strains with FSF, we found that N-linked glycosylation sites on HA proteins almost always arise due to single nucleotide mutations. Of the observed 30 SAEs in the historic IFV selection, 28 arose due to single nucleotide mutations. Of the two that did not, one appeared to be the result of a deletion that occurred in A/Denver/JY2/1957 (compared to A/Malaysia/JY2/1954) at AA position 144, followed by two single nucleotide changes, one in AA 142 and the other in AA 145, which occupied position 144 following the deletion event. The other appeared to be due to two single nucleotide mutations in A/WilsonSmith/1933 (compared to A/NewYorkCity/2/1918), one in AA codon 286 and the other in AA 288. In both cases, these mutations may have arisen simultaneously to generate glycosylation sequons. Alternatively, it could be that more closely related strains containing near-sequons at these sites were never sequenced given the limited number of HA sequences available from the early 20th century. The notion that N-linked glycosylation sequons almost always arise due to single nucleotide mutations coincides with the finding that antigenic diversity in IFV HAs is primarily driven by the gradual accumulation of single base pair mutations [36,37]. Though unsurprising, this finding allows for more accurate sequon predictions given that sites requiring more than a single base mutation can be eliminated from consideration. While it is clear that the ultra and very near-sequon categories have a significantly different proportion of SAEs than the near-sequon category, (2, N = 711) = 16.736, p = 0.0002, the difference between the ultra and very near-sequon categories is less apparent, (1, N = 274) = 2.6353, p = 0.1045. More data is needed to determine if this difference is statistically significant.

Side chain surface exposure and mutation frequency analysis improve prediction

While FSF had proven to be effective at tagging sites as near-sequons in strains circulating before SAEs, it would also tag many sites that would never become sequons. To eliminate some of these false positives, the AA-specific side chain surface exposure analysis method was implemented. The sidechain exposure ratios for each IFV strain were determined using AlphaFold2 and GetArea. HAs were analyzed in their monomeric forms using GetArea to determine sidechain exposure. As a result, the number of exposed sites is likely overestimated compared to trimerized HAs, given that fewer regions of the HA are exposed in its oligomeric state. Although glycosylation precedes oligomerization, glycans that sterically hinder HA trimerization would likely be selected against. Measuring the surface accessibility of AAs in conformationally closed, trimerized HAs would likely reduce the number of exposed sequons and improve prediction accuracy, though this analysis has not yet been performed.

The site-specific mutation frequency analysis method was implemented to further organize the remaining near-sequons with exposed sidechains. To determine mutation frequencies, HA nucleotide sequences were obtained from BV-BRC and GISAID (32,000 H1N1 sequences, 40,000 H3N2 sequences) to form H1N1 and H3N2 master lists. The mutation frequencies were calculated separately for the analysis of every strain preceding an SAE using only strains that predated or coincided with the analysis strain. For example, site-specific mutation frequencies for A/Udron/307/1972 (H3N2) were calculated only using strains from the H3N2 master set that circulated before 1972. Due to computational restrictions, if a single year had more than 500 strains, 500 were randomly selected to be used in the analysis. For years with strain numbers below the cutoff, all strains were used. The number of IFV HA genomes obtained for each year is shown in Fig 4. At this stage, FSF was using all three pathways outlined in Fig 1 to generate the historic results shown in Table 3. Strain comparisons matched those used in the phylogenetic analysis.

thumbnail
Fig 4. Influenza genomes obtained by year.

These histograms show the total number of sequences in the H1N1 and H3N2 master lists acquired from the BV-BRC [33] and GISAID’s EpiFlu database [34] for each year. The dotted line at Y = 500 represents the aforementioned cutoff applied when calculating site-specific mutation frequencies.

https://doi.org/10.1371/journal.pone.0328174.g004

Predicting sequon emergence in swine H1N1 strains

We next examined the predictive capacity of FSF by testing its ability to predict the emergence of glycosylation sequons in swine H1N1 strains. Six thousand swine H1N1 sequences collected from 1931–2022 were obtained from the BV-BRC [33] and pre-processed using a custom Java program designed to identify SAE strains. Once identified, a maximum-likelihood phylogenetic tree containing 15 SAE strains and 700 other H1N1 swine strains (a maximum of 25 strains selected from each year, 1931–2022) was generated, and close relatives to the SAE strains were identified. Those relatives predating the SAE were analyzed with FSF with their accompanying sidechain exposure and mutation frequency data as described for the human influenza strains. Unlike our analysis of human IFVs above, sequons did not need to meet the definition of prominence for their SAE strains to be included in this analysis. The only swine SAE strains to be excluded from the analysis were those with no close ancestors, defined as strains on isolated branches of the phylogenetic tree. The results of this analysis are displayed in Table 4.

In the case of the swine influenza dataset, near-sequon designations, along with sidechain exposure ratios, were effective predictive tools. Fourteen of 15 SAEs were tagged as near-sequons with exposed surfaces, with 12 of those 14 being very near-sequons. Site-specific mutation frequency proved less effective in this dataset, with 50% of very near-sequons being ranked in the bottom half of the very near-sequon pool, compared to 15% in the human H1N1 dataset and 23% in the human H3N2 dataset. There are many differences between IFV circulation in swine and human populations that may account for this disparity. Swine herds are often tightly regulated and have limited exposure to other herds, in contrast to humans who frequently encounter other individuals worldwide [38]. This may explain why SAEs in the swine dataset rarely rise to prominence compared to those in the human dataset. The isolation of swine herds may also drive local adaptation of swine IFV, where mutation patterns are shaped by specific environmental pressures unique to each herd. As a result, mutation frequencies observed in one swine population may not align with those in others, which could explain the reduced effectiveness of the mutation frequency analysis in this dataset.

thumbnail
Table 3. Historic FSF predictions with side-chain exposure and site-specific mutation frequency data. The same closely related H1N1 and H3N2 IFV strains analyzed in Fig 2 were reanalyzed with side-chain exposure and site-specific mutation frequency data. The surface exposure of the first AA in each near-sequon is expressed as a percentage from the GetArea results. The relative mutation rate of the mismatched AA position is provided, where a value of 1 represents the average mutation rate for AAs across the HA protein. The site-specific mutation gamma category placement (out of 8) is displayed, with AAs assigned to category 8 being the most likely to mutate. The certainty with which MEGA assigned sites to their respective gamma categories is also shown as a proportion. Rankings of near-sequons within the same HA are based on relative AA mutation rates. Near-sequons with surface exposure ratios ≤15% are excluded from the rankings.

https://doi.org/10.1371/journal.pone.0328174.t003

thumbnail
Table 4. Historic Swine H1N1 FSF predictions. This table presents FSF results for swine IFVs closely related to SAE strains, incorporating side-chain exposure and site-specific mutation frequency data. The layout of this table is identical to [finalHistoricticAnalysisTable]Table 3. Consistent with the human IFV analysis, near-sequons with surface exposure ratios <15% are excluded from the rankings.

https://doi.org/10.1371/journal.pone.0328174.t004

Predicting future IFV sequon addition events

Following the predictive success FSF displayed with the historic IFV datasets, it was time to generate future predictions from currently circulating strains. Ten currently circulating IFV strains were selected for both H1N1 and H3N2 from GISAID’s EpiFlu database. These strains were selected semi-randomly, with each strain obtained from a different nation to cover a wider range of circulating influenza geographic diversity. The selected strains include:

  • H1N1 - A/Netherlands/10114/2024, A/Curacao/10079/2024, A/Aragon/102/2024, A/Bayern/33/2024, A/SriLanka/32/2024, A/Bangkok/P479/2024, A/Timis/563634/2024, A/Norway/01013/2024, A/NorthCarolina/14611/2024, A/UnitedKingdom/14762/2024
  • H3N2 - A/Curacao/10082/2024, A/Canberra/14/2024, A/Bangkok/P458/2024, A/ Hungary/1/2024, A/Netherlands/10110/2024, A/Norway/00732/2024, A/Victoria/12/2024, A/Philippines/2/2024, A/Yekaterinburg/2/2024, A/Oklahoma/14622/2024

Each chosen strain was analyzed using all FSF pathways. Sidechain exposure ratios were calculated for each strain individually and mutation frequencies were calculated with the same HA sequence dataset and ‘pick 500’ approach used for the historic strain analysis. For H1N1, mutation frequencies were calculated using strains from all years (1918-2024). Mutation frequencies were recalculated with strains from 2009-2024 to determine if there was a noteworthy difference between the resultant ranking of near-sequons given the large phylogenetic distance between modern H1N1 strains and those circulating pre-2009. For H3N2 viruses, the mutation frequency was calculated once using strains from all years (1968-2024). All accurate SAE predictions in the historic human dataset were predicted as very near-sequons or ultra near-sequons, so it is logical to prioritize these categories when predicting future SAEs. Consequently, exposed very near-sequons found in the circulating IFV strains were assigned an overall rank based on the relative mutation rate of their mismatched AA site. The frequency of the very near-sequon sites in the analyzed IFVs was reported but not factored into the overall ranking. Final rankings were determined by site-specific mutation frequency alone. The summarized results of these future sequon predictions are displayed in Table 5.

thumbnail
Table 5. Future H1N1 and H3N2 N-linked glycosylation sequon locations predicted by FSF. Each table summarizes the top site predictions from 10 individual FSF runs using currently circulating IFV HAs. Overall near-sequon rankings were determined by site-specific mutation rates, with higher rates corresponding to higher overall rankings. Near-sequon frequencies among the analyzed strains are displayed as proportions, alongside the average side-chain exposure ratios of the first amino acids (AAs) in the near-sequon triads. The mutation rates for the mismatched AA positions and their gamma category assignments are also presented, along with the near-sequon distinction of the site and its ranking among other near-sequons based on mutation rates. (Top) Future H1N1 sequon locations predicted by FSF using a mutation frequency calculation based on a random selection of H1N1 HA sequences per year for all years from 1918-2024. (Middle) Future H1N1 sequon locations predicted by FSF using a mutation frequency calculation based on a random selection of H1N1 HA sequences per year for all years from 2009-2024. (Bottom) Future H3N2 sequon locations predicted by FSF using a mutation frequency calculation based on a random selection of H3N2 HA sequences per year for all years from 1968-2024.

https://doi.org/10.1371/journal.pone.0328174.t005

Discussion

The temporal analysis of historic H3N2 IFVs supports the notion that the HA glycosylation patterns of currently circulating strains are most closely related to the HAs of strains circulating in the previous year (Figs 2 and 3). The temporal analysis of H1N1 IFVs revealed a similar pattern with some exceptions, given that strains closely related to distant IFVs occasionally reemerged and replaced the predominant strains. The most dramatic example was the emergence of A(H1N1)pdm09 type IFVs from swine reservoirs in 2009, whose HA proteins and glycosylation patterns were more closely related to those from the 1976 swine outbreak than any circulating in the prior decade. This demonstrates the importance of monitoring for the re-emergence of IFVs propagating in human or animal reservoirs at low levels. However, our results suggest that for both H1N1 and H3N2, the HA sequences of prominent circulating strains are the best FSF inputs for predicting sequons that will arise in the near future.

In both the temporal and phylogenetic analyses of historic IFVs, falsely predicted sequons greatly outnumbered accurate predictions. Consequently, the reduction and sorting of near-sequon pools became a priority, leading to the implementation of the two optional pathways shown in Fig 1. The sidechain exposure ratio analysis of individual AA sites greatly narrowed the near-sequon pools. A typical implementation of the surface accessibility pathway reduces HA near-sequon pools by 35–45% (∼50 near-sequons reduced to ∼30). In the analysis of historic human and swine IFVs, no near-sequon that would subsequently become a sequon was excluded by the 15% sidechain exposure cutoff, though some were close (within 2%). To be safe in future analysis, it may be beneficial to slightly lower this cutoff.

To explain the lack of N-linked glycosylation in unexposed regions, we propose that SAEs with unexposed asparagine residues either remain unglycosylated or become glycosylated before protein folding, introducing detrimental conformational changes. Given the co-translational nature of glycan addition via oligosaccharyltransferase, a region that is unexposed post-folding may be exposed at the time of glycosylation. Adding a bulky oligosaccharide moiety to such a site, however, would likely result in significant conformational changes during folding, rendering the HA protein ineffective. If the SAE site folds correctly into its unexposed state before glycan binding, glycosylation will not occur. As a result, the mutation will provide no selective advantage via epitope shielding and is less likely to rise to prominence. Sequons may still emerge in these locations, but their immediate biological relevance would be less significant given that they either wouldn’t host an N-linked glycan, or the new glycan would introduce detrimental conformational changes.

In conjunction with the solvent accessibility results, the mutation frequency analysis proved a powerful predictor of which near-sequons would become true sequons in the human IFV dataset. In particular, ranking very near-sequons based on the mutation frequencies of their mismatched AAs resulted in almost all correct near-sequons being found at the top of their respective lists. Of the 25 correctly predicted very near-sequons, 24 were ranked in the top 10 very near-sequons, 19 in the top 5, and 17 were ranked in the top 3 based on site-specific mutation frequency. It is also worth noting that all correctly predicted near-sequons with one exception had associated site-specific mutation rates exceeding 1.0, meaning the rates of the sites that would ultimately mutate and generate sequons were almost always greater than the average AA mutation rate across the HA. Interestingly, the mutation frequency analysis proved effective even when calculated from a minimal number of preexisting sequences, especially in the case of H1N1 viruses in the early to mid-20th century. H1N1 mutation analyses for 1934 and 1935 successfully placed all correct very near-sequons in the top 7, despite the calculation being based on only 59 and 64 preexisting H1N1 strains, respectively. It is unclear if the abundance of available strains from the early 2000s and onwards significantly improves the power of mutation frequency analysis as a predictive tool, as there have been few SAEs in H1N1 and H3N2 since the turn of the century. While it seems reasonable that larger sequence datasets could improve the predictive power of mutation analyses, more data is needed.

There are several limitations to the FSF method worth addressing. FSF excels at predicting SAE locations when the analyzed strain is closely related to the SAE strain, and the SAE occurs in a region with a high mutation rate. It isn’t guaranteed that next year’s predominant IFVs will be most closely related to this year’s, particularly in the case of H1N1. If a historic strain distantly related to circulating strains were to reemerge, FSF analyses using circulating strains would be less accurate. This is also true in the event of novel strain emergence from animal reservoirs. Additionally, FSF only predicts the emergence of sequons and does not predict whether those sequons will become glycosylated. The sidechain exposure analysis used by FSF may help provide some insight into this, as unexposed sequons will seldom be glycosylated. There are, however, other factors influencing whether a glycan will bind to a sequon. There are, however, other factors influencing whether a glycan will bind to a sequon. The local secondary structure, the hydrophobicity of the region, and the conformational flexibility of the region have all been shown to influence sequon occupancy in a structural analysis of glycoproteins[39]. Additional factors include the identity and charge of amino acids flanking the tripeptide sequon[40], as well as the sequon’s proximity to the C-terminus[41]. Fortunately, there are numerous existing programs designed to determine which sequons in a protein are likely to become glycosylated [4246]. Running the most probable near-sequon mutants identified by FSF through these programs may further enhance glycosylation site predictions.

The analysis of currently circulating H1N1 and H3N2 IFV strains revealed sites that may become glycosylated in the future, assuming future SAE strains are most closely related to those circulating presently. In the historical analysis of human IFVs, every correct near-sequon prediction was tagged as a very near-sequon. Consequently, only very near-sequons (or ultra near-sequons) were considered serious candidates for SAEs in the near future. All near-sequons appearing in the lists were the highest-ranking sites predicted by FSF. It is important to track these near-sequons over time as a near-sequon present in strains this year may not persist in strains next year, and vice versa. For example, position 142 in the H1N1 HA is not included in the list of predicted sites given that it is not a single nucleotide change away from becoming a sequon, but the mismatched AA in this near-sequon (position 144) currently has the highest mutation rate in the entire H1 domain of the H1N1 HA protein and may become a sequon in time. For this reason, an analysis of presently circulating IFVs should be conducted every few years to catch the new very near-sequons that emerge. That being said, should a new N-linked glycosylation sequon emerge soon in either H1N1 or H3N2, it is likely included in Table 5.

The historic analysis of IFVs with FSF demonstrated that it is often possible to predict N-linked glycosylation sequon locations before they emerge in H1N1 and H3N2 HA proteins. The predictive capacity also seems to extend to swine H1N1 viruses, though the mutation frequency analysis was less effective. Whether this predictive capacity extends to HA proteins in other IFV types, such as influenza B viruses or those circulating in avian reservoirs, should be tested. Glycosylation changes on the IFV neuraminidase surface protein are also likely to be predictable. This predictive capacity may even extend to other viral surface proteins such as the heavily glycosylated envelope glycoprotein of the human immunodeficiency virus (HIV), which is known to alter glycosylation patterns to take advantage of glycan shielding [47], or the SARS-CoV-2 spike protein which has also been shown to evade antibodies via an N-linked glycan shield that protects approximately 40% of the underlying protein surface from antibody recognition [48]. FSF serves as proof of concept, and the predictive power of sequon prediction programs like it can undoubtedly be improved. With a larger dataset of recorded SAEs in viral surface proteins, machine learning could be implemented to reveal other useful parameters for sequon prediction or assign weights to existing parameters to enhance prediction accuracy. One promising parameter not explored in this study is the frequency of near-sequons in IFVs circulating worldwide at a given time. Near-sequon frequencies were reported in Table 5, though it is unclear how these may impact the likelihood of a given sequon emerging. Once refined, sequon prediction programs like FSF could work in synergy with existing algorithms. This integration would enhance antigenic escape modeling and guide the development of effective vaccines and monoclonal antibody therapies.

Supporting information

S1 File. Future Sequon Finder source code.

https://doi.org/10.1371/journal.pone.0328174.s002

(ZIP)

S2 File. Influenza HA-based phylogenetic trees.

https://doi.org/10.1371/journal.pone.0328174.s003

(ZIP)

S3 File. Influenza HA protein structure files.

https://doi.org/10.1371/journal.pone.0328174.s004

(ZIP)

Acknowledgments

The authors would like to thank Dr. Daven Presgraves and Dr. Christopher Anderson of the University of Rochester for their helpful suggestions regarding phylogenetic analysis techniques and influenza sequence data organization, respectively.

References

  1. 1. Newby ML, Allen JD, Crispin M. Influence of glycosylation on the immunogenicity and antigenicity of viral immunogens. Biotechnol Adv. 2024;70:108283. pmid:37972669
  2. 2. Zhang X-L, Qu H. The role of glycosylation in infectious diseases. Adv Exp Med Biol. 2021;1325:219–37. pmid:34495538
  3. 3. Reis CA, Tauber R, Blanchard V. Glycosylation is a key in SARS-CoV-2 infection. J Mol Med (Berl). 2021;99(8):1023–31. pmid:34023935
  4. 4. Chawla H, Fadda E, Crispin M. Principles of SARS-CoV-2 glycosylation. Curr Opin Struct Biol. 2022;75:102402. pmid:35717706
  5. 5. Idris F, Muharram SH, Diah S. Glycosylation of dengue virus glycoproteins and their interactions with carbohydrate receptors: possible targets for antiviral therapy. Arch Virol. 2016;161(7):1751–60. pmid:27068162
  6. 6. Gorzkiewicz M, Cramer J, Xu HC, Lang PA. The role of glycosylation patterns of viral glycoproteins and cell entry receptors in arenavirus infection. Biomed Pharmacother. 2023;166:115196. pmid:37586116
  7. 7. Doud MB, Lee JM, Bloom JD. How single mutations affect viral escape from broad and narrow antibodies to H1 influenza hemagglutinin. Nat Commun. 2018;9(1):1386. pmid:29643370
  8. 8. Schön K, Lepenies B, Goyette-Desjardins G. Impact of protein glycosylation on the design of viral vaccines. Adv Biochem Eng Biotechnol. 2021;175:319–54. pmid:32935143
  9. 9. Ozdilek A, Avci FY. Glycosylation as a key parameter in the design of nucleic acid vaccines. Curr Opin Struct Biol. 2022;73:102348. pmid:35255387
  10. 10. Hariharan V, Kane RS. Glycosylation as a tool for rational vaccine design. Biotechnol Bioeng. 2020;117(8):2556–70. pmid:32330286
  11. 11. Nilchan N, Kraivong R, Luangaram P, Phungsom A, Tantiwatcharakunthon M, Traewachiwiphak S, et al. An engineered N-glycosylated dengue envelope protein domain III facilitates epitope-directed selection of potently neutralizing and minimally enhancing antibodies. ACS Infect Dis. 2024;10(8):2690–704. pmid:38943594
  12. 12. Martina CE, Crowe JE Jr, Meiler J. Glycan masking in vaccine design: targets, immunogens and applications. Front Immunol. 2023;14:1126034. pmid:37033915
  13. 13. Carnell GW, Billmeier M, Vishwanath S, Suau Sans M, Wein H, George CL, et al. Glycan masking of a non-neutralising epitope enhances neutralising antibodies targeting the RBD of SARS-CoV-2 and its variants. Front Immunol. 2023;14:1118523. pmid:36911730
  14. 14. Chuang G-Y, Boyington JC, Joyce MG, Zhu J, Nabel GJ, Kwong PD, et al. Computational prediction of N-linked glycosylation incorporating structural properties and patterns. Bioinformatics. 2012;28(17):2249–55. pmid:22782545
  15. 15. Kornfeld R, Kornfeld S. Assembly of asparagine-linked oligosaccharides. Annu Rev Biochem. 1985;54:631–64. pmid:3896128
  16. 16. Feng T, Zhang J, Chen Z, Pan W, Chen Z, Yan Y, et al. Glycosylation of viral proteins: implication in virus-host interaction and virulence. Virulence. 2022;13(1):670–83. pmid:35436420
  17. 17. Lavie M, Hanoulle X, Dubuisson J. Glycan shielding and modulation of Hepatitis C virus neutralizing antibodies. Front Immunol. 2018;9:910. pmid:29755477
  18. 18. Grant OC, Montgomery D, Ito K, Woods RJ. Analysis of the SARS-CoV-2 spike protein glycan shield reveals implications for immune recognition. Sci Rep. 2020;10(1):14991. pmid:32929138
  19. 19. Gomez Lorenzo MM, Fenton MJ. Immunobiology of influenza vaccines. Chest. 2013;143(2):502–10. pmid:23381315
  20. 20. Rodpothong P, Auewarakul P. Viral evolution and transmission effectiveness. World J Virol. 2012;1(5):131–4. pmid:24175217
  21. 21. Tate MD, Job ER, Deng Y-M, Gunalan V, Maurer-Stroh S, Reading PC. Playing hide and seek: how glycosylation of the influenza virus hemagglutinin can modulate the immune response to infection. Viruses. 2014;6(3):1294–316. pmid:24638204
  22. 22. Chen W, Zhong Y, Qin Y, Sun S, Li Z. The evolutionary pattern of glycosylation sites in influenza virus (H5N1) hemagglutinin and neuraminidase. PLoS One. 2012;7(11):e49224. pmid:23133677
  23. 23. Sun S, Wang Q, Zhao F, Chen W, Li Z. Glycosylation site alteration in the evolution of influenza A (H1N1) viruses. PLoS One. 2011;6(7):e22844. pmid:21829533
  24. 24. An Y, Parsons LM, Jankowska E, Melnyk D, Joshi M, Cipollo JF. N-Glycosylation of seasonal influenza vaccine hemagglutinins: implication for potency testing and immune processing. J Virol. 2019;93(2):e01693-18. pmid:30355697
  25. 25. Altman MO, Angel M, Košík I, Trovão NS, Zost SJ, Gibbs JS, et al. Human Influenza A virus hemagglutinin glycan evolution follows a temporal pattern to a glycan limit. mBio. 2019;10(2):e00204-19. pmid:30940704
  26. 26. Bryan SP, Zand MS. Future Sequon Finder (Version 1.0). Zenodo; 2025.https://doi.org/10.5281/zenodo.15473388
  27. 27. Fraczkiewicz R, Braun W. Exact and efficient analytical calculation of the accessible surface areas and their gradients for macromolecules. J Comput Chem. 1998;19(3):319–33.
  28. 28. Tamura K, Stecher G, Kumar S. MEGA11: molecular evolutionary genetics analysis version 11. Mol Biol Evol. 2021;38(7):3022–7. pmid:33892491
  29. 29. Le SQ, Gascuel O. An improved general amino acid replacement matrix. Mol Biol Evol. 2008;25(7):1307–20. pmid:18367465
  30. 30. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
  31. 31. Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022;19(6):679–82. pmid:35637307
  32. 32. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. pmid:15034147
  33. 33. Olson RD, Assaf R, Brettin T, Conrad N, Cucinell C, Davis JJ, et al. Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR. Nucleic Acids Res. 2023;51(D1):D678–D689.
  34. 34. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 2017;22(13):30494. pmid:28382917
  35. 35. Boni MF. Vaccination and antigenic drift in influenza. Vaccine. 2008;26(Suppl 3):C8-14. pmid:18773534
  36. 36. Jones S, Nelson-Sathi S, Wang Y, Prasad R, Rayen S, Nandel V, et al. Evolutionary, genetic, structural characterization and its functional implications for the influenza A (H1N1) infection outbreak in India from 2009 to 2017. Sci Rep. 2019;9(1):14690. pmid:31604969
  37. 37. Chen Z, Bancej C, Lee L, Champredon D. Publisher correction: antigenic drift and epidemiological severity of seasonal influenza in Canada. Sci Rep. 2024;14(1):2952. pmid:38316957
  38. 38. Otake S, Yoshida M, Dee S. A review of swine breeding herd biosecurity in the united states to prevent virus entry using porcine reproductive and respiratory syndrome virus as a model pathogen. Animals (Basel). 2024;14(18):2694. pmid:39335283
  39. 39. Petrescu A-J, Milac A-L, Petrescu SM, Dwek RA, Wormald MR. Statistical analysis of the protein environment of N-glycosylation sites: implications for occupancy, structure, and folding. Glycobiology. 2004;14(2):103–14. pmid:14514716
  40. 40. Jones J, Krag SS, Betenbaugh MJ. Controlling N-linked glycan site occupancy. Biochim Biophys Acta. 2005;1726(2):121–37. pmid:16126345
  41. 41. Bañó-Polo M, Baldin F, Tamborero S, Marti-Renom MA, Mingarro I. N-glycosylation efficiency is determined by the distance to the C-terminus and the amino acid preceding an Asn-Ser-Thr sequon. Protein Sci. 2011;20(1):179–86. pmid:21082725
  42. 42. Gupta R, Brunak S. Prediction of glycosylation across the human proteome and the correlation to protein function. Pac Symp Biocomput. 2002;310–22.
  43. 43. Pitti T, Chen C-T, Lin H-N, Choong W-K, Hsu W-L, Sung T-Y. N-GlyDE: a two-stage N-linked glycosylation site prediction incorporating gapped dipeptides and pattern-based encoding. Sci Rep. 2019;9(1):15975. pmid:31685900
  44. 44. Hou X, Wang Y, Bu D, Wang Y, Sun S. EMNGly: predicting N-linked glycosylation sites using the language models for feature extraction. Bioinformatics. 2023;39(11):btad650. pmid:37930896
  45. 45. Pakhrin SC, Pokharel S, Aoki-Kinoshita KF, Beck MR, Dam TK, Caragea D, et al. LMNglyPred: prediction of human N-linked glycosylation sites using embeddings from a pre-trained protein language model. Glycobiology. 2023;33(5):411–22. pmid:37067908
  46. 46. Pakhrin SC, Aoki-Kinoshita KF, Caragea D, Kc DB. DeepNGlyPred: a deep neural network-based approach for human N-linked glycosylation site prediction. Molecules. 2021;26(23):7314. pmid:34885895
  47. 47. Wagh K, Hahn BH, Korber B. Hitting the sweet spot: exploiting HIV-1 glycan shield for induction of broadly neutralizing antibodies. Curr Opin HIV AIDS. 2020;15(5):267–74. pmid:32675574
  48. 48. Grant OC, Montgomery D, Ito K, Woods RJ. Analysis of the SARS-CoV-2 spike protein glycan shield: implications for immune recognition. bioRxiv. 2020:2020.04.07.030445. pmid:32511307