Reconstruction of the Evolutionary Dynamics of A(H3N2) Influenza Viruses Circulating in Italy from 2004 to 2012

Background Influenza A viruses are characterised by their rapid evolution, and the appearance of point mutations in the viral hemagglutinin (HA) domain causes seasonal epidemics. The A(H3N2) virus has higher mutation rate than the A(H1N1) virus. The aim of this study was to reconstruct the evolutionary dynamics of the A(H3N2) viruses circulating in Italy between 2004 and 2012 in the light of the forces driving viral evolution. Methods Phylodinamic analyses were made using a Bayesian method, and codon-specific positive selection acting on the HA coding sequence was evaluated. Results Global and local phylogenetic analyses showed that the Italian strains collected between 2004 and 2012 grouped into five significant Italian clades that included viral sequences circulating in different epidemic seasons. The time of the most recent common ancestor (tMRCA) of the tree root was between May and December 2003. The tMRCA estimates of the major clades suggest that the origin of a new viral strain precedes the effective circulation of the strain in the Italian population by 6–31 months, thus supporting a central role of global migration in seeding the epidemics in Italy. The study of selection pressure showed that four codons were under positive selection, three of which were located in antigenic sites. Analysis of population dynamics showed the alternation of periods of exponential growth followed by a decrease in the effective number of infections corresponding to epidemic and inter-epidemic seasons. Conclusions Our analyses suggest that a complex interaction between the immune status of the population, migrations, and a few selective sweeps drive the influenza A(H3N2) virus evolution. Our findings suggest the possibility of the year-round survival of local strains even in temperate zones, a hypothesis that warrants further investigation.


Methods
Phylodinamic analyses were made using a Bayesian method, and codon-specific positive selection acting on the HA coding sequence was evaluated.

Results
Global and local phylogenetic analyses showed that the Italian strains collected between 2004 and 2012 grouped into five significant Italian clades that included viral sequences circulating in different epidemic seasons. The time of the most recent common ancestor (tMRCA) of the tree root was between May and December 2003. The tMRCA estimates of the major clades suggest that the origin of a new viral strain precedes the effective circulation of the strain in the Italian population by 6-31 months, thus supporting a central role of global migration in seeding the epidemics in Italy. The study of selection pressure showed that four codons were under positive selection, three of which were located in antigenic sites. Analysis of population dynamics showed the alternation of periods of exponential growth followed by a decrease in the effective number of infections corresponding to epidemic and inter-epidemic seasons.

Introduction
Influenza A viruses are characterised by their rapid evolution, which allows them to generate new strains against which humans are not immune on such a regularly that they cause seasonal epidemics and occasionally global pandemics. Human influenza viruses are one of the major cause of morbidity and mortality worldwide, on average accounting for infections in 5-15% of the global population and about 500,000 deaths a year [1,2].
The gene fragment coding for hemagglutinin (HA) is particularly important because surface glycoprotein HA is the main target of the immune system, and mutations in the globular head region of the protein (residues 50-230 of HA1 according to H3 HA numbering) give rise to antigenic novelty, species adaptation, and viral transmission [3] Influenza pandemics are characterised by a change in the viral HA subtype due to the re-assortment or introduction of a new virus (antigenic shift), whereas influenza epidemics are characterized by the acquisition of point mutations in the viral HA1 domain encoding the major antigenic sites of HA protein that lead to serial antigenic changes (antigenic drift) [4].
Of the eighteen known subtypes of HA and eleven subtypes of neuraminidase (NA) in influenza A viruses [1,4], H3N2 and H1N1 are currently the major circulating subtypes and have been circulating together since 1977 [1]; however, A(H3N2) has a higher mutation rate than A (H1N1) [5,6], and there has been a high rate of antigenic drift in the human H3 subtype since its emergence in 1968.
The aim of this study was to reconstruct the evolutionary dynamics of the A(H3N2) influenza viruses circulating in Northern Italy between 2004 and 2012 in the light of the forces driving viral evolution.

Ethics statement
The respiratory samples were collected by sentinel practitioners, and anonymously analysed at the reference laboratories of the Italian Influenza Surveillance Network (Influnet: http://www. iss.it/iflu/) and in the framework of severe influenza A surveillance program in Lombardy region (DGR IX/1046, 22 Dec. 2010 and DGR 5988, 30 Jun 2011). This retrospective study was performed according to the guidelines of the Institutional Review Board on the use of biological specimens for scientific purposes in keeping with Italian law (art. 13 D.Lgs 196/2003) and was approved by the Ethics Commitee of Fondazione IRCCS Policlinico San Matteo in Pavia, Italy. The work described here is a retrospective study performed on left over samples that were obtained as part of routine tests performed. No extra samples were obtained for this research. The retrospective analysis was anonymous. Therefore, informed consent (either written or verbal) was not required.

Sequence dataset
A sequence dataset was constructed that included 202 HA gene sequences obtained from as many A(H3N2)-positive respiratory specimens (nasal or oropharyngeal swabs) collected within the framework of Influnet from outpatients with symptoms of influenza-like illness (ILI) and hospitalised patients suffering from severe respiratory syndrome in Northern Italy. The A(H3N2) HA sequences were collected between January 2004 and April 2012 (i.e. during eight consecutive influenza seasons).
A preliminary global phylogenetic analysis was also made of a larger dataset constructed by aligning the sequences of these 202 patient isolates with 307 other A(H3N2) HA sequences from isolates collected throughout the world and obtained from the National Center for Biotechnology Information (NCBI: http://www.ncbi.nlm.nih.gov) GenBank or the Global Initiative on Sharing All Influenza Data (GISAID) EpiFlu database (platform.gisaid.org/epi3/). Their identification numbers are listed in S1 Table. HA amplification and sequencing The A(H3N2) HA gene was amplified directly from the clinical specimens. Total RNA was extracted from the respiratory samples using the Nuclisens easyMAG automated extraction kit (BioMerieux, Lyon, France), and the sequences were obtained by means of an RT-PCR assay specific for a 882 bp fragment (nt. 174-1,056) in the HA1 domain [7]. The PCR products were purified using a Microcon-100 microconcentrator in accordance with the manufacturer's instructions (Millipore, Bedford, MA, USA), and then sequenced using a BigDye Terminator Cycle-Sequencing kit (Applied Biosystems, Foster City, USA) and ABI Prism 3100 DNA sequencer (Applied Biosystems, Foster City, USA).

Likelihood mapping analysis
The phylogenetic signal of each sequence dataset was investigated by means of the likelihoodmapping analysis of 10,000 random quartets generated using TreePuzzle. All of the three possible unrooted trees for a set of four sequences (quartets) randomly selected from the dataset were reconstructed using the maximum likelihood approach and the selected substitution model. The posterior probabilities of each tree were then plotted on a triangular surface so that the dots representing the fully resolved trees fell at the corners and those representing the unresolved quartets in the centre of the triangle (star-tree) [8]. Using this strategy, which has been described in detail elsewhere [9], the data are considered unreliable for phylogenetic inference when more than 30% of the dots are in the centre of the triangle.

Path-O-Gen analysis
The clock-like signal of the dataset was investigated using Path-O-Gen software (freely available at http://tree.bio.ed.ac.uk/software/pathogen/), which analyses the correlations between time and the root-to-tip distances of a tree constructed without assuming a molecular clock.

Phylogenetic analysis
The sequences were aligned using CLUSTALW (integrated within the Bio-Edit sequence editor by Tom Hall, 2001; http://www.mbio.ncsu.edu/BioEdit/bioedit.html). The best-fitting nucleotide substitution model was estimated using JModeltest [10], and selected a GTR model [11] with gamma-distributed rates among sites.
The phylogenetic tree, model parameters, evolutionary rates and population growth were co-estimated using a Bayesian Markov Chain Monte Carlo (MCMC) method implemented in the BEAST package v.1.74 [12].
A strict clock and an uncorrelated log-normal relaxed clock model were both implemented under a GTR + G substitution model. A Bayes factor (BF, using marginal likelihoods) implemented in Beast selected the best-fitting models [13]. In accordance with [14], only values of 2lnBF 6 were considered significant. A less restrictive Bayesian skyline plot (BSP, a nonparametric piecewise-constant model) was used as coalescent prior. Two independent MCMC chains were run for 30 million generations (with sampling every 3,000 th generation), and were combined using the LogCombiner 1.74 included in the BEAST package. Convergence was assessed on the basis of the effective sampling size (ESS) after a 10% burn-in using Tracer software version 1.5 (http://tree.bio.ed.ac.uk/software/tracer/). Only ESS's of 200 were accepted.
Uncertainty in the estimates was indicated by 95% highest posterior density (95% HPD) intervals.
The obtained trees were summarised in a maximum clade credibility tree using the Tree Annotator program included in the BEAST package, and the tree with the maximum product of posterior probabilities (maximum clade credibility: MCC) after a 10% burn-in was displayed using Figtree version 1.3.1) (http://tree.bio.ed.ac.uk/software/figtree/).

Selection pressure
Tests for positive selection were conducted on the Datamonkey server [15] using the single-likelihood ancestor (SLAC), fixed-effects likelihood (FEL), internal branch fixed-effects likelihood (IFEL), mixed effects model of evolution (MEME), and fast unconstrained Bayesian approximation (FUBAR) methods, and the dN/dS ratios were calculated using the SLAC and FEL codonbased maximum likelihood approaches. SLAC counts the number of non-synonymous changes per non-synonymous site (dN) and tests whether it is significantly different from the number of synonymous changes per synonymous site (dS). FEL estimates the ratios of non-synonymous to synonymous changes for each site in an alignment [16]. The IFEL method is similar to FEL, but tests site-by-site selection for only along internal branches of the phylogeny. In order to avoid an excessive false-positive rate, sites with SLAC, FEL, IFEL and MEME p-values of <0.1 and a FUBAR posterior probability of >0.90 were accepted as candidates for selection. The property informed model of evolution (PRIME) was designed to take into account the biochemical properties of the amino acids: it works using the same conceptual frameworks as FEL and MEME but, unlike FEL and MEME, allows the non-synonymous substitution rate β to depend on which residues are being exchanged as well as on the site in question.

Likelihood mapping and root-to-tip regression analysis
The likelihood mapping of 10,000 random quartets showed that more than 87.5% were distributed at the corners of the likelihood map and only 9.8% in the central area, thus indicating that the dataset contained sufficient phylogenetic information in S1  In order to reconstruct the population dynamics of the A(H3N2) epidemics in Italy, we separately analysed the Italian isolates with a known month of isolation, and estimated the evolutionary rate.

Evolutionary rate estimates
A strict and a relaxed (log-normal) molecular clock model were implemented under two less stringent Bayesian coalescent models: a classical Bayesian skyline plot (BSP) and a GMRF skyride. As a BF test showed that the relaxed clock was significantly better than the strict clock (2lnBF = 93.6) and the BSP did not fit the data any better than the GMRF skyride (BSP: 2lnBF = 9, not significant), the skyride coalescent and relaxed clock models were selected for the reconstruction of time-scale phylogeny and population dynamics. Under these conditions, the estimated mean evolutionary rate of the HA sequences analysed using both the restricted Italian and the larger global alignments was 3.8×10 −4 subs/site/month (95% HPD: 2.7-4.7×10 −4 ).

Bayesian dated tree analysis
A Bayesian time-scaled tree reconstructed using the Italian isolates (Fig 1) showed that the Italian strains segregated into at least five significant clades ( Table 1) (Table 1), whereas the three isolates from the preceding epidemic seasons were at the outgroup of these subclades.
The estimated times of the most recent common ancestor (tMRCA) of the main clades are shown in Table 2.
The estimated tMRCA of the tree root was September 2003 (95% credibility interval: May 2003-December 2003). More detailed estimations of subclades TMRCAs are reported in S2 Table. Clade    Finally, the PRIME method was used to estimate the biochemical properties preserved or modified by the evolutionary process. There were two residues (positions 170 and 173) with conserved properties, whereas codon 225 had changed properties.

Population dynamics
Analysis of the skyride curve indicating the size of the viral population in a calendar timescale (Fig 4) showed an alternation of periods of exponential growth (corresponding to the epidemic  Clades A, B and E harboured sequences that were only identified in a single season, whereas clades C and D also included sequences isolated in preceding and/or subsequent seasons, which explains why there were sometimes two or more clades co-circulating in the population.   [19], and segregated into at least five subclades. In general, the tMRCA estimates of the major clades suggest that a new viral strain originates several months or even years (6-31 months) before its effective circulation in the Italian population, thus supporting the importance of global migration in seeding the epidemics in Italy [20,21].
A total of 10 significant subclades of exclusively Italian isolates encompassed 3-15 sequences. Most of the Italian groups originated and became extinct in the same season, thus suggesting that they represent local epidemiological networks; however, a highly significant subclade (including five Italian isolates within Italian clade D) circulated from 2007 to 2009. This finding may of course be simply due to a casual absence of sequences of this clade from the international databanks but, more interestingly, it may indicate that influenza A viruses can sometimes "over-summer" also in the northern hemisphere and persist for a few years. The possible existence of local reservoirs capable of perpetuating viral circulation and supporting the year-round survival of a local strain even in temperate zones deserves more specific investigations [22][23][24][25].
The study of selection pressure showed that most of the amino acid substitutions distinguishing the viral clades and subclades were neutral. Only a minority were under positive selection, thus suggesting that the forces driving the local evolution of influenza A viruses are mainly stochastic. The estimated overall mean dN/dS ratio was in line with the values reported by others [26,27]. When estimating site-by-site rate variation, a SLAC counting method used to infer selection but this did not identify any sites under positive selection pressure. Our analysis showed that one codon (144) had signatures of episodic diversifying selection using both SLAC and MEME models, but codon 193 was only identified by the MEME model. Codon 193 is located in the 190-helix of the RBS but, as suggested by [28], substitution S193F is associated with a negligible impact on receptor binding activity. The IFEL method indicated the presence of six codons under positive selection pressure, thus suggesting a variation within the lineages: half of these codons were included in the antigenic sites. Finally, the PRIME analysis identified residue 225 as being under positive selection pressure. Additionally, the changes (D$N/G) observed at the 225 site cause the changing of physicochemical properties of the residue. This position has also been reported to be responsible for altered receptor binding affinity [28].
Analysis of evolutionary population dynamics using a Bayesian skyride plot showed the typically fluctuating trend associated with the seasonal characteristic of influenza A outbreaks, with peaks corresponding to the major A(H3N2) epidemics. The greatest exponential growth was observed between summer 2007 and winter 2009, with a peak in November 2008 that was immediately followed by an evident bottleneck due to pandemic A(H1N1)09 virus in the following seasons. The epidemic resumed to grow in 2011, and reached a new peak in winter 2011-2012.
In conclusions, on the basis of our findings, it is clear that the main forces driving the local evolution of influenza A(H3N2) virus are stochastic [29], and mainly due to multiple founder effects. In any given season, different viral strains are imported from other geographical areas, and the epidemic is then amplified by local transmission networks, as in the case of A(H1N1) pdm09 [18]. This genetic drift leads to the loss of genetic heterogeneity within a population, but a large genetic distance between populations. This is particularly evident in season 2006-2007, when a viral variant (clade D) that probably entered Italy in July 2006 became the predominant strain, and was still being isolated in 2010-2011. During the same period, a purely Italian clade encompassed the seasons from 2007 to 2009. Important selective sweeps sometimes cause a dramatic change in the predominant strain, such that observed during the 2011-2012 season. In this case, the absence of A(H3N2) viruses in previous seasons due to the prevalent circulation of the pandemic A(H1N1)09 strain caused the rise of a completely new strain that showed several positively selected sites. This complex evolutionary behaviour can be explained by interactions between the immune status of the population, migrations and selective sweeps [30].
The selection of a new strain characterised by multiple changes that allow it escape immune defences causes the rapid spread of infection within a susceptible population, but the consequent increase in the percentage of immunised subjects rapidly clears it. On the contrary, the rise of a strain with only minor neutral modifications in comparison with the previous season, and therefore circulating in a partially immune population, spreads more slowly and can persist longer at both global and local level. This seems to be confirmed by the observation that the most enduring strains are those showing the weakest selective pressure.
Supporting Information S1 Fig. Likelihood mapping of the 202 A(H3N2) influenza virus HA gene sequences. Each dot represents the likelihoods of the three possible unrooted trees for each quartet randomly selected from the dataset: the dots near the corners or the sides respectively represent tree-like (fully resolved phylogenies where one tree is clearly better than the others) or network-like phylogenetic signals (three regions in which it is not possible to decide between the two topologies). The central area of the likelihood map represents a star-like signal (the region in which the star tree is the optimal tree). The numbers indicate the percentage of dots.  Table. tMRCAs with credibility intervals (95%HPD) and the corresponding months of the tree root and A(H3N2) Italian clades. (DOCX)