Reconstruction of the Evolutionary Dynamics of the A(H1N1)pdm09 Influenza Virus in Italy during the Pandemic and Post-Pandemic Phases

The aim of this study was to reconstruct the evolutionary dynamics of the A(H1N1)pdm09 influenza virus in Italy during two epidemic seasons (2009/2010 and 2010/2011) in the light of the forces driving the evolution of the virus. Nearly six thousands respiratory specimens were collected from patients with influenza-like illness within the framework of the Italian Influenza Surveillance Network, and the A(H1N1)pdm09 hemagglutinin (HA) gene was amplified and directly sequenced from 227 of these. Phylodynamic and phylogeographical analyses were made using a Bayesian Markov Chain Monte Carlo method, and codon-specific positive selection acting on the HA coding sequence was evaluated. The global and local phylogenetic analyses showed that all of the Italian sequences sampled in the post-pandemic (2010/2011) season grouped into at least four highly significant Italian clades, whereas those of the pandemic season (2009/2010) were interspersed with isolates from other countries at the tree root. The time of the most recent common ancestor of the strains circulating in the pandemic season in Italy was estimated to be between the spring and summer of 2009, whereas the Italian clades of the post-pandemic season originated in the spring of 2010 and showed radiation in the summer/autumn of the same year; this was confirmed by a Bayesian skyline plot showing the biphasic growth of the effective number of infections. The local phylogeography analysis showed that the first season of infection originated in Northern Italian localities with high density populations, whereas the second involved less densely populated localities, in line with a gravity-like model of geographical dispersion. Two HA sites, codons 97 and 222, were under positive selection. In conclusion, the A(H1N1)pdm09 virus was introduced into Italy in the spring of 2009 by means of multiple importations. This was followed by repeated founder effects in the post-pandemic period that originated specific Italian clades.


Introduction
In March 2009, a novel swine-derived A(H1N1) influenza virus -A(H1N1)pdm09 -emerged in Mexico and started spreading across the globe, prompting the World Health Organisation (WHO) to raise the level of influenza pandemic alert to phase 6 (WHO -available at: http://www.who.int/csr/disease/swineflu/ en/). Despite the rapidity with which the virus reached a large number of countries in the world, its transmission was initially sustained only in a subset of countries, particularly the USA and temperate countries in the southern hemisphere in which winter influenza transmission was ongoing and a full A(H1N1)pdm09 influenza epidemic was observed. The pandemic strain quickly became the predominant circulating influenza virus and replaced seasonal strains in most countries.
There was considerable heterogeneity in the pattern of A(H1N1)pdm09 spread in Europe. The UK experienced a substantial first wave of transmission in the early summer, followed by a second in the autumn, whereas most European countries (including Italy) experienced only limited transmission before the summer and a single wave in the autumn of 2009 [1]. A second epidemic wave was recorded during the post-pandemic period (November 2010-March 2011) during which the influenza A(H1N1)pdm09 virus was responsible for the majority of infections.
A(H1N1)pdm09 is a novel reassortant virus containing genes from the North American triple reassortant swine viruses and neuraminidase (NA) and matrix (M) genes derived from Eurasian swine viruses. It had probably been circulating undetected among swine during the previous decade, but only recently emerged among humans [2,3].
The emergence and subsequent rapid global spread of this influenza virus provided a unique opportunity to observe the evolutionary population dynamics of the first influenza pandemic virus after forty years, particularly in regions where virological surveillance is comprehensive, closely matched to the well-defined chronology of epidemic waves, and related disease surveillance (available at the Italian Influnet website: http://www.iss.it/iflu/). The molecular characterisation of the A(H1N1)pdm09 pandemic revealed seven globally distributed main clades [4].
The aim of this study was to reconstruct the evolutionary dynamics of the A(H1N1)pdm09 influenza virus in Italy during two epidemic seasons (2009/2010 and 2010/2011) in the light of the forces driving viral evolution.

Ethics statement
According to the Regional Surveillance and Preparedness Plan (DGR IX/1046, 22 Dec. 2010 and DGR 5988, 30 Jun 2011), diagnostic and clinical management of patients admitted at hospitals in the Lombardy Region with severe and moderate ILI included prospective influenza A detection, subtyping and sequencing. These activities were centralized at the two regional reference laboratories (S.S. Virologia Molecolare, Fondazione IRCCS Policlinico San Matteo, Pavia, and Dipartimento di Scienze Biomediche per la Salute, Università degli Studi di Milano, Milan). Mild respiratory infections were collected by sentinel practitioners and anonymously analyzed at the reference laboratory in Milan, in the frame of the National Surveillance Plan (Influnet). Data were analyzed anonymously according to a Regional Surveillance and Preparedness Plan. Mild ILI were collected and analyzed within the National Surveillance Plan (Influnet), following approval by the Ethic Commitee of Fondazione IRCCS Policlinico San Matteo, Pavia.

Dataset and patients
Within the framework of the Italian Influenza Surveillance Network, nasal swabs (NS) or broncho-alveolar lavages (BAL) were collected from outpatients with the symptoms of influenza-like illness (ILI) and hospitalised patients suffering from severe respiratory syndromes.
A dataset was constructed that included 227 HA gene sequences (835 nucleotides in length, positions 121-954) obtained from as many A(H1N1)pdm09-positive patients, whose characteristics are shown in Table S1.

A(H1N1)pdm09 detection and HA sequencing
Total RNA was extracted from the respiratory samples using the NuclisensH easyMAG TM automated extraction kit (BioMerieux, Lyon, France), and a virological diagnosis of A(H1N1)pdm09 infection was made by means of a real-time reverse-transcriptase Phylodynamics of A(H1N1)pdm09 Virus in Italy PLOS ONE | www.plosone.org polymerase chain reaction (RT-PCR) assay [5]. The A(H1N1)pdm09 HA gene was amplified directly from the clinical specimens using a SuperScriptIII TM One-step RT-PCR amplification kit (Invitrogen, Carlsbad, USA) and an RT-PCR assay specific for a 995 bp fragment (nt. positions 64-1,058) in the HA1 domain [6]. The PCR products were purified using a Microcon-100 microconcentrator in accordance with the manufacturer's instructions (Millipore, Bedford, MA, USA), and the purified products were sequenced using a BigDye Terminator Cycle-Sequencing kit (Applied Biosystems, Foster City, USA) and ABI Prism 3100 DNA sequencer (Applied Biosystems, Foster City, USA).

Phylogenetic analysis
The sequences were aligned using CLUSTALW (integrated within the Bio-Edit sequence editor by Tom Hall, 2001; http:// www.mbio.ncsu.edu/BioEdit/bioedit.html). The best-fitting nucleotide substitution model was estimated by means of JModeltest [7], and selected an HKY model [8] with gamma-distributed rates among sites.
The phylogenetic tree, model parameters, evolutionary rates and population growth were co-estimated using a Bayesian Markov Chain Monte Carlo (MCMC) method implemented in the BEAST v.1.54 package [9]. Statistical support for specific clades was obtained by calculating the posterior probability of each monophyletic clade. As coalescent priors, we compared four simple parametric demographic models (constant population size, and exponential, expansion and logistic population growth) and a piecewise-constant model, the Bayesian skyline plot (BSP) under both a strict and a relaxed (uncorrelated log-normal) clock [10].
Two independent MCMC chains were run for 100 million generations with sampling every 10,000 th generation, and were combined using the LogCombiner 1.54 included in the BEAST package. Convergence was assessed on the basis of the effective sampling size (ESS) after a 10% burn-in using Tracer software version 1.5 (http://tree.bio.ed.ac.uk/software/tracer/). Only ESS's of $200 were accepted. Uncertainty in the estimates was indicated by 95% highest posterior density (95% HPD) intervals, and the best-fitting models were selected using a Bayes factor (BF with using marginal likelihoods) implemented in BEAST [11]. In accordance with Kass and Raftery [12], the strength of the evidence against H 0 was defined as 2lnBF,2 = none; 2-6 = weak; 6-10 = strong; and .10 = very strong. A negative 2lnBF indicates evidence in favour of H 0 . Only values of $6 were considered significant. The trees were summarised in a target tree using the Tree Annotator program included in the BEAST package, choosing the tree with the maximum product of posterior probabilities (maximum clade credibility) after a 10% burn-in.
The basic reproductive number (R 0 ), indicating the mean number of secondary cases generated by a single primary case, was estimated on the isolates sampled during the pandemic period. It was calculated on the basis of the exponential growth rate (r) using the equation R 0 = rD+1, where D is the average duration of infectiousness [13], assuming a generation time similar to that of other pandemic viruses [14,15]. The doubling time of the epidemic was given by the relation l = ln(2)/r [16].

Bayesian phylogeography
The spatial reconstruction was obtained by means of the same Bayesian framework using a continuous time Markov Chain (CTMC) implemented in BEAST [17] over discrete sampling locations, and applying a Bayesian stochastic search variable selection (BSSVS) model, which allows the diffusion rates to be zero with a positive prior probability. Comparison of the posterior and prior probabilities of the individual rates being zero provided a formal BF to test the significance of the linkages between locations. Rates yielding a BF of .6 were considered well supported and formed the migration pathway.

Selection pressure analysis
The d N /d S ratio (v) was estimated using the maximum likelihood (ML) approach under a global single-ratio model implemented in the HyPhy program [18]. In particular, the global model (which assumes a single selective pressure for all branches) was compared with the local model (which allows selective pressure to change along every branch) using the  likelihood ratio test (LRT). The second model was not better than the first in any of the patients. Site-specific positive and negative selections were estimated using three different algorithms: single likelihood ancestor counting (SLAC), derived from the Suzuki-Gojobori approach [19]; fixed-effects likelihood (FEL), which fits an v ratio to every site and uses the likelihood ratio to test whether d N ? d S ; and random effect likelihood (REL), a variant of the Nielsen-Yang approach [20], which assumes the existence of a discrete distribution of rates across sites, and allows both d S and d N to vary independently site-by-site. The three methods were described in more detail elsewhere [18].
Finally, in order to investigate whether the sampled sequences have been subjected to selective pressure at population level (i.e. along internal branches), an internal fixed effects likelihood (IFEL) method [21] was also used.
In order to select the sites under selective pressure, we assumed a p value of #0.1 or a posterior probability of $0.9. The likelihood ratio test (LRT) was used to compare the performances of the M0 (one-ratio), M1 (nearly-neutral) and M2 (selection) models. Hyphy software was used for all of the analyses, some of which were made using the web-based Datamonkey interface (http://www.datamonkey.org/) [22].

Global phylogenetic analysis and population dynamics of the Italian A(H1N1)pdm09 epidemics
The maximum likelihood and Bayesian analyses of the global data set of 561 A(H1N1)pdm09 isolates (227 from Italy and 334 from all over the world) showed that the Italian isolates clustered into five significant groups (pp.0.9). The clusters included a total of 136 isolates, representing 59.9% of the Italian isolates and 100% of those of the post-pandemic season (2010/2011), whereas the isolates of the pandemic season (2009/2010) were interspersed with sequences from other countries ( Figure 1).
In order to reconstruct the population dynamics of the A(H1N1)pdm09 epidemics in Italy, we separately analysed the Italian isolates with a known month of isolation, and estimated the evolutionary rate.

Evolutionary rate estimates
A strict and a relaxed (log-normal) molecular clock model were implemented under the less stringent Bayesian skyline plot demographic model. The marginal likelihood comparison showed that the relaxed clock did not fit the data significantly better than the strict clock (2lnBF = 9.85). Moreover, the lower 95% HPD limit of the coefficient of variation and the evolutionary rate standard deviation estimates were always very small (respectively 1.01610 25 and 1.3610 24 ), thus indicating that the evolutionary rate varied only slightly over the branches of the tree. For these reasons, the strict clock model was selected for all of the subsequent analyses.

Temporal dynamics
The Bayesian time-scaled tree of the 227 A(H1N1)pdm09 HA gene sequences collected during the two A(H1N1)pdm09 epidemics showed that the isolates sampled in the pandemic season were mainly interspersed at the root of the tree, with only two significant clades respectively including 24 (clade E, pp = 0.93) and six isolates (clade F, pp = 0.97) sampled between August 2009 and January 2010. These two clades were also present in the global tree, but included multiple isolates from different countries in the world (Figure 1). On the contrary, most of the isolates collected in the post-pandemic season grouped into four highly significant clades (A-D) and a number of sub-clades, largely corresponding to those identified in the global tree.  (Table 1).

Signature amino acid substitutions in the Italian clades and HA selection pressure analysis
The mean genetic distance was 0.3% (substitutions per 100 sites) (60.1%) within the group of isolates obtained in the pandemic season, and 1% (60.2%) within the group of isolates obtained during the post-pandemic season. The mean distance between the isolates of the two seasons was 0.9% (60.2%).
A total of 16 codons showed mutations affecting more than 30% of the isolates included in at least one clade. Table 2 shows the clade-specific frequency of amino acid modifications at each site. Some codons were mutated in 100% of the isolates belonging to a specific clade: codons 205, 216 and 249 in clade A, 185 in clade B, 125 in clade C, 134 and 183 in clade D, 222 in clade E, and 32 in clade F. The most frequent mutation in our isolates was a D to N (or less frequently to K or H) substitution at codon 97 affecting the majority of clade A (98.6%) and B strains of the post-pandemic season (81.6%). Interestingly, codon 222 was mutated in 100% of the clade E strains (substitutions from D to E in all cases but one, in which the mutant amino acid was G) and was also sporadically mutated in 11 isolates dispersed in different clades although, in these cases, the mutant amino acid was always different from E.
The ML estimate of the d N /d S ratio (v) gave a mean value of 0.43 (95% CI: 0.35-0.52), with no significant difference between different lineages (LRT from a global and local model 2Dlikelihood = 161.7, p.0.1). Using the Nielsen-Yang approach, we found that a model of evolution assuming site-specific selection fitted our data better than a neutral evolution model (M1-2Dlikelihood = 13.06, p = 0.001 by LRT), and analysis of site-bysite selection pressure revealed that most of the methods used supported two sites under positive selection at a 90% level of significance (Table 3): codons 97 (from D to N/K/H) and 222 (from D to E/G/Y/N).
Six sites were selected along the internal branches (Table 3), mainly corresponding to previously described clade-specific substitutions (codons 97, 125, 141, 185, 249 and 297). In-terestingly, codon 222 was not one of the sites selected at population level.

Population dynamics
Bayes factor comparison of four simple parametric (constant population size, exponential, expansion and logistic growth) and one piecewise demographic model (BSP) showed that the last fitted the data better than the others (Table S2)

Local phylogeography
The local phylogeographical analysis was made by grouping the isolates on the basis of their sampling locations and building a spatial scaled phylogeny using the Bayesian framework. The location-annotated tree is shown in Figure S1.
Analysis showed that the first season of the epidemic was characterised by ancestors localised in Milan (the most probable location of the tree root) and in the area north of the city: the MRCA of clade E was most probably located in Milan (pp = 0.76), but the isolates included were from different places in Lombardy, whereas clade F was more restricted to Milan city and the hinterland. The post-pandemic season showed a more dispersed origin of the clades, which were localised in both northern and southern Lombardy: clades A and D most probably originated in the northern area of Milan (pp = 0.37), whereas clades B and C had MRCAs most probably located in southern Lombardy (pp = 0.37 and 0.34). The isolates included in all of the clades were dispersed throughout the region, including the southern part. The first wave of infection originated in localities with a high population density ($382 inhabitants/km 2 ), whereas the second wave involved less densely populated places (#250 inhabitants/ km 2 ) (Figure 4).

Discussion
We used a sophisticated Bayesian evolutionary framework for the molecular characterisation and reconstruction of the phylodynamics of the A(H1N1)09pdm influenza virus in northern Italy during two epidemics: the pandemic between summer and autumn 2009, and the post-pandemic period between November 2010 and March 2011.
In order to describe the Italian epidemics in the setting of the widespread diffusion of infection throughout the world, we analysed a total of 227 newly characterised northern Italian strains and a series of reference isolates from different countries retrieved from public databases. The first analysis included all of the patients' and reference isolates (global tree), and the second only the Italian strains, using a time-scaled phylogeny that assumed a strict molecular clock model.
Analysis of the trees suggested that the spread of the virus in Italy was the result of multiple independent introductions from different geographical areas because the Italian isolates sampled during the pandemic season were interspersed with sequences sampled in other countries at the root of the tree, and did not form any significant pure Italian clusters. The only exceptions were two clades in the Italian tree (E and F) that tended to group together in the global tree and included a number of isolates from other countries. Clade E was characterised by the presence of the signature substitution D222E, which has been previously described as circulating in the UK between July and September 2009 [23], thus suggesting possible importation from this country.
On the contrary, four or five highly significant pure Italian clades connected to the tree with long branches were observed during the post-pandemic season. Clade C was split into two different highly significant groups in the global tree, and clades A and D included a single non-Italian strain (one Turkish and one Tunisian).
These data suggest that multiple initial introductions of A(H1N1)09pdm in 2009 were followed by founder effects causing the local amplification of the infection in the post-pandemic season. In line with this hypothesis, the isolates of the postpandemic wave showed greater intra-seasonal genetic divergence from those of the pandemic period, probably because of the typical effect of genetic drift on genetic variability, which tends to be less within groups but greater between groups.
In order to reconstruct the population dynamics of the H1N1 pandemic in Italy on a calendar time scale, we estimated the evolutionary rate of the Italian isolates using a better fitting strict molecular clock implemented in the Bayesian framework. Our estimates ranged from 3.5610 23 to 6.4610 23 substitutions/site/ year, and were in line with those recently estimated for the HA gene by other authors [3,24].
The tMRCA estimate of the tree root dating back to February 2009 is in line with the majority of the previous estimations, which place the origin of the pandemic H1N1 strain in January 2009,  [3,25,26]. This pre-dates the first Italian identification of A(H1N1)09pdm infections and suggests that, also in Italy, unidentified infections in returning travellers may have occurred before the epidemiological alert [23]. In line with these epidemiological data, the tMRCAs of the four specific Italian clades were dated spring 2010 (between March and May), and the radiation of the Italian strains (corresponding to the tMRCAs of the nodes inside the clades) were dated late summer and autumn 2010. Moreover, coalescent-based population dynamics revealed two phases in the exponential growth of the effective number of infections corresponding to the two seasons. The first exponential growth phase was between May and December 2009, and the second was between October 2010 and January 2011. The estimation of the basic reproductive number of the first pandemic period, gave a R 0 close to the lower confidence limit estimated by Fraser, confirming the limited potential to spread of A(H1N1)pdm09, in comparison with previous pandemics [25].
In order to reconstruct the geographic dispersion of the infections, the northern Italian isolates were grouped on the basis of their place of isolation, and an estimate was made of the genetic flows between the different geographical areas. The analysis showed that the isolates obtained during the first pandemic season most probably originated in areas with high population densities, such as Milan and its north-western hinterland where there are important international airports, whereas the isolates of the second season were more dispersed and most probably originated in smaller and less densely populated areas such as southern Lombardy. Given the characteristics of the urbanisation of this area, these localities may represent the most probable geographical areas, in which the founder effect occurred, a hypothesis that is supported by the phylogeographical tree. This suggests that the geographical dispersion of A(H1N1)pdm09 was characterised by possibly gravity-like dynamics in which larger cities act as attractors and drive the spread of infection to their smaller counterparts [27][28][29].
The implementation of innovative advanced phylogenetic analysis methodologies to the study of emerging viruses at a molecular level will be of fundamental importance to improve concretely the epidemiological surveillance of emerging and reemerging infections. Our present data shed light on the relationships between the evolutionary and phylogenetic characteristics of influenza A(H1N1)2009 virus and its geo-epidemiology, allowing to estimate essential parameters such as the transmission potential of the virus or the most probable path of geographical dispersion. Indeed, the opportunity of studying the emerging pathogens and analyzing their ecology, diffusion and evolution will allow generating an early response in case of outbreaks, particularly in a pandemic caused by viral pathogens able to spread quickly in the human population. The rapidity with which the history of the origin and dispersion of A(H1N1)pdm09 has been reconstructed on the basis of the available genomes [3,25] it is an example of the power of the ''phylodynamic'' approach in the Public Health.
We also investigated the positive selection pressures acting on HA protein. Various algorithms revealed evidence of positive selection in the terminal and internal branches of six codons, most of which were fixed in the sequences of viruses belonging to different clades. The codons in the influenza HA gene identified as being positively selected are presumably encoding amino acid replacements that allow the virus to evade existing population immunity. In particular, two positions were observed with over 90% of significance. The D97N change was previously observed in influenza strains recovered from patients with fatal cases circulated in England [30] and in Indian isolates in co-occurrence with mutation E374K [31]. As regard the 222 residue, our data are in agreement with previous publications that report a positive selection on this site [30,32,33].
Particular attention was given to positions 187 and 222, which have recently been extensively studied because of their importance in receptor binding preference and cross-specific shifts [34][35][36][37][38]. In addition, the substitutions D222G/N/Y have been associated with severe influenza infection [36,[38][39][40][41], probably because of preferential binding to the a2-3 sialic acid receptor in lung tissue rather than to the a2-6 sialic acid receptor in the upper airways. The majority of A(H1N1)pdm09 viruses circulating during the pandemic did not bear the D222G/N/Y mutation, whereas the E change was present in all of the strains in clade E and was not associated with an increased risk of severe respiratory distress. However, it is worth noting that D222G/N/Y changes were observed in sequences belonging to all of the clades and that this position was under positive selection only in the terminal branches. The mutations at codon 222 associated with increased virulence were therefore not fixed in the population but seem to have emerged as a result of intra-patient selection. The segregation of these mutants in the lower respiratory tract and their lower affinity with sialic acid receptor (a2-6) are possible reasons for the unsuccessful spread of this virus. In conclusion, on the basis of all of these observations, we can hypothesise that the A(H1N1)pdm09 virus was introduced into Italy as a result of multiple importations by travellers coming from affected foreign areas. The initial transmission networks originated in the more densely populated locations in northern Italy between summer and autumn 2009, after which repeated founder effects occurred in more dispersed populations living in smaller cities and originated new specifically Italian clades that characterised the second season of the pandemic between November 2010 and March 2011. This suggests a possible gravity-like model of phylogeographical spread.