Intrasubtype Reassortments Cause Adaptive Amino Acid Replacements in H3N2 Influenza Genes

Reassortments and point mutations are two major contributors to diversity of Influenza A virus; however, the link between these two processes is unclear. It has been suggested that reassortments provoke a temporary increase in the rate of amino acid changes as the viral proteins adapt to new genetic environment, but this phenomenon has not been studied systematically. Here, we use a phylogenetic approach to infer the reassortment events between the 8 segments of influenza A H3N2 virus since its emergence in humans in 1968. We then study the amino acid replacements that occurred in genes encoded in each segment subsequent to reassortments. In five out of eight genes (NA, M1, HA, PB1 and NS1), the reassortment events led to a transient increase in the rate of amino acid replacements on the descendant phylogenetic branches. In NA and HA, the replacements following reassortments were enriched with parallel and/or reversing replacements; in contrast, the replacements at sites responsible for differences between antigenic clusters (in HA) and at sites under positive selection (in NA) were underrepresented among them. Post-reassortment adaptive walks contribute to adaptive evolution in Influenza A: in NA, an average reassortment event causes at least 2.1 amino acid replacements in a reassorted gene, with, on average, 0.43 amino acid replacements per evolving post-reassortment lineage; and at least ∼9% of all amino acid replacements are provoked by reassortments.


Introduction
The genome of influenza A virus consists of 8 segments, each represented by an RNA molecule.Coinfection of a cell by viruses of different genotypes occasionally leads to reassortments, i. e. formation of genotypes containing molecules from different sources.Most of the major Influenza A pandemics during the last century were caused by reassortant strains [1].Indeed, reassortments, especially those creating novel combinations of hemagglutinin (HA) and neuraminidase (NA) genes, may lead to radical changes in antigenic properties and give rise to viral types that escape the herd immunity.Still, after a reassortment event, the viral segments find themselves in a novel genetic environment, which may lead to disruption of coadaptations that previously existed between them and reduce viral fitness [2,3].
Thus, it is likely that only a small proportion of reassortment events lead to creation of novel, successful viral genotypes.Reassortments between different Influenza A subtypes gave rise to the major pandemics of the 1957, 1968, 2009, and possibly 1918 [1,4].Reassortments between strains belonging to a single subtype likely occur much more frequently than inter-subtype reassortments; however, they leave a less pronounced phylogenetic signal, and are therefore harder to study [5].
In theory, reassortments can be detected through the incongruencies between phylogenies of different segments of a viral genome.Indeed, after a reassortment, the segments obtained from the same viral isolate will occupy conflicting phylogenetic positions, due to the differences in their evolutionary histories.In practice, however, detecting reassortments is difficult.The influenza sequence databases are subject to ascertainment biases, with recent sequences being oversampled, and some countries sampled better than others.Reassortment events are prone to be missed when one or both parental strains are not sampled properly, or when they are closely related.Conversely, spurious reassortments may be inferred due to differences in phylogenies caused by phylogenetic noise.Multiple reassortments nested within a single clade compound the difficulties.
In most of the early studies, reassortments were inferred via manual detection of incongruencies between phylogenies of different viral segments [6][7][8][9].However, this approach is impractical for systematic analyses of large datasets of influenza genomes with complex reassortment histories.Recently, several methods for automatic detection of reassortments have been proposed.These methods can be broadly categorized into two groups.The distance methods [10,11] measure, for each viral segment, the degree of similarity between all pairs of viral genomes, and infer reassortments from the differences between the distance matrices obtained from different segments.The phylogenetic methods [4,[12][13][14] make explicit use of the evolutionary histories of individual segments, comparing their phylogenies and detecting incompatibilities between them.In general, phylogenetic methods are more robust than the distance methods [12,14], particularly in detecting reassortments that became fixed or reached high frequencies within the population.
Comparisons of relative frequencies of different progeny coming from coinfecting strains in experiments [3,[15][16][17][18][19][20] as well as phylogenetic analyses of circulating reassortant strains [10,21] demonstrate that reassortments between different segments are not equiprobable.Such differences likely arise, in part, from variance in the extent of epistatic interactions between pairs of genes.For example, a recent analysis of experimental data suggests that polymerase genes (PB1 and PA) tend to be inherited together, and that reassortment preferences for HA depend on the subtypes of the parental strains, while NA and matrix protein (MP) have no preferences in their reassortments with other segments [19].Analysis of distributions of coalescent times for different segments suggests, however, that reassortments between HA and NA are particularly frequent [22].
Besides radical shifts in antigenic properties, reassortments are associated with a reduction in genetic diversity of circulating strains, indicative of positive selection favoring the spread of the reassortant strain [22].Several observations also suggest that, subsequent to the ''antigenic shift'', reassortants tend to undergo increased ''antigenic drift'', i.e., elevated rate of amino acid replacements, perhaps due to follow-up coadaptation of genes that find themselves in new genetic environments [5,23].For example, a reassortment associated with host change has led to short-term positive selection in NS gene of swine Influenza A [24].Still, the phenomenon of coevolution of viral genes subsequent to reassortments has not yet been studied systematically.Here, we purport to close this gap.

Identification of reassortments
To study the within-subtype reassortments in the influenza H3N2 virus, we first used GiRaF [12] to automatically infer the reassortant taxa in the dataset of 1376 complete influenza H3N2 genotypes.GiRaF compares large pools of Monte Carlo-sampled phylogenetic trees constructed for each viral segment separately, inferring the topological incongruencies between them.Assuming that an incongruence between phylogenies of two segments, observed in a high fraction of comparisons, reflects an ancestral reassortment event, GiRaF then predicts the subsets of taxa that are descendant to such reassortments.The number of reassortments predicted by GiRaF in which a particular segment was involved was similar for all segments except M1 and NS1 (Table 1); in M1 and NS1, fewer reassortments were detected, probably because their shorter sequences and/or higher conservation (Table 1) lead to a weaker phylogenetic signal.We saw no preferences for particular pairs of segments to be reassorted (Table S1).
We then mapped the reassortments inferred by GiRaF onto the reconstructed phylogenies of each of the segments involved.We assumed that each reassortment event had happened on the phylogenetic branch leading to the last common ancestor of the reassortant taxa (reassortment-carrying branch, RCB).This way of mapping reassortments was self-evident for monophyletic sets of taxa.However, when the histories of reassortment events are complex (which is the case for Influenza A; [7,9]), the subsets of genotypes resulting from a single reassortment predicted by GiRaF are not necessarily monophyletic [12].Indeed, we found that in our dataset, many of the inferred sets of reassortant taxa were not monophyletic (Figure 1).To test the robustness of our conclusions, we therefore also tried alternative ways of mapping reassortment events, which also supported our key findings (see below).

Elevated rate of amino acid replacements after reassortments
We asked whether reassortments affect subsequent accumulation of amino acid-changing replacements.We hypothesized that subsequently to RCBs, the rate of accumulation of amino acid replacements will be temporarily elevated as the reassorted sets of genes coadapt to each other.
To address this, we inferred, for the genes coded by each of the 8 segments, the phylogenetic positions of all amino acid replacements, and studied those replacements that were descendant to at least one RCB.When a reassortment and a replacement occurred on the same phylogenetic branch, it was impossible to deduce which came first.To avoid this ambiguity, we considered a replacement to be descendant to a particular reassortment if it occurred on a phylogenetic branch descendant to an RCB, but not on the RCB itself.(Including the substitutions on the RCBs as descendants to reassortments gave similar results; see below.)Between 33% and 50% (depending on the segment) of the RCBs were terminal branches, and thus their effect on subsequent replacements could not be studied.14% to 33% more of the RCBs were pre-terminal branches.Amino acid replacements on the terminal branches, especially in RNA viruses, tend to be more deleterious, and may follow evolutionary patterns distinct from the replacements elsewhere on the phylogeny [25][26][27][28][29][30]; thus, we chose to exclude such replacements from our analyses.Therefore, the RCBs on pre-terminal branches could also have no effect on replacements in our dataset.The remaining 33% to 42% of RCBs could be followed by amino acid replacements, and in fact, most of the replacements occurred in reassortant clades.In NA, for example, 320 out of 411 (78%) observed non-terminal amino acid replacements were descendant to at least one RCB (Table 1).
To assess the effect of reassortments on subsequent accumulation of amino acid replacements, we measured the phylogenetic distance between each replacement and its most recent ancestral RCB.We then compared these distances to those expected if the

Author Summary
Influenza A is a rapidly evolving virus with genome composed of eight distinct RNA molecules called segments.This genetic structure allows formation of new combinations of segments when a cell is coinfected by multiple viral strains, in a process called reassortment.While ''antigenic drift'' -the process of continuous accumulation of point mutations that change the antigenic properties of the viral proteins -is mainly responsible for the seasonal flu, the heaviest pandemics were caused by spread of novel reassortant strains and the associated radical ''antigenic shift''.However, the association between these two types of processes has not been studied systematically.Here, we use the extensive available complete-genome sequencing data for Influenza A H3N2 subtype to infer the evolutionary timings of withinsubtype reassortment events, and study the patterns of point amino acid-changing replacements that followed reassortments.We find that reassortments were often rapidly followed by replacements, which possibly compensated for the loss of fitness associated with the reassortment or explored newly accessible fitness peaks.These findings may be relevant for prediction of future pandemic strains of Influenza A. phylogenetic positions of post-reassortment replacements were random in respect to reassortments.This approach is conservative, in that it ignores any possible long-term effect of reassortments spanning phylogenetic distances comparable with the height of the phylogenetic tree.
In two of the genes, NA and PB1, the mean distance was significantly lower than expected (Table 1), indicating that RCBs were followed by a transient increase in the rate of amino acid replacements.Since the inferred number of RCBs for each gene is moderate (e.g., 20 for NA, only 14 of which could have descendant replacements), individual RCBs could have a disproportionate effect on the distribution of distances.For the NA segment, we asked whether the observed accelerated evolution after reassortments is due to some single reassortment event.To this end, we repeated the analysis 14 times, each time excluding one of the RCBs, and comparing the observed and expected distances between the amino acid replacements and the remaining RCBs.In all 14 comparisons, the results remained significant, indicating that the acceleration of replacements is a phenomenon to which multiple reassortments contribute (Table 2).
To further test its robustness, we repeated the analyses using alternative methods of mapping reassortments onto the tree (Tables S2, S3, S4, S5, S6, S7).These methods differ in the strength of the required statistical evidence for inference of reassortments, and in the ways non-monophyletic sets of reassortant taxa are treated (see below).For NA, acceleration of evolution after reassortments was significant in 5 out of 6 analyses; in the sixth analysis, it was marginally significant (p = 0.09).For PB1, acceleration was significant in 3 out of 6 analyses.Moreover, in a number of analyses, acceleration was also observed in several other genes for which no significant result is observed in Table 1: M1 in 4 tests, and HA and NS1 in 3 tests each.Thus, 5 out of 8 genes show evidence for acceleration at least in half of the tests.Still, the results for NA are the most robust, and this is the gene for which the evidence for reassortments-caused-acceleration of evolution is the strongest.When the substitutions on the RCB itself were also counted as descendant to the reassortment, the results were similar (Tables S8, S9, S10, S11, S12, S13).
In PB1, the increase in the rate of amino acid replacements after the RCB is rather long-lived: it spans a phylogenetic distance of ,0.04 ds units, although the excess is not significant for most individual distance bins (Figure 2).In contrast, in NA, this increase is very brief, with most of the excess replacements observed on the very short phylogenetic branches that immediately follow the RCB (Figure 3).In particular, we observe 30 such replacements at phylogenetic distances up to ,0.003 ds units (which is just above the time it takes the NA gene to obtain a single synonymous replacement, and is therefore the highest phylogenetic resolution we can achieve; leftmost bin in Figure 3).Because virtually no such replacements would be expected to occur in such a short period of time if they had been independent of reassortments, all these excess 30 replacements are reassortment-provoked.Therefore, at least ,9% (30/320) of all amino acid replacements in NA were caused by reassortments; since some of the later replacements descendant to the RCBs could also be reassortment-provoked, the actual number is probably higher.Moreover, the fact that these replacements occur so fast implies that most of them are facilitated by positive selection, and thus comprise a post-reassortment adaptive walk [31,32].What is the length of such adaptive walks, i.e., the characteristic number of amino acid replacements provoked by an individual reassortment?The 30 ''fast'' replacements were descendant to 3 individual RCBs (Table 2); 11 more RCBs were not followed by replacements so soon, although they could be followed by reassortment-associated replacements later.Therefore, an average reassortment provoked ,2.1 (30/14) amino acid replacements in its descendant clade.However, these replacements could occur in multiple independent lineages, so that the number of postreassortment replacements per lineage was lower.Indeed, by the time the phylogenetic distance of 0.003 ds units after an RCB was reached, the descendant subtree had often multifurcated, so that these 3 RCBs gave rise to a total of 33 individual descendant lineages; 36 more lineages originated from the remaining 11 RCBs.In 2 out of the 3 RCBs that provoked replacements, different post-RCB lineages accumulated replacements independently (Table 2).As a result, over the evolutionary time of 0.003 ds units, an average post-RCB lineage attained 0.43 (30/69) reassortment-provoked replacements (Table 2).This is the lower boundary for the length of the ''adaptive walk'' per lineage associated with a reassortment event, as it excludes any effect of reassortments over longer timescales.

Distances to ancestral RCBs for different classes of amino acid replacements
We asked whether the post-RCB amino acid replacements are enriched in particular classes of mutations, compared to the rest of the replacements.In this analysis, we considered the NA gene, because in it, the effect of post-reassortment adaptive walk is the most robust; and also the HA gene, because it is the other primary determinant of antigenic properties, is known to evolve under continuous positive selection, and is highly epidemiologically relevant.
Several categories of mutations had biased phylogenetic distances from the RCBs, compared with the complementary sets (Table 3).Firstly, in NA, the replacements at amino acid sites experiencing positive selection tended to be farther from RCBs, while no such difference was observed for HA.Secondly, the    replacements at sites that distinguish the antigenic clusters [33] of HA tended to occur farther from RCBs.Thirdly, parallel replacements had a strong tendency to occur soon after the RCBs both in NA and HA.Fourthly, reversions in NA, but not in HA, occurred soon after the RCBs.Fifthly, the sites previously shown to be involved in intragenic epistatic interactions [34] as ''leading'' occurred father from RCBs both in NA and in HA, while the ''trailing'' sites occurred soon after the RCBs.Neither the NA nor the HA genes showed any bias for replacements at the epitopic sites.As was the case with the previous analysis, most of these results were also supported by alternative methods of inferring RCBs (Tables S14, S15, S16, S17, S18, S19); the exception were reversions in NA, which gave discordant results in different tests.

Discussion
Our study provides the first systematic analysis of association between reassortments and amino acid-level changes in influenza A. We show that a reassortment involving a particular segment provokes a transient increase in the rate of amino acid replacements at the gene encoded on this segment; these replacements tend to occur at sites that do not normally experience positive selection, and often involve parallel replacements.
One way to estimate the effect of RCBs on subsequent accumulation of amino acid replacements would be to directly compare the replacement rates on branches that had descended from reassortments and those that had not.However, this is not feasible in Influenza A, because in most segments, the majority of branches (50-85%) descend from at least one RCB (Table 1); the remaining branches tended to be deep-lying, and thus likely have biased patterns of replacements [29,30].Instead, we searched for a transient increase in the rate of amino acid changes on the branches descendant to the internal RCBs.This analysis could miss some of the very rapid reassortment-provoked changes that had happened on the same branches as the reassortments themselves.
We compared the phylogenetic distances between the amino acid replacements and the preceding RCBs to the null distribution expected under the assumption that the amino acid replacements were distributed over post-RCB, non-terminal branches randomly, with probability of a replacement to fall onto a particular branch proportional to the branch length.In NA and PB1 genes, the replacements occur sooner after the RCBs than in the null model.When alternative strategies of mapping reassortments were used, three other genes -M1, HA and NS1 -also show the same pattern in at least half of the tests.Since it is hard to validate the reassortment-mapping algorithms, the results of each individual analysis for each individual gene should be taken with some caution.Together, however, they provide strong support for the reassortment-caused accumulation of changes, especially in the NA gene.The post-RCB excess of amino acid replacements was not exclusively dependent on a single particular RCB (Table 2); thus, it seems to be a universal phenomenon.
Non-uniformity of the substitution rate has been described previously for Influenza, and has been attributed to episodic action of positive selection [35], to simultaneous fixation of multiple interacting advantageous mutations [36] or to frequent selective sweeps under clonal interference.There is no obvious mechanistic association between reassortments and either of these factors.If reassortments themselves are adaptive, they can spread through population rapidly by means of positive selection [22].Strong positive selection favoring an allelic variant may also drive to fixation neutral and even mildly deleterious point mutations linked with it [37]; this phenomenon is prevalent in Influenza A which evolves, on the within-segment level, under nearly complete linkage [38], and where clonal dynamics is largely determined by linkage with beneficial alleles [38,39].This phenomenon could cause an excess of replacements on the same branches as the reassortment events, which is, however, not observed in our data (data not shown).There is no way hitchhiking can cause accumulation of replacements on the branches descendant to RCBs.Therefore, the only feasible explanation for the post-reassortment increase of the rate of amino acid replacements seems to be that they constitute an adaptive walk [31,32], i.e., a burst of positively selected adaptive changes provoked by a shift in the fitness landscape.In NA, we observe a radical increase in the rate of amino acid replacements immediately after reassortments.This excess spans only a short period of time: it is mostly over by 0.003 ds units, i.e., by the time a single synonymous replacement occurs somewhere in the gene (which takes less than a year; [40]).
Thus, at least NA, and very likely other genes, experience transient positive selection after reassortment events.The excess of replacements at phylogenetic distances up to 0.003 ds (i.e., up to the time a single neutral replacement is expected to be accumulated at the gene) in NA suggest that a total of ,30 amino acid replacements in the entire phylogeny, or ,0.43 replacements per lineage, were facilitated by preceding RCBs.Therefore, reassortments are responsible for at least ,9% of all amino acid replacements in this gene.In fact, this may be an underestimate for at least two reasons.First, some of the reassortments were likely to have been undetected.Second, a fraction of the adaptive walks could have spanned longer phylogenetic distances than this threshold (Figures 2, 3).
What is the cause of the post-reassortment accumulation of positively selected replacements?Influenza A is a model system for studying positive selection, with most of the selective pressure exerted by the host immune system.Positive selection is most pronounced in the genes coding for the surface glycoproteins (NA and HA), and within these genes, at the epitopic sites which are most involved in the immune response.Conceivably, the immunity-driven positive selection could increase after a reassortment.We observe, however, that the post-RCB adaptive walk is mostly manifested at sites other than those under constant positive selection, or responsible for antigenic properties, and is not affected by the epitopic vs. non-epitopic location of the site (Table 3).This suggests that the post-reassortment adaptive walks are not driven by the pressure to evade the host immune system.
Rather, these replacements are probably associated with epistatic interactions between genes [41].In general, reassortments and host shifts lead to changes in the patterns of both synonymous and nonsynonymous substitutions, probably due to joint effects of changes in the mutation and selection [29,[42][43][44].After a reassortment event, a gene finds itself in a novel genetic environment which may, through epistatic interactions, exert novel selective pressures on its amino acid sites, facilitating further amino acid changes.An adaptive walk could compensate for the loss of fitness associated with the preceding reassortment [2]; however, as the reassortments themselves tend to be adaptive [22], it seems more likely that these replacements could exploit novel fitness peaks that have become newly accessible after the reassortments.
The evidence for the post-reassortment adaptive walk is the most robust for the NA gene.Sequence evolution of influenza NA does not always lead to changes in antigenic properties [45], and may be caused by other forces instead.Indeed, antibody-driven affinity-changing mutations in HA can be compensated by substitutions changing the activity of NA [46,47]; this indicates that the choice of the optimal NA genotype is dependent at least on HA, and possibly on other genes as well.Furthermore, reassortments often involve a currently circulating strain and an older strain [48], and a reassortment between HA and NA frequently involves an up-to-date variant of HA and an older variant of NA, as suggested by less discordance between sampling time and phylogenetic position of HA sequences than of NA sequences ( [7] and our data).Therefore, while the immune escape is the primary factor of evolution of HA, much of the NA evolution may be epistatic and, in particular, compensatory.
Parallel replacements are overrepresented after reassortments (Table 3).Overall, the rate of parallelism in Influenza A evolution is high [49,50], probably due to similar selective pressures exerted on different strains.The high parallelism observed in this study suggests that the replacements involved in a post-reassortment adaptive walk may also be adaptive in other contexts.
Epistatic interactions both within [51][52][53][54] and between segments [55] are wide-spread in Influenza A. One evolutionary manifestation of this phenomenon is positive epistasis between replacements: a replacement can facilitate subsequent replacements at different sites of the same protein [34].The sites involved in such epistatic interactions can be classified as ''leading'' or ''trailing'', depending on whether replacements in them tend to come as first or second in epistatic pairs; for example, replacements at leading sites can introduce radical changes to protein structure, while replacements at trailing sites may compensate those changes [34].We find that, while the replacements at leading sites are remote from reassortments, the replacements at trailing sites occur sooner after reassortments than expected.Therefore, the sites experiencing post-reassortment replacements are the same sites that also react to the change of the protein structure due to replacements elsewhere in the protein.This suggests that the class of sites denoted as ''trailing'' in [34], and involved in post-adaptive walk in this study, may be responsible for adaptation to novel genetic environment that stems from changes in the same gene as well as in other genes.
Association between reassortments and the rate of subsequent accumulation of amino acid mutations may be important for predicting future pandemic strains.For example, the avian H5N1 influenza is among the most likely candidates for the agent of a future pandemia [56][57][58].Naturally occurring strains of A/H5N1 are not transmittable between mammals; however, to become transmittable, they require just five additional mutations [59] or a reassortment with just four additional mutations [60].Two of these mutations are already frequent among the A/H5N1 viruses [61].If a reassortment commonly leads to accelerated accumulation of amino acid replacements, gaining the remaining mutations and evolving a natural mammalian-transmittable H5N1 strain may take less time than predicted by simple models [61].

Sequences and alignments
We downloaded all complete human H3N2 influenza A genotype sequences (N = 2205) available on 27.10.2011from the flu database [62].Nucleotide sequences for each segment were aligned using muscle [63,64].Genotypes containing truncated sequences, multiple unidentified nucleotides, or indels were discarded.We used CD-HIT [65] to cluster genotypes that had identical sequences of NA segments, and retained one random sequence from each cluster, thus retaining 1379 genotypes for further analysis.For segments encoding PB1, M1 and NS1 that contain overlapping ORFs, we excluded the overlapping regions, and analyzed the longest remaining ORF.All alignments are available at http://makarich.fbb.msu.ru/flu_walks/.

Phylogenetic analysis
For each segment, we Bayes-sampled the 1,000 phylogenetic trees using MrBayes MPI version [66][67][68] with the following settings: GTR+I+G model, 22 million iterations, sampling each 22,000th iteration.Three isolates: A/Ontario/RV123/2005, A/ Ontario/1252/2007 and A/Indiana/08/2011 were excluded from analyses, because we found the branch leading to the clade formed by them to be, for several segments of non-human origin (NP, M, NS, PB2 and PA), too long for a meaningful estimation of evolutionary parameters; these isolates are SOIV triple reassortants (see also [69]).Each phylogenetic tree was rooted by the isolate A/Albany/18/1968.These 1,000 trees were used to infer the reassortment events (see below).
For each segment, the consensus tree of the 1,000 MrBayessampled trees was used as input for HyPhy [70] to estimate the evolutionary parameters and to restore the ancestral sequences.The branch-specific dS values were estimated using the local MG94xHKY85 [71] model.The ancestral sequences were reconstructed using the GTR+I+G global nucleotide model.As an alternative approach, we also repeated our analyses using maximum likelihood trees constructed with PhyML [72] instead on consensus Bayesian trees, and obtained similar results.
For subsequent analyses, we rescaled the lengths of all branches in the consensus trees in the units of dS; this allowed us to study the distribution of nonsynonymous replacements independently of branch lengths.To obtain the gene-specific dN/dS values, we used global MG94xHKY85 model in HyPhy.

Inference of reassortments
For phylogenetic mapping of reassortments, we used a two-step procedure.First, we computationally predicted the subsets of taxa that occupied incompatible positions in phylogenies of different segments using GIRAF software [12] running on a cluster node with 512 Gb of RAM.GiRaF automatically predicted the subsets of taxa originating from each ancestral reassortment event on the basis of the MrBayes sampled trees for all eight segments.
Importantly, not all segments were necessarily involved in each inferred reassortment; although in reality each reassortment splits all segments into two subsets (the retained and the acquired segments), for some of the segments, the phylogenetic signal was often too weak to allow GiRaF to ascribe them to one of the two mutually reassorting sets of segments [12].For such segments, no reassortment event could then be inferred; as a result, although the same sets of taxa were considered for each segment, the number of reassortment events per segment varied (Table 1), and the number of segments involved in each reassortment (on either side) was usually under 8, with some of the segments ''abstaining''.We used two approaches to quality filtering of the predicted reassortments.In the first approach (''joint reassortments'', recommended in [12] and used in the main text), we acquired the subsets of taxa involved in reassortments from the ''catalog file'' produced by GiRaF.This file includes only those reassortments that involved inconsistencies between at least 3 pairs of segments; therefore, each reassortment involved between 4 and 8 segments.In the second approach (''high-confidence reassortments''), we used the reassortment subsets involving any number of pairs of segments (i.e., between 2 and 8 segments), but required the GiRaF-predicted confidence level for the reassortments of 1.0.
Second, we inferred, on the basis of these lists of reassortant taxa, the phylogenetic positions of the RCBs.In theory, each reassortment event should give rise to a monophyletic set of taxa; the last common ancestral branch to this clade is then the RCB.In reality, however, many of the predicted subsets of taxa were not monophyletic.This occurs because, under complex histories of sequential reassortments, GIRAF can either split the taxa descendant to a common reassortment event into multiple sets, or join the taxa descendent from multiple reassortment events into a single set [12].In such cases, the inference of RCBs is ambiguous.
We used three different approaches to infer the phylogenetic position of the RCBs.While these approaches produced identical results for monophyletic sets of reassortants, they differed in the way they treated non-monophyletic sets.For each set of reassortants, we inferred as RCB(s) (i) the (single) branch leading to the most recent common ancestor of all reassortants (''one-point inference'', used in the main text); (ii) the set of branches leading to the most recent common ancestors of each clade involving only reassortants (''two-point inference''); or (iii) the union of (i) and (ii) (''three-point inference'').Arguably, each approach has its merits.Under (i), the number of inferred RCBs is minimal (and equal to the number of sets of reassortant taxa), and so this approach is most parsimonious; conversely, under (ii) and (iii), a single subset of reassortant taxa could give rise to multiple RCBs on the same phylogenetic tree.Under (ii), only reassortant taxa are descendants to RCBs; finally, (iii) may be best for inference of sequences of nested reassortments such that a later reassortment affects a subset of lineages that were also involved in an earlier reassortment, and GIRAF underpredicts the set of reassortant taxa for the earlier reassortment.

Validation of reassortments
For NA, we used two approaches for validation of the observed reassortments.First, we compared our RCBs with the reassortments inferred previously on the basis of manual analysis of a much smaller dataset [7].Out of the 6 reassortments reported in [7], 5 mapped precisely to 4 of our RCBs (accounting for the strains missing in [7], and including one pre-terminal RCB; two of the reassortments mapped to the same RCB).The remaining reassortment mapped to a branch adjacent to an RCB (Table S20).The reassortments from [7] are included for comparison with our RCBs in the phylogenetic trees available at http:// makarich.fbb.msu.ru/flu_walks/.
Second, we analyzed the inconsistencies in the dates of sampling of strains.In general, the sampling dates of Influenza strains are highly correlated with their distance from the root on the ''cactusshaped'' Influenza phylogeny [33], consistently with the major role of continuous positive selection on immunity avoidance shaping it [73].However, recombination may lead to disruptions of this order, because a strain nested deep in the phylogeny may have a recent sampling date if it has reassorted with another recent strain.
We inferred the inconsistencies in the sampling dates as follows (the approach is similar to that used in [48]).We split all our strains (including those identical by NA sequence) into subsets depending on their most recent RCB, and further into smaller subsets based on the year of sampling.We then built a consensus sequence for each of these subsets, excluding those that carried fewer than 10 sequences, and used these consensuses to construct a new ML phylogenetic tree with branches scaled in ds units as described above.We then rotated the branches of the tree to order them by the sampling year.Some of the branches could not be thus ordered; i.e., had sampling dates inconsistent with their phylogenetic position.Specifically, seven of the branches, descendant to four inferred reassortments, had sampling dates later than some of the branches found to the right of them on the tree (Figure S1), supporting their origin from reassortment events.The reassortments that corresponded to these branches were among the top-ranking in our analysis (Table 2).The method of inference of reassortments based on inconsistencies in sampling dates of a single segment is orthogonal to our main approach based on inconsistencies between phylogenies of different segments; therefore, it provides an independent validation for the reassortments that we detect.

Phylogenetic distribution of nonsynonymous replacements
Using the reconstructed ancestral sequences, we inferred, for each segment, the phylogenetic positions of all nonsynonymous replacements.We then measured the distance between each nonsynonymous replacement and its ancestral reassortment.If multiple RCBs were ancestral to a given replacement, we considered the most recent one.In measurements of phylogenetic distances, we assumed that reassortments occurred at the middles of the RCBs, and that each nonsynonymous replacement occurred at the middle of the corresponding phylogenetic branch.We used two approaches for dealing with replacements on the RCBs themselves: they were considered to be either ancestral to this reassortment, and thus excluded from the list of its descendants (Tables 1-3, S2, S3, S4, S5, S6, S7, S14, S15, S16, S17, S18, S19); or descendant to this reassortment, with the distance between the reassortment and the replacement equal to zero (Tables S8, S9, S10, S11, S12, S13).The distances were measured in dS units.
To obtain the expected phylogenetic distribution of the replacements, we, in 10,000 Monte Carlo trials, redistributed the replacements among the tree branches.The probability of a replacement to fall onto a given branch was taken to be proportional to its dS value.We excluded from reshuffling the branches that were not descendant to at least one RCB, the terminal branches, and (for Tables 1-3, S2, S3, S4, S5, S6, S7, S14, S15, S16, S17, S18, S19) the RCBs themselves; since replacements on those branches are not included in corresponding analyses, the number of analyzed replacements was thus conserved.
For the analysis of the length of adaptive walk in NA, we considered a replacement reassortment-provoked if it occurred within 0.003 ds units after reassortment.We traced the number of phylogenetic lineages descendant to each RCB at this time point, and calculated the per-lineage length of the adaptive walk by dividing the number of replacements by the number of lineages.Post-RCB replacements were considered phylogenetically independent if none of them were descendant to any of the remaining ones; the number of such replacements equaled the number of lineages carrying replacements at distance of 0.003 ds.

Subsets of sites and classes of mutations in HA and NA
Positively selected sites were inferred by IFEL [70] and MEME [74] methods from the HyPhy package; a site was considered positively selected if it was predicted by either of these methods.Epitopic sites were taken from [34], and sites distinguishing the HA antigenic clusters, from [33].Replacements from a particular ancestral amino acid to a particular descendant one that occurred at more than one lineage on the phylogeny were categorized as parallel, and replacements that reverted to a once-ancestral state, reversing.
All manipulations with phylogenetic trees were done using the Perl Bio::Phylo package [75].The statistical analyses were performed with R [76].

Figure 1 .
Figure 1.Reassortment events inferred from comparison of phylogenies of HA and NA segments of Influenza A H3N2 virus, and subsequent accumulation of amino acid replacements.(A) The tanglegram [77] of the NA (left) and HA (right) genes for the same set of strains.Topologies of the phylogenies of the two genes are incongruent; such incongruence can be resolved by assuming reassortments.The clades corresponding to inferred reassortments are shown in color, with the lines of the corresponding color connecting the inferred reassortants for the two genes.(B) Zoom-in of the clade in rectangle in A. Red dot, inferred RCB (node id 62); green hatches, amino acid replacements on internal branches following the RCB.Some clades are collapsed into black triangles for clarity, and substitutions on such clades are not shown.doi:10.1371/journal.pgen.1004037.g001 Distance between post-RCB substitutions and their most recent ancestral RCB, excluding the current RCB, dS units Reassortment-provoked substitutions very soon after reassortments a Node id No. -terminal, non-pre-terminal RCBs are shown.The table shows the observed and expected distances between the amino acid replacements and the preceding reassortment, and the Wilcoxon p-value for the difference between the observed and expected values.The RCBs are ordered by the Wilcoxon p-value.a Substitutions at phylogenetic distances up to 0.003 ds units after the reassortment.b Phylogenetically independent substitutions are such that none of them are descendant to any of the remaining ones.c Number of substitutions accumulated per lineage.d RCBs also described in [7].e RCBs confirmed by sampling times.doi:10.1371/journal.pgen.1004037.t002

Figure 2 .Figure 3 .
Figure 2. Excess of amino acid replacements after RCBs in PB1.Top: frequencies of amino acid replacements observed at various phylogenetic distances from the most recent ancestral reassortment; pink: observed; gray: expected; red: overlayed observed and expected.Bottom: the excess of amino acid replacements, compared to the expectation.Gray shaded area indicates the 90% CI for the expected value.The expected distribution was obtained from 10,000 reshuffling trials.doi:10.1371/journal.pgen.1004037.g002

Table 1 .
Characteristics of RCBs and amino acid replacements.
The mean distances between the amino acid replacements and most recent RCBs that were significantly (p,0.05)lower than expected are in boldface.*Pre-terminal branches are those immediately ancestral to the terminal branches.doi:10.1371/journal.pgen.1004037.t001

Table 3 .
Phylogenetic distances from reassortments for different classes of amino acid replacements.The phylogenetic distances to the most recent ancestral RCB are compared between a given class of amino acid replacements and all remaining replacements.Classes that are significantly (p,0.05)closer to the most recent RCB are in boldface, and classes that are farther from RCB, in italic. doi:10.1371/journal.pgen.1004037.t003

Table S15
Phylogenetic distances from reassortments for different classes of amino acid replacements.Joint reassortments, two-point inference.Classes that are significantly (p,0.05)closer or farther from the most recent RCB are in boldface, with red color indicating closer to RCB, and blue, farther from RCB. (XLS) Table S16 Phylogenetic distances from reassortments for different classes of amino acid replacements.Joint reassortments, three-point inference.Classes that are significantly (p,0.05)closer or farther from the most recent RCB are in boldface, with red color indicating closer to RCB, and blue, farther from RCB. (XLS) Table S17 Phylogenetic distances from reassortments for different classes of amino acid replacements.High confidence reassortments, one-point inference.Classes that are significantly (p,0.05)closer or farther from the most recent RCB are in boldface, with red color indicating closer to RCB, and blue, farther from RCB. (XLS) Table S18 Phylogenetic distances from reassortments for different classes of amino acid replacements.High confidence reassortments, two-point inference.Classes that are significantly (p,0.05)closer or farther from the most recent RCB are in boldface, with red color indicating closer to RCB, and blue, farther from RCB. (XLS) Table S19 Phylogenetic distances from reassortments for different classes of amino acid replacements.High confidence reassortments, three-point inference.Classes that are significantly (p,0.05)closer or farther from the most recent RCB are in boldface, with red color indicating closer to RCB, and blue, farther from RCB. (XLS)