Phylodynamic Inference and Model Assessment with Approximate Bayesian Computation: Influenza as a Case Study

A key priority in infectious disease research is to understand the ecological and evolutionary drivers of viral diseases from data on disease incidence as well as viral genetic and antigenic variation. We propose using a simulation-based, Bayesian method known as Approximate Bayesian Computation (ABC) to fit and assess phylodynamic models that simulate pathogen evolution and ecology against summaries of these data. We illustrate the versatility of the method by analyzing two spatial models describing the phylodynamics of interpandemic human influenza virus subtype A(H3N2). The first model captures antigenic drift phenomenologically with continuously waning immunity, and the second epochal evolution model describes the replacement of major, relatively long-lived antigenic clusters. Combining features of long-term surveillance data from the Netherlands with features of influenza A (H3N2) hemagglutinin gene sequences sampled in northern Europe, key phylodynamic parameters can be estimated with ABC. Goodness-of-fit analyses reveal that the irregularity in interannual incidence and H3N2's ladder-like hemagglutinin phylogeny are quantitatively only reproduced under the epochal evolution model within a spatial context. However, the concomitant incidence dynamics result in a very large reproductive number and are not consistent with empirical estimates of H3N2's population level attack rate. These results demonstrate that the interactions between the evolutionary and ecological processes impose multiple quantitative constraints on the phylodynamic trajectories of influenza A(H3N2), so that sequence and surveillance data can be used synergistically. ABC, one of several data synthesis approaches, can easily interface a broad class of phylodynamic models with various types of data but requires careful calibration of the summaries and tolerance parameters.


S.1 Reconstruction of H3N2 case report data
For the Netherlands, sentinel surveillance case report data from 1970 to 2009 were provided by the National Institute for Health Services Research (NIVEL), and sentinel flu virological type and subtype counts from 1994 to 2009 from the National Institute for Public Health and the Environment (RIVM) [1,2]. Case report and virological data were collected through the Dutch sentinel general practitioner (GP) network. The sentinel GP network in the Netherlands is nationally representative by age, gender, regional distribution and population density, and the population varied between approximately 106,000 and 134,000 registered patients. ILI incidence is defined as the number of people who consulted their GP with ILI in a week divided by the population of GPs practices which reported the consultation numbers of the same week. ILI cases are defined by fever (ě 38.0˝C), sudden onset, and cough, sore throat, running nose, frontal headache, retrosternal chest pain or muscle pain (see also www.nivel.nl/peilstations). The virological data considered here correspond to patients that met ILI case definitions. Both data sets are reported by week, from Monday through Sunday, and information on age, gender and region was neglected in the present study. For For the Netherlands, type and subtype specific case report time series from 1994 to 2009 were estimated with a regression model that relates expected weekly ILI counts to weekly virological type and subtype counts. We found that the variability in the NIVEL case report data is appropriately described under an integer count model, and there was substantial evidence for overdispersion in explaining the spread of the residuals. To exclude multiplicity effects, we used a Negative Binomial model with identity link [3]. The virological surveillance data set includes weekly counts of ILI specimen that tested negative, which allowed us to model the baseline of negative ILI case reports explicitly. Compared to using a smooth, seasonally forced baseline function as is typically done [4,5], this approach led to slightly higher estimates of total incidence in winter seasons. This is because the baseline is not constrained to increase in winter seasons. Our initial regression model y t " NegBinpµ t , kq µ t " β 0`β1 rA(H1N1) t s`β 2 rA(H3N2) t s`β 3 rB t s`β 4 rRSV t s`β 5 rNeg t s (S1) consistently overestimates total ILI past 2005, see Figure S1. We found that this trend coincides with broad changes in relative sampling effort, which we defined as the number of ILI cases per virological specimen count. Figure S1B illustrates broad changes in sampling effort with a lowess fit. To adjust for broad changes in sampling effort, we model high sampling effort in an ad-hoc fashion with a binary variable S h t that equals one if the lowess curve in Figure S1B is above 4.5, and zero otherwise. Thus, a simple way to account for the interaction of the previous covariates with sampling effort is the regression model y t " NegBinpµ t , kq µ t " β 0r S h t s´β h 1 rA(H1N1) t s`β h 2 rA(H3N2) t s`β h 3 rB t s`β h 4 rRSV t s`β h 5 rNeg t s¯r 1´S h t s´β lw 1 rA(H1N1) t s`β lw 2 rA(H3N2) t s`β lw 3 rB t s`β lw 4 rRSV t s`β lw 5 rNeg t s¯.
(S2) Figure S1C displays the estimated type and subtype specific incidence time series under model (S2), and Figure S1D (red) illustrates that this model explains the variation in case report data more homogeneously than model (S1). The difference ∆ S1,S2 in the Akaike information criteria of models (S1) and (S2) is 93. Thus, accounting for broad changes in sampling effort leads to improved model predictions and improved model fit. While more refined statistical approaches to account for changes in sampling effort are possible in principle, the broad adjustment in terms of a binary categorical variable in model (S2) is sufficient for our purposes, particularly in comparison to the marked differences between estimated H3N2 incidence time series across countries.
Alternative estimation methods can have a profound effect on the magnitude and the shape of the estimated type and subtype specific time series [5]. We also compared the H3N2 time series under model (S2) to an alternative estimate derived under a standard Serfling regression model [4]. Specifically, we first reconstructed a baseline seasonal ILI time series from interannual ILI counts with the Negative Binomial Serfling regression model y t " NegBinpµ t , kq µ t " β 0`β1 t`β 2 t 2`β 3 cosp2πt{52q`β 4 sinp2πt{52q. (S3) Existing estimates of the timing and the duration of flu seasons were used to define the interannual period [6]. Next, we defined excess ILI as the difference between total ILI and predicted baseline ILI during epidemic periods as in [4], and took the weekly proportion of H3N2 virus counts among all specimen that tested positive in the respective week to reconstruct a weekly H3N2 excess ILI time series. Figure S2A illustrates that this excess H3N2 ILI time series is in broad agreement with the estimate under model (S2).
For France and the United States, we estimated H3N2-specific case report time series from 1997 to 2009 with the above Serfling approach. Excess ILI and H3N2 excess ILI are illustrated in Figures S2B-C for both countries. We observed large variations in relative sampling effort over time, which precluded the application of the multiple regression model (S1).
The reconstructed weekly H3N2 time series across countries are compared in Figure S2D, and differ in magnitude and interannual variatiability, see Figure 1 and Table 1.    Table S1). We computed respective maximum clade credibility trees with TreeAnnotator v1.6.1, and found that the corresponding phylogenetic summaries divergence, diversity, lineages, TMRCA did not differ substantially, see Figure 1 and Table 1. For different population demographic parameters and molecular clocks, posterior mean estimates from the BEAST MCMC trees after burn-in are reported. As an alternative to a constant effective population size, a piecewise constant Bayesian skyline of 20 groups was used. Table S2. Gelman-Rubin convergence diagnostics for the MCMC output under the SEIRS model and the epochal evolution model (Figure 3 and Figure 5 respectively)

S.3 Two-tier MCMC algorithm
Simple ABC rejection samplers are computationally inefficient whenever the posterior density π τ pθ|xq is markedly different from the prior density πpθq [7]. We expect this situation in most phylodynamic applications. Several MCMC and sequential importance sampling algorithms are available to improve computational efficiency [8,9]. These algorithms must be run for some time, or "burn-in", until samples from the ABC target density are generated. Phylodynamic simulations can be time consuming, particularly when the mutation rate is high. To reduce the number of simulations during burn-in, we use an MCMC sampler exactly as the one in Figure 2, but with a standard annealing scheme on the tolerances τ k and on the variance of the proposal density that updates at acceptance [10]. We applied the MCMC algorithm in Figure 2 with annealing on the tolerances τ k and the diagonal covariance matrix of the Gaussian proposal density. Here, the annealing schemes were updated at acceptance events. Since the MCMC sampler updates only a single particle in relation to its previous value, convergence to the target density is typically reached quickly.
To assess the convergence of the above algorithm, we run mABC in parallel and compare the variability of the generated Markov chains within a run to the variability across runs.
The Gelman-Rubin diagnostic was computed [11], and convergence was rejected when the diagnostic exceeded 1.2 (see Table S2). The acceptance probabilities of the MCMC sampler in Figure 2 are typically below 5% and the algorithm may occasionally get stuck in the tails of the target density as in Figure 3A [12]. As further detailed in the technical report [13], we suggest using a sequential importance sampler after convergence to quickly replenish effective sample sizes. Here, we only used the above MCMC sampler.

S.4 Calculating the effective reproductive number from model simulations
We showed that the basic reproductive number R 0 under a phylodynamic model can be substantially larger than the effective reproductive number R eff when loss of immunity due to the antigenic evolution is accounted for, see Figure 3 and Figure 5. In this section, we outline how R eff was back-calculated from model simulations.
We used simulated daily total population-level incidence data in the seasonally forced sink population to obtain reproductive numbers that are comparable to empirical estimates of R eff of H3N2 epidemics in the Northern Hemisphere. Growth rates cannot be well estimated when incidence is low in summer seasons, and were therefore calculated only when incidence increased above a tolerance of 10% of the following peak as illustrated in Fig-ure S3. Remaining noise was smoothed with a sliding window of two weeks as in [14], and the largest growth rate in each season was used to estimate seasonal effective reproductive numbers under the generation time distribution specified by the phylodynamic model [15].
In particular this implied that estimates of R eff always exceed one. Figure S3 illustrates that typical estimates of peak seasonal R eff from simulated daily population-level incidence data are fairly robust to the assumed shape of the generation time distribution and the degree of smoothing.
In the main text, we report the average of the seasonal effective reproductive numbers arising under an Exponentially distributed incubation period with a mean of 0.9 days, and an Erlang distributed infectiousness period with shape parameter 2 and a mean of 1.8 days [15].
Corresponding growth rate in winter seasons (grey), and two-week sliding window (red). (C) Estimated peak seasonal R eff under a fixed generation time of 2.8 days, and different generation time distributions and smoothing intervals. For the susceptible-infected-recovered (SIR) model, generation time follows an Exponential distribution. For the susceptible-exposed-infected-recovered (SEIR) model, the mean incubation period was set to 0.9 days, and the average time an individual remained infectious was set to 1.8 days. We considered two cases, first the infectiousness period was assumed to follow an Exponential distribution and second an Erlang distribution with shape parameter set to 2 as in the main text.
S.5 Choosing basic features of H3N2 sequence and incidence data S.5.1 Robust summaries of H3N2 associated case report data Interpandemic H3N2 case report data in Northern temperate regions are broadly characterized by explosive seasonal epidemics in winter months and substantial irregular interannual variation [1]. Particularly the magnitude of reported disease incidence varies across countries, likely reflecting differences in reporting practices and/or health-seeking behavior, and we identified summaries of H3N2 case report data that are largely insensitive to these differences (see Figure 1 and Table 1). A robust measure of the magnitude in interseasonal variation is the standard deviation in the log ratio of consecutive seasonal case report attack rates because it is based on fractional temporal information in cumulated case report data (σ-attack). To capture the irregularity in interseasonal variation, we compute the autocorrelation in case report peaks for the first few lag years (correlation). Autocorrelation coefficients at the same lag year can vary largely, also because the H3N2 time series is relatively short. We only use the correlation to penalize against strong periodic model behavior; see Table 1. Among other possibilities, the mean seasonal attack rate of reported cases (µattack) describes disease magnitude well and is informative of the country-specific reporting rate ρ. Finally, we quantify the explosiveness of winter epidemics with their average duration at half their peak size (explosiveness) to escape substantial uncertainty in identifying the onset and end of H3N2 seasons.

S.5.2 Inference with and without population level incidence data
In general, surveillance time series data alone is not sufficient to estimate both unknown reporting rates and the transmission parameters R 0 , 1{γ of a communicable disease that experiences waning immunity [16]. Here, we show that relatively vague information on population-level incidence data is sufficient to disentangle the reporting rate from the transmission parameters when the generation time distribution is assumed known. For H3N2, we consider the largest seasonal population-level attack rate in the sink population (pop-attack).
This summary is less sensitive to differences in periodic model behavior than the average population-level attack rate, and was favored in our analysis because the fitted SEIRS model showed strong periodicity.
For simplicity, we consider the first tier of the SEIRS model and fit it to the epidemiological summaries µ-attack, σ-attack, correlation, explosiveness and pop-attack as described in Table 1 under different specifications of the weighting scheme for pop-attack. We employ the same prior densities as in Table 2.  Figure S4. Parameter estimates for R 0 , 1{γ of the first tier of the spatial SEIRS model under broader weighting schemes with respect to pop-attack. We interfaced the first tier of the SEIRS model with the epidemiological summaries described in Table 1, broadening only τ`of pop-attack from (A) 0.05 to (B) 0.08, (C) 0.1, (D) 0.13. Two-dimensional histograms in (R 0 , 1{γ) of the estimated ABC target density illustrate that an increasingly multimodal and broader fit is obtained when population-level incidence is allowed to differ more broadly from empirical estimates.  1e+08 2e+08 3e+08 4e+08 5e+08 Figure S5. Parameter estimates for s, N oe of the second tier of the spatial SEIRS model with and without lineages. We fixed the parameters corresponding to the first tier of the SEIRS model and interfaced (s, ζ, N oe ) with the phylogenetic summaries described in Table 1 (A). We then repeated inference without the lineages (B). Two-dimensional histograms in (s, N oe ) of the estimated ABC target density illustrate that the source population size cannot be jointly estimated with s and ζ when lineages is excluded. We ran four MCMC chains in parallel, and the first tier of the SEIRS model was fixed to the four parameter sets R 0 " 3, 2.5, 2.6, 2.7, 1{φ " 0.9, 1{ν " 1.8, 1{γ " 9.6, 6.5, 8.1, 9.3, ρ " 0.11, 0.12, 0.07, 0.17, 1{µ oe " 50, ϕ Ó " 0.58, 0.31, 0.24, 0.54, ϕ oe " 0.02, 0.003, 0.004, 0.002, m Ó " 9.9, 7.9, 13.1, 12ˆ10 6 , m oe " 0.02, 0.08, 0.004, 0.009. These parameters were chosen as described in the Methods section. Table S3 and Figure S4 illustrate how the transmission parameters and the reporting rate of the fitted epidemiological model change with a broader pop-attack weighting scheme.
The 95% confidence intervals of R 0 , 1{γ and ρ increase as the pop-attack weighting scheme discriminates less against small maximum population-level attack rates, and the joint posterior density of the transmission parameters R 0 , 1{γ turns increasingly irregular. The tolerances τ´"´0.1 and τ`" 0.05 in Figure S4A correspond to largest population-level attack rates between 15-30%, which is well in line with epidemiological estimates of the population-level attack rate of influenza H3N2 in the Northern Hemisphere [17]. Larger τà llow for population-level attack rates that are known to be too low, and available data do not support to choose smaller τ`to the best of our knowledge.

S.5.3 Summaries of H3N2 sequence data
Based on the estimated HA phylogeny, we initially summarized H3N2's evolution with its fast divergence and its limited average genetic diversity across seasons (see Figure 1 and Table 1). Specifically, divergence is captured as the slope through the number of nucleotide substitutions of sampled sequences to the founder sequence across time (divergence). To 1e+08 3e+08 5e+08 7e+08 9e+08 Figure S6. Parameter estimates for s, N oe of the second tier of the spatial epochal evolution model with and without lineages. We fixed the parameters corresponding to the first tier of the epochal evolution model and interfaced (s, ζ, N oe ) with the phylogenetic summaries as detailed in Table 1 (A). We then repeated inference without the lineages (B). Two-dimensional histograms in (s, N oe ) of the estimated ABC target density illustrate that the source population size is broader when lineages is not included. We ran four MCMC chains in parallel, and the first tier of the epochal evolution model was fixed to the four parameter sets R 0 " 22, 14, 20, 24, 1{φ " 0.9, 1{ν " 1.8, 1{γ " 274, 276, 375, 377, ρ " 0.42, 0.61, 0.72, 0.86, 1{µ oe " 50, ϕ Ó " 0.12, 0.48, 0.32, 0.22, ϕ oe " 0.02, 0.01, 0.02, 0.01, m Ó " 3.9, 9.3, 4.9, 12.5ˆ10 6 , m oe " 0.02, 0.04, 0.07, 0.03. These parameters were chosen as described in the Methods section.
reflect genetic diversity, we compute the average pairwise diversity of any two sequences sampled in the same season (diversity). The number of dated HA sequences available before 1990 is very small, so that these years effectively do not contribute to the diversity. To make this sampling effect more apparent, all phylogenetic summaries except the divergence are only computed on the period 1991-2009.
We could not simultaneously estimate the source population size, the mutation rate and the residual selection parameter from H3N2's divergence and diversity alone. Here, we show that conditioning also on the lineages of the HA phylogeny is sufficient to estimate N oe , ζ and s. To this end, we interfaced these parameters with the phylogenetic summaries as described in Table 1 under both models, while keeping the epidemiological model parameters fixed. Figures S5-S6 illustrate that estimates of N oe are broader when information on H3N2's lineages is not used. Since pop-attack is much lower under the fitted epochal evolution model, an overall larger source population size is required to yield a similar number of new genetic variants.
Finally, we found that the TMRCA's are a useful summary to avoid simulated phylogenies with reasonable divergence, diversity and lineages but deep phylogenetic branch-  Figure S7. Parameter estimates for s, ζ of the spatial epochal evolution model with and without TMRCA. We repeated inference with all summaries as in Table 1 but without TMRCA. (A) Figure 5F for comparison. (B) Two-dimensional histogram of (s, ζ) of the estimated ABC target density without TMRCA.
ing. Figure S7 illustrates phylodynamic inference of the epochal evolution model with and without TMRCA. The selective advantage between replacing viral variants in the epochal evolution model is strong enough to limit the average pairwise diversity in simulated phylogenies, but fails to exclude deep phylogenetic branching. Thus, when inference conditions on TMRCA, substantial residual selection pressures must be included in the epochal evolution model. When deep phylogenetic branching is not explicitly penalized, the estimated residual selection parameter is much lower.
S.6 Inference and model assessment on simulated data S.6.1 Parameter inference To assess the accuracy of ABC phylodynamic inference in the context of the flu example, we first generated one data set under the SEIRS model. Model parameters were set to R 0 " 3.5, 1{φ " 0.9, 1{ν " 1.8, 1{γ " 10, ρ " 0.08, N oe " 2ˆ10 8 , 1{µ oe " 50, ϕ Ó " 0.41, ϕ oe " 0.006, m Ó " 4ˆ10 6 , m oe " 0.01, ζ " 1.5, s " 0.09, a sample of the posterior distribution under the summaries and tolerances in Table 1. Figure S8 illustrates ABC parameter estimates for this simulated data set, using summaries and tolerances as in Table 1. Estimates of R 0 , 1{γ, ρ were fairly broad, and broader than those obtained from inference against the real flu data set. We failed to estimate N oe , and the estimates of s, ζ did not correspond well to the true values. As on the real data study, ϕ Ó , ϕ oe , m Ó , m oe could not be estimated. The reason for such poor parameter inference is that the summary errors against data simulated from the same model are much smaller than those computed against real data. Thus, a broader set of parameters is now acceptable when the same tolerances are used.
Using tighter tolerances than those in Table 1, we could accurately estimate the same parameters as compared to inference on the real data (except N oe ). Using (the values used in the main text are added in brackets) τṕ op-attack "´0.03 p´0.1q, τp op-attack " 0.03 p0.05q, τσ -attack "´0.45 p´0.7q, τσ -attack " 0.35 p0.35q, τμ -attack "´0.45 p´1.3q, τμ -attack " 0.35 p0.35q, τé xplosiveness "´0.35 p´0.6q, τè xplosiveness " 0.35 p0.35q, we obtained more reliable estimates of R 0 , 1{γ, ρ as illustrated in Figure S9A. Next, we could obtain improved estimates of s, ζ with tighter tolerances on the phylogenetic summaries, τd ivergence " 0.175 p´0.4q, τd ivergence " 0.175 p0.4q, τd iversity "´0.2 p´0.6q, τd iversity " 0.2 p0.6q, τĺ ineages "´0.025 p´1.3q, τl ineages " 0.025 p1.3q, τT MRCA "´0.75 p´3q, τT MRCA " 0.75 p3q; see Figure S9B. We could not estimate an upper limit for N oe . Tolerances must be chosen in view of the error magnitude. If the error magnitude is considerably larger on real data than on simulated data, then ABC parameter inference on real data with tolerances τ ‹ calibrated on simulated data will suffer from very low acceptance rates and will thus be extremely unreliable. For example, the τ ‹ of the phylogenetic summaries are smaller than the errors between the SEIRS model and the H3N2 phylogeny, so that ABC parameter inference against real data is impossible with these τ ‹ .
We further checked that ABC based on the summaries in Table 1 can accurately estimate a range of SEIRS model parameters with suitable tolerances, see Figure S10. In general, for higher values of R 0 smaller tolerances had to be chosen because differences in population-level attack rates decrease. For relatively large tolerances, posterior distributions were broad, but never inaccurate and reliable. We emphasise that often, it is not appropriate to set the tolerances to tight values τ ‹ even when this is computationally feasible.
Available data may not support the use of narrow ABC tolerances τ ‹ . For example, it is possible to use tighter tolerances on pop-attack than those in Table 1. The tolerances τṕ op-attack "´0.1, τp op-attack " 0.05 correspond to maximum population-level attack rates between 15=30%, which is well in line current epidemiological estimates of population-level attack rates between 10-20% [17]. Considerably tighter tolerances would result in some form of overfitting.  Figure S8. Parameter estimates of the spatial SEIRS model on data simulated from the spatial SEIRS model (case R 0 " 3.5), using summaries and tolerances as in Table 1. One-dimensional histograms of the ABC fit using the summaries and tolerances in Table 1. Four MCMC chains were started at overdispersed starting values, a burn-in period was removed and the remaining samples are shown in color for each chain. True parameters from which the data set were generated are indicated in black.  Figure S9. Parameter estimates of the spatial SEIRS model on data simulated from the spatial SEIRS model (case R 0 " 3.5), using tighter tolerances than in Table 1. One-dimensional histograms of the ABC fit using the summaries in Table 1 but (A) tighter tolerances on µ-attack, σ-attack, pop-attack and (B) tighter tolerances on divergence, diversity, lineages, TMRCA as detailed in the text. In both cases, four MCMC chains were started at overdispersed starting values, a burn-in period was removed and the remaining samples are shown in color for each chain. True parameters from which the data set were generated are indicated in black. Figure S10. Parameter estimates of the spatial SEIRS model on data simulated from the spatial SEIRS model (case A: R 0 " 1.3, case B: R 0 " 14), using tighter tolerances than in Table 1. One-dimensional histograms of the ABC fit against data generated with the parameters (A) R 0 " 1.3, 1{γ " 5, ρ " 0.08 and (B) R 0 " 14, 1{γ " 150, ρ " 0.4 and all other parameters as before. We used the summaries in Table 1 with tolerances (A) as in Figure S9A, and (B) τp op-attack " 0.002, τṕ op-attack "´0.002, τσ -attack " τμ -attack " τμ -attack " τè xplosiveness " 0.2, τσ -attack " τμ -attack " τμ -attack " τé xplosiveness "´0.2 and all others as before. MCMC samples were generated as before, and true parameters from which the data set were generated are indicated in black.  Figure S11. Accuracy in estimating the inclusion probability ι s . One-dimensional histograms of parts of the ABC fit against simulated data, case R 0 " 3.5, (A) without and (B) with selection, see the text for details. Four MCMC chains were started at overdispersed starting values and the epidemiological parameters were held fixed. The first 1000 iterations were discarded and histograms of the remaining samples are shown in color for each chain. The correct value is indicated with a vertical black line.
As before, the tolerances had to be tightened for reliable parameter inference. We used the Indicator weighting scheme (3) for all summaries and the tolerances τd ivergence " 0.175, τd iversity " 0.2, τl ineages " 0.035, τT MRCA " 0.4 and τḱ "´τk . Figure S11 illustrates that the inclusion probability ι s can be reliably estimated. All other parameters are also well estimated, although credibility intervals are substantially broader under scenario (ii).
Notably, the variable selection prior penalizes large values of s so that the true values of s can be expected to be somewhat larger than estimates thereof.

S.6.3 Model assessment
In light of our results on the epochal evolution model, we first verified that this model can be accurately fitted to data generated from the same model, and that the associated summary errors are then centred around zero. Incidence and phylogenetic data were generated under the parameters R 0 " 20, 1{φ " 0.9, 1{ν " 1.8, 1{γ " 180, ρ " 0.75, σ " 0.75, N oe " 4ˆ10 8 , 1{µ oe " 50, ϕ Ó " 0.28, ϕ oe " 0.015, m Ó " 8ˆ10 6 , m oe " 0.05, ζ " 3, s " 0.09, κ " 2, λ " 380, which correspond well to the fitted epochal evolution model in the main text, Figure 5. We used ABC with the same summaries as in Table 1. Prior densities were chosen as in Table 2. As before, the tolerances had to be tightened to estimate parameters reliably, again because the summary errors were here considerably smaller for a broader set of model parameters. Figure S12A illustrates that R 0 , 1{γ, σ, ρ and λ could be accurately estimated under the same tolerances as in Figure S9A (which are tighter than those in Table 1). Next, again using tighter tolerances, we could obtain accurate estimates of the molecular genetic parameters (see Figure S12B). All other parameters not shown had fairly broad posterior distributions and could not be estimated. With the tolerances in Table 1, estimates of R 0 had a 95% confidence interval of r7, 27s and a posterior mean at 16.9, and s, ζ could not be estimated. Figure S12C shows that all the associated summary errors plotted in Figure 5I-L are now close to zero, and Figure S12D shows that the same is true for all other summary errors, indicating that the summary errors correctly indicate goodness of fit.
Finally, we verified that ABC with the summaries in Table 1 can detect discrepancies of the SEIRS model in reproducing data generated under the epochal evolution model. Here, we used the same tolerances as in Table 1, except for diversity: τd iversity "´1, τd iversity " 1. Figure S13A shows the ABC summary diagnostics that we used in the main text, and Figure S13B shows all remaining ones. The correlation, diversity and TMRCA summary errors deviate clearly from zero, indicating that the summary errors can correctly identify model mismatch.  Figure S12. Parameter estimates of the spatial epochal evolution model on data simulated from the same model, using tighter tolerances than in Table 1.
One-dimensional histograms of parts of the ABC fit based on (A) tighter tolerances of the epidemiological summaries (as in Figure S9A), and (B) tighter tolerances on the phylogenetic summaries τd ivergence "´0.2 p´0.4q, τd ivergence " 0.2 p0.2q, τd iversity "´0.4 p´0.6q, τd iversity " 0.4 p0.6q, τĺ ineages "´0.4 p´1.3q, τl ineages " 0.4 p1.3q, τT MRCA "´0.5 p´3q, τT MRCA " 3 p3q; values of Table 1 added in parentheses. MCMC samples were generated as before, and true parameters from which the data set were generated are indicated in black. (C-D) 2-D histograms of the associated summary errors.  Figure S13. Accuracy in detecting discrepancies of the SEIRS model when fitted to simulations of the epochal evolution model. Two-dimensional histograms of the nine-dimensional distribution of ABC summary diagnostics that correspond to fitting the SEIRS model with the summaries in Table 1, see the text for details. (A) Summary diagnostics shown in the main text for ABC analyses on real data, and (B) diagnostics for all other summaries. Four MCMC chains were started at overdispersed parameter values, the burn-in period was removed and all remaining samples were pooled across chains to produce the histograms.

S.7 Inference with and without phylogenetic summaries
We analyzed if the phylogenetic summaries divergence, diversity, lineages and TMRCA have any effect on estimates of the epidemiological parameters of SEIRS model. To this end, we fitted the first tier of the SEIRS model to the epidemiological summaries µ-attack, σ-attack, correlation, explosiveness and pop-attack in Table 1, using the same prior densities as in Table 2.
Comparing Tables 2-3 to the first column in Table S3, we find that without the phylogenetic summaries, the fitted source-sink SEIRS model results in smaller basic reproductive numbers 2.46˘0.80 and maximum population-level attack rates that are on average 15%˘4% (3.03˘0.55 and 12%˘3% respectively when the full phylodynamic model is fitted). Because lower incidence results in more narrow phylogenies, this discrepancy can be explained with the addition of the phylogenetic summaries. To substantiate this explanation, we fitted the tier 1 model under a relaxed pop-attack weighting scheme (τ`" 0.08), which gave a distribution of pop-attack posterior summary errors that is almost identical to the one obtained with the phylogenetic summaries; see the second column in Table S3. The associated R 0 is estimated at 3.2˘0.79, which is very similar to the estimate obtained with the phylogenetic summaries. Thus, the imposed penalties in the weighting scheme for narrow phylogenies (see Table 1) result in slight deviations to low maximum population-level attack rates, which implies relatively high values of R 0 when phylogenetic summaries are included. This implies that conditioning on phylogenetic summaries has an effect on parameter inference even under the SEIRS model, and the molecular genetic and epidemiological summaries are not independent of each other. S.8 Inference under models without spatial substructure Figure 3H-I illustrate that the spatial SEIRS model can reproduce the divergence of H3N2's HA phylogeny. To illustrate that a geographically separated source population is required to reproduce the divergence of H3N2's HA phylogeny, we consider here the corresponding SEIRS model in a seasonally forced, spatially homogeneous population that is, as before, calibrated to represent the Netherlands. Leaving demographic stochasticity aside, H3N2 phylodynamics are described by where all parameters are as in (5) for the sink population, and I " I 1`I2 , 1{φ is the average duration of incubation, m is the number of visiting infected travelers,Î is the number of infected individuals at disease equilibrium, and ψ ě 1 is an inflation factor. As we show below, this inflation factor is necessary to reproduce the divergence of the HA phylogeny.
The second tier of (S4) thus corresponds to a first tier that is inflated by ψ.
The spatially homogeneous SEIRS model was fitted to the phylodynamic summaries as described in Table 1. The inflaction factor is estimated at ψ " 10.8˘8.2, and Figure S14 illustrates that with large ψ, the behavior of spatially homogeneous model is similar to the behavior of the spatially heterogeneous SEIRS model. At smaller inflation factors, the spatially homogeneous model fails to reproduce the divergence of H3N2's HA phylogeny. The Dutch population size is too small to reproduce the basic features of the HA phylogeny, and the large inflation factor suggests that a viral reservoir with a population around 1.7ˆ10 8 individuals is needed to match the basic features of H3N2 surveillance and molecular genetic data in Table 1. This is well in line with the estimated size of the source population under the spatial SEIRS model, N oe " 1.85˘1.2ˆ10 8 . which depends itself on m Ó as well as further epidemiological variables. Thus, we took the 95% credibility interval of ϕ to define a plausible parameter range for ϕ Ó , see Table 2.

S.9 Sensitivity analyses
Our prior assumptions on the model parameter are listed in Table 2, along with a brief justification. Here, we describe how sensitive our results in the main text are to changes in the generation time 1{φ`1{ν, the birth rate µ oe , the functional form of the antigenic emergence rate h in Eqns. 5, and ϕ oe .
Most importantly, we assume that the strength of seasonal forcing in the source population is weak, ϕ oe P r0, 0.02s. While plausible, there is considerable uncertainty in the strength, timing and form of influenza's seasonality in tropical regions [18]. Here, we show that the epochal evolution model of major antigenic clusters is well in agreement with the summaries in Table 1 if strong seasonal forcing is assumed in the source population, ϕ oe ą 0.15. Essentially, strong ϕ oe leads to sufficiently severe genetic bottlenecks in which less favorable   Figure S16. Phylodynamics under the spatial epochal evolution model with weak seasonal forcing in the source population, ϕ oe " 0.05. (A-I) as in Figure S15.  Figure S17. Phylodynamics under the spatial epochal evolution model with weak seasonal forcing in the source population, ϕ oe " 0.07. (A-I) as in Figure S15.  Figure S18. Phylodynamics under the spatial epochal evolution model with weak seasonal forcing in the source population, ϕ oe " 0.1. (A-I) as in Figure S15.  Figure S19. Phylodynamics under the spatial epochal evolution model with weak seasonal forcing in the source population, ϕ oe " 0.15. (A-I) as in Figure S15. antigenic variants go readily to extinction. Therefore, and in contrast to the analysis in the main text, the duration of cluster-specific immunity 1{γ is not constrained by the shape of the phylogeny and can be set small. Figure S15-S19 illustrate phylodynamics under the epochal evolution model with the parameters R 0 " 3, 1{γ " 9, 1{φ " 0.9, 1{ν " 1.8, ρ " 0.1, N oe " 2ˆ10 8 , 1{µ oe " 50, ϕ Ó " 0.25, m Ó " 9.7ˆ10 6 , m oe " 0.05, ζ " 3.3, s " 0.09, κ " 2, λ " 180 but varying ϕ oe . For ϕ oe ą 0.15, 1{γ can be relatively short, so that R 0 can be much smaller too and hence the simulated pop-attack are within the range of empirical estimates.
Recent reanalyses of influenza infections in household studies estimate a generation time around 2.7 days, while higher estimates around 4 days have also been reported [14,19]. To evaluate if higher generation times up to 4 days could influence the fit and the goodness of fit of the both models considered, we initially considered the first tier only. The first tiers of both models were fitted to the epidemiological summaries in Table 1, using the same respective tolerances and weighting schemes. In order to fit the epochal evolution model without computationally expensive phylogenetic simulations, we also generated the annual time series of the number of coexisting antigenic clusters and computed their average value in 1968-2002. To favor strain replacement, on average no more than 1.5 clusters were allowed to coexist in any season. Table S5 demonstrates that higher mean generation times 1{φ`1{ν " 2.7, 3.5, 4 result in slightly larger estimates of R 0 , and do not affect the fit to any of the other parameters that can be estimated with the epidemiological summaries. Goodness of fit was insensitive to changes in the generation time. Consequently, full phylodynamic inference including phylogenetic summaries was not run.
Recent models of influenza evolution and epidemiology assume that antigenic variants emerge at a rate h that is either constant or increases through time [20,21]. Following the argument in [20], it is plausible that h might alternatively depend on cumulative incidence rather than time, i.e.
hpt, t e i q " To evaluate if a constant antigenic emergence rate or dependence of h on cumulative incidence might influence the fit and the discrepancies of the fitted epochal evolution model, we initially considered only the first tier for computational reasons. As above, we required that the average number of coexisting antigenic clusters stays below 1.5. Table S5 illustrates that the key epidemiological parameters that can be estimated with the epidemiological summaries in Table 1 are not sensitive to using κ " 1 or (S5).
Finally, we also analyzed if a higher birth rate in the source population changes parameter inference or model assessment. Briefly, a lower birth rate µ oe increases the extinction probability of phylodynamic simulations, especially for the fitted epochal evolution model, where infrequent but relatively strong cluster invasions lead to pronounced genetic bottlenecks, see Figure 6H. For an average lifespan of 80 years, more than 50% of all model simulations go extinct. A lower average lifespan than the 50 years used here reduced the extinction probability of phylodynamic simulations, but implied lineages that were 2-3 times as thick as in the observed HA phylogeny.  Figure S20. Reproducibility of interannual variability in H3N2 incidence for increasing nclust in 1968-2002. The first tier of the epochal evolution model was fitted with ABC to features of H3N2 surveillance data as specified in Table 1 under different assumptions on the number of antigenic clusters in 1968-2002: (A) nclust " 0, (B) nclust " 3˘1, (C) nclust " 5˘1, (D) nclust " 7˘1, (E) nclust " 11˘2, (F) nclust " 17˘4. (A) is the same as Figure 3G, and (E) is the same as Figure 5I. To circumvent computationally expensive phylogenetic simulations, we here generated the annual time series of the number of coexisting antigenic clusters and computed their average value in 1968-2002. To favor strain replacement, on average no more than 1.5 clusters were allowed to coexist in any season. Four MCMC chains were run in parallel, burn-in periods were pruned and the remaining samples were pooled. The two-dimensional histograms between correlation and σ-attack indicate that the interannual variability in H3N2 incidence is increasingly better reproduced with a larger number of antigenic clusters that replace each other. Under the epochal evolution model, the turnover of more than 10 clusters, or equivalently 1 replacement event every 3-4 years, results in irregular incidence time series that are consistent with H3N2's interannual variability.