Assessing Optimal Target Populations for Influenza Vaccination Programmes: An Evidence Synthesis and Modelling Study

Marc Baguelin and colleagues use virological, clinical, epidemiological, and behavioral data to estimate how policies for influenza vaccination programs may be optimized in England and Wales. Please see later in the article for the Editors' Summary


Introduction
This supplementary information report aims at complementing the main paper by describing in more details some of the more technical points of the methods. Section 2.1 presents complementary details about the structure of the model. Section 2.2 is dedicated to the statistical inference of the parameters from the model. Section 2.3 makes explicit the algorithms used for the implementation of the model and finally Section 2.4 reviews and discusses other existing methods. Section 3, the last part of this report, presents additional results that were not included in the main text.   Table S1 the final coverage by age, risk group and season used in the model can be found.   Table S1: Final coverage by age, risk group and season used in the model. For the 0-1 age-group, only infants over 6 months receive vaccination.
where S N ik and S V ik represent the number of susceptibles of age group i and risk group k with different vaccine history. The susceptibles indexed by N are naive and the ones indexed by V have received the vaccine. E 1X ik and E 2X ik represent exposed, but not yet infectious individuals with vaccine history X. I 1X ij and I 2X ij represent infectious individuals. R X ik represent immune individuals of age class i and risk group k with vaccine history X, and λ i is the age-group specific force of infection, given by: where q is the transmissibility parameter, c ij is the rate at which individuals in age group i make contact with those in age group j and σ i is the susceptibility of age group i; µ ik = with ν ik the rate of immunization of risk group k in age group i.

Description of the observation model using sets
We would like to stress here that the ascertainment probability i should not be confused with the propensity to consult among infected people. To illustrate the differences between the two, we can use a set notation to write down the different states (which refers to the same subtype) to which an individual in age group i can belong at week j (illustrated in Figure 2 of the main paper): • A: be infected by one given subtype of influenza; • B: have an influenza infection detectable 1 by the virological test as applied in field condition; • C: develop ILI symptoms; • D: go to the GP; • E: be recorded as having ILI; • F : be tested virologically as part of the ILI testing scheme.
It is then possible to describe derived quantities using the "conditional" and "intersection" set operators noted respectively | and ∩. These operators have both a logical and probabilistic interpretation 2 . A ∩ D can be read as "A and D" i.e. if event A and D are true and corresponds thus to the combined event of being infected by a specific subtype of influenza and going to the GP. Similarly, P (B|A) refers to the probability of event B conditional on event A, so P (B|A) is the probability to be positive to a specific subtype (event B) given than the individual has been infected by this subtype (event A). Then, if we drop the age group and week indices for clarity, the model quantities can be written as: m = |E|; m + = |B ∩ E|; z outside + z θ = |A|; n = |F |; n + = |B ∩ F |; (3) and the ascertainment probability is equal to P (B ∩ E|A) while the probability to consult among infected people is P (D|A). Some other common and tangible epidemiological quantities can be formulated using this set notation: • P (B|A) is the sensitivity of the virological test against influenza cases; • P (B|A ∩ C) is the sensitivity of the virological test against clinical influenza cases; • P (C|A) is the proportion of symptomatic 3 among influenza cases; • P (D|A ∩ C) is the propensity to consult 4 for clinical influenza infections; • P (E|A ∩ C ∩ D) is the GP ability to recognize a patient with influenza symptoms; We can thus express , the ascertainment probability, by way of using these quantities. For this we start with, using conditioning, separating the GP and virological surveillance streams: P (E|A) can be further broken down in terms of the elementary processes: As we assume that A ∩ E = A ∩ C ∩ E (all individuals recorded as ILI consulting following an influenza infection have ILI symptoms) and E ⊂ D, P (A ∩ E) = P (A ∩ C ∩ D ∩ E), we have then: i.e. the GP ability to recognize a patient with influenza symptoms. As A ∩ E = A ∩ C ∩ E, we can rewrite as: P (B|A ∩ C ∩ E) is the probability to have developed a (virologically) detectable influenza infection given that the individual is infected and picked up as ILI by the GP. We can assume that B and E are independent conditional on A ∩ C.
That means that the probability that a symptomatic individual who goes to the GP has a detectable influenza is independent of the assessment by the GP to record this patient as ILI 5 . We thus have We can finally write Equation (7) using only elementary epidemiological quantities: where P (E|A∩C ∩D) is the GP recognition ability, P (D|A∩C) the propensity to consult among symptomatic influenza cases, P (C|A) the proportion with symptoms among the infected and P (B|A∩C) the sensitivity of the test against clinical influenza cases. 2 Both fundamental logic and probability can be theoretically ultimately ground to set theory. 3 Given the continuous range of symptoms associated with influenza, a case definition is needed for influenza-like symptoms. We match here the definition used for our ILI data, i.e. fever plus at least one other respiratory symptom. 4 Assuming that the influenza symptoms are the cause of the visit to the GP. 5 It is possible though that a more severe influenza infection would have a higher chance of being classified as ILI by the GP or result as a positive from the virology testing. Nevertheless, as this potential bias is integrated into the chain of probabilities of the ascertainment probability , this assumption does not impact on the model developed.

Reproduction numbers and mixing between groups
The epidemiological potential of a pathogen is often described by way of its basic reproduction number R 0 , defined as the average number of secondary cases following the introduction of an infectious individual in a totally susceptible population. Mathematically, R 0 is defined as the biggest eigenvalue of the next generation matrix (NGM). When the population is not entirely susceptible (presence of immunity), a similar quantity can be defined, called the effective reproduction number and noted R e . Nevertheless, these quantities synthesize the dynamical properties of potential outbreaks using a homogeneous mixing paradigm (all individual are equally likely to contact each other) and thus do not capture the dynamics associated with the intra-and inter-group transmission in population stratified in non-homogeneously mixing groups.
In the setting of targeted vaccination campaigns where interruption of transmission is aimed, summary quantities should reflect the potential of transmission of the targeted groups. The requirement of more detail should be balanced with the necessity for summary quantities to be simple to interpret. We thus have chosen to compute in addition to R 0 and R e , the 2-by-2 NGM matrix reduced to the children (under 15 years) and adults.
The basic-reproduction number R 0 , the effective reproduction number R e , and the 2-by-2 reduced NGM are derived quantities, dependent on other parameter values, rather than fitted directly in our model. R 0 and R e are obtained from the dominant eigenvalue of the next generation operator, adjusting for susceptibility, as appropriate. For a given mixing matrix and susceptibility profile, R 0 is proportional to q, the estimated transmission coefficient, which is estimated for each strain and each year.
We investigate in the following sections, the link between this 2-by-2 reduced NGM matrix and the potential of transmission in the whole population summarized by R 0 .
General formula for the 2-by-2 case The NGM of a population with two age groups C and A can be written 6 as: The basic reproductive number R 0 is given by the dominant eigenvalue of the NGM. To calculate the eigenvalues, one should find the roots of the characteristic polynomial: with I the 2x2 identity matrix. We need thus to solve: and thus, assuming that R CC R AA = 0, we have: where d R = R CA R AC R CC R AA the degree of assortativity of the matrix. We assume that for a disease such as influenza, where transmission is associated with close contact, transmission R CC R AA ≥ R CA R AC and thus 0 ≤ d R ≤ 1. 6 In the main paper, for clarity reasons, R CC and R AA are denoted by R C and R A respectively.
Assortative sub-populations If the transmission structure is very assortative then d R ≈ 0 and developping Equation (11) with d R = 0 leads to As a result, we have 7 that for two fully assortative sub-populations: Highly coupled sub-populations d R = 1 Different epidemiology for the same R 0 As R 0 varies monotonously when d R varies between 0 and 1, we have: As R 0 is a combination of the intra-group transmission and of the coupling between these groups, two populations with a very different transmission structure can have the same R 0 . For example, both the following NGM have the same R 0 (equal to 2): Control strategies for population A though should be target children, while control strategies for population B should involve the whole population.

Estimation of the number of deaths due to influenza
Tables S2 and S3 alongside Figure S2 are presented below in order to complement the text on the estimation of the number of deaths due to influenza in the main manuscript where they are referred to.

Statistical inference 2.2.1 Inference of the contacts
In the inference process, the advantages of basing the mixing structure on a resampled set of participants rather than directly on the parameters contact matrix are numerous. First, each of the resampled matrices corresponds to one mixing matrix, with the whole set of these matrices embedding the uncertainty resulting from the contact survey. Secondly, two similar datasets of participants are likely to lead to similar mixing matrices. This means that this structure defines some sort of distance in the matrix space. A 'random-walk' in this matrix space can thus be performed by resampling a certain number of participants from the current matrix to obtain a proposal mixing matrix.    Figure S2: Distribution of the case fatality ratio (CFR) for the low risk group by age and strain estimated by generalized linear model.

Identifiability of the variables
Examination of Equation (2) for the force of infection λ i among age group i shows that it is impossible to identify both the transmissibility q and the susceptibility profile {σ i } from transmission events only. Indeed, for any set {q, σ i }, the family of parameters {q/a, aσ i } (for any constant a such that aσ i ≤ 1 for all i) would give the same value for the force of infection leading to the same dynamic of transmission for the disease. To avoid this identifiability problem some other sources of data are required to inform the priors of these two parameters. In the next section, we present the methods used to derived priors on susceptibility, transmissibility and the ascertainment probabilities using some serology data existing from one of the season and subtype of the study.

Derivation of the priors
After the season 2003/04, serology assays were performed for antibodies of the A/H3N2 subtype in sera taken before and after the influenza season [19]. This allows us to derive informative priors on susceptibility at the start of the season and the ascertainment probabilities { i } for this particular year. For the other years, we use the information gained on this particular year to inform the transmissibility q and the { i }.
Susceptibility at the start of season 2003/04 In order to use the serology data for the derivation of susceptibility levels for the start of the 2003/04 season, we need to implement a way of deriving level of protection from infection by the influenza virus as a function of the titre of haemagglutination inhibition (HI) antibodies. The most commonly used model [40] is based on an inverse logit function of the logarithm of the titre and two parameters α and β: π(T, α, β) = e β(log(T )−α) 1 + e β(log(T )−α) where 0 ≤ π(T, α, β) ≤ 1 is the protection conferred by a HI titre of T . If the baseline risk for an individual to develop influenza is λ in absence of protective antibodies, the risk to develop influenza with antibodies is λ(1 − π(T, α, β)). The  [19].
parameter α is the logarithm of the titre giving 50% protection and β is linked with the shape of the function. Different studies [41,42] have shown than this relationship does not appear to depend on the nature (arising from vaccines or infection) or the subtypes of the antibodies (at least not being statistically significant). Nevertheless, Black et al. [42] have demonstrated that age was important as younger individuals need higher titres to experience the same levels of protection. Given α i and β i the parameters of the protection equation corresponding to the age group i, the link between the protective antibodies of the individuals belonging to the age group and the susceptibility σ i at the population level in this age group is given by the following equation: with N i is the number of individual in age group i and T j the titre of the j th individual in this age group.
To establish a prior on the three inferred susceptibility levels σ 0−14 , σ 15−64 and σ 65+ at the start of the season 2003/04, we use titres from a study by Johnson [19] ( Figure S3). The values for the protection function are sampled using the methodology 8 and values from Coudeville et al. [41] for the age-group 1-4 year old and from Black et al. [42] for individuals older than 15 years. For the age group of the 5-14 year old for which we do not have found studies measuring the range of the α and β (the Coudeville study is restricted to the range of 6 to 72 months children), we use an intermediate value by sampling from a broad distribution encompassing the two distributions. The obtained distributions allow us to derive an estimate of the susceptibility levels of σ 0−14 , σ 15−64 and σ 65+ assuming to be normally distributed (truncated to [0,1]). The values found for the susceptibility priors are summarised in Table S4 Table S4: Derivation of the susceptibility priors Ascertainment probability Using the set-based definition of the ascertainment probability , an estimator can be derived from the data. Indeed, as is the ratio of the number of people recorded as ILI and potentially positive for the virological test (|B ∩ E|) by the total number of people infected by influenza A/H3N2 during the season 2003/04 (|A|).
To calculate an estimatorm + i,tot of 52 j=1 m + ij during the year 2003/04, the number of people in the age group i who were recorded for ILI and who would have been positive to influenza A/H1N1 if tested, we summed over the season to To get an estimateẐ i,tot of the total number of infection in age group i over the season, we use the number of individual who seroconverted during this season, as defined in [19] by the difference of the proportion of people with a titre ≥40.  Table S5: Derivation of the ascertainment probability priors

Age groupm
As an important proportion of children seroconverted, we have a good estimate of 0−14 while for the elderly, there is a lot of uncertainty regarding individuals who actually seroconverted.. This is reflected by the upper bound of 65+ which is very high. To reflect the skewness of the data, we use lognormal distributions, with a mode inm + tot /Ẑ i,tot and a mean at half of the boundaries.
Transmissibility of the virus Susceptibility levels in a population are changing due to antibody dynamics (immunity built up from past season and by decay), demographic renewal and drift of the virus. It is difficult thus to infer immunity from another season especially if distant in time or measured for a different subtype. Transmissibility, though probably not constant every year, is much more likely to remain similar from one year to another. The value of transmissibility from one year is likely to be a good prior for other year.
As susceptibility levels and transmissibility are intimately linked and cannot be measured independently, we use the value of the transmissibility inferred for the season 2003/04 in H3N2 as a prior for the other seasons and strains, while employing an uninformative prior on the susceptibility. To account for uncertainty, we have chosen as a prior for the transmissibility for the other years a normal distribution with the same mean as the posterior of season 2003/04 and 1.5 times the standard deviation.

Derivation of the likelihood
We start by deriving the likelihood for the full model and then derive the likelihood for the simplified model that we used for the implementation.
Likelihood of the full model with augmented data Let D be the set of the observable variables (i.e. our data). We are interested in inferring the parameters of the dynamical system θ, the ascertainment probabilities and the outside infection probability ψ. At the same time, we need to account of the m + which appear in the hierarchical component of the observation model and which can be considered as "nuisance parameters". Two options exist, the first is to treat them as normal parameters and then consider the marginal distributions from the posterior; the second is to integrate them out. These two options lead to two different possible "likelihood" functions.
In a Bayesian framework the first approach would be favoured as it allows to define for example a prior distribution on these parameters. Nevertheless, the second is equivalent, as integrating out the m + can be done using a prior. In term of computation, the former results in a more complex parameter space to explore while the latter leads to a more complex likelihood to compute. We present both versions of the likelihood, while we have chosen to use the second version of our likelihood which is equivalent to assuming a flat prior on the m + but simplify the dimension of our parameter space.
The first likelihood function L 1 is derived from the hierarchical structure of the observation part of our model (see Figure 1 of the main paper) by breaking down the virological and GP attendance parts of the model: The virological testing is independent of θ so the first part of Equation (16) is given by the hypergeometric distribution of n + ij , the probability of getting exactly n + ij positive among the n ij samples drawn at week j among age group i is: The second part of the likelihood, describing the model of attendance to the GP and test sensitivity is a sum of two random variables and thus its distribution is the discrete convolution of the two distributions B(z θ ij , i ) and P(ψ i N mon ij ): with h min ij = max(m + ij − z θ ij , 0) and z θ ij the number of cases in age group i predicted in week j by the model given the parameters θ. Hence, by combining (17) and (18) in (16), we obtain the complete likelihood for one season: Likelihood of the full model with summation over m + ij It is possible to derive a likelihood function free from the m + ij by summing them over all the values they can take, for this we need to find first the lower and upper boundaries of all possible m + ij . Using the set notation with the indices i and j denoting respectively the age-group and the week, we have: As B ij ∩B ij = ∅, we can combine the values from (3) with Equation (20) to get: and, from (F ij ∩B ij ) ⊆ (E ij ∩B ij ) we have the inequality: Using the same decomposition for F ij than (20) for E ij we obtain, plugin in the values from (3) and B ij ∩B ij = ∅: Then, by combining (22) with (24) we obtain the first upper boundary for m + ij : , and thus n + ij ≤ m + ij , we get the final boundaries for m + ij : Hence, if we define k max ij = min(z θ ij , m ij − n ij + n + ij ), the complete likelihood for one season is equal to: with z θ ij the number of cases in the monitored population in age group i predicted in week j by the model given the parameters θ.

Implementation
We present here some details about the implementation of our study. Though the work has been coded in C for speed and flexibility, we describe the used algorithm using pseudo-code. Transcription in other programming languages should be easy.

Implementation of the epidemic model
We implement the epidemic model using a simple Euler integration with a time step of 0.25 days.

Proposal for the contact matrix
The Metropolis-Hastings algorithms works efficiently when the proposed set of parameters at each step leads to a reasonable acceptance rate in order to provide good mixing. A trade off needs to be found between high acceptance of proposed parameters close to the current ones, leading to slow mixing, and low acceptance of more distant values leading to poor efficiency of the algorithms and high correlation of the sampled values. For this, one needs to be able to define a metric on the parameter sets to determine if the proposed values are close or far from the current ones and adapt them (whether manually or using more sophisticated approaches). It is relatively easy to define such a distance for real valued parameters (though tuning the actual proposal for high dimensional parameters remains a challenging problem). Here, we propose a way of efficiently producing proposals for contact matrices "near" the current matrix in a reversible way (condition necessary for the validity of the resulting Markov chain). The aim of this section is not to define formally the proposed solution in terms of distance 9 in a discrete matrix spaces but to explain our approach and allow reproducibility of our results.
Instead of tracking the transmission matrix itself (or a contact matrix of some sort), we track a representative set of contacts. The update is done on the set of contacts rather than the transmission (or contact) matrix coefficients by resampling the original contact survey (POLYMOD) with replacement. The more similar the set of contacts, the more similar the transmission will be, and one can tune the extent of possible change at each step by allowing j substitution with a probability p j . When this selection is done with a flat likelihood, the random walk on the matrices resulting from progressively updating the set of contacts produces a sampled of transmission matrix with the expected matrix being the POLYMOD matrix and a variance being the one of the original survey. This method allows for the inclusion of the transmission matrix into the inference process, with the transmission matrix effectively representing an additional parameter of the model.

Implementation of the likelihood
To shorten the computation time, we need to optimize the calculation of the likelihood function. We start by reordering the terms in Equation (27): In order to be used by the Metropolis-Hasting algorithm, the likelihood function needs to be defined up to a multiplicative constant. This means that the computation of any multiplicative factors not depending on , θ or ψ can be omitted in order to speed up the calculation. As mij nij is only dependent on the data, we have: which can be expanded using factorials: and can be further simplified noticing that n + ij and n ij are observable (and thus constant): The computation of the likelihood function involves the summation of the "elementary" terms inside the square brackets in two directions to get a factor which is part of a bigger product involving each week and age group. Figure S4 depicts the surface of summation for one week and age group. Each of the blue squares represents one of the term of the double sum in Equation (29) for a given i and j. If we call a ij kh the term at position (k, h), we have: From a practical point of view, the summation can be made in any order as soon as all the "blue squares" are integrated. Equation (27) suggests a summation by columns (by varying first the h) but any other scheme is valid. For example, the likelihood can be computed by summing over the lines (h constant, k varying) or the diagonals (h − k constant, h and k varying). The summation scheme will exploit that neighbouring a ij kh will share an important number of factors in common and thus passing from one to another will only involves relatively simple calculations based on a recursive scheme.
For example, the following system, presenting respectively vertical, horizontal and diagonal increments allows to compute all the a ij kh : starting from any value of a ij kh for example a ij The sum a ij sum of the a ij k,h allows us to derive the likelihood which is equivalent to the more computing-friendly form: The value of the constant is equal to which is typically extremely big. The use of the log-likelihood rather the likelihood allows to avoid numerical problems.
Approximation of the likelihood In order to speed up the computation of Equation (34), we use an approximation of the a ij sum . For this, we start by reordering the terms in Equation (33): which, given a constant d, the RHS of Equation (35) can be broken into two parts: with h trunc ij = min(d, m ij − n ij + n + ij ) and Equation ( By combining Equations (36) and (37), we thus get: As m outside ij is Poisson distributed with coefficient ψN mon ij i 1, we have P (m outside ij > d) ≈ 0 and: We thus can define the truncated likelihood L d 3 by: Relationship between L 2 (D| , ψ, θ) and L d 3 (D| , ψ, θ) Given the range of values taken by ψ and i during the inference, computing log(L d 3 (D| , ψ, θ)) is equivalent in practice (and much quicker) than computing log(L 2 (D| , ψ, θ)). More precisely, by plugging in (38) into (34) we have: in the range of values considered, we get: Algorithm to compute the approximate likelihood To compute the approximate likelihood function L d 3 (D| , θ, ψ), we need to define an initial value and a summation path over the reduced surface of summation. For computational reasons, as the values of the terms in the sum decrease quickly as h increases, it is best to start with the lowest line, and then sum line by line until reaching the line at height d. The height of the lowest line is given by: If h init > d then the surface of summation is empty and L d 3 (D| , θ, ψ) = 0. Otherwise, the first value (k = n + ij ) on the line can be computed using the following set of equations: The calculation of the initial seeding value from System (43) not explicitly calculating the factorial function will depend on the different possible values of n ij , n + ij and z θ ij . We can break down the different element of a ij init and evaluate them for each of the different possibilities: with the values of B ij , C ij and D ij as follows: When the seed value for the first line is calculated (see Algorithm 1), it is possible to start calculating and summing over the values of the first line by iteration of the k's between n + ij and min(z θ ij + h init , m ij − n ij + n + ij ) using the relationship given in the System (31) (horizontal increment). Once all the values of the first line are summed up, the initial seed is used to calculate the value of the seed for the upper line. The vertical increment of System (31) is used if h ≤ n + ij , else the diagonal increment is used. Then the calculation and summing of the values of the second line are performed. The same process is repeated to calculate iteratively the values on the lines above until reaching the height d. Figure S5, depicts the summation surface and paths for the truncated likelihood calculation and in Algorithm 2, some code is provided. Figure S5: Surface of summation of the truncated likelihood function for a given week i and age group j. The red dot is giving the starting position of recursive calculation.
In practice Even extremely small values of d's give good results for the inference. For example, for our dataset, even a value of d = 0 works for most seasons. Only for the season 2008/09, with a huge increase in sampling due to the testing for the new pandemic strain, a value of d = 2 was chosen.

Implementation of the Markov chain Monte Carlo
Reduction of the dimension of the parameter space The use of the likelihood with summation over the m + ij (Equation (27) simplified in (40)) rather than the original likelihood (19) allowed us to reduce the dimension of the parameter space to explore by 260 for each of the years considered (52 weeks in 5 age groups). Furthermore, to avoid overfitting and stabilise estimates, susceptibility was grouped into three age bands, children (<15), adults , and the elderly (65+), as was the ascertainment probability, i .
The dimension of the final model used for the inference is thus considerably lower than the original model (Table S6). While we would have need to estimate 11,466 parameters in the original model (273 * 3 subtypes * 14 seasons) and 42 sets of contacts, this has dropped to 378 parameters with 42 sets of contacts in the simplified model. This considerably simplifies the parameter space to explore even if its geometry remains complex with highly correlated variable and banana shaped 10 connected parameters.
Proposal mechanism and adaption We are interesting in drawing from the posterior of Θ and the set of resampled contacts. For this, we used a proposal mechanism combining an Adaptive Metropolis type proposal [43] for Θ with the replacement of 1 contact in the current set A at probability 0.1.
The proposal function Q(x, .) for the Adaptive Metropolis is defined as follow: Algorithm 2 Calculation of the truncated log-likelihood log(L d 3 ) LL ← 0 for i = 1 to 5 do for j = 1 to 52 do where Σ n is the empirical variance-covariance matrix defined as the unbiased estimator of the variance-covariance matrix: withΘ = 1 n n k=1 Θ k the empirical mean of the samples Θ k . The chain {Θ k } 1≤k≤n loses the Markov property it has with other proposal functions (e.g. for symmetric random walks), but fulfill the Diminishing Adaptation and Bounded Convergence conditions [43] which guarantees the ergodicity and asymptotic convergence of the chain towards the correct target distribution.
This algorithm, with the (2.38) 2 Σn d variance for the adaptive bit, the (0.1) 2 I d d variance for the fix part and ζ = 0.05, leads to poor acceptance rates due to the shape of the posterior distribution of our parameters and for some seasons, "learning" problems as the fix bit at the start of the chain does not lead to any valid values. For this we used an improvement of the algorithm [44]. By adding another adaptive element, to "learn" the scale of the proposal in order to lead to an acceptance rate close to the theoretical 0.234 value. For this we define:  where ∆ is a suitably chosen fixed diagonal matrix and ξ n is updated as follows: with lim n→∞ δ n = 0 to keep the Diminishing Adaptation property. The use of this adaptive scaling factor allows the algorithm to learn the optimal scaling for the proposal distribution and gives in practice very good results.
From an implementation point of view, to avoid storing the whole history of the chain {Θ k } 1≤k≤n with n potentially very big (several millions for long chains), we use that, as Σ ij n the element at row i and column j is equal to: we only need to store at each step the matrix For each functions, the results from the "fast" algorithms coded in R were compared with the results from the built-in functions from the R package (usually being the same up to a constant). Then the results from the C code were compared to the results given by the "fast" functions implemented in R. The parallel development of three codes, though more time consuming, reduces the chances of coding errors.
The C code for the transmission model and the inference algorithm are available from the corresponding author on request.
Performance of the algorithm The inference presented in this work has been produced using 11 million length chains (1 million for the burn-in), with a thinning of 10,000. This took ≈ 4 hours on a cluster for the 14 seasons and 3 subtypes of the study.
Performance of the algorithm on simulated data Additionally, we checked whether we could retrieve parameters from simulated epidemics using associated (simulated) data. Inputting the simulated data, the inference algorithm managed to estimate the correct values and propagate the corresponding uncertainty associated with the sample scheme adopted to generate the mock sets of data. As in any Bayesian analysis though, the accuracy of the resulting posterior to retrieve the "true" values is a combination of the quality of the data and the assumptions made for the prior estimation. Strong informative priors distant from the "true" values will bias the estimates. This is true for the real valued parameters but also for the inferred structure of contacts.
To illustrate this and demonstrate how our inference algorithm works on simulated data, we have chosen to present the following example. We simulated an epidemic peaking at a similar time than the 1997/98 season for H3N2. We assumed that the background of ILI (from different pathogens) mirrors the epidemiology of the influenza virus considered (the simulated influenza epidemic is thus assumed to be at a constant 10% of the total ILI). The value of the ascertainment probabilities and transmissibility are chosen near the median of the prior distribution. We assumed a non informative prior on the susceptibility parameters and a contact matrix distant from the POLYMOD mean matrix (the algorithm infers which contacts are likely to describe best the epidemics and uses the POLYMOD data for this). The results of the inference can be seen in Figure S6 with the "true" values marked with a dashed line (panels C, D and E) and the posteriors in red. The contact matrix resulting from the "true" set of resampled contacts used to simulate the epidemic and the mean POLYMOD matrix are represented for comparison in panel B of the figure.  Figure S6: Inference results from a simulated dataset. A: Comparison of the fit of the model to the age-specific timeseries of positive ILI cases (m + ij ) generated from the simulated epidemic. For the model, the mean (red line) with 50 and 95% CI (shaded areas) for m + ij are based on the binomial process m + ij ∼ Binom(Z θ ij , i ). For the "data", since n + ij ∼ HyperGeo(n ij , m + ij , m ij ), the unbiased estimator for m + ij is n + ij × m ij /n ij (black dots) whereas its 95% CI can be computed using the R function hypergeo.conf (error bars). B: Comparison of the contact matrix used to simulate the data (left panel) to the Polymod contact matrix resulting from the contact survey used as a prior for the structure of contact in the population (right panel). C: Age-specific probability of being recorded as ILI and positive if tested and infected ( i ). D: Age-specific susceptibility at the begining of the flu season (σ i ). E: Transmission coefficient (q, left panel) and derived quantities: basic (R 0 , middle panel) and effective (R e (t = 0), right panel) reproduction numbers. For C − E: The prior distribution is shown in blue and the posterior distribution in red.
Results show that the algorithm is "learning" from the data. The algorithm is able to estimate the parameters correctly, in particular the ones with no prior information (initial susceptibility). Also, the correlations between susceptibility and ascertainment probabilities in a given age group appears clearly together with the "banana" shape of the likelihood function (see Figure S7). However, the actual epidemic curves underestimate the number of cases in the 0-4 year old children and overestimate the infections among the 5-14. As the two age group share the same ascertainment probability and the same susceptibility profile, the estimation relies on the contact matrix to adjust the mixing accordingly to the patterns observed in the data. A closer examination of (for example) the posterior distribution of the dominant eigen values of the inferred contact matrices (used as a summary statistics) shows that the resulting distribution "moves" away from the prior towards the "true" value but still largely covers the prior spectrum of values ( Figure S8).
If, for the same dataset, the priors (apart the one on the contacts) are tighten towards their "true" values, then the inferred matrices move towards the "true" matrix. It is illustrated in Figure S9 where the posterior of the dominant eigen values can be seen moving toward its "true" value.
Tests of the algorithm on simulated data suggest that the algorithm is able to infer correctly both real valued parameters with their uncertainty and correlation structure, but also the structure of relevant epidemiological contacts inside the population.

Comparison of our statistical framework with other approaches
Our study is the first to combine, at the scale of 14 seasons and for the three subtypes circulating in human populations, an evidence synthesis with dynamical modelling. We utilize this analysis to assess the impact of a change of vaccination policy. It takes into account the health care scope by including the risk structure of the population as well as the age mixing. One of the most difficult parts of such a project is to propose a coherent framework to combine rigourous statistical inference with dynamical modelling.
The main difficulty of the statistical modelling is to manage to link accurately the set of infectious individuals (used in the transmission model) with the set of people recorded by GP as presenting with ILI (the observation variables). In simple words, one needs to assess how many of the cases recorded by the GPs as ILI are actually consulting following infections by influenza viruses. The observation process can be seen as the intersection of sets (see Venn diagram in Figure 2 of the main paper). The accurate linkage of these two quantities is the keystone of the approach and is performed in our study by the following model: This equation mechanistically describes the drawing process of sampling and thus represent as faithfully as possible the observation process. The inclusion of an hypergeometric distribution makes the numerical implementation of the problem computationally challenging. Our approach to implement our model is described in Section 2.3.
To circumvent the difficulties of its implementation, it could be tempting to approximate the hypergeometric distribution with a simpler distribution such as the binomial distribution. For the approximation to work well, the size of the sample should be much smaller than the population from which it is drawn and the probability of success must not be too close to 0 or 1. Unfortunately, under typical surveillance-oriented sampling, these conditions are constantly violated. For example, at the beginning of the epidemics when ILI cases are sparse in the population monitored and surveillance high, the number of virological samples can be close to the number of people consulting. If other pathogens are low, the positivity might be high in the course of the epidemics. The binomial approximation is likely to introduce biases, especially during the beginning of the epidemics, the period on which the estimation of R 0 is highly dependent. Two recently published studies (both with pandemic data) present the objective of coupling statistical inference from health care associated data and dynamical models [45,46]. Both have adopted a binomial approximation which has lead these studies to add further layers to enable the inference (negative binomial distributions to introduce overdispersion, strong priors on some of the paramters, etc.).
In the first study, Birrell et al. [45] combined age-specific daily ILI, virological, confirmed cases and serological data collected during the 2009 influenza pandemic in London. Thus, in comparison with our study, the authors benefited of two additional type of data as well as a finer temporal resolution. Focusing only on the type of data we have in common, Birrell et al. [45] assume that the ILI (m ij ) and the virological (n + ij ) data are generated as follows: where m − ij is the age-specific daily number of non-flu ILI cases (background consultation), η j is a dispersion parameter, p sympt the probability, given infection, to become symptomatic, p GP ij the probability, given symptomatic infection, to consult a GP and be recorded as ILI (age and time dependent probability and the lag between infection and GP report, which is assumed to be gamma distributed with fixed mean (µ lag = 4.1 days) and standard deviation (σ lag = 3.5 days) where f lag (k) gives the probability that the gamma random lag lies in the interval [k, k + 1[. Thus, in addition to the choice of the binomial for the virological sampling, Birrell's model differs from ours as a deterministic approach for the mean number of people consulting among the infected combined with a negative binomial integrating the background consultation rate. The deterministic part of the process includes a time lag in ILI reporting by GP. Though we could have included it in our model, little effect is expected on weekly aggregated ILI data as the time-scale of this lag is likely to be less than one week. Thus, in our context, the computationally expensive evaluation of this equation (as noted by Birrell et al. [45]) doesn't appear fully justified.
This model is coupled with strongly informative priors on the m − ij . For this, the background consultation rate (m − ij ) in London is assumed to follow the relation: where the time (α j ), age (β i ) and interaction (γ ij ) terms are obtained by regressing this log-linear model (52) on empirical estimates of m − ij = (1 − n + ij /n ij )m ij obtained from the regions outside London. More precisely, the log-linear regression provides means and covariance matrix for these parameters, which are then used as priors in the Bayesian framework of Birrell et al. [45].
The second paper, by Dorigatti et al. [46], combined age-specific weekly ILI data with weekly virological data collected during the 2009 influenza pandemic in Italy. In contrast to our study, the weekly virological data were only available at the population level. Put another way, the authors had to use the age-aggregated weekly number of tested (n j = i n ij ) and positive (n + j = i n + ij ) samples. Dorigatti et al. [46] introduce the weekly true positivity in the entire ILI population: π j = i m + ij i mij and assume that m + ij andn + j are generated by the following processes: The use of the negative binomial to generate the m − ij is mathematically convenient as it allows to use a conjugate beta prior to simplify the calculation of the associated likelihood. It is nevertheless not appropriate in terms of representation of the sampling mechanism, and, as noted by Dorigatti

Other seasons
Figures S10 to S51 compare the fit of the model to the age-specific swabbing data (left 4 panels), and show posterior distributions for some of the estimated model parameters, and the derived quantities R 0 and R e (bottom row), for H3N2, H1N1 and influenza B and each of the year of the period. The comparison of the predicted positivity with the observed one suggests that the model is able to capture the timing and magnitude of the strain-specific epidemic peaks, as well as the age-specific positivity. It also highlights the wide credibility intervals on the swabbing data, which are derived from relatively small sample sizes taken in any given week for any given age group.
In general, for a given strain, in years in which there were significant numbers of positive cases, then the expected pattern of susceptibility with age is observed, with higher levels of susceptibility in the younger age groups, declining into adulthood and the elderly (see for instance H3N2 1996/97, H1N1 2000/1 and B 2000/01). In these years, the effective reproduction number at the outset of the epidemic, (Effective reproduction number, middle of panel D) is estimated to be well above one, as expected. In years in which there were very few positive samples, the model cannot accurately distinguish between low numbers of initial seeds (which is related to low importations) with perhaps relatively high susceptibility, or higher number of seeds and lower susceptibility. Hence, in these years there are wide credibility intervals on the estimates of age-specific susceptibility (see for instance, H1N1 1996/97). However, even in these years very high levels of susceptibility are not well supported by the data, and the distribution for the reproduction number only just exceeds one. Examining the pattern of susceptibility over time for a given strain, it is clear that the model suggests that susceptibility is generally higher at the outset of an epidemic year, and lower in years in which there is no epidemic (see for instance, for influenza B, comparing the posterior distribution on susceptibility around the epidemic years of 1996/97, 2000/01 and 2005/06).
Estimates of the contact patterns (top right panel for Figures S10 to S51) are relatively tight and consistent from one year to the next and between strains; with contact within and between age group 3 (5-14) and other age groups being consistently estimated as the highest. The exception to this is the mixing within the first age class (under 1), in which there are very wide CIs on the underlying contact rate. This is due to a combination of very small sample size in the original POLYMOD dataset for this age group, and no age-specific epidemiological data (no swabs were taken from the under 1s).

Strain: H1N1
Figure S10: Inference results for H1N1 during the 1995-96 season. A: Comparison of the fit of the model to the agespecific time-series of positive ILI cases (m + ij ) estimated from the data. For the model, the mean (red line) with 50 and 95% CI (shaded areas) for m + ij are based on the binomial process m + ij ∼ Binom(Z θ ij , i ). For the data, since n + ij ∼ HyperGeo(n ij , m + ij , m ij ), the unbiased estimator for m + ij is n + ij × m ij /n ij (black dots) whereas its 95% CI can be computed using the R function hypergeo.conf

Posteriors of the contact matrices
The distribution of the posterior matrices during the epidemic year are very close from the POLYMOD prior distribution. This indicates the consistency of POLYMOD data with the observed epidemics but also illustrates the uncertainty present in the data. Little is learnt in terms of contact structure beyond the information contained in the POLYMOD survey. To illustrate the similarity between the posterior distributions of the inferred contact matrices and the POLYMOD prior, we have plotted the posterior of the dominant eigenvalues of these matrices for the seasons where a particular strain was circulating ( Figure S55). The posterior of the contact values overlap almost entirely with the prior POLYMOD matrix. Figure S56 shows the values of the specific reproductive numbers (R CC and R AA ) and the assortativity (d R ) computed from the posterior distributions of the model for the 23 strain-specific seasonal outbreaks observed during the 14 seasons of the study. Figure S56 shows the values of the specific effective reproductive numbers (R e CC and R e AA ).

Pairwise significance of differences between strategies
Due to the structure of correlations, the posterior distributions of some of the outcomes measured to compare the different vaccination strategies might present overlapping credibility intervals. This is particularly true for the deaths due to the important uncertainties surrounding some of the CFRs. In order to assess the significance of the differences between strategies when analysed jointly (rather than comparing marginals), we computed in Tables S58 and S59 the p-values of the differences in term of cases and deaths averted per dose between each pairs of strategies.

Sensitivity analysis
To assess the robustness of our study, we tested the stability of the results to two variations from the original assumptions of our model.

Prior on transmissibility
We tested the sensitivity of the results to the choice of variance for the transmissibility prior (chosen to be 1.5 times the variance of the posterior of transmissibility for the year for which we had serology). For this we have chose to scale our prior by 1.2 and 1.8 instead of the 1.5 value. It can be seen in Figure S60 that the results are not affected by the change of variance of the transmissibility prior.

Improved vaccine efficacy in elderly
As a base case for the study, we assume based on Fleming at al. [28] that the vaccine efficacy is lower for the 65+ age group. Here we measure the impact of a possible improvement in vaccine efficacy for older age groups. We thus have tested the robustness of our conclusions to the assumption than the vaccine efficacy in 65+ year old is the same than for other age group. Results are shown in Figure S61. As expected, strategies involving vaccination of the elderly becomes more efficient, nevertheless the conclusions than vaccination targeting children is the most efficient strategy (in term of reducing infections and deaths) remains true.   Figure S58: Computation of the pairwise significance of the differences observed between strategies in term of cases averted. Strategies are ordered by increasing number of doses given. A circle indicates that the tested strategy averts cases compared with the reference strategy. The size of circle gives the range of the difference, the colour indicates the significance of the difference between the two posterior distributions of the strategies. Red disks show high significance with p-values < 0.01, green disks show significance with p-values < 0.05, and blue disks show non significant differences with p-values between 0.05 and 0.5. The black diagonal line represents the bisector, tested strategies appearing below the bisector are very effective: they involve less doses than the reference with which they are compared but results in a greater number of cases averted per dose.   Figure S59: Computation of the pairwise significance of the differences observed between strategies in term of deaths averted. Strategies are ordered by increasing number of doses given. A circle indicates that the tested strategy averts deaths compared with the reference strategy. The size of circle gives the range of the difference, the colour indicates the significance of the difference between the two posterior distributions of the strategies. Red disks show high significance with p-values < 0.01, green disks show significance with p-values < 0.05, and blue disks show non significant differences with p-values between 0.05 and 0.5. The black diagonal line represents the bisector, tested strategies appearing below the bisector are very effective: they involve less doses than the reference with which they are compared but results in a greater number of deaths averted per dose. S4: S1+S2 S5: S1+S3 S6: S1+S2+S3

30%
50% 70% Figure S62: Effectiveness of the alternative programmes when taking into account the epidemiological uncertainty for each year.