Inference of epidemiological parameters from household stratified data

We consider a continuous-time Markov chain model of SIR disease dynamics with two levels of mixing. For this so-called stochastic households model, we provide two methods for inferring the model parameters—governing within-household transmission, recovery, and between-household transmission—from data of the day upon which each individual became infectious and the household in which each infection occurred, as might be available from First Few Hundred studies. Each method is a form of Bayesian Markov Chain Monte Carlo that allows us to calculate a joint posterior distribution for all parameters and hence the household reproduction number and the early growth rate of the epidemic. The first method performs exact Bayesian inference using a standard data-augmentation approach; the second performs approximate Bayesian inference based on a likelihood approximation derived from branching processes. These methods are compared for computational efficiency and posteriors from each are compared. The branching process is shown to be a good approximation and remains computationally efficient as the amount of data is increased.


Introduction
First Few Hundred (FF100) studies are data collection exercises carried out in the early stages of pandemic influenza outbreaks [1][2][3][4]. The aim of these is to characterise a novel strain to determine its impact and hence inform public health planning [5,6]. FF100 studies involve the collection of data from households where one person is confirmed to be infected. The members of the household are surveilled to identify their time(s) of symptom onset and the study is continued until the first few hundred cases have been observed, or adequate characterisation has been achieved. Households are the primary unit of observation because they are convenient to surveil-in contrast to more general contact tracing-and a large fraction of transmission occurs within the household [7].
Stochastic models, where the population are split into households with different rates of mixing within and between households, are a natural framework to understand FF100 data [8]. Recent work inferred within-household epidemic parameters from this type of household stratified data [9,10]. In [10], inference is performed using a Bayesian MCMC framework, a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Motivated by these problems, we develop another approach, based on approximation of the original process, and compare it to a data-augmentation method. Our main goal in developing this is to produce a more computationally-efficient algorithm that can potentially be used for real-time inference. Our approach is to carefully consider the dynamics and structure of the problem to allow us to derive an approximation to the exact likelihood that can be evaluated using a novel combination of numerical methods (matrix exponential methods [18], stochastic simulations [19] and numerical convolutions). This allows us to use a simple Metropolis-Hastings algorithm to compute a joint posterior for all the parameters of interest. There are three main assumptions underpinning our method. The first is that we can approximate the early time behaviour of the epidemic as a branching process where only a single introduction to each household is possible. This is a very mild assumption and we would expect data collected in the early stages of an outbreak, say from an FF100 study, to conform to this reasonably closely. The second, more technical assumption we make, is that we can replace certain random variables that arise in the problem with their mean values. The third assumption we make is that households are infected at uniformly distributed times on the day of their initial infection. We show that our method provides a good approximation to the full model and the final posteriors that we compute show good convergence to the true model parameters as the amount of household data is increased. The method becomes more efficient than the standard DA-MCMC when dealing with large numbers of households, at the expense of introducing some positive bias in our estimates of the between household transmission rate.

Households model and data
The dynamics of the epidemic are modelled as a continuous-time Markov chain. Individuals are grouped into H mutually exclusive households and make effective contact at a high rate within households and at a low rate between households. In this paper, for simplicity, we will assume that all households are of the same size, N, and an SIR model for disease dynamics. Thus each individual is classified as susceptible to infection, s, infectious and able to infect susceptible individuals, i, or recovered and immune to the disease, r. As N is fixed, the state or configuration of a household can be specified by the number of susceptible and infectious individuals within the household (where r = N − s − i). Note that in this paper we do not consider models with a latent / exposed period. Extensions allowing for this are detailed in the Discussion.
If we index households by j = 1, . . ., H, then the state of the system, Y(t) can be specified by an H × 2 matrix where the j'th row gives the number of susceptible and infectious individuals in household j, Thus the state space is then (dropping the dependence on time), S ¼ fðs j ; i j Þ ðj¼1:HÞ 2 f0; 1; . . . ; Ng HÂ2 j s j þ i j N 8 jg: Note that there are lower dimensional representations of household models in which a state is a vector which describes the total number of households in each possible configuration [20]; however, we adopt the higher dimensional version here as it simplifies inference. The dynamics of the SIR household model are defined by the transitions that can occur throughout the population and their corresponding rates. Infectious individuals make effective contact within their household at rate β. In household j the probability that effective contact within the household leads to an infection is s j ðtÞ NÀ 1 , and hence the rate of within household infection is bs j ðtÞi j ðtÞ NÀ 1 . Each infectious individual recovers at rate γ, so recoveries in household j occur at rate γi j (t). Lastly, infectious individuals may make effective contact with any individual in the population outside of their own household at rate α. Thus between household effective contact results in an infection in household j at rate as j ðtÞðIðtÞ À i j ðtÞÞ NðH À 1Þ ; where I(t) = ∑ j i j (t) is the total number of infectious individuals in the population. We assume that the first infection is seeded in a single household at some U(0, 1) distributed time, θ 0 , such that the matrix encoding the first state, Y(θ 0 ), has first row (N − 1, 1) and all other rows (N, 0). Note that, as our data only reveals cases of infectiousness at a daily resolution, the time of the first infection is unknown.
Data. Suppose we have observed the start of an epidemic over some time period (0, T]. We assume our data counts the cumulative number of infections in each household each day, where day t is defined as the time interval (t − 1, t]. Here we are assuming that symptoms coincide with infectiousness. Each household is labelled by j = 1, . . ., M in the order that they became infected, but note that as the process is only observed at a daily resolution (taken to be the end of each day) the ordering within a day is arbitrary. It is natural to specify this data in terms of two quantities: the days on which each household is infected and the time series of cumulative infection counts within each household, starting from their day of infection. More precisely, let ψ t be the set of the labels (j) of the households that became infected on day t. Then let w (j) = (w k ) (j) be a vector where w k is the cumulative number of infection events within the jth household, recorded at the end of day k, from the day of the households initial infection up to day T. Thus the data is completely specified by the sets {ψ t } t = 1: T and the vectors {w (j) } j = 1: M , which we denote D ¼ ffc t g t¼1:T ; fw ðjÞ g j¼1:M g: ð2Þ These quantities are illustrated for a specific example in Fig 2. We also define O t ¼ [ tÀ 1 j¼1 c j , which is the set of labels of households that became infected before day t; this will be used in the derivation of the branching process approximation.

Data augmented MCMC
Data augmented Markov Chain Monte Carlo (DA-MCMC) is a powerful, exact Bayesian inference method for data with missing information. We adopt an approach similar to [11] to infer the joint posterior density of (α, β, γ). The general approach is to construct an augmented likelihood, the joint density of the data and the missing information given the model parameters, and use this to construct a single-component Metropolis-Hastings algorithm. This method proves useful for FF100 study data as the exact times of infection over each day are missing and the number of recovery events, and the times at which they occur, are entirely unknown. Although the data-augmented approach is a standard method for this kind of problem, we are not aware of it having been implemented in a household model where no transition times are known exactly and in which all parameters are unknown. For example, data-augmented MCMC has been implemented for a similar model with data obtained at regular discrete times, however parameters associated with the infectious period or recvery distribution were assumed to be known [12,13]. Hence we outline the algorithm developed for our particular problem.
As per the usual approach we augment our data with the transition times θ 2 R m and corresponding states Y = {Y(θ 1 ), . . ., Y(θ m )} in the underlying model, where m is the unknown number of transitions over time (θ 0 , T] which is allowed to vary. Additionally we consider the classification of infection events as missing, that is, we augment the data by transition labels z 2 {recovered, within, between} m . This is such that we can construct sets of transition indices, A, B and C, which correspond to within-household infection, between-household infection and recovery events respectively. In writing down the expression for the augmented likelihood function we adopt the convention that all quantities (s, i) are evaluated immediately prior to a transition. Hence we have, L DA ≔ f ðD; y; Y; zja; b; g; y 0 Þ ¼ where superscripts denote transition indices, subscripts denote household indices, terms without a subscript refer to the household which changes state, 1 fD;y;Y;zg denotes an indicator function corresponding to one if the data, D, could have arisen from the events defined by (θ, Y, z) and θ m+1 ≔ T for simplicity. The indicator function ensures that the augmented data has the same number of infections in each household, each day, as our observed data, and that the augmented data corresponds to a feasible realisation of a household SIR model. For example, the indicator takes the value 0 if there is a within-household infection in a completely susceptible household. Note that inference could be made without labelling the two kinds of infection,  however this more explicit representation produces gamma or truncated gamma marginal densities of β and α for uniform, gamma, inverse uniform or truncated gamma priors; hence they may be efficiently sampled.
Marginal posterior densities of α, β, γ and θ 0 can be evaluated and sampled from in a similar way to [11]. Lastly the joint posterior density, f ðθ; Y; zjb; g; y 0 ; DÞ, is proportional to L DA , thus it can be sampled from by randomly choosing from the following five kinds of moves according to an arbitrary probability mass function with non-zero components, {q 1 , . . ., q 5 }: (v). Randomly select and remove a recovery event with probability Acceptance probabilities are the ratio of the likelihood functions multiplied by the probability of returning to the current state, divided by the proposal density. For moves (iv) and (v), related to the insertion and removal of recovery times, we have that the probability density of choosing move (iv), selecting a particular household and inserting a recovery time is q 4 /(M(T − θ k )), and the probability of choosing move (v) and selecting one of |C| recoveries is q 5 /|C|. Each iteration of the DA-MCMC algorithm is comprised of Gibbs samples of α, β, γ and θ 0 followed by a Hastings step for (θ, Y, z) as per (i)-(v). The distribution of these samples converge to the joint posterior distribution of (θ, Y, z, α, β, γ), though consecutive samples will be highly correlated. The marginal over the parameters is simply obtained by ignoring the samples of (θ, Y, z).

Branching process approximation
We now provide an approach for analysing the household model by approximating it by a model that acts like a branching process at the household level. This model is equivalent to the model obtained by letting the number of households tend to infinity. As a consequence, between-household infection occurs into completely susceptible households almost surely; this approximation is reasonable, as the data we wish to perform inference on is from the very earliest stages of an outbreak in a large population. The main reason for considering the branching process model is that households act independently after initial infection. Hence we can consider the dynamics within each infected household, following their initial infection, in isolation from each other [8,21]. Under this model we construct an approximate likelihood with the aim of obtaining accurate estimates for the joint posterior distribution of (α, β, γ). The posteriors using this approximation are compared with the full household epidemic model in the Results section. We show that the resulting posterior distribution approximates the exact posterior distribution of the household model well, while the independence assumption allows for computational gains in the inference as the data set grows in size.
As we are considering households in isolation from each other, we define the state space for a single household as S ¼ fðs; iÞ 2 f0; 1; :::; Ng 2 : s þ i Ng: The within-household dynamics are defined by the transitions that can occur within an individual household and their corresponding rates; these are simply the within-household infection and recovery transitions with rates as described before. The within-household process can be defined in terms of its infinitesimal transition rate matrix, Q, given by where f : S ! f1; :::; jSjg is a bijective map [22]. The first infection within each household moves it into state (N − 1, 1), at which point the within-household dynamics determine how the disease spreads within the household for the remainder of the epidemic. We assume that between-household infection occurs due to homogeneous mixing of all the individuals in the population at rate α, thus the rate at which new households are infected is simply αI(t), where I(t) is the total number of infected individuals in the population at time t. The model is initialised with a single infected household at a U(0, 1) distributed time.
For this model we can identify the threshold parameter, RÃ, which is a household (population level) reproduction number [8]. This is the expected number of households infected by a primary infectious household in an otherwise susceptible population of households; where a household is considered infectious while it contains at least one infectious individual and a household is considered susceptible if it contains only susceptible individuals. It is one of at least five reproductive numbers that might be used when assessing the controllability of a disease in a community of households [23][24][25], but we adopt it herein as it is relatively easy to calculate and interpret. Let fX t g t2R þ be the Markovian process that describes the state of an individual household from the time of its infection (i.e., the time of the first infection within the household). Let I(k) be the function which returns the number of infectious individuals corresponding to state k. Then we have where X 0 = (N − 1, 1) is the initial state of the process [8,26,27]. This can be calculated by solving a system of linear equations that depend on the parameters of the epidemic model [27,28]. Also of interest from a public health perspective, is the early growth rate, r; this is also called the Malthusian parameter. Under the same conditions as above, this is defined as the unique solution to This can once again be evaluated efficiently [27]. Approximate likelihood. The branching process likelihood approximation relies on expressing the likelihood in terms of the data on a given day, t, partitioned into newly infected households, ψ t , and formerly infected households each day, Fig 2 for an example). With this partition, the likelihood for (α, β, γ) can then be written as, we have invoked the independence between ({w (j) } j2ψ t |ψ t ) and {w (j) } j2O t due to the branching process assumption. That is, we use the fact that under the branching process assumption households are conditionally independent given their initial infection. Note that we have split the likelihood in a way that does not use the Markov property, this is because the Markov property can not be easily exploited here as the state of the process is never observed exactly.
As O 1 = ⌀, that is, there are no households infected before t = 0, the term P c 1 fw ðjÞ g j2O 1 ¼ P ðc 1 Þ is determined by the initial condition. Further, households in ψ t are identically distributed in the absence of within-household information. Thus their labels are arbitrary and only the number of households in ψ t is relevant, that is where |ψ t | denotes the set norm of ψ t , that is, the number of households infected over day t. Thus we can factor the likelihood, Eq (4), into two parts that are related to the within-household dynamics and between-household dynamics respectively. That is, and We refer to L w as the within-household likelihood function and L b as the between-household likelihood function. In the following we detail how we calculate L b and L w . Between-household likelihood, L b . Each term in the product for the between-household likelihood, Eq (6), is the probability that we observe H t ≔ |ψ t | new infected households on day t, given the data, over the time period [0, T], for households that were infected before day t. We decompose H t into two components, t , is the number of the newly infected households on day t that are infected by a household in O t , i.e., a household infected before day t. The second component, H ðcÞ t , is the remaining number of newly infected households on day t, i.e., those that are infected by households that become infected on day t. We do not observe this demarcation in our data, but it assists us in the evaluation of the likelihood.
We start by considering the calculation of the probability mass function (pmf) of H ð1Þ t , denoted h ð1Þ t . Then, we consider the evaluation of the pmf of H ðcÞ t , h ðcÞ t . The required pmf of H t , h t , is subsequently evaluated using efficient methods for calculating convolutions.
First generation of households, H ð1Þ t . To calculate h ð1Þ t , the pmf of the number of first generation infected households, we note that on the first day of the epidemic there is only a single household infected at a U(0, 1) distributed time. Hence there is exactly 1 infected household in the first generation of households, so P ðH ð1Þ 1 ¼ 1Þ ¼ 1: For t ! 2 we consider the rate at which the households in O t infect new households. As we model the outbreak as a branching process, we assume that only completely susceptible households are infected, hence the instantaneous rate of infection at time τ where X j t is the state of household j at time τ and I(k) is a function returning the number of infectious individuals in a household in state k.
Thus the first generation of households are created as an inhomogeneous Poisson process, and conditioning on the information about the households Hence, we need to evaluate the distribution of However, this is expensive to compute, so instead we replace Λ t in Eq (7) with its expectation, which can be evaluated in a feasible manner. Precisely, we use This approximation allows tractability of the between-household likelihood; we do not however restrict the paths of households We assume that the infection of the H ð1Þ t households are uniformly distributed over day t, and since their dynamics are independent, we have that H ðcÞ t is the convolution of H ð1Þ t random variables; we will use G to denote one of these random variables. Each of these random variables correspond to the size of a household branching process at time 1 that was initialised at a Uniform(0, 1) time. The calculation of the pmf of G is once again computationally expensive, so here we choose to estimate this distribution using simulation [19]. We are simulating over a short period of time (a day) and use the most efficient representation to minimise computational time. This allows for a large number of simulations to be produced in a computationallyover-efficient manner.
Once we have estimated the pmf of G, we can calculate h t from h ð1Þ t as where the convolution matrix, M, is defined as follows. Let ϕ be a column vector of the pmf of the random variable G + 1. Then let where ' Ã ' denotes a discrete convolution and c 1 = ϕ. The matrix M is then given by where e 1 is a vector of 0s with the exception of a 1 in the first entry. The matrix M is truncated such that no probability needed for the calculation of the likelihood is lost. The calculation of this is not expensive, even for large matrices as the convolutions can be done using discrete Fourier transforms [29]. In this paper we simply use the built in MATLAB function conv(), although other methods may provide computational gains, if required.

Single household dynamics, E[Λ t ] and L w .
Recall, the evaluation of the pmf h ð1Þ t for t ! 2, corresponding to the number of first generation infected households on day t, requires the evaluation of the expected force of infection over day t from households infected prior to day t, E[Λ t ]. We begin by detailing the evaluation of E[Λ t ], and then note how the within-household likelihood, L w , follows.
The computation of the E[Λ t ] can be expressed in terms of integrals of the expected number of infectious individuals within each household in O t , Eq (8). As this expression is a sum over independent households we simplify our exposition, by detailing the calculation for a single arbitrary household in O t , with observed data w (thus dropping the superscript 'j' notation for now). The independence also means we can rescale time within the household to begin at the start of the day of the first infection. Thus we need to calculate the expected number of infected individuals over each of the |w| = ω days since the first infection within that house, i.e. E ½I ðX t jwÞ ¼ for all τ 2 (t − 1, t], where t = 2, . . ., ω. Note that as we are conditioning on the entire observed data within the household, w, the random variable ω is implicitly conditioned on. That is, we are conditioning on knowing that the household became infected on day T − ω + 1. In the remainder of this subsection all probabilities are conditioned on ω, but this is not written explicitly for concision. The expectation Eq (9) can be calculated efficiently, and hence the integral of the expectation to find the force of infection can also be calculated efficiently and accurately with Simpsons Rule, say. Our calculation is similar to that of the forward-backward algorithm [30], but is more involved as we need to calculate the expectations for all τ, not just the discrete time points at which observations occur. First we define some quantities. As w t is the total number of infections observed in the household by the end of day t (within-household time), X t must be in a set of states such that N − s(t) = w t . These states are encoded by indicator vectors, z t , with 1s in entries corresponding to states where N − s(t) = w t and zeros otherwise; these are either row or column vectors as required.
Define the row vector f t as the 'forward' probabilities of the system, so the kth element is the probability the system is in state k at the end of day t, given the observed data up to t, ½f t k ¼ P ðX t ¼ kjw ð1:tÞ Þ: These can be calculated in a recursive manner as follows: where '' is an element-wise vector product. The first vector, f 1 , is determined from the initial condition as follows: let v be a probability vector with a 1 in the entry corresponding to state (N − 1, 1). Then, as the infection is introduced into each household at a Uniform(0, 1) distributed time on their day of infection, the distribution of X 1 is given by Conditioning on w 1 gives f 1 = u z 1 /u Á z 1 . We also define the 'backward' probabilities, b t , with elements ½b t k ¼ P ðw ðtþ1:oÞ jX t ¼ kÞ: These are the probabilities of observing the remainder of the data given that the system is in state k at the end of day t. These can be calculated in a similar recursive way to the forward probabilities, but working backward from the final observation: Applying Bayes' theorem to the pmf in Eq (9) and using the Markov property we arrive at where i is a vector whose elements are the number of infected individuals in each state. This allows us to numerically evaluate h ð1Þ t ; note that all matrix exponential calculations here can be expressed as [e Q ] a , so we only need to compute the matrix exponential once per parameter set and take powers of the resulting matrix. Further, when numerically integrating, using a symmetric grid about t − 1/2 allows us to take advantage of the symmetry of e Q(τ−t+1) and e Q(t−τ) , effectively halving the number of times we need to take powers of e Q .
Using the quantities calculated above, we can also calculate the within-household likelihood, L w , described in Eq (5). Let l (j) denote the length of w (j) . Note, under the branching process assumption, infected households act independently of each other, so their within household dynamics following their infection are independent. Hence, that is, the within-household likelihood is a product of the normalising constants for the forward probabilities. Thus the within-household likelihood is calculated as a by-product of the expectation calculations.

Results
Our inference methods are compared based upon 50 simulations with true parameter values (α, β, γ) = (0.32, 0.4, 1/3) and 50,000 households of size N = 3 (the average household size in Australia is estimated to be 2.6 [31]). These simulations are from the full stochastic household model, not the simplified model where households are conditionally independent after their initial infection. These parameters are chosen such that the average infectious period is three days (this is a typical infectious period for influenza), R 0 = β/γ = 1.2 and RÃ % 1.8. For the branching process approximation (BPA) the number of realisations used to estimate the distribution of G was 10 3 . Each algorithm is based upon a Bayesian Markov chain Monte Carlo (MCMC) framework in order to estimate the joint posterior distribution of our parameters [32]. In particular, the BPA is a Metropolis-Hastings algorithm and the DA-MCMC is a single-component Metropolis-Hastings algorithm. Each algorithm is run at various stages of the epidemic in order to show how the posterior distributions converge as more households become infected; the inference for each simulation is run after 50, 100, 200, 300 and 400 households become infected. All kernel densities were estimated using the freely available MATLAB packages kde2.m and akde.m [33]. Means and standard deviations for MAP estimates are given explicitly in Table 1. From Fig 3, we observe that MAP estimates begin negatively biased for all parameters and converge towards fixed points as more data is obtained. The median of the MAPs of the BPA method for β and γ are lower than that of the DA-MCMC method, whereas the median of the MAPs of α are higher. The boxes associated with β and γ for each method are overlapping, whereas the boxes associated with α are biased higher for the BPA method when data is based upon 300 and 400 infected households. In Fig 4, we observe that the boxes of the MAP estimates converge to the true values of RÃ and r for the DA-MCMC method, whereas they are biased above the true value for the BPA method. The positive bias in these quantities is due to the overestimation of α by the BPA method. The box plots indicate a general trend that the variability of the MAP estimates decrease as more data is obtained. It should be noted that these box plots do not show the correlation structure of the parameters; this is not presented here as the dimension of the parameter space makes the correlation structure difficult to display. In Fig 5 the posterior densities of RÃ and r appear similar between the two methods, although the bias of the BPA is clear.
For both methods the variability in the posterior distribution is observed to decrease in a similar way as more households are infected. Table 1 shows that when inference is run after 400 households are infected, the mean of the MAPs of both methods lie within a standard deviation from the true values of (α, β, γ). We also find that the average MAP estimates of b g is found to be 1.2484 and 1.2484 in both methods; this excellent agreement at the household level indicates that the branching process is an appropriate approximation for the full household epidemic process. Out of the two methods, only the means of the MAPs from the DA-MCMC method lie within a standard deviation of the true values of RÃ and r. As the DA-MCMC method is an exact method, and both methods were run on the same simulations, we can compare the difference of MAPs from the two methods, this is given in the final row of Table 1. The difference of the MAPs for β and γ lie within a standard deviation of 0, the difference for α, RÃ and r are in excess of 2.5 standard deviations from 0. This indicates that the BPA method leads to a significantly different answer, in terms of α, RÃ and r compared to exact methods. On average we saw a 7.8%, 16.0% and 24.7% positive error in α, RÃ and r respectively.
The efficiency of the two algorithms cannot be compared directly in terms of iterations per time, as samples from the DA-MCMC are more highly correlated than samples from the BPA [16]. Hence, the algorithms are compared in terms of their multivariate effective sample size per hour, where the multivariate effective sample size is an estimate of the number of independent samples in a dataset [34]. Fig 6 shows that the DA-MCMC is initially much more efficient than the BPA algorithm, however it scales poorly as more data is obtained and is less efficient than the BPA after 200 households are infected. The efficiency of the BPA algorithm appears to be highly left skewed, as there were some outlying simulations that were much less efficient than the others. These outliers were still more efficient when using the BPA method when inference is based on 400 infected households. Note, the multivariate effective sample sizes of the BPA and DA-MCMC had an average of 3366 and 4138 when inference is based on 400 Inference from household stratified data infected households, so even though the results are based upon different sample sizes, the multivariate effective sample sizes are comparable and sufficiently large. On average the DA-MCMC algorithm with 50, 100, 200, 300 and 400 infected households will take 0.06, 0.19, 1.45, 5.14 and 13.72 hours respectively to obtain an effective sample size of 3000, whereas the BPA algorithm can do the same in 0.49, 0.53, 0.70, 0.94 and 1.24 hours respectively. The BPA method is twice as efficient as the DA-MCMC algorithm by the time 200 households are infected and it is more than 11 times as efficient when 400 households are infected.

Discussion
In this paper we have implemented a DA-MCMC algorithm for exact inference on a stochastic SIR household model and derived a method to approximate the likelihood for an SIR household branching process. These allow us to perform Bayesian MCMC inference to compute posteriors for both RÃ and r, which are of importance for public health planning. This is the first study that we are aware of to estimate these quantities using FF100 study data and a Bayesian framework. The posterior distributions for the DA-MCMC method converge towards the true parameter values and the MAPs exhibited a standard deviation of less than 0.034 for all model parameters-this indicates that the FF100 study data can be highly informative despite  of (α, β, γ, R * , r). The means and standard deviations of the 50 MAP estimates based upon data with 400 infected households for each parameter is shown in the form mean(standard deviation) for the BPA and DA-MCMC methods. The last row shows the difference in the mean and standard deviation between the two methods. Inference from household stratified data the amount of missing information. In particular, posterior densities exhibit little variability after 200 households have become infectious. The posteriors from both methods appear to have a similar shape and a similarly decreasing variability, however the BPA method leads to some systematic positive bias in α, the betweenhousehold transmission parameter. The average positive bias when 400 households were infected was only 7.8%, though this then leads to larger positive biases in both RÃ and r. Both methods accurately estimate the within-household parameters, β and γ. The branching process should be least applicable with 400 infected households, yet the BPA performs as well as the DA-MCMC method for estimating b g , hence the assumption of a branching process is good. Thus the most likely cause of the positive bias in α in the BPA method is the replacement of random variable Λ t , the distribution of the path integral of the number of infectious individuals over the day, with its expectation. This may arise if Λ t is left skewed and hence its mean will be situated to the left of the majority of the probability mass of f Λ t . Hence αE[Λ t ] would often be less than αΛ t , leading to an overestimation of α.
We compared the efficiency of the methods and found that the DA-MCMC is superior for data with up to 100 infected households, however due to the poor scaling properties of the DA-MCMC, the BPA is much more efficient for data of 200 or more infected households; keeping in mind that it may introduce some positive bias to estimates of RÃ and r. Both methods are able to produce reasonable effective sample sizes within a day of computation for up to 400 infected households and hence could be useful in the early stages of a disease outbreak; Inference from household stratified data however the BPA allows for a more immediate assessment once more than 100 households are infected.
Clearly a more complex epidemic model will be needed for analysing real FF100 data which accounts for the latent period of the disease as well as partial observation [10], but the type and amount of data as assumed in this paper will basically remain the same (only symptom onset times). Hence the amount of missing data needed to be inferred in a DA approach would increase, but it is a well known drawback of this method that the mixing deteriorates and hence overall speed of sampling also deteriorates with increasing amounts of missing data [15,16]. Hence the speed-up achieved with the BPA method indicates that such approximations could be the only way forward for near real-time estimation of more complicated models where the DA-MCMC approach becomes too inefficient. Such considerations motivate further investigation of these methods, particularly looking at improvements to reduce the bias in the α parameter.
Another way to analyse the stochastic SIR household model is to cast it as a multi-type branching process [35,36]. The theory of these is well developed and hence we can write down equations for many of the quantities we need in order to calculate a likelihood, but actually solving these is too inefficient for practical inference where the likelihood calculation is embedded in an MCMC scheme and hence needs to be repeated many times. Indeed the equations for the probability generating function for the full branching process are very simply stated, but their solution involves a multi-dimensional inversion [37]. As such, the approach we have taken with the BPA is to factor the likelihood in a non-standard way, using a small number of well motivated assumptions. Our factorisation allows us to calculate its parts using a combination of numerical techniques; in particular, matrix exponential methods as well as stochastic simulations and numerical convolutions. Each method is appropriate for the task and relatively efficient. For example the simulation to calculate the distribution of G can be programmed efficiently as there is no conditioning involved, so all the generated realisations can be used. There may be room to improve the accuracy of estimates by choosing a more appropriate distribution for the initial infection times of the households over each day, rather than just assuming they are uniformly infected over the day. Letting households be infected at times according to the distribution obtained from splitting the day into discrete time steps and weighting these intervals by the expected force of infection over each interval is one approach that could be considered.
Still, there is room to improve the efficiency in many aspects of the procedures for each of the algorithms. In particular, no attempt has been made to parallelise any part of the BPA algorithm. This would be relatively trivial as most of the calculations are independent of each other and hence this would provide a large speed-up. For example, the simulations to calculate G and the expectation calculations within each household could be parallelised. In [10], we also used a tree data structure to minimise the number of operations needed to calculate the within-household likelihood. A similar approach could be taken here to minimise the cost of the expectation calculations, as well as casting them as explicit path integrals that can typically be solved more efficiently [28]. Another aspect of our algorithm that can be tuned is the time step used in the numerical integrations. Decreasing this will result in a faster running time, but a larger error in our final posteriors. While the DA-MCMC algorithm may not be parallelisable, efforts could be made to optimise the move proposal density, {q 1 , . . ., q 5 }, or to optimise the number of proposals to make in each iteration, or to use proposals informed by the model [16,38]. It would also be instructive to compare our results to those obtained assuming a discrete-time model [39]. The assumption of continuous-time is computationally expensive, thus if discrete-time models can perform inference to the same level of accuracy, these would be preferred.
There are a number of extensions that could be made quite easily to this methodology. An exposed /latent period can be added to the model, but this changes the processes somewhat in that households are no longer observed at the time of first infection, but after the first individual becomes infectious (and displays symptoms). Thus we would need to track the distribution of exposed but not yet infectious households. One aspect that becomes easier, for the BPA method, with the addition of an exposed period is that longer chains of household infections become less likely on a given day. If the exposed period is sufficiently long with high enough probability (say, typically greater than 1 day) then we can approximate the distribution of newly exposed households with just a single generation. The efficiency of the DA-MCMC method is likely to scale much worse for an SEIR model, as there will be much more missing information to sample. Another extension would be the incorporation of a realistic distribution of household sizes within the population. For the BPA, the expectation calculations would essentially remain the same, but the proportions of each size of household would need to be taken into account in the between-household likelihood calculation, in the simulations and the convolution procedure. For the DA-MCMC these proportions will need to be accounted for in the augmented likelihood and acceptance probabilities.
The biggest weakness of this work is that we assume perfect detection of infectious cases. Especially for diseases such as influenza, there can be a large fraction of asymptomatic cases and hence partial detection is the best that can be achieved. In previous work we have not made this perfect detection assumption, but instead assumed that there is some probability per case of detection [10]. Many aspects of this work could be extended to incorporate partial detection, but the largest challenge for the BPA is modelling the distribution of currently unobserved households and how they contribute to the overall force of infection. The convolution approach may be appropriate here, especially given how fast this is using modern GPU hardware, but this is a topic for further research. This also becomes challenging for the DA-MCMC approach, as the missing data can be very high dimensional as it samples from parameter space corresponding to low observation probabilities.