A Systematic Bayesian Integration of Epidemiological and Genetic Data

Genetic sequence data on pathogens have great potential to inform inference of their transmission dynamics ultimately leading to better disease control. Where genetic change and disease transmission occur on comparable timescales additional information can be inferred via the joint analysis of such genetic sequence data and epidemiological observations based on clinical symptoms and diagnostic tests. Although recently introduced approaches represent substantial progress, for computational reasons they approximate genuine joint inference of disease dynamics and genetic change in the pathogen population, capturing partially the joint epidemiological-evolutionary dynamics. Improved methods are needed to fully integrate such genetic data with epidemiological observations, for achieving a more robust inference of the transmission tree and other key epidemiological parameters such as latent periods. Here, building on current literature, a novel Bayesian framework is proposed that infers simultaneously and explicitly the transmission tree and unobserved transmitted pathogen sequences. Our framework facilitates the use of realistic likelihood functions and enables systematic and genuine joint inference of the epidemiological-evolutionary process from partially observed outbreaks. Using simulated data it is shown that this approach is able to infer accurately joint epidemiological-evolutionary dynamics, even when pathogen sequences and epidemiological data are incomplete, and when sequences are available for only a fraction of exposures. These results also characterise and quantify the value of incomplete and partial sequence data, which has important implications for sampling design, and demonstrate the abilities of the introduced method to identify multiple clusters within an outbreak. The framework is used to analyse an outbreak of foot-and-mouth disease in the UK, enhancing current understanding of its transmission dynamics and evolutionary process.


Supplementary Details of the MCMC Algorithm
The model parameters and the unobserved quantities are updated sequentially (see below). In this section we give supplementary details of the algorithm not described in the main text.

Sampling of E j
In the Part I of the description of the algorithm in the main text, the exposure time E j is proposed as a random draw E j ∼ U (t l , t u ).
If j is a primary infection we have t l = 0 (2) and min{t s j , I j }, if j has an observed sequence sample (at t s j ) and j ∈ χ I , t s j , if j has an observed sequence sample and j / ∈ χ I , if j has no observed sequence sample and j ∈ χ I , t max , if j has no observed sequence sample and j / ∈ χ I .
In the case of ψ j ∈ χ I , we have t l = I ψ j (4) and if j has an observed sequence sample (at t s j ) and j ∈ χ I , min{t s j , R ψ j }, if j has an observed sequence sample and j / ∈ χ I , min{I j , R ψ j }, if j has no observed sequence sample and j ∈ χ I , R ψ j , if j has no observed sequence sample and j / ∈ χ I .
When ψ j / ∈ χ R , Equation 5 reduces to Equation 3. In the Part II of the description of the algorithm in the main text, E j is proposed in the same manner with ψ j now being replaced by ψ j . The reader is reminded that the sampling of E j is only a part of the joint sampling procedures described in the main text.

Sampling of I j
To incorporate some uncertainty in the onset of infectiousness, I j is assumed known within a range. Let t o denote the actual time of symptom onset. Then we assume that I j is known only within a range t o ± D. For simulation studies, we assume t o to be the true I j and D = 0.6. Taking account of the additional constraint that E j < I j < R j , we propose I j uniformly between t a and t b where and R j is replaced by t max if j / ∈ χ R .

Sampling of G 1,j
In addition to the joint sampling of E j and G 1,j , separate updating of G 1,j is necessary to explore thoroughly the domain of G. We implement a simple updating algorithm for proposing G 1,j − for an individual j ∈ χ E , each nucleotide base G i 1,j is sampled uniformly from the set ω N = {A, C, G, T }.

Sampling of θ
Each parameter in θ = (α, β, a, b, γ, η, κ, p, µ 1 , µ 2 ) is updated sequentially with a standard randomwalk Metropolis-Hastings algorithm. For example, a new parameter value α is proposed from a normal distribution centered on the current value of α where ρ controls the step-size of the random-walk.

Sampling of Cryptic Exposures
Denote ω C as the set of exposed individuals who do not have an observed sample and are not in χ I . We refer to j ∈ ω C as to a cryptic exposure. We incorporate j in our framework by imputing the sequence transmitted to j. Allowing cryptic exposures requires a 'swap' of individuals between the sets ω C and χ S and a transmitted sequence needs to be imputed when an individual from χ S moves to ω C . After the individual to be swapped has been proposed, the sequence is imputed in a similar manner to Part II of the algorithm described in the main text. The acceptance probability is similar to that in Part II of the algorithm described in the main text with an additional term that accounts for the 'swapping' probability [1].

Initialisation of the Transmission Graph ψ
When only a subset of individuals j ∈ χ E have an observed sequence sample, the choice of the starting value of ψ becomes important for the rate of convergence of the Markov chain. In this case, we sample the starting value ψ 0 from the marginal posterior distribution of ψ, P e (ψ), obtained from only fitting the epidemic model to the epidemic data using standard data-augmentation methods. Effectively, we set g(·) = h(·) = 1 in the likelihood function (main text) and do not attempt to impute the unobserved sequences.

Sampling and Initialisation of G M
The master sequence G M determines the source sequence for a particular cluster and the choice of its initial value in the MCMC algorithm is very important. Specifically, we choose the first observed sample in the population as the initial value. We implement a simple updating algorithm similar to the updating of G 1,j − each nucleotide base in G i M is proposed uniformly from the set ω N \ G i M .

Initial Values of Other Parameters
In the application to FMDV, we initially set α = 0.0002, β = 3.0, a = 2.0, b = 2.0, µ = 8.0, κ = 0.1, µ 1 = 1e − 04 and µ 2 = 5e − 05 as the initial values. In the simulation studies, each parameter in θ is initialised to be one half of its true value. An initial value of I j , j ∈ χ I , is randomly drawn within a range t o ± D. Set the initial χ E be χ I (i.e., no cryptic exposures). Let ω ψ = {j ∈ χ I |I j ≤ t u } be the set of potential sources for a particular individual i ∈ χ E −1 . The source of i, ψ i , is then chosen uniformly from ω ψ . Note that in the case of fitting the full model, a candidate drawn from P e (ψ) is set to be ψ i if it is also in χ ψ (see also main text). After initialising the transmission network and times of events (i.e., E j and I j ), the transmitted sequences are initialised sequentially in the order of E j according to the evolutionary model specified by Equation 2 in the main text.

Computing Time and Other Benchmarks
The MCMC algorithm was coded in the C++ language (executed on a system with an Intel(R), i7-2600, 3.40GHz CPU). To provide a benchmark, we report the computing time and some key features of the Markov chain from the simulated single-cluster example where full genome sequencing and full sampling of exposures were considered (i.e., population size N = 150, sequence length n = 8000 and sampling proportion =100%).
Convergence and mixing of the chain were assessed on the basis of visual inspection of trace plots. MCMC output is a chain of autocorrelated samples, and a common measure of mixing and the size of independent samples is the so-called effective sample size (Eff ) which aims at "un-coupling" the effect of autocorrelation (it is often advised not to stop the MCMC with Ef f < 100). The effective sample size of a parameter θ is commonly defined as Ef f (θ) = S/(1 + 2 k ρ k (θ)), where S is the number of posterior samples and ρ k (θ) is the autocorrelation at lag k (the sum is usually truncated at lag k when ρ k (θ) < 0.05). The effective sample sizes for individual model parameters were computed using a package [2] available in the statistical software R. We obtained a converged and well-mixed chain with a reasonable effective sample size (obtained from 400,000 iterations after 50,000 burn-in). The computing time was 63803.28 seconds (17.7 hours) which is considered to be practical and efficient [3,4]. The effective size was Ef f θ = (286,912,998,5380,1950,7416,9945,30133) with elements corresponding to parameters in θ = (β, a, b, γ, η, κ, µ 1 , µ 2 ). We also report that the computing time is greatly reduced (i.e., 2.3 hours) in the case of partial genome sequencing with n = 1000 − this has practical implications, for the estimates of most epidemiological parameters obtained from using partial genome sequencing have no material difference compared to using full genome sequencing (S10 Fig. −S11 Fig.). We also note that our code may be parallelised fairly easily for a potentially significant reduction in run time, using multi-core computers that are becoming more common nowadays; for example, mutations nucleotide sites are assumed to be independent of each other, which can be utilised for parallelisation in a straightforward manner using popular platforms like MPI or OpenMP.

Acceptance Probabilities
The acceptance probability of a proposed parameter value θ i with current value θ i is where P (θ i ) is the prior distribution of θ i and q(θ i |θ i ) proposal distribution of θ i given the current value θ i . The probability of accepting a proposal to a component of the augmented data z i is similar.
In most of the cases, q is a symmetric proposal distribution and hence the proposal ratio (e.g., ) reduces to 1, simplifying the problem. However, when the proposal density is less straightforward the proposal ratio must be treated explicitly. We describe in detail the computation of the proposal ratio for the joint sampling of E j and G 1,j described in Part I of the algorithm in the main text. As an illustration, we consider only the case where G p and G f are both defined. We have to compute the forward proposal probability (i.e., the denominator) and the backward proposal probability (i.e., the numerator) As ψ j is unchanged the domains of E j and E j are identical and we have We also have where m f is the number of nucleotides on G f which match the nucleotide in the corresponding position of G 1,j . The quantity q 2 (G 1,j |E j , E , G ) is similarly computed by considering the reverse move. In particular, we must re-define G p and G f as the direction of change of time is reversed. The proposal ratio for the joint sampling procedure in Part II of the algorithm in the main text can be easily computed in a similar fashion. In this case we must take account of the difference in the domains of E j and E j in the respective proposal distributions, and the ratio of probabilities of proposing ψ j and ψ j as described in Part II of the algorithm in the main text. Non-informative uniform priors with "unrealistically" wide intervals are specified for all model parameters. For example, the prior for mutation rates is U (0, 0.1), the mean latent period has a prior U (0, 50) and the secondary transmission rate β has a prior U (0, 50).

Contribution Genetic Data to Model Assessment
The authors have shown that effective model assessment of a general spatio-temporal model may be achieved by proposing suitably designed non-centred parameterization schemes and imputing the corresponding residuals, whose sampling distributions are known, in such a manner that posterior distributions are sensitive to mis-specifications of particular components of the model [6]. Here we investigate how the genetic data may help in assessing, in particular, the goodness-of-fit of a specified spatial kernel by utilizing the so-called Infection-link Residual (ILR).
The set of ILR, hereinafter denoted as r = {r 1 , r 2 , · · · , r ne } where n e is the total number of exposures, uniquely determines the respective infection link (i.e., source of infection) for every exposure. The distribution of r can be shown to be U (0, 1) under their construction scheme and the model assumption given by Equation (1) in the main text and is independent a priori of the form of the spatial kernel. Its posterior samples, hereinafter denoted asr, can be easily imputed in standard data augmentation algorithms such as Markov chain Monte Carlo (MCMC) by inverting the construction procedures of ILR and imputing the infection links. On applying a classical test tor for its compliance with U (0, 1), a posterior distribution of p-values is generated from which the evidence against the model assumption can be discerned. Specifically, we measure the evidence against the model by π(P ( r) < 0.05|y), the proportion of the posterior p-values which are less than 0.05. The Anderson-Darling hypothesis test [5] is adopted (for details see [6]). We consider the six-cluster epidemic data mentioned in the main text.
We consider fitting three forms of spatial kernel: • An exponentially-bounded kernel (Kernel A): K(d ij , κ 1 ) = exp(−κ 1 d ij ); • A power-law kernel (Kernel B): K(d ij , κ 2 ) = d −κ 2 ij ; • A Cauchy-type kernel (Kernel C): It is noted that Kernel A is the actual spatial kernel used in the simulations.
In the main text we have shown that increased availability of genetic data improves the estimation of the transmission graph. Given that the imputations of ILR rely on the imputed infection links (equivalently the transmission graph), increased availability of genetic data may potentially increase the sensitivity of the test based on imputed ILR over the mis-specification of the model. Table S7 shows that this improvement of sensitivity is indeed achieved.

Further Simulated Epidemics
In this section we consider 15 random independent replicates of epidemics. Specifically we simulate 5 epidemics using each of the 3 sets of the model parameters where multiple-cluster scenarios were investigated. All the epidemics considered here are of more than one cluster. To recap, compared to the first set of model parameters, the second set of model parameters is characterized by a higher background transmission rate and hence is expected to give rise to epidemics exhibiting higher numbers of clusters than those generated from the first and third set of model parameters. The third set of model parameters is characterized by the lower mutation rates which match the foot-and-mouth disease scenario.
For epidemics simulated from the first and second sets of model parameters, we compare the estimation performance at sampling levels 100%, 50% and 0% of exposures. For epidemics simulated from the third set of model parameters, we compare the estimation performance at sampling levels 100%, 10% and 0% of exposures.
In the main text it is observed that posterior uncertainty in the model parameter estimates tends to increase as the sampling % drops with this effect appearing most dominant for the secondary transmission rate β and the spatial kernel parameter κ. Tables S1−S3, which show the sample means and standard deviations of the posterior samples of β and κ, suggest similar findings. Note that for the third set (characterized by significantly lower mutation rates and showing a higher tolerance to level of sub-sampling) we have considered sampling level 10% (instead of 50%) obtaining results similar to the 0% setting.
Tables S4−S6 show the absolute difference between the number of clusters obtained from the posterior samples and the actual number of clusters, denoted as ∆N c . They also show the number of different bases (out of 1,000) between the imputed master sequence G M and the actual ones, denoted as ∆M , and the overall coverage rate obtained from the posterior samples. Similar to the findings shown in the main text, it is observed that ∆N c and ∆M in general increase when the sampling percentage reduces. In the case where no genetic data are available the mean of ∆N c , and its degree of variation, are quite considerable. Also, comparison of the values of ∆M from Table S4 and Table S5 reveals that the estimation of G M may become more reliable as the number of clusters increases. It is observed that when there are fewer than 3 clusters in the actual epidemic (e.g., Replicate 1, Replicate 3 and Replicate 5 in Table S4), ∆M becomes considerably larger. The overall coverage rate increases with the sampling percentage and becomes less dispersed.

A Markov Process to Model the Evolutionary Process
A continuous-time Markov process with states (i.e., nucleotide bases) taking values in the set ω N = {A, C, G, U } can be defined to model the nucleotide substitution process. Let µ xy be the transition rate between state x ∈ ω N and y ∈ ω N . Moreover, let P( t) be the transition probabilities matrix whose entry p xy ( t) is the probability of transition to state y after arbitrary evolutionary time t, given the initial state x.
The particular form of P( t) depends on the model assumptions. For details of the derivation of P( t) see [7]. For example, the simplest form of a nucleotide substitution model (the Jukes-Cantor model ) assumes that the transition rates between any pair of nucleotide bases are the same (i.e., µ xy = µ, x = y). Under the JC model, The Kimura model we adopt [7] allows different rates for transition and transversion. Let µ 1 and µ 2 be the rates of transition and transversion respectively. Then under the Kimura model p xy ( t) = 0.25 + 0.25e −4µ 2 t − 0.5e −2(µ 1 +µ 2 ) t , for x = y and it is a transition. 0.25 − 0.25e −4µ 2 t , for x = y and it is a transversion. (15b) Note that the notation used to denote transition probabilities differs from that used in Equation (2) in the main text. The notation p xy (·) used here is more intuitive in the context of a matrix.

Signum Function
The signum function of a real number t is defined as follows: (16) Table S1: Independent replicates of multiple-cluster epidemics simulated from the 1 st set of model parameters with α = 0.0004, β = 8, κ = 0.02, a = 10, b = 0.5, γ = 2.0, η = 2.0, p = 0.01, µ 1 = 0.002 and µ 2 = 0.0005. Sample means and standard deviations of the posterior samples of β and κ obtained from fitting the model to 5 independent replicates of multiple-cluster epidemics simulated from the first set of model parameters.  Table S2: Independent replicates of multiple-cluster epidemics simulated from the 2 nd set of model parameters, which is characterized by a higher background transmission rate and higher mutation rates compared to the first set of model parameters.