Model diagnostics and refinement for phylodynamic models

Phylodynamic modelling, which studies the joint dynamics of epidemiological and evolutionary processes, has made significant progress in recent years due to increasingly available genomic data and advances in statistical modelling. These advances have greatly improved our understanding of transmission dynamics of many important pathogens. Nevertheless, there remains a lack of effective, targetted diagnostic tools for systematically detecting model mis-specification. Development of such tools is essential for model criticism, refinement, and calibration. The idea of utilising latent residuals for model assessment has already been exploited in general spatio-temporal epidemiological settings. Specifically, by proposing appropriately designed non-centered, re-parameterizations of a given epidemiological process, one can construct latent residuals with known sampling distributions which can be used to quantify evidence of model mis-specification. In this paper, we extend this idea to formulate a novel model-diagnostic framework for phylodynamic models. Using simulated examples, we show that our framework may effectively detect a particular form of mis-specification in a phylodynamic model, particularly in the event of superspreading. We also exemplify our approach by applying the framework to a dataset describing a local foot-and-mouth (FMD) outbreak in the UK, eliciting strong evidence against the assumption of no within-host-diversity in the outbreak. We further demonstrate that our framework can facilitate model calibration in real-life scenarios, by proposing a within-host-diversity model which appears to offer a better fit to data than one that assumes no within-host-diversity of FMD virus.

The source of infection is also updated in a similar fashion [3].
Other model parameters such as infection rates α and β are inferred using standard MCMC algorithm. Given inferred model quantities (including m the number of observed mutation), latent residual sample r can be obtained by reversing the procedure described in Equation 5 in the main text. Non-informative uniform priors with wide intervals are used for all model parameters. S1 Table  lists the prior distributions used for fitting the model to the FMD data.

Refining S-D-S Model M 0
Pseudo-likelihood Function We propose a likelihood function for the within-host-diversity 'emulator' M p for calibrating the single-dominant-strain model M 0 . The likelihood function accounts for the complete genetic data by augmenting the genetic data with the unknown T * (G A , G B ) (see Equation 9 in main text) for each successive pair of transmitted strains. We describe briefly how this likelihood is constructed.
• Consider the a chain of infection events − individual k is infected at t k , and it subsequently infects individual i at t i and then j at t j (t j > t i > t k ).
• Let the dominant sequence at t k on k be G k 0 , the sequence at t i on i be G i 0 and the sequence at t j on j be G j 0 . Noted that as we assume that the dominant strain on an infector will also be transmitted to a new infected individual, we have G k 1 = G i 0 and G k 2 = G j 0 − accordingly the sequence data on k may be denoted as • Each infection (i.e., k, i and j) is associated with a T * l ∼ Beta(γ, η), where l = k, i, or j. T * l will be used to determine the effective genetic time between the latest sequence on the infector (by the time of the infection) and the sequence on the infected individual (at the time of the infection). For example, for the infected individual j, the corresponding effective genetic time between G j 0 and G i 0 Then, for individual k, we can construct a pseudo-likelihood function as follows.
where P B is the density of Beta distribution and λ is the mutation rate and Noted that Equation 3 only described a likelihood function for the evolutionary process − a complete likelihood function requires components that account for the epidemiological process which can be found in [3]. Equation 3 is in effect a pseudo-likelihood as it assumes that the relationship of G k m+1 to G k m is independent of the latter's relationship to any previous strains. Compared to M 0 , this approach allows the effective time between strains to be an additional imputed quantity by inferring γ and η, and therefore allows departure from the more restricted s-d-s assumption made in M 0 . propose a new and statistically consistent sequence somewhere between two 'known' sequences (namely the nearest past sequence and the nearest future sequence) at either side of a newly proposed infection time, where a 'known' sequence can either be an observed or imputed sequence. Noted that effective genetic time, associated with the proposed G i 0 and the latest sequence on the proposed infector by the proposed time t i , is also simultaneously updated (See Equation 2). The effective times between the (to-be-proposed) G i 0 and the two 'known' sequences at either side G i 0 are to used to determine the probability that a nucleotide base on G i 0 takes the value on the nearest past/future sequence at the same position. Acceptance probability of the proposed v and the associated effective genetic time is determined similarly following [3].

Model Inference Model inference is performed by adapting the MCMC algorithm used to fit the s-d-s model in [3] (see an overview above in Bayesian Data-augmentation and Model Inference of S-D-S Model
The algorithm is then applied sequentially to all infected hosts in each MCMC iteration, so that the whole set of effective genetic times, sources of infection, infection times and transmitted sequences may be updated. Furthermore, at each iteration, the collection of T * for each infection is also updated using a Metropolis-Hastings sampler where new T * is proposed using the Beta distribution parameterized by the current values of γ and η. The parameters γ and η are proposed using random-walk centered at current values. We use the same set of prior distributions for the shared parameters found in the s-d-s model (S1 Table), and additionally, we have non-informative prior Exponential(rate = 0.01) for parameters γ and η.