Fig 1.
Schematic illustration of the rationale of the marked latent residuals.
Assume that a single pathogen strain enters a host and begins to evolve at time t0. A within-host-diversity model, crudely illustrated by (a), allows for establishment of new strains (i.e. new branches) generated from mutations (occurred at internal nodes). The s-d-s model M0, illustrated by (b), allows only mutations along the (linear) line/branch and as a result assumes only one (dominant) strain at any particular time point. Assume that two sequence samples GA and GB (superscripts dropped without ambiguity) are randomly sampled from the pathogen population at tA and tB respectively, where tA − tB ≈ 0 (i.e. ζ(k) ≈ 1). (a) may predict distinct GA and GB, while (b) would predict minimal difference between them due to the implied linear relationship between mutations and time. Therefore, residuals associated with high ζ(k) may be expected to be large if a model takes insufficient account of within-host-diversity.
Fig 2.
Empirical cumulative distribution functions of p-value obtained by applying the Anderson-Darling test to the subset of residuals with non-zero marks and to the full set of residuals (for simulations set 1 to set 5 in Scenario II where s-d-s model M0 is (inappropriately) fitted to data generated from a within-host-diversity model M1), see also Table 1.
Proportions of p-value less than 0.05 (indicated by colored text) in this case are consistently higher for the subset of residuals corresponding to non-zero marks (red text).
Table 1.
Proportions of p-value less than 0.05 that indicate overall evidence against the null model in two scenarios: (1) fitting the correct model structures and (2) fitting the s-d-s M0 to data generated from a within-host-diversity model M1.
For scenario (2), five datasets are generated independently from M1 with a same set of parameter values (Models), which are used to reveal any consistent difference of the evidence of model mis-specification compared to scenario (1). Noted that in both scenarios, the (same) correct epidemic process component is fitted. is the subset of the full set of residuals
, associated with non-zero marks ζk.
Fig 3.
Systematic deviation revealed by the marked latent residuals.
(a)-(e) correspond to simulation set 1-5 (from Scenario II where s-d-s model M0 is (inappropriately) fitted to data generated from a within-host-diversity model M1). The histograms depicted in the first row are formed by aggregating residuals whose associated marks ζ(k) lie in the top tercile of marks for any posterior sample for which the p-value of the Anderson-Darling test is less than 0.05. The histograms of ζ(k) are shown in the second row. Residuals associated with smaller ζ(k) may exhibit a multiplicity of patterns (see SI: S1 Fig).
Fig 4.
Residuals associated with the top tercile of marks ζ(k), in applying our diagnostic framework to a foot-and-mouth dataset.
Fig 5.
Logarithm of the ratio between inferred effective genetic time teff = T(GA, GB) and the ‘absolute’ genetic time tabs = tB − tA used in the s-d-s model.
Consider an individual who becomes infected at time t0 = 0 and causes two infections at times tA and tB where tA < tB are generated as the first two event times in a Poisson process. Then we have teff/tabs = 1 + 2 × T*(GA, GB) × Z (see also Eq 10) where Z = u/(1 − u) with u ∼ Unif(0, 1). For each simulated Z, we draw a corresponding T*(GA, GB) from (see Eq 10) where γ and η are taken to be their respective posterior means.