Model diagnostics and refinement for phylodynamic models

doi:10.1371/journal.pcbi.1006955

Fig 1.

Schematic illustration of the rationale of the marked latent residuals.

Assume that a single pathogen strain enters a host and begins to evolve at time t₀. A within-host-diversity model, crudely illustrated by (a), allows for establishment of new strains (i.e. new branches) generated from mutations (occurred at internal nodes). The s-d-s model M₀, illustrated by (b), allows only mutations along the (linear) line/branch and as a result assumes only one (dominant) strain at any particular time point. Assume that two sequence samples G_A and G_B (superscripts dropped without ambiguity) are randomly sampled from the pathogen population at t_A and t_B respectively, where t_A − t_B ≈ 0 (i.e. ζ^(k) ≈ 1). (a) may predict distinct G_A and G_B, while (b) would predict minimal difference between them due to the implied linear relationship between mutations and time. Therefore, residuals associated with high ζ^(k) may be expected to be large if a model takes insufficient account of within-host-diversity.

More »

Expand

Fig 2.

Empirical cumulative distribution functions of p-value obtained by applying the Anderson-Darling test to the subset of residuals with non-zero marks and to the full set of residuals (for simulations set 1 to set 5 in Scenario II where s-d-s model M₀ is (inappropriately) fitted to data generated from a within-host-diversity model M₁), see also Table 1.

Proportions of p-value less than 0.05 (indicated by colored text) in this case are consistently higher for the subset of residuals corresponding to non-zero marks (red text).

More »

Expand

Table 1.

Proportions of p-value less than 0.05 that indicate overall evidence against the null model in two scenarios: (1) fitting the correct model structures and (2) fitting the s-d-s M₀ to data generated from a within-host-diversity model M₁.

For scenario (2), five datasets are generated independently from M₁ with a same set of parameter values (Models), which are used to reveal any consistent difference of the evidence of model mis-specification compared to scenario (1). Noted that in both scenarios, the (same) correct epidemic process component is fitted. is the subset of the full set of residuals , associated with non-zero marks ζ^k.

More »

Expand

Fig 3.

Systematic deviation revealed by the marked latent residuals.

(a)-(e) correspond to simulation set 1-5 (from Scenario II where s-d-s model M₀ is (inappropriately) fitted to data generated from a within-host-diversity model M₁). The histograms depicted in the first row are formed by aggregating residuals whose associated marks ζ^(k) lie in the top tercile of marks for any posterior sample for which the p-value of the Anderson-Darling test is less than 0.05. The histograms of ζ^(k) are shown in the second row. Residuals associated with smaller ζ^(k) may exhibit a multiplicity of patterns (see SI: S1 Fig).

More »

Expand

Fig 4.

Residuals associated with the top tercile of marks ζ^(k), in applying our diagnostic framework to a foot-and-mouth dataset.

More »

Expand

Fig 5.

Logarithm of the ratio between inferred effective genetic time t_eff = T(G_A, G_B) and the ‘absolute’ genetic time t_abs = t_B − t_A used in the s-d-s model.

Consider an individual who becomes infected at time t₀ = 0 and causes two infections at times t_A and t_B where t_A < t_B are generated as the first two event times in a Poisson process. Then we have t_eff/t_abs = 1 + 2 × T*(G_A, G_B) × Z (see also Eq 10) where Z = u/(1 − u) with u ∼ Unif(0, 1). For each simulated Z, we draw a corresponding T*(G_A, G_B) from (see Eq 10) where γ and η are taken to be their respective posterior means.

More »

Expand