Skip to main content
Advertisement
  • Loading metrics

iPAR: A framework for modelling and inferring information about disease spread when the populations at risk are unknown

  • Stephen Catterall ,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    stephen@bioss.ac.uk

    Affiliation Biomathematics and Statistics Scotland, Edinburgh, United Kingdom

  • Thibaud Porphyre,

    Roles Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Laboratoire de Biométrie et Biologie Évolutive, UMR 5558, Universite Claude Bernard Lyon 1, CNRS, VetAgro Sup, Marcy l’Étoile, France

  • Glenn Marion

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Biomathematics and Statistics Scotland, Edinburgh, United Kingdom

Abstract

We introduce the inference for populations at risk (iPAR) framework which enables modelling and estimation of spatial disease dynamics in scenarios where the population at risk is unknown or poorly mapped. This framework addresses a gap in spatial infectious disease modelling approaches, with current methods typically requiring data on the spatial distribution of the population at risk. The principles for iPAR are demonstrated in the context of a susceptible-infected disease dynamics model coupled with Bayesian inference implemented via data-augmentation Markov chain Monte Carlo (MCMC). This implementation of iPAR is tested for a range of scenarios using simulated outbreak data. Results indicate that the method can effectively estimate key properties of disease spread from spatio-temporal case reports and make useful predictions of future spread. The method is then applied to a case study exploring the 2014–2019 Estonian outbreak of African Swine Fever (ASF) in wild boar. Estimates of epidemiological parameters reveal evidence for long distance transmission, as well as disease control via reduction of the wild boar population in Estonia.

Author summary

Models for the spatial spread of infectious disease are essential tools to help us understand the way that disease spreads across a landscape during a disease outbreak. In addition to knowledge of disease case reports, most such models also assume knowledge of the spatial distribution of the disease hosts. If the distribution of the disease hosts is unknown then it becomes harder to interpret any apparent patterns of disease spread. Furthermore, most existing modelling techniques can no longer be applied. This paper introduces a novel modelling framework that can be used to analyse outbreak data even when we don’t know the spatial distribution of the disease hosts. To achieve this, we exploit spatial data for the region of the outbreak, for example land use data, that can help to better inform us of the distribution of the population at risk, when combined with available disease case reports. As a case study, we explore the use of the new framework for modelling data from the 2014–2019 outbreak of African Swine Fever in wild boar in Estonia.

Introduction

Outbreaks of infectious disease can be very costly and can have serious impacts on both natural systems and human economic activities [1,2]. For example, African swine fever (ASF), a disease that can infect both domestic pigs and wild boar, has spread since its introduction in Georgia in 2007 through many countries in Europe and Asia, leading to millions of culled pigs and consequent serious impacts on the global pig industry [3]. More broadly, including for diseases of humans and plants, epidemic models can be usefully employed to assess risks arising from both current and future disease outbreaks [46]. Specifically, they can be fitted to reported outbreak data to infer important characteristics of disease spread [7]. These characteristics can then be used to inform potential disease control measures [8,9].

The spatial distribution of the population at risk is very important because it strongly influences the contact structure of the population, which in turn is of key importance in the disease transmission process [10]. Further, spatially explicit epidemic models can offer insights and information critical to the control of real-world outbreaks, e.g., identifying high risk areas and control zones [11,12]. Typically, such models assume knowledge of the spatial distribution of the population at risk, e.g., foot and mouth disease (FMD) transmission models [13], Classical swine fever (CSF) transmission models [14]. However, a frequently encountered and challenging scenario is when we do not have complete knowledge of the population at risk of infection [5] and, in particular, we may not know the spatial distribution of this population. This is especially true for wildlife populations, which may not be well mapped, but can also frequently be the case for livestock. Records of livestock holdings may either be non-existent (especially for small ‘backyard’ holdings), out of date, incomplete or may not be publicly available for privacy reasons (e.g., farm locations in the USA [15]). In such cases there is a need to make use of case-only data. Here, we therefore focus on the scenario where cases are reported with possibly imprecise spatial and temporal coordinates.

A number of techniques have been proposed to model disease transmission when there is incomplete knowledge of the population at risk. One class of approaches involves modelling the disease case generating process, with the susceptible population not explicitly specified. Transmission trees are one example of this approach, e.g., [16]. Another example is the use of techniques based on contact distribution models [17], for example [18,19]. Such approaches benefit from the fact that, typically, more is known about disease cases than is known about general members of the susceptible population. On the other hand, this approach may be problematic in scenarios where the set of disease case reports is incomplete, or where depletion of susceptible individuals is an important factor in determining the progression of the epidemic. There is also a risk of bias in scenarios where individuals are highly mobile, such as susceptible wildlife populations. In such scenarios the location of a case (often where an animal is found dead) may be a poor proxy for the spatial history of the individual [20]. The above referenced approaches do not allow for spatial variation in susceptible population density, with the exception of [19] in which the susceptible population density is both allowed for and known.

A second class of approaches, which has been applied to livestock epidemic data, involves making assumptions/predictions about unknown farm locations to permit spatially explicit transmission modelling. For example, assuming that farms are located uniformly at random across the landscape can sometimes be sufficient to identify optimal control measures [21], though this can only be applied to certain controls (ring culling) and requires extensive outbreak data. However, using land use data and other landscape covariates to predict farm locations can improve the accuracy of livestock epidemic models [15,22]. Such approaches seem useful in scenarios where we know some aspects of the susceptible population, e.g., for livestock the total number of farms in the region of interest, but it is unclear how they might be applied to less well mapped populations such as wildlife, ‘backyard’ holdings or in jurisdictions where data is more limited.

We consider how to model disease spread when the susceptible population density is unknown, but we have spatial covariates that may explain population density and other aspects of the population of relevance to the transmission process. Intuitively, case reports with approximate spatial locations are informative of the distribution of the population at risk, and coupled with observation times also provide insight into rates of spread in space and time. To motivate our proposed approach, consider again how knowledge of the population at risk is used to construct spatially explicit epidemic models, e.g., those used to model the spread of FMD [23]. The rate of transmission between an infectious location and a susceptible location is typically assumed to be an overall rate multiplied by (a) a kernel function depending on the distance between the locations, (b) the infectivity of the infectious location and (c) the susceptibility of the susceptible location. Thus, from a modelling perspective, in the most general setting, epidemiologically relevant knowledge of the population at risk is captured in two ‘surfaces’ covering the landscape: an infectivity surface and a susceptibility surface. In reality, although these surfaces will be influenced by the spatial distribution of the population at risk, they are also influenced by many other factors, e.g., variations in biosecurity, variations in host susceptibility. As a consequence, the two surfaces are not regarded as population density estimates for the host, but as we will see below in some cases, they can provide insights into spatial variation of the underlying population at risk. In this paper, assuming limited or no knowledge of the population at risk, we develop an approach to estimate these two surfaces along with key characteristics of spatial disease transmission, given reported disease cases, but avoiding making any explicit assumptions about the spatial distribution of the population at risk. We refer to this as the inferring - information about - populations at risk (iPAR) framework.

To constrain the set of possible surfaces, we assume that each surface is a function of multiple landscape covariates, with parameters estimated from the disease case data. We also aggregate spatially to a set of patches covering the landscape, with the patches being the infectious units of the model. This aids computational tractability but also enables use of uncertain spatial locations. Patch-based models have been used successfully to model the spread of infectious disease, the spread of invasive non-native plant species and other spatio-temporal ecological processes [2426]. Although spatial aggregation can result in a loss of spatial resolution, the degree of aggregation (patch size) can be controlled to minimise any potential loss of information. The iPAR framework is also highly flexible and can make use of a wide array of spatial covariate data. Land use and climate data might be expected to explain at least some of the variation in patch susceptibility/infectivity. Other covariates could be useful in specific scenarios, e.g., human population density, hunting bag data and previously published estimates of the susceptible population density. The precise choice of covariates will depend on the specific application at hand and the availability of spatially explicit covariates. In the case of a human disease, relevant covariates could include local human population density and key indices, e.g., measuring social deprivation and air pollution.

In this paper we describe an initial implementation of the iPAR framework outlined above. This is described in detail in Methods along with the MCMC implementation of Bayesian inference used for parameterisation from disease case reports for the region of interest. This inference framework is extensively tested using simulated case data in Results. We illustrate the potential of the iPAR framework through application to case reports arising from the ASF outbreaks in Estonia that occurred between 2014 and 2019 in Case study. Finally, we discuss potential application to other systems and possible future development of the iPAR framework in Discussion.

Methods

Here we define an underlying spatial disease transmission model and associated inference tools that together provide a flexible approach to deal with the problem of missing information on populations at risk in spatial epidemiology. This defines an initial implementation but note that many extensions are possible within the iPAR framework.

The generic model

We model transmission of disease in continuous time across a region . In principle, any disease could be modelled, whether of animals, plants or humans. However, the case study presented focusses on animal disease. We partition the region into a collection of patches, with each patch indexed by an integer () and associated with a given sub-region of . Within the region , the spatial distribution of the population at risk is uncertain. However, we assume that we have covariates available that can be used to model the variation in patch susceptibility/infectivity across the region. In practice, we choose covariates that are likely to explain at least some of the variation in host population risk profile across the region. The patches are the units of infection within the model framework. For each patch , there is a set of associated covariate values . In principle, the covariates may vary both temporally and spatially. However, for notational simplicity, we do not show explicit time dependence in the presentation that follows, and in all the applications of the methodology that we consider in this manuscript the covariates are constant in time.

The chosen size of the patches depends on a number of factors, including the spatial resolution of the covariates and the disease case data. If the covariates have a different spatial resolution to the case data then one or both data sets may be aggregated or de-aggregated to the chosen resolution. Patch size should be small enough to capture the apparent pattern of disease spread across the region.

For simplicity we consider modelling transmission according to a spatial SI model. That is, each patch is either susceptible to infection or has become infected/infectious. However, as shown in Case study, even where individual-host dynamics are not limited to susceptible and infectious compartments, patch-level SI dynamics can still provide a good approximation to the spatial transmission process. In addition, we note in general that more complex dynamics and a greater set of possible states could in principle be accounted for at the patch-scale if suitable data were available. However, such complexities are outside the scope of the current implementation. The SI assumption is most appropriate for the early stages of a disease outbreak, but can continue to be reasonable if the disease is persistent at the patch level on the timescales of the outbreak and this criterion should be considered when selecting an appropriate patch size for the analysis – for example, see Case study. The transmission of disease is modelled over some time interval via a continuous time discrete state space Markov process. The initial conditions comprise a list of the patches that are infected at time , with the remainder assumed susceptible.

Given the initial conditions, the transmission model is fully specified once we define the force of infection on a susceptible patch at any point in time. The force of infection

(1)

on a susceptible patch is comprised of two terms, the first term representing primary infection due to external transmission from outside the region under study, the second term representing transmission from infected patches within the region . Each patch has susceptibility and, when infected, infectivity which can be functions of spatially varying covariates – meaning that they vary between patches. The parameter controls the overall rate of background infection, while controls the overall rate of secondary transmission. The summation is over all infectious patches , is a so-called transmission kernel and is the distance between the centroids of patches and . Typically, we use a truncated power law kernel . Here, is an indicator function (equal to 1 if its argument is a true statement, 0 otherwise), is a normalisation constant defined in S1 Appendix (see also [27]), is a fixed maximum transmission distance, and is a parameter estimated from the outbreak data. If the patches are all squares of the same size then distance, as an input to the kernel function, is measured in units such that one unit is equal to the width of a patch, meaning that, e.g., precisely when patches and are nearest neighbours. The above expressions together represent what we call the constant-in-time model, although as noted above this can accommodate temporal variation in the covariates.

We also consider a variant of this model, which we term the varying-in-time model, in which the total force of infection is multiplied by a function of time . This function is intended to model the effects of disease control. For identifiability we assume that , with then typically decreasing following the application of control measures such as removal of hosts. However, we assume only that for all , allowing to both increase and decrease over time, so that such changes are estimated from the data. For simplicity we assume that is piecewise constant on pre-defined subintervals of the modelled time interval with for identifiability, then on the ’th subinterval (.

We now define the susceptibility and infectivity of a patch in terms of the covariates for that patch. There are many ways of defining the relationship between susceptibility/infectivity and the covariates. Here we suggest a general approach. Typically, covariates are either compositional or non-compositional [28]. Non-compositional covariates have unconstrained values, for example, average temperature and rainfall for the patch. Compositional covariates are constrained and dependent, for example, the proportion of the patch occupied by different land uses. These covariates are proportions that sum to one. Suppose that the set of covariates for patch comprises non-compositional covariates and compositional covariates , with being the proportion of patch on which a spatial categorical variable takes value (of distinct possible values). For example, these proportions may be obtained from a partition of each patch into distinct land uses. Then, a plausible model for the susceptibility of a patch is where the Greek letters are parameters to be estimated and each . We also require that for identifiability, bearing in mind that will be multiplied by the overall rate in the expression for the secondary force of infection. This general form of susceptibility function is analogous to the suitability function defined in [27] in the context of biological invasions. A similar expression is assumed for patch infectivity . Again, the Greek letters for and for are parameters to be estimated, with each and .

Each of the categories contributes additively to the susceptibility, with the contributions weighted by the covariates . Each non-compositional covariate contributes multiplicatively, with negative relationships possible for these covariates when . However, it is important to note that negative relationships are also possible for the compositional covariates, even though the are non-negative, due to the constraint that . For example, if the smallest component of is then the covariate will be negatively related to susceptibility in the sense that, if increases and the for decrease subject to the summation constraint, then the overall effect is that will decrease.

Bayesian inference and prediction based on disease case reports

This section describes details of the approach to estimating model parameters from data, how we compare between different model formulations, quantify uncertainty in model predictions and test performance of the iPAR framework. Some readers may wish to skip these details on initial reading.

Parameter and latent variable estimation.

Having defined the model, we now present methodology that allows fitting of the model to data including estimation of unobserved (latent) infection times. For a disease outbreak, case reports describe observed cases of the disease in individual animals or on individual farms. We assume that each case has a location and a detection time associated with it. Locations are taken to be specified at patch resolution or better. Detection times of disease cases are taken to be interval-censored, which allows for the common scenario whereby new case reports are provided at regular intervals (e.g., daily, or monthly), but can also accommodate irregular observation times. We denote these reporting times by where is the time interval over which case report data are available. We assume that interval-censored detection times should constrain the actual infection times of the individuals/farms within the patch, i.e., that a case occurring after time but before time is reported at time . In practice, we expect performance of this modelling framework to degrade if a significant number of infections occurring in an interval are unobserved at . To some extent this can be countered by extending these reporting time windows used in the analysis resulting in greater estimated uncertainty in inferred infection times (see below). When aggregating to the patch level, we define the infection time of a patch to be the earliest recorded infection time of the set of individuals/farms within it. We assume that all infected patches are detected; in this sense the system is taken to be perfectly observed, a common assumption in the literature, see, e.g., [19,29].

We use Bayesian techniques to infer the model parameters from outbreak data. The true infection time of each patch is not known with certainty so, as is common when fitting epidemic models, we use data augmented MCMC to facilitate the inference [7]. This general approach to inference is well established in the literature and has been applied in, for example, [2931]. Let denote the set of model parameters and let denote the set of unobserved patch infection times. For each patch , let denote the ‘time category’ into which the patch falls in terms of reporting. So, if was already reported as infected at time and () if was first reported as infected at reporting time . Finally, if was not reported as infected at any point in the time interval , which we assume implies that remains susceptible at time . Finally, let denote the set of observed . These observed time categories are the data we use to infer the model parameters and the actual and unobserved or latent patch infection times . Bayes’ theorem states that

Here represents the joint prior distribution of the model parameters. The prior distribution is typically chosen to be non-informative (see S2 Appendix for details). The factor above is usually termed the complete data likelihood for the model parameters conditioned on the observed data and a given set of latent infection times. The transmission process is a continuous time discrete state space Markov process, so this likelihood can be computed using standard techniques as described in, for example, [30,32]. We are now in a position to draw samples from the joint posterior distribution . This is achieved by Markov Chain Monte Carlo (MCMC) methods. Full details of the likelihood calculation and sampling algorithm may be found in S3 Appendix. The computer code for running the simulations and inference was written in C, with R used for pre- and post-processing of data.

Model selection.

We employ a widely used metric for assessing goodness of fit, the Deviance Information Criterion (DIC) [33]. The DIC is not uniquely defined for latent variable models [34]. In [14] they assessed a number of DIC variants in the context of spatio-temporal epidemic modelling and found that their metric (which corresponds to in [34]) performed well. Therefore, we use this metric to compare models, in particular to compare the constant-in-time and varying-in-time models when both fitted to the same set of outbreak data. See S4 Appendix for details of the definition and computation of this metric. Model selection is a challenging problem, and is not the main focus of this manuscript.

Uncertainty in model predictions.

Having fitted the model to outbreak data, we can then use the fitted model to predict the timing of future transmission events beyond the period of the observed data by sampling from the appropriate posterior predictive distribution. This is achieved by forward simulation of the transmission model over a time interval , with model parameters sampled from the joint posterior distribution and initial conditions typically set to the infection status of the patches at the final time point of the data used to fit the model (). We typically conduct large numbers of such simulations, at least 10000 say, so as to capture the full distribution of possible outcomes.

The model predictions obtained from sampling the posterior predictive distribution can be summarised in a number of ways. We can plot the number of infected patches as a function of time. We can also plot risk maps that depict, for each patch , the probability that it is infected by some future time , which we calculate as the proportion of simulations in which becomes infected by time .

The use of the constant-in-time model to predict future events in this way is straightforward. On the other hand, fitting the varying-in-time model leads to an estimate of the function for , but we cannot use this fitted model to make predictions without further assumptions on the behaviour of for . For this reason, we only use the constant-in-time model to make predictions, although one could envisage future projections conditioned on specified scenarios for at .

The predictions can also be compared to what actually happened, if such data are available. This is possible for simulated data where we generate a realisation of the model that represents the ‘truth’. Alternatively, with real data from a disease outbreak we can hold out some portion of the data from later stages of the outbreak, fitting the model only to data from the earlier stages of the outbreak. In either case, we can compare the predictions to such ‘truth’ in two ways. First, we may directly compare the total number of infected patches as it increases over time. Second, we can assess the ability of the predictions to identify the correct infected patches, rather than simply the correct overall number of infected patches, by computing discrimination metrics that compare the risk map to the true map of infection. For a given probability threshold , we calculate the true positive rate (TPR, the proportion of patches infected during with infection probability exceeding ) and the false positive rate (FPR, the proportion of patches uninfected at time with infection probability exceeding ). Plotting the true and false positive rates for multiple thresholds gives the Receiver Operating Characteristic (ROC) curve [35], from which standard metrics such as the AUC (Area under the ROC curve) and the TPR at 5% FPR can be derived.

Testing the iPAR modelling approach.

To assess the iPAR framework, it is important to first test it using simulated outbreak data. The use of simulated data has two key benefits. First, we can explore a wider range of scenarios than is possible using real outbreak data, which can be very helpful when exploring the strengths and weaknesses of methods, as well as the limits of their applicability. Second, with simulated data we always know the ‘true’ data generating process, which allows us to assess, for example, how close parameter estimates obtained from the simulated data are to their ‘true’ values. Such comparisons between true and estimated parameter values are impossible with real outbreak data. On the other hand, simulated data may have properties that are different from real outbreak data, and the conclusions may depend on the choice of synthetic trajectories. Therefore, it is also critical to test methods via application to real world data and problems as we do in Case study. Model assumptions can be validated externally by comparing fitted model outputs with other sources of data on the outbreak and the susceptible population that were not used to fit the model. Such approaches depend very much on the details of the scenario under investigation. We apply the iPAR framework to real-world disease case report data and describe some specific validation approaches in Case study.

Simulation-based methods are used to systematically test the constant-in-time iPAR model in Results. We employ a form of sensitivity analysis in which we start from a baseline model parameterisation based on the parameter estimates obtained from the ASF outbreak data in Case study. We then vary each model parameter separately (keeping all others fixed), simulating a single set of outbreak data for each parameter combination and then inferring model parameters from the simulated outbreak data. In this way, we can build up plots of true versus inferred parameter values (95% credible intervals) for each model parameter. This kind of sensitivity analysis has been conducted in, e.g., [36]. If the model parameters are sampled from the prior then the average coverage is equal to the nominal value [37] but, for a specific choice of parameters, Bayesian credible intervals don’t necessarily have their nominal coverage. This means that the true value of a parameter may not fall within a 95% Bayesian credible interval for 95% of the time. Nevertheless, we can visually assess the output from the sensitivity analysis in order to check (a) whether the inference is recovering the ‘true’ parameter values in the way we would expect, and (b) which parameters are more difficult to recover from the simulated data. For parameters that are difficult to recover, the inferred parameter may only depend weakly on the true parameter, and may instead be strongly influenced by the prior distribution assigned to the parameter.

In the setting of the constant-in-time model with no non-compositional covariates, which is the setting we use in Estimation of key epidemiological parameters in Results, the full set of model parameters is as follows: the land use susceptibilities , the land use infectivities , and three remaining parameters (kernel parameter), (overall transmission rate) and (background transmission rate). In order to perform the sensitivity analysis, each parameter is taken in turn and varied while keeping other parameters fixed. One issue is that the susceptibilities are not independent, being constrained to sum to 1, and similarly for the infectivities. To circumvent this problem, we reparametrize the model using the isometric logratio transform , , which maps the simplex-valued parameters to real multivariate parameters (see S5 Appendix for details). Thus, the components of and are unconstrained and can be varied independently as part of the sensitivity analysis.

Results

Assessing the iPAR framework performance

We now assess the iPAR framework by applying it to simulated outbreak data. In Estimation and prediction for an illustrative simulated outbreak, we analyse data from a single simulated outbreak – simulated from the constant-in-time model with spatially varying susceptibility/infectivity – in order to illustrate the typical workflow and outputs obtained from the model fitting process. In Estimation of key epidemiological parameters, we conduct a systematic simulation study in which we generate outbreak data for a wider range of scenarios, though still for the constant-in-time model with spatially varying susceptibility/infectivity, and assess the ability of the inference to recover the ‘true’ parameters. In Estimation of temporal trends in transmission, we explore the ability of the inference to recover temporal trends in transmission, using simulated outbreak data from the varying-in-time model with spatially varying susceptibility/infectivity. Finally, in Benefits of modelling spatial variation in susceptibility and infectivity, we assess the benefits of modelling spatial variation in susceptibility and infectivity, one of the key aspects of the iPAR framework, both in terms of estimating key parameters and in terms of predicting held-out future simulated outbreak data. To do this, we generate simulated outbreak data from the constant-in-time model with spatially varying susceptibility/infectivity, but we fit either a model of similar form or a model which assumes spatially homogeneous susceptibility/infectivity.

The results of the assessments carried out here demonstrate that the iPAR framework is able to reliably infer epidemiologically relevant information from case-only reports across the range of scenarios considered. We found that the simulations run very quickly but the inference code can take between 1 and 5 hours to run for the scenarios presented here. The time taken increases with the number of patches and the number of patches that become infected, but in a non-straightforward manner due to optimisations implemented in the inference code. Computations were carried out using an Intel i7-8665u processor with 32GB memory. Five MCMC chains were run in parallel, with 100000 iterations per chain. Simulated trajectories were selected at random. Some readers may wish to skip the following details on initial reading and proceed directly to Case study, later returning to gain greater insights into those latter results.

Estimation and prediction for an illustrative simulated outbreak

Outbreak data are simulated using the constant-in-time iPAR model on the Estonian landscape, which is divided into 575 10km square patches, with model parameters similar to those estimated from the outbreak data for ASF in wild boar in Estonia (see Case study). As in the case study, land use data for Estonia are used as the basis for computing covariates used to model spatial variation in both susceptibility and infectivity. Four patches are initially seeded with infection, and the duration of the simulation is chosen such that a substantial proportion of patches become infected by the end of the simulation.

A single simulated outbreak can be summarised by plotting the number of infected patches over time (Fig 1A, black line). The constant-in-time iPAR model was fitted to the early-stage epidemic data up to month 12 from this single simulated outbreak, with simulated patch infection times interval censored into one-month long intervals. The resultant posterior distribution provides estimates for all model parameters. For example, we can visualise the posterior marginal distribution of the background transmission rate parameter (Fig 1B, with true value represented by a vertical line) by aggregating MCMC samples. We can also generate the 95% credible intervals for parameters and functions of the model parameters such as the transmission kernel (Fig 1C, true kernel in black, 95% credible interval in blue shading). Credible intervals for the components of the susceptibility parameter (Fig 1D, 95% credible intervals represented by blue lines) are often quite wide, but in general these intervals align well with the ‘true’ values.

thumbnail
Fig 1. Estimated parameters and predictions from fitting the iPAR model to simulated outbreak data.

A: number of infected patches over time in the simulated outbreak data (black line). The model was fitted to the outbreak data up to month 12. Predicted number of infected patches (95% credible interval) beyond month 12 is shown in blue. B: marginal posterior distribution for the background transmission rate, with ‘true’ value indicated by the black vertical line. C: ‘true’ kernel function (black line) and estimated kernel function 95% credible interval (blue shading). D: estimated susceptibility parameter for each land use (95% credible interval, blue) and ‘true’ value (black circle).

https://doi.org/10.1371/journal.pcbi.1012622.g001

The model, fitted to data up to month 12, was then used to predict progression of the epidemic beyond the first 12 months. This was achieved by simulating 10000 times with parameters sampled from the posterior distribution and initial conditions specified by the state of the epidemic at month 12. These predictions may be summarised by 95% credible intervals for the number of infected patches over time (Fig 1A, blue shading). The intervals contain the ‘true’ epidemic curve in this case, suggesting satisfactory inference of the model parameters.

Estimation of key epidemiological parameters

We systematically tested the constant-in-time model over a wide range of parameter combinations using the sensitivity analysis approach described in Testing the iPAR modelling approach in Methods. The simulation model set-up was identical to that in the previous section except for the use of different sets of model parameters. The data from most of the simulated outbreak (up to month 52) was used to fit the model. Plots of true versus inferred parameters are available in Fig 2 for , , and in S6 Appendix for the remaining parameters. Key model parameters such as the kernel parameter and the background transmission rate are well recovered from the simulated outbreak data, though the overall transmission rate tends to be somewhat overestimated. Reasons for the latter phenomenon will be discussed later. The susceptibility parameters tend to be well recovered, while the infectivity parameters seem more difficult to recover, which is consistent with the results in [36].

thumbnail
Fig 2. Reliable parameter inference for the constant-in-time iPAR model.

Each panel corresponds to a parameter of the model. The ‘true’ value of the parameter is plotted against the posterior median (cross) and posterior 95% credible interval (vertical line). For comparison purposes we also superimpose the diagonal line representing perfect agreement between inferred and true values. This figure includes panels for the kernel parameter , the overall transmission rate and the background transmission rate . Analogous figures for the other parameters may be found in S6 Appendix.

https://doi.org/10.1371/journal.pcbi.1012622.g002

Estimation of temporal trends in transmission

Next, we focus on the varying-in-time iPAR model with parameters similar to those obtained from fitting this variant of the model to the Estonian ASF outbreak data (see Case study) as a baseline parameterisation. We then vary the form of the temporal trend function according to three pre-chosen scenarios: (1) decreasing linearly (perhaps representing increasing levels of disease control), (2) constant (no disease control), and (3) initially constant then suddenly drops to low levels (corresponding to rapid strong disease control after some delay). The changepoints of the function are fixed and are kept equal to those used in the case study. The function from the first scenario is closest to what was actually estimated from the data in the case study. Note that, in the case study, was seen to initially increase before decreasing, which is why we have also allowed for an initial increase in in the first and third scenarios. We assess the extent to which these temporal trends can be recovered by inference from data simulated under each of the above three scenarios. We also fit the constant-in-time model to the same sets of simulated outbreak data, which allows us to compare model fits (constant-in-time versus varying-in-time) using as discussed in Model selection in Methods.

Fig 3 shows the simulated ‘true’ incidence curve for each of the three temporal trends. Overlaid on this ‘true’ incidence curve is a region representing 95% credible intervals for incidence, obtained from simulations of the fitted varying-in-time model with initial conditions the same as those of the simulated outbreak data used for fitting the model. For all three temporal trends, the agreement between the credible intervals and the ‘true’ incidence curve shows that the algorithm is able to reliably recover parameters giving temporal trends in incidence that agree with the ‘true’ incidence curve. Fig 3 also shows the 95% credible interval for in the same set of three simulations. Despite the striking fact that the form of this function is not at all obvious if presented with a single outbreak incidence curve, the trend function is inferred correctly from single outbreaks, though with some overestimation. We also computed values for each model fit (see S4 Appendix for actual values obtained). For scenarios (1) and (3), in which , the indicated a strong preference for the varying-in-time model. For scenario (2), in which , there was a mild preference for the constant-in-time model. These results were all as we would expect.

thumbnail
Fig 3. Reliable recovery of temporal trends in the time varying simulation study.

A: the black line shows the incidence curve from a single simulation of the “linear decrease” scenario. The blue area shows the 95% credible intervals for incidence, defined as newly infected patches each month, obtained from the posterior predictive distribution of the fitted varying-in-time model, given the data from that single simulation. B: true and estimated for the same scenario. Black line represents true value, blue area shows the 95% credible interval. C,D: same plots for the “constant over time” scenario. E,F: same plots for the “sudden drop off” scenario (see text for details).

https://doi.org/10.1371/journal.pcbi.1012622.g003

Benefits of modelling spatial variation in susceptibility and infectivity

We assess the importance of modelling spatial variation in susceptibility and infectivity in the context of the constant-in-time iPAR model with land use covariates but no non-compositional covariates. The spatial variation in the patch susceptibilities and infectivities of the model arises as a result of spatial variation in the covariates, combined with differences in the susceptibility and infectivity parameters between land use categories. Our approach here is to simulate outbreak data incorporating such spatial variation and then fit two model variants to the simulated outbreak data. First, we fit a heterogenous model which accounts for this spatial variation and has the same form as the model used for simulation. Second, we fit a simpler homogeneous model in which there are no covariates and patch susceptibility/infectivity is spatially constant. Comparing the two model fits and their associated predictions enables us to assess the importance of taking into account variation in susceptibility and infectivity.

We consider model fitting given two levels of data availability. First, we fit models to data from the early stages of an outbreak. Second, we fit models to data from an entire outbreak. At both levels, we consider four different parameter combinations for the simulated outbreak data. This is less than the number of parameter combinations considered in Estimation of key epidemiological parameters, however the chosen parameter combinations cover a range of common transmission patterns. We use either a wide or a moderately wide transmission kernel. Narrow kernels, such as those approximating nearest neighbour kernels, are avoided because we intuitively expect there to be little difference between the predictive performance of the heterogenous and homogeneous models in this case – see Case study and associated discussion for more on this topic. In the simulations, we assume that there is a single land use with high susceptibility (all other land uses are assigned equal low susceptibilities), and the same applies for infectivity. Broadleaf woodland is taken to have high susceptibility in all scenarios, while the high infectivity land use alternates between broadleaf woodland and arable land. Precise details of the scenarios are provided in S8 Appendix.

We repeat the model fitting and prediction for 20 replicate simulations per parameter combination, in order to give a fuller picture of parameter recovery and predictive performance in each case. Mean (over all 20 replicates) AUC and mean TPR at 5% FPR measure discriminatory power. We compute the posterior predictive distribution of , the predicted number of infected patches divided by the true number of infected patches, and its reliability which is the % of replicates in which the 95% CrI for includes . We also compute the mean bias (posterior median of minus , averaged over all 20 replicates). Corresponding quantities are computed for the kernel parameter . For the susceptibility and infectivity parameters we adopt a different approach. We focus on estimation of the large susceptibility parameter and the large infectivity parameter. These parameters are likely more difficult to estimate than the other susceptibility/infectivity parameters because they are far from the main mass of the prior distribution on the simplex. So, we measure the difference between the posterior and prior medians (averaged across all 20 replicates), which captures the extent that the posterior has moved away from the prior. If the difference is very small then there is little information in the data regarding that parameter. For the given simulation set up, a posterior median that is 0.73 higher than the prior median corresponds to perfect estimation of the large susceptibility/infectivity parameter.

Using only early-stage data, predictive performance assessed using discrimination metrics was consistently better for the heterogeneous model, if not by a very large margin (Table 1). Ability to predict was good for both the heterogenous and homogeneous models. In terms of parameter recovery using only early stages data, we focus on the transmission kernel parameter, the susceptibility parameter associated with the high susceptibility land use class and the infectivity parameter associated with the high infectivity land use class. The kernel parameter was typically well estimated (Table 2), though the homogeneous model tended to somewhat overestimate this parameter, a phenomenon that was also observed by [27]. However, susceptibility and infectivity parameters were imperfectly recovered (Table 2). Comparison of the posterior and prior medians suggests that the early-stage data used for fitting provides some information on susceptibility but is less informative on infectivity.

In the ‘entire outbreak’ data availability setting, it is no longer meaningful to consider predictive performance on held-out future outbreak data. Parameter recovery is assessed for the same set of parameter combinations as for the ‘early stages’ setting. Reliability of inference for the transmission kernel parameter is excellent for the heterogenous model but less good for the homogenous model (Table 3). This parameter is consistently overestimated but especially by the homogeneous model, which again is consistent with what was observed in [27]. The posterior medians of the (large) susceptibility and infectivity parameters have shifted closer to their true values when compared to Table 2. This is as we would expect given that we have more data available to fit the model in this setting. In particular, there is now clearer evidence of a shift in the infectivity parameter posterior distribution away from the prior distribution.

Finally, to put these results into context, we take a closer look at the spatial variation in the Estonian landscape, which we have used for these simulations. At the 10km resolution, the Estonian landscape is not particularly heterogeneous in terms of land use (Fig 4). Most of the 575 10km square patches comprise a mixture of land uses, with very few patches consisting mostly of a single land use. The low heterogeneity in land use is one factor leading to limited variation in estimated patch susceptibilities (Fig 5). In general, we might expect that increased heterogeneity in land use, increased differences in susceptibility/infectivity between land uses, a wide kernel and higher numbers of available case reports are all factors that should increase the benefits of modelling spatial variation in susceptibility and infectivity.

thumbnail
Fig 4. Landscape heterogeneity in Estonia.

A: ternary diagram showing the land use composition of every 10km square patch covering Estonia (land use amalgamated into three categories – forest, agricultural and other – for ease of visualisation). B: histogram showing maximum patch land use proportion for every 10km square patch in Estonia, with six land uses as in the model covariates.

https://doi.org/10.1371/journal.pcbi.1012622.g004

thumbnail
Fig 5. Typical distribution of inferred patch susceptibilities.

Typical distribution of (inferred) posterior mean patch susceptibilities given data from a simulated disease outbreak in Estonia. There is some variation but few patches with very high or low susceptibilities.

https://doi.org/10.1371/journal.pcbi.1012622.g005

Case study: African swine fever in the Baltic states

Background/approach

Here, we used all ASF cases in wild boar populations reported in Estonia between 2014 and 2019 to the World Organisation for Animal Health (WOAH). The Estonian ASF outbreak had very serious impacts for both the pig industry and wildlife. African Swine Fever leads to high mortality, and its spread is difficult to control. It continues to pose a major ongoing threat in many regions of Europe [3840]. In this study we focus on better quantifying risk of transmission and spread of ASF in the Estonia wild boar population. Knowledge of the wild boar population in this region is incomplete, though estimated wild boar density maps have been published [39]. Within Estonia, there have been extensive efforts to control the ASF outbreak, partly by reducing the density of wild boar. These efforts are reflected in density maps showing an apparent reduction in the population over time [39]. Case reports of ASF in dead wild boar are available at monthly time intervals, with detailed location information. Therefore, given the uncertainty in knowledge of the Estonian wild boar population described above, we apply the iPAR framework to this outbreak.

To apply the framework to the ASF outbreak in Estonian wild boar, we must first define a grid of patches covering Estonia, as well as covariates likely to be related to wild boar density and hence to patch susceptibility and infectivity. Land use is one of the key factors likely to determine wild boar density [41], and hence we use high resolution CORINE land use data available for Estonia [42]. The case report data for wild boar in Estonia also have high spatial resolution. However, if the patches used to define the infectious units of the model are too small then (a) fitting the model will be very computationally intensive, and (b) the persistence assumption inherent in our assumed SI framework is less likely to be reasonable. As a compromise we initially opted for square patches of size 10km by 10km, though smaller patches would also be feasible. This choice of patch size was informed by a persistence of infection analysis, which revealed that, at the 10km resolution, ASF in wild boar in Estonia typically persists for long periods of time, often multiple years (S9 Appendix). Any of the 10km patches composed wholly of sea or inland water were removed from the set of patches. In the case of Estonia, there are 575 patches covering the country. Fig 6 illustrates the spatial spread of ASF through the wild boar population in Estonia at the 10km patch resolution.

thumbnail
Fig 6. Spread of ASF in Estonian wild boar.

Spatially aggregated African Swine Fever wild boar cases reported in Estonia. Snapshots at: November 2014, September 2015, July 2016. Bottom right panel shows the outbreak incidence curve, generated using spatially aggregated wild boar case reports. Map base layer taken from https://gadm.org/maps.html.

https://doi.org/10.1371/journal.pcbi.1012622.g006

We took CORINE land cover data [42] at 100 metre resolution and spatially aggregated it to the 10km patches to obtain proportions of each land use category for each patch. We also aggregated some of the CORINE land cover categories, resulting in a total of 6 categories: 1 = urban, 2 = agriculture, 3 = broadleaf/mixed forest, 4 = coniferous forest, 5 = semi-natural, 6 = wetlands/water bodies excluding sea. We defined the susceptibility of a patch to be where the are parameters to be estimated and is the proportion of land in patch occupied by land use category . Similarly, consistent with the notation in Methods, the infectivity of a patch is defined as .

The latent period of ASF in wild boar is very short relative to the scale of the spreading process, so we have chosen to ignore this aspect of disease dynamics, which is implicit in the SI assumption. In addition, as discussed above the SI assumption represents the patch-scale dynamic which may differ from disease progression in an individual host. A case report from the outbreak is associated with a specific month. So, the time of the case report is considered to be interval-censored, with the time interval equal to the relevant month. The model can now be fitted to the outbreak data as described in Methods. We could have taken into account other covariates but, for simplicity, we chose to use land use data only. This case study is intended as a proof of principle, rather than a definitive analysis of ASF transmission in wild boar in Estonia.

Parameter estimates

The constant-in-time and varying-in-time heterogeneous models were fitted to the full set of Estonian outbreak data, with the estimated parameters as in Table 4. The metric was computed for both model fits, and was found to strongly favour the varying-in-time model (details in S4 Appendix). This is not surprising given the strong decreasing trend in that was identified (Table 4). Generally, parameter estimates were consistent between the two model variants. There is much uncertainty, but broadleaf woodland seems to have the highest susceptibility, while agricultural land has the highest infectivity. External transmission occurs at very low levels, and the estimated kernel is relatively narrow. The estimates of the overall rate for the two models are not easily comparable because, in the varying-in-time model, is multiplied by the function which is estimated as less than one for most of the observation time interval. So it is not surprising that the estimated value of is higher for the varying-in-time model.

thumbnail
Table 4. Parameter estimates obtained from fitting the iPAR model to data from the entire Estonian outbreak.

https://doi.org/10.1371/journal.pcbi.1012622.t004

thumbnail
Fig 7. Model prediction and validation. Left column: Predicted numbers of infected 10km patches from models fitted to the early stages of the ASF outbreak in Estonian wild boar. Two constant-in-time models (heterogenous, homogeneous) were fitted to outbreak data up to September 2015. The black lines above show the true cumulative incidence curve at 10km patch resolution. A: the blue line and region show the posterior predictive median and 95% credible interval for the total number of infected patches under the homogeneous model. The red line and region show the corresponding quantities under the heterogenous model. B: the red line and region show the posterior predictive median and 95% credible interval for the total number of infected patches under the heterogeneous model. The grey line and region show the corresponding quantities under a varying-in-time heterogeneous model fitted to the full set of outbreak data. Right column: External validation via hunting bag data from [40]. C: estimated wild boar density in the east of Estonia. D: estimated wild boar density in the west of Estonia. E: estimated overall spread rate from the varying-in-time model fitted to the entire ASF outbreak in Estonia. The black line represents the posterior median, the blue area shows the 95% credible interval.

https://doi.org/10.1371/journal.pcbi.1012622.g007

Model predictions

Finally, we assess the predictive performance of the model fitted to the Estonian data. We fit the constant-in-time model to the early stages of the outbreak in Estonia, holding out data from later stages of the outbreak in order to test the model’s predictions. Both the heterogeneous and homogeneous variants of the model were fitted.

The predicted number of infected patches under the heterogeneous and homogeneous models tends to be overestimated, though the predictions of the heterogeneous model lie a little closer to the actual data (Fig 7A). The predicted number of infected patches from simulation of a heterogeneous varying-in-time model fitted to the entire outbreak is much closer to the actual data (Fig 7B). This indicates that the varying-in-time model is sufficiently flexible to capture the key trends in the observed cumulative incidence curve in Estonia.

External validation

Many aspects of the fitted model are difficult to validate externally. However, one aspect of interest is the time varying function , which was estimated to initially increase from the baseline value of 1 in 2015 to a maximum of 1.71 in 2016, before dropping to 0.06 by the end of the outbreak (Table 4). The function is intended to model the effects of disease control measures, and in Estonia a key control measure was reduction of the wild boar population. The trend seen in the estimated time varying function is broadly consistent with reported trends in wild boar population density across Estonia based on information from hunters (see Fig 5 in [40]). This comparison provides some evidence for the validity of the fitted varying-in-time model (see Fig 7CE).

Quantitative risk assessment for disease control

In addition to the retrospective analysis described above which shows the impact of the control measures that were adopted, the iPAR framework also provides information relevant to the within-outbreak targeting of disease control activities. For example, Fig 8A illustrates the estimated susceptibilities of patches in Estonia for the constant-in-time model, identifying which locations are more likely to be susceptible to disease incursion. Although there are some areas with greater susceptibilities there does not seem to be a strong trend in susceptibility across the country. However, Table 4 also provides potentially useful information to inform disease control efforts by identifying the land use classes (broadleaved and agricultural) that contribute most strongly to susceptibility and infectivity. Such information may be useful in guiding development of control policies and programmes.

thumbnail
Fig 8. Quantifying disease risk across Estonia.

(A) Susceptibility map for Estonia obtained from fitting the constant-in-time model to the Estonian ASF wild boar outbreak data. (B) Risk map illustrating predictions from the constant-in-time heterogeneous model fitted to the early stages of the ASF outbreak in wild boar in Estonia. Grey patches are recorded as infected at the end of the early stage of the outbreak. Red colours indicate probability of infection four months beyond this end point. The black dots show patches that did in fact become infected within this time horizon. Alternative summaries of the same predictions may be found in Table 5 and Fig 7A. Map base layer taken from https://gadm.org/maps.html.

https://doi.org/10.1371/journal.pcbi.1012622.g008

thumbnail
Table 5.

Predictive performance of the fitted constant-in-time model.

https://doi.org/10.1371/journal.pcbi.1012622.t005

In addition, risk maps can be helpful for visualising spatial predictions of which areas will become infected over defined scenarios and specified time horizons. Based on observation of the early stage epidemic, up to September 2015 (grey patches), Fig 8B shows model predictions of the probability of future infection. Such information can aid decisions about where to focus limited resources for disease control as the outbreak unfolds in real time. This prediction is over a specified time horizon under the assumption that rates of spread remain fixed. However, we also note that such use of this information means that actual spread will ideally be different from that predicted. This is due to the impact of disease controls as illustrated in Fig 7CE which suggests a substantial benefit of disease control onwards from 2016.

By comparing spatial risk prediction to actual case reports over a limited time horizon we see the heterogeneous model identifies future infected patches slightly better than the homogeneous model (Table 5). However, the differences between the two models’ predictions are not substantial. One reason for this observation is likely to be that the landscape of Estonia is relatively homogeneous in the sense that there is little spatial variation in susceptibility at large scales (see Benefits of modelling spatial variation in susceptibility and infectivity in Results). Another explanation may be the narrowness of the estimated kernel, which is much narrower than the kernels used in Benefits of modelling spatial variation in susceptibility and infectivity in Results. Further simulations (S10 Appendix) similar to those in Benefits of modelling spatial variation in infectivity and susceptibility in Results but with a narrow kernel give predictive performance similar to that in Table 5.

Discussion

This paper develops the novel iPAR (inference for populations at risk) framework for inferring and modelling spread of disease in an unknown population at risk, given outbreak data in the form of only case reports with possibly less than precise spatial and temporal coordinates. The methodology introduced thus also contributes to tools suitable for scenarios where anonymisation of data are required. The method introduced exploits the intuitive idea that case reports simultaneously provide information on both disease spread and the population at risk, enabling modelling in relatively data poor scenarios not previously amenable to inference based on mechanistic transmission models. Where it is available the iPAR framework also allows for use of spatial covariate data that may be informative about the population at risk. iPAR uses a patch-based model with patch susceptibility and infectivity varying across the region over which disease is spreading. By focussing on estimation of patch susceptibilities and infectivities we circumvent any requirement for detailed knowledge of the spatial distribution of the population at risk. In particular, we emphasise that the patch susceptibilities/infectivities are not population density estimates, but they should account for at least some of the variation in population density. If spatial population density estimates are available then they could be used as covariates in the model, possibly with a strong prior attached to the associated effect if we believe that the estimates are particularly informative.

As discussed in the Introduction, other methods have been developed to handle situations where there is uncertainty regarding the population at risk. Transmission tree models may be better suited to questions of “who infected who?” but are not suited to making predictions of future spread. Contact distribution models are better suited to making predictions, and it would be interesting to compare these models to the iPAR approach, but note that only the latter accounts for depletion of susceptibles. Models based on prediction of host (e.g., farm) locations can also be used to predict future spread, but in such cases modelled host locations are based solely on prior knowledge unlike in the iPAR approach where disease cases inform the inferred susceptibility and infectivity surfaces, which may correlate with host density but are directly informative of transmission processes.

We demonstrate, via a range of simulation studies, that the inference techniques employed here reliably estimate model parameters, with uncertainty reflecting available case data. Typically, parameters in the transmission kernel, as well as external transmission rates, are well estimated, susceptibility parameters are moderately well estimated, while as is to be expected infectivity parameters are the least well estimated, e.g., see [36]. It is challenging to precisely define the level of outbreak data availability needed to robustly estimate model parameters. Simple spatial spread models with no variation between infectious units can be fitted even when there are less than 10 reported disease cases [14]. Larger numbers of cases may be required for precise estimates when fitting more complex models. Analytic expressions for the precision of parameter estimates have been derived for non-spatial models that allow for variation in susceptibility and infectivity [43]. To successfully fit the iPAR model, the required number of cases will depend on the model structure and choice of spatial covariates. For the scenarios considered in this paper, we found that – as a very rough rule of thumb - between 10 and 100 cases are required to obtain robust parameter estimates. The duration of time over which the cases used to fit the model are collected is of lesser importance than the number of cases.

The iPAR approach is also able to estimate the ‘overall’ rate of infection , and time variation in this quantity captured by . In the case study this estimated time variation was shown to correlate with impacts of disease control efforts to reduce wild boar density in Estonia. Application of the iPAR approach is thus able to retrospectively characterise an outbreak and assess likely benefits of control actions taken. However, perhaps more critically iPAR can also be used to provide information useful for risk assessment during an outbreak. In the ASF case study presented here this includes: (i) identifying which land use classes and thus areas pose the highest intrinsic disease risks – potentially enabling targeted controls; (ii) quantifying the scale of spatial spread – information that could be used to design protocols for exclusions zones, ring vaccination or culling; and (iii) spatio-temporal quantification of future risk – identifying when and where new cases are most likely to arise thereby directing effort to where it is most needed in real time.

A key innovation of the iPAR framework is the ability to account for variation in susceptibility and infectivity across the landscape. Simulations highlighted scenarios where the heterogeneous model, with varying susceptibility/infectivity, significantly outperforms the corresponding homogeneous model. However, in our case study the heterogeneous model offered only a small improvement over the homogeneous model, but was nonetheless favoured by model selection. It is possible that use of other covariates not available to our analysis, might yield a heterogeneous model that offers a greater improvement, but the narrowness of the estimated kernel (short range transmission) makes this less likely. Intuitively, with a narrow kernel, there are fewer possible destinations for onward transmission of disease, so having a heterogeneous model that attempts to distinguish these possible destinations becomes less useful.

The most appropriate spatial resolution at which to model the spread of disease is dependent on multiple application-specific factors. If the spatial resolution is too low then the model may not be able to produce accurate estimates of quantities of interest [44], but if the resolution is too high then there may be computational issues and a lack of suitable outbreak/covariate data. Other factors to consider are the intended application, model assumptions and the key quantities of interest. In the case study presented here, the model assumes persistence of infection, which is less likely to be reasonable at very high resolutions. We chose 10km resolution for the case study as analysis demonstrated good persistence at this scale. Furthermore, preliminary analysis of the case study data at 5km resolution gave similar results (S11 Appendix) suggesting considerable robustness of results to patch-scale.

To handle model parameters constrained to the simplex, e.g., the susceptibility parameter , this work made novel, within the context of transmission models, use of methods from compositional data analysis, such as the isometric logratio transform. As noted previously, there is typically slight overestimation of the ‘overall’ rate of infection , which is proportional to the rate of transmission between two uniform patches (patches containing equal proportions of all land uses). Investigation of this phenomena suggests that it arises from the influence of the prior distribution on parameters for which the available data are not strongly informative (S7 Appendix). However, we note here that the properties of transmission models containing such constrained parameters do not seem to have been well explored in the epidemic modelling literature. Issues for further research include specification of prior distributions for parameters on the simplex, and interpretation of credible intervals for parameter components that are highly constrained and interdependent. In scenarios where knowledge of the population at risk is limited, compositional covariates such as land use proportions may be the only data that we have available, so methods that make best use of such data could have significant practical impact on disease control in data poor scenarios such as those addressed here.

Future methodological developments could extend the implementation of the iPAR framework to include more general patch-level disease dynamics beyond the susceptible-infected dynamics employed here. For example, a SEIR model could be implemented, but this would depend on either sufficiently informative data or strong prior assumptions. It would also be possible to allow for uncertainty in observations by treating them as imperfect diagnostic tests, but note that impact of false negative observations are somewhat mitigated in the iPAR framework, with both patch-scale and interval censoring reducing the probability that patches with significant infection remain unobserved. The Bayesian inference is currently based on a likelihood approach. The use of a more complex transmission model might increase the computational requirements, possibly necessitating a change to an approximate inferential approach, though this would depend on the specifics of the model and data.

It would also be desirable to implement iPAR for multiple host species, e.g., transmission of ASF in wild boar and domestic pigs, or transmission of Xylella fastidiosa in multiple susceptible plant species. Nonetheless there are numerous potential applications of the implementation of the iPAR framework introduced here. These include not only the study of other outbreaks of ASF in wild boar and/or domesticated pigs, but also other transmissible disease outbreaks where case only data are available. These might include Newcastle disease in poultry or highly pathogenic avian influenza in wild birds and/or backyard poultry.

In conclusion, we have developed a modelling framework that fills an important gap in epidemiological modelling. The iPAR approach allows modelling of spatial disease spread when the distribution of the disease host is uncertain, and the model can be parameterised using only disease case report data. In recent years, the amount of data available to inform models for spatial disease spread has rapidly increased. At the same time, data availability and data privacy continue to be an issue in many situations. Our approach can circumvent the requirement for host distribution data and so can open up many disease outbreak datasets for analysis.

Supporting information

S5 Appendix. The simplex and the isometric logratio transformation.

https://doi.org/10.1371/journal.pcbi.1012622.s005

(DOCX)

S6 Appendix. Additional figures for Estimation of key epidemiological parameters in Results.

https://doi.org/10.1371/journal.pcbi.1012622.s006

(DOCX)

S7 Appendix. Overestimation of the transmission rate .

https://doi.org/10.1371/journal.pcbi.1012622.s007

(DOCX)

S8 Appendix. Details of the setup for simulations in Benefits of modelling spatial variation in susceptibility and infectivity in Results.

https://doi.org/10.1371/journal.pcbi.1012622.s008

(DOCX)

S10 Appendix. Narrow transmission kernel simulations.

https://doi.org/10.1371/journal.pcbi.1012622.s010

(DOCX)

S11 Appendix. Spread of ASF in Estonia at different spatial scales.

https://doi.org/10.1371/journal.pcbi.1012622.s011

(DOCX)

References

  1. 1. Knight-Jones TJD, Rushton J. The economic impacts of foot and mouth disease - what are they, how big are they and where do they occur? Prev Vet Med. 2013;112(3–4):161–73. pmid:23958457
  2. 2. Kaye AD, Okeagu CN, Pham AD, Silva RA, Hurley JJ, Arron BL, et al. Economic impact of COVID-19 pandemic on healthcare facilities and systems: International perspectives. Best Pract Res Clin Anaesthesiol. 2021;35(3):293–306. pmid:34511220
  3. 3. Brown VR, Miller RS, McKee SC, Ernst KH, Didero NM, Maison RM, et al. Risks of introduction and economic consequences associated with African swine fever, classical swine fever and foot-and-mouth disease: A review of the literature. Transbound Emerg Dis. 2021;68(4):1910–65. pmid:33176063
  4. 4. McBryde ES, Meehan MT, Adegboye OA, Adekunle AI, Caldwell JM, Pak A, et al. Role of modelling in COVID-19 policy development. Paediatr Respir Rev. 2020;35:57–60. pmid:32690354
  5. 5. Brooks-Pollock E, de Jong MCM, Keeling MJ, Klinkenberg D, Wood JLN. Eight challenges in modelling infectious livestock diseases. Epidemics. 2015;10:1–5. pmid:25843373
  6. 6. Gilligan CA, van den Bosch F. Epidemiological models for invasion and persistence of pathogens. Annu Rev Phytopathol. 2008;46:385–418. pmid:18680429
  7. 7. Swallow B, Birrell P, Blake J, Burgman M, Challenor P, Coffeng LE, et al. Challenges in estimation, uncertainty quantification and elicitation for pandemic modelling. Epidemics. 2022;38:100547. pmid:35180542
  8. 8. Kretzschmar ME, Ashby B, Fearon E, Overton CE, Panovska-Griffiths J, Pellis L, et al. Challenges for modelling interventions for future pandemics. Epidemics. 2022;38:100546. pmid:35183834
  9. 9. Keeling MJ, Hill EM, Gorsich EE, Penman B, Guyver-Fletcher G, Holmes A, et al. Predictions of COVID-19 dynamics in the UK: Short-term forecasting and analysis of potential exit strategies. PLoS Comput Biol. 2021;17(1):e1008619. pmid:33481773
  10. 10. Gudelj I, White KAJ. Spatial heterogeneity, social structure and disease dynamics of animal populations. Theor Popul Biol. 2004;66(2):139–49. pmid:15302223
  11. 11. Boender GJ, Hagenaars TJ, Bouma A, Nodelijk G, Elbers ARW, de Jong MCM, et al. Risk maps for the spread of highly pathogenic avian influenza in poultry. PLoS Comput Biol. 2007;3(4):e71. pmid:17447838
  12. 12. Porphyre T, Correia-Gomes C, Chase-Topping ME, Gamado K, Auty HK, Hutchinson I, et al. Vulnerability of the British swine industry to classical swine fever. Sci Rep. 2017;7:42992. pmid:28225040
  13. 13. Keeling MJ, Woolhouse ME, Shaw DJ, Matthews L, Chase-Topping M, Haydon DT, et al. Dynamics of the 2001 UK foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape. Science. 2001;294(5543):813–7. pmid:11679661
  14. 14. Gamado K, Marion G, Porphyre T. Data-Driven Risk Assessment from Small Scale Epidemics: Estimation and Model Choice for Spatio-Temporal Data with Application to a Classical Swine Fever Outbreak. Front Vet Sci. 2017;4:16. pmid:28293559
  15. 15. Tildesley MJ, Ryan SJ. Disease prevention versus data privacy: using landcover maps to inform spatial epidemic models. PLoS Comput Biol. 2012;8(11):e1002723. pmid:23133352
  16. 16. Haydon DT, Chase-Topping M, Shaw DJ, Matthews L, Friar JK, Wilesmith J, et al. The construction and analysis of epidemic trees with reference to the 2001 UK foot-and-mouth outbreak. Proc Biol Sci. 2003;270(1511):121–7. pmid:12590749
  17. 17. Mollison D. Spatial Contact Models for Ecological and Epidemic Spread. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1977;39(3):283–313.
  18. 18. MacCalman L, McKendrick IJ, Denwood M, Gibson G, Catterall S, Innocent G, et al. MAPRA: Modelling Animal Pathogens: Review and Adaptation. EFSA Supporting Publications; 2016;13(12). https://doi.org/10.2903/sp.efsa.2016.EN-1112
  19. 19. Lau MSY, Gibson GJ, Adrakey H, McClelland A, Riley S, Zelner J, et al. A mechanistic spatio-temporal framework for modelling individual-to-individual transmission-With an application to the 2014-2015 West Africa Ebola outbreak. PLoS Comput Biol. 2017;13(10):e1005798. pmid:29084216
  20. 20. Cross PC, Caillaud D, Heisey DM. Underestimating the effects of spatial heterogeneity due to individual movement and spatial scale: infectious disease as an example. Landscape Ecol. 2012;28(2):247–57.
  21. 21. Tildesley MJ, House TA, Bruhn MC, Curry RJ, O’Neil M, Allpress JLE, et al. Impact of spatial clustering on disease transmission and optimal control. Proc Natl Acad Sci U S A. 2010;107(3):1041–6. pmid:19955428
  22. 22. Sellman S, Tildesley MJ, Burdett CL, Miller RS, Hallman C, Webb CT, et al. Realistic assumptions about spatial locations and clustering of premises matter for models of foot-and-mouth disease spread in the United States. PLoS Comput Biol. 2020;16(2):e1007641. pmid:32078622
  23. 23. Chis Ster I, Singh BK, Ferguson NM. Epidemiological inference for partially observed epidemics: the example of the 2001 foot and mouth epidemic in Great Britain. Epidemics. 2009;1(1):21–34. pmid:21352749
  24. 24. Lau MSY, Marion G, Streftaris G, Gibson GJ. New model diagnostics for spatio-temporal systems in epidemiology and ecology. J R Soc Interface. 2014;11(93):20131093. pmid:24522782
  25. 25. van den Driessche P. Spatial Structure: Patch Models. 2008:179–89.
  26. 26. Catterall S, Cook AR, Marion G, Butler A, Hulme PE. Accounting for uncertainty in colonisation times: a novel approach to modelling the spatio‐temporal dynamics of alien invasions using distribution data. Ecography. 2012;35(10):901–11.
  27. 27. Cook A, Marion G, Butler A, Gibson G. Bayesian inference for the spatio-temporal invasion of alien species. Bull Math Biol. 2007;69(6):2005–25. pmid:17457652
  28. 28. Pawlowsky‐Glahn V, Egozcue JJ, Tolosana‐Delgado R. Modelling and Analysis of Compositional Data. Wiley. 2015. https://doi.org/10.1002/9781119003144
  29. 29. Jewell CP, Kypraios T, Neal P, Roberts GO. Bayesian analysis for emerging infectious diseases. Bayesian Anal. 2009;4(3).
  30. 30. O’Neill PD, Roberts GO. Bayesian Inference for Partially Observed Stochastic Epidemics. Journal of the Royal Statistical Society Series A: Statistics in Society. 1999;162(1):121–9.
  31. 31. Stockdale JE, Kypraios T, O’Neill PD. Modelling and Bayesian analysis of the Abakaliki smallpox data. Epidemics. 2017;19:13–23. pmid:28038869
  32. 32. Gibson G. Estimating parameters in stochastic compartmental models using Markov chain methods. Mathematical Medicine and Biology. 1998;15(1):19–40.
  33. 33. Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A. Bayesian Measures of Model Complexity and Fit. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2002;64(4):583–639.
  34. 34. Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models. Bayesian Anal. 2006;1(4).
  35. 35. Fielding AH, Bell JF. A review of methods for the assessment of prediction errors in conservation presence/absence models. Envir Conserv. 1997;24(1):38–49.
  36. 36. Pooley CM, Marion G, Bishop SC, Bailey RI, Doeschl-Wilson AB. Estimating individuals’ genetic and non-genetic effects underlying infectious disease transmission from temporal epidemic data. PLoS Comput Biol. 2020;16(12):e1008447. pmid:33347459
  37. 37. Cook SR, Gelman A, Rubin DB. Validation of Software for Bayesian Models Using Posterior Quantiles. Journal of Computational and Graphical Statistics. 2006;15(3):675–92.
  38. 38. Cwynar P, Stojkov J, Wlazlak K. African Swine Fever Status in Europe. Viruses. 2019;11(4):310. pmid:30935026
  39. 39. European Food Safety Authority, Depner K, Gortazar C, Guberti V, Masiulis M, More S, et al. Epidemiological analyses of African swine fever in the Baltic States and Poland: (Update September 2016-September 2017). EFSA J. 2017;15(11):e05068. pmid:32625356
  40. 40. Schulz K, Staubach C, Blome S, Viltrop A, Nurmoja I, Conraths FJ, et al. Analysis of Estonian surveillance in wild boar suggests a decline in the incidence of African swine fever. Sci Rep. 2019;9(1):8490. pmid:31186505
  41. 41. Tarvydas A, Belova O. Effect of Wild Boar (Sus scrofa L.) on Forests, Agricultural Lands and Population Management in Lithuania. Diversity. 2022;14(10):801.
  42. 42. CLMS 4 2. Corine land cover 2018 (vector/raster 100 m), Europe, 6-yearly. 2018.
  43. 43. Pooley C, Marion G, Bishop S, Doeschl-Wilson A. Optimal experimental designs for estimating genetic and non-genetic effects underlying infectious disease transmission. Genet Sel Evol. 2022;54(1):59. pmid:36064318
  44. 44. Mills HL, Riley S. The spatial resolution of epidemic peaks. PLoS Comput Biol. 2014;10(4):e1003561. pmid:24722420