Bayesian Analysis for Inference of an Emerging Epidemic: Citrus Canker in Urban Landscapes

Outbreaks of infectious diseases require a rapid response from policy makers. The choice of an adequate level of response relies upon available knowledge of the spatial and temporal parameters governing pathogen spread, affecting, amongst others, the predicted severity of the epidemic. Yet, when a new pathogen is introduced into an alien environment, such information is often lacking or of no use, and epidemiological parameters must be estimated from the first observations of the epidemic. This poses a challenge to epidemiologists: how quickly can the parameters of an emerging disease be estimated? How soon can the future progress of the epidemic be reliably predicted? We investigate these issues using a unique, spatially and temporally resolved dataset for the invasion of a plant disease, Asiatic citrus canker in urban Miami. We use epidemiological models, Bayesian Markov-chain Monte Carlo, and advanced spatial statistical methods to analyse rates and extent of spread of the disease. A rich and complex epidemic behaviour is revealed. The spatial scale of spread is approximately constant over time and can be estimated rapidly with great precision (although the evidence for long-range transmission is inconclusive). In contrast, the rate of infection is characterised by strong monthly fluctuations that we associate with extreme weather events. Uninformed predictions from the early stages of the epidemic, assuming complete ignorance of the future environmental drivers, fail because of the unpredictable variability of the infection rate. Conversely, predictions improve dramatically if we assume prior knowledge of either the main environmental trend, or the main environmental events. A contrast emerges between the high detail attained by modelling in the spatiotemporal description of the epidemic and the bottleneck imposed on epidemic prediction by the limits of meteorological predictability. We argue that identifying such bottlenecks will be a fundamental step in future modelling of weather-driven epidemics.


Dispersal in two dimensions: definition
We assume that inoculum is emitted from a given "source" tree, with emission rateβ, along a random direction given by an angle θ ∈ (0, 2π] drawn from the probability distribution g(θ). Inoculum travels along the bearing direction θ and is deposited at a distance r given by the conditional probability distribution, or one-dimensional dispersal kernel with parameter α, f (r|θ; α) (see e.g. [1]). If inoculum dispersal is isotropic, g(θ) is a uniform distribution, g(θ) = 1/(2π), and f (r|θ; α) ≡ f (r; α). The probability that inoculum falls in a small area at distance r and angle θ, delimited in polar coordinates by the diameter dr and the subtending arc dθ (dA = r dr dθ), is given by P (dA) ≃ f (r|θ; α)g(θ)dr dθ = 1 2π f (r; α) dr dθ = 1 2πr f (r; α) dA. (S1) The definition of two-dimensional dispersal kernel, K(r; α), is P (dA) = K(r; α) dA. Hence, the relationship between K(r; α) and the one-dimensional kernel f (r; α) is: We assume that the probability that inoculum lands on a "target" tree depends on an effective area A T , unknown but identical for all trees (it could represent, e.g., the projection of the tree crown on the ground, but the specific definition of A T is not relevant here, as we are not trying to estimate its value). If we assume that A T is small with respect to any source-target distance r, then the probability that inoculum lands on the target can be approximated by K(r; α) A T . The rate of infection of the target tree is then given by: whereβ is the rate of emission of inoculum from the source, as above, and P INF is the conditional probability that inoculum infects the target once contact is made. Hence, the infection rate used in the Manuscript is given by β =β P INF A T , with dimensions L −2 T −1 .

Short-distance behaviour
We re-write below the dispersal kernels from Equations (3) in the Manuscript, adding the indices E for "exponential", and C for "Cauchy": The information about the spatial location of trees in census sites was obtained using differential GPS (DGPS) modules [2]. The accuracy of the DGPS readings varied between ±7.5 m and ±2.0 m, with an accuracy for measured inter-tree distances between ±15 m and ±4 m. Kernels (S4a) and (S4b) both have a singularity at d = 0, hence they are sensitive to distances below the DGPS accuracy. In order to make our results robust to short-distance observations, we used modified functional forms for the kernels, whereby dispersal coincides with Equation (S4) for distances d beyond a "hard-core" radius d 0 , but is distance-independent for d ≤ d 0 : where N E , N C are normalising constants that can be calculated analytically.
For the sake of completeness, we give below the functional forms of the other two dispersal kernels mentioned in the Manuscript: the Gaussian kernel (with index G) and the power-law kernel (with index P): where again N G , N P are normalising constants. The parameter α has the usual meaning of a dispersal length for the Gaussian kernel (Equation S6a), while for the power-law kernel (Equation S6b) it is a dimensionless parameter with α ∈ [0, 1). Parameter estimates found using different "hard-core" radii d 0 were compared (d 0 ranging from 2 m to 25 m). The results were essentially equivalent in all cases. All the results presented in the Manuscript and in this text were found using the kernels in (S5), with d 0 = 5 m.
1.3 Comparison of exponential and Cauchy kernels (Table S1, Figure   S1) In  [3], models with DIC within 2 of the "best" (i.e., the lowest score) deserve consideration, and models with a difference greater than 10 are given substantially no support. Table S1 shows a very strong preference for models with monthly-varying rates, which reproduce better the environmental drivers (as expected by comparing the middle and bottom rows in Manuscript Figure 3). It also shows a trend for the performance of model E over model C to improve going from ∆T = 6 months to ∆T = 1 month intervals. The latter effect is most likely due to the time-dependent external rate, ǫ t , becoming free to vary more frequently and "explain" some infection events that would otherwise require a long-tail kernel. For convenience, the following discussion focuses on the results for ∆T = 6 months (conclusions drawn from ∆T = 1 month estimates being very similar). Table S1 shows no significant differences between the E and C models, except for site D1, for which model E is favoured. For the other census sites, the two models are essentially equivalent. In Manuscript Figure 7, we gave a partial explanation of this result by showing that the dispersal kernels for model E, ) are almost identical up to distances of ∼ 300m (∼ 150m for site D1). Beyond those distances, which are still only a fraction of the size of the census sites (1km-4km), the relative differences between E and C kernels increase rapidly (Manuscript Figure 7). Hence, in principle, it should still be possible to detect the effect of such differences from spatio-temporal maps of disease.
However, the differences between the two kernels are balanced by the primary infection rate, as an illustrative example shows in Figure S1. We consider a disease snapshot, at 150 days, for census site D2 (Figure S1A), and use the observed positions of infected hosts (red circles in Figure S1A), together with the posterior distributions for the model parameters calculated with MCMC, to estimate the spatial distribution of the probability of new infections at the successive snapshot (180 days).
The infectious pressure φ s on a susceptible host s (cf. Equation (2b) in the Manuscript) is given by the sum of two terms: = ǫ is the contribution coming from primary (external) sources, and φ (2) s = β i∈I(150d) K( r s − r i ; α) is the contribution from secondary (internal) sources i already infected at 150d. Here, r s ≡ (x s , y s ) is the two-dimensional vector of the coordinates of s (the same for r i ), and r s − r i = d si . The probability that s is infected between 150d and 180d can be estimated as where we assumed that the integral in the exponent is small, and took into account that the infectious pressure φ s (t) ≡ φ s is constant between 150d and 180d. The reader is warned that the latter assumption was made for the sake of simplicity, and that it neglects the effect on φ s (t) of any other host infected between 150d and 180d. For ease of presentation, we replace the spatially discrete distribution of susceptible hosts with a spatially continuous (smoothed) host density, ρ(r) (r ≡ (x, y)), shown in gray scale in Figure S1A (the density units being km −2 ). Likewise, the infectious pressure on individual hosts is replaced with a spatially continuous density: ρ(r)φ(r) = ρ(r)φ (1) (r) + ρ(r)φ (2) (r), where φ (1) (r) = ǫ as above, while φ (2) (r) = β i∈I(150d) K( r−r i ; α). Finally, the probability of infection for individual host s is replaced with the expected number of new infections per unit area, 30d · ρ(r)φ(r). The maps are built using the posterior mean values of the parameters estimated for model E ( Figure S1(B,D)) and model C ( Figure S1(C,E)).
In Figure S1(B,C), we show the estimated density of new infections generated by secondary sources, ρ(r)φ (2) (r). The iso-density curves for the E model ( Figure S1B) and for the C model ( Figure S1C) are very similar down to densities of about 1km −2 (several hundred times less than the maximum density value). For lower values of the density, given that the internal sources of infection lie in the bottom half of the figure (geographical South), differences between E and C become visible in the top half (geographical North). In the northernmost region of the census site, the infection density for C is still non-negligible, although small, whereas the infection density for the exponential kernel is several orders of magnitude smaller. In the absence of external infection rate, such differences would clearly emerge from parameter estimation (e.g., giving DIC results strongly favouring the Cauchy kernel). However, when the infectious pressure generated by primary sources are added ( Figure S1(D,E)), the differences disappear, and maps of potential infections generated by the two models are essentially identical. Hence, exponential and Cauchy models tend to converge in the presence of external infection. The scale of our observations (i.e., the size of the census sites), however, may play a role in this result. It is possible that, with data available at a larger scale, differences between the effect of long-range transmission (both density-and distance-dependent), modelled with a Cauchy kernel, and the effect of primary infection (only density-dependent) would be detectable. The estimator of spatial autocorrelation used in the Manuscript is the spline correlogram introduced by Bjørnstad and Falck [4]. The main underlying ideas and methods are briefly explained below, following closely reference [4].
We define z j (t) as the infection indicator for host j at a given time t (z j (t) = 0 if j is susceptible, z j (t) = 1 if j is infected), so thatz(t) = j z j (t)/N = I(t)/N ≡ ρ I (t) is the average amount of infection in the system at any given time. The sample local autocorrelation between hosts j and k can be estimated as [4,5]: A widely used, spatially discrete estimator of the autocorrelation function is the spatial correlogram [5], which approximates the underlying continuous correlation function with a spatially discrete function. The spatial correlogram is built by specifying a set of distances {d q }, q = 1, . . . , q max , and assigning each inter-point distance d jk to distance class q if it falls within a given "tolerance distance" from d q : for example, δ/2 being the tolerance distance, a possible choice is d q = δ(q − 1/2), with distance class q given by the interval [d q − δ/2, d q + δ/2). The spatial correlogram is defined for d q as the local average ofĉ jk over the associated distance class: where the indicator function 1 q (d jk ) is equal to 1 if d jk belongs to class q, and equal to 0 otherwise. The spatial correlogram is the most commonly used estimator of spatial covariance. Some limitations, however, have been remarked [4]. For example, the spatial correlogram is not a valid covariance function itself, as it is not positive definite [6]. Moreover, it is not easy to obtain confidence intervals for the estimator (see the discussion in [4]) Hall and Patil [7] introduced a spatially continuous, kernel-smoothed estimator of the autocorrelation function:C where K() is a kernel function [8], and the bandwidth h controls the smoothness of the estimator. Equation( S8) gives a consistent estimator of the covariance function, in the sense that it is possible to find a value of h for whichC t (d) converges to the true covariance function (provided that the latter has continuous first and second derivatives) when N → ∞ [6,7].
In the spline correlogram [4], a cubic B-spline function, B λ (d) [8], fitted to the data, is used as the equivalent kernel smoother (see [9] for seminal work on splines as kernel smoothers, and also [10]). More explicitly, Equation( S8) is replaced with an optimisation problem: finding the function B λ (d) that minimises the penalised residual sum of squares: where B λ (d) is a piecewise cubic polynomial, and λ ≥ 0 is a smoothing parameter that controls the tradeoff between closeness to the data (first term) and roughness of the curve (second term or "penalty", increasing with the curvature of B λ (d)).
The spline correlogram has several desirable properties (see [4] for a discussion): notably, as shown below, it is possible to define a bootstrap method to calculate confidence regions for the estimator. The smoothing parameter λ in Equation( S9) plays a role analogous to the bandwidth h in Equation( S8), i.e., the smoothness of the fit increases with λ. There are several ways to select the value of λ, hence the degree of smoothness of a spline, for a given problem [8]. A commonly used method is to fix the number of effective degrees of freedom df λ , corresponding to the effective number of parameters used in the fit (see [8] for a rigorous definition). Since df λ is monotonically decreasing with λ, by choosing df λ the value of λ is specified automatically.
Spline smoothing is typically carried out with numerical methods [8]. In our case, the spline correlogram routines from a dedicated R package [11], and the related spline fitting routines in the R stats library [12], were adapted and rewritten in C++, in order to be integrated efficiently with our simulation code. In all the results shown in the paper, the degree of smoothing was determined by choosing df λ = 15 for spatial autocorrelation functions estimated in a range of distances from 0 to 1km.
The same smoothing methods were used to estimate the time-lagged statistic, R T t 0 (d). In this case, some of the reasons for choosing the spline correlogram estimator (e.g., positive definiteness) do not apply. The flexibility of splines as kernel smoothers, and the availability of a method to calculate confidence intervals for the estimator, were the main reasons for the choice.

Bootstrapping
We used the dedicated bootstrap algorithm introduced by Bjørnstad and Falck [4] in order to generate confidence intervals for the spatial statistics extracted from experimental data. The status of the system at any given time t is fully described by a set of N triplets (N being the number of hosts in the system): {x j , y j , z j }, j = 1 . . . N , where x j , y j are the coordinates of host j, and z j is the infection indicator defined above. A bootstrap dataset was generated by sampling (with replacement) N observations from the original set of triplets. Spatial summary statistics (the autocorrelation function and the time-lagged spatial statistic) were estimated from the bootstrap dataset in the usual way, with the only difference that pairs of the kind j-j, containing two replicates of the same host, were discarded before the computation (to avoid bias towards very small distances). By repeating the process several times (1000), we generated a bootstrap distribution for the smoothed autocorrelation function and the time-lagged statistic for several pre-chosen values of the distance, and used the 2.5 − 97.5% quantile interval at each of those values to build the 95% confidence envelope shown in the Manuscript.

Random Labelling
The significance of spatial statistics calculated both from experimental and simulated datasets was tested against the random labelling hypothesis (see e.g. [13]). A re-labelled dataset can be created by keeping the host locations fixed but re-allocating their infection status randomly: the new set is given by {x j , y j , z σ(j) }, j = 1 . . . N , where σ(j) is the j-th element of a random permutation of the indices {1, . . . , N }. Smoothed summary statistics can be calculated from the re-labelled dataset in the usual way. A Monte Carlo significance envelope was generated by creating 1000 re-labelled dataset and calculating the 95% confidence envelope from the distri-butions of summary statistics stored at a set of pre-defined distances. When spatial summary statistics were estimated for intervals (t 0 , t 1 ) with t 0 > 0, the significance envelope was found by keeping fixed the infectious status of hosts already infected at t 0 , and re-labelling all the other hosts.
2.2 Further results on spatial goodness-of-fit tests 2.2.1 Time evolution of spatial statistics (Figures S5, S6).
In Manuscript Figure 5, we showed only results for the spatial statistics at times t 1 corresponding to the end of the six-month intervals. Statistics calculated at intermediate times can provide complementary, possibly useful information. In Figure S5 Figure 5). For the 3-9 month period ( Figure S5(A1-A6)), and to a lesser extent for the 6-12 month period ( Figure S5(B1-B6)), the agreement of simulated and experimental summary statistics is good for all the intermediate times. The simulated autocorrelation functions for the 9-15 month period ( Figure S5(C1-C6)) shows good agreement with the experimental autocorrelation functions up to 2 and 4 months from the beginning of the period ( Figure S5(C1,C2)), and the discrepancy between simulated and experimental summary statistics observed in Manuscript Figure 5J emerges in the last two months ( Figure S5(C3)). The discrepancy is most likely caused by a large above-average fluctuation of the primary infection rate ǫ at month 15 (not shown here), or equivalently, by the fact that a higher-than average fraction of the new infections at month 15 was produced by external sources. Our model (with β and ǫ constant over the 6-month observation period) is not able to capture that effect: as a consequence, the model output contains a higher fraction of infections caused by secondary transmission, which in turn generate higher spatial autocorrelation, as found in our simulated data.
As a second example, we consider Figure S6(C-E) for the spatial statistics of site D2 related to the 3-9 months observation period. The simulated autocorrelation function C 9 (d) ( Figure S6D) is significantly overestimating the experimental autocorrelation function up to 300m. A similar effect is observed for the time-lagged statistic R 9 3 (d) ( Figure S6E). Finally, simulated disease progress also consistently overestimates the observed disease progress ( Figure S6C). The origin of the discrepancy in the latter case is clear: for two months (the first three red circles from the left in Figure S6C), the observed epidemic spread very little (the cumulative infection count being: 26, 28, 34 infected trees at months 3, 4, 5, respectively). Monthly estimates of the secondary infection rate β are consistently very small for months 4 and 5 (cf. Manuscript Figure 3I). The offset gained by the simulated disease progress in the first two months is maintained up to the end of the observation period ( Figure S6C); thus, experimental disease progress "lags behind" simulated disease progress. The correlation between the positions of infected trees is generated when infection is transmitted from within the system (i.e., via secondary transmission), and increases with the number of sources of infection present in the system at a given time, I(t). Hence, a time lag between simulated and experimental number of infections, I(t), can generate a corresponding time lag between simulated and experimental autocorrelation. In Figure S6, this hypothesis is tested in two stages: first, simulated and experimental spatial statistics are compared ( Figure S6(A-F)); second, the experimental spatial statistics is artificially shifted forward in time and the comparison repeated ( Figure S6(G-J)). In Figure S6(A-F), we show the statistics at 5 months (A,D), 7 months (B,E), and 9 months (C,F) (respectively 2, 4, and 6 months from the beginning of the period, the latter corresponding to Figure S6(D,E)). At 5 months (Figure S6(A,D)) the simulated statistics (shaded gray) are already significant (as compared with the cyan dash-dotted line), while the experimental statistics (red lines) are not. The discrepancy between experimental and simulated statistics increases at 7 months ( Figure S6(B,E)) and 9 months ( Figure S6(C,F)). In Figure S6(G-J), experimental statistics at 7 months ( Figure S6(G,I)) and 9 months ( Figure S6(H,J)) are compared with the corresponding simulated statistics calculated two months before, at 5 and 7 months, respectively. There is substantial agreement between simulated and experimental statistics with the 2-month lag, which confirms that the discrepancy in spatial statistics is mainly due to a delay of two months in the epidemic progress of the experimental system.
2.2.2 Spatial statistics to discriminate a wrong model ( Figure S7) In the Manuscript and in Section 1.3 of the present text we explained that models with exponential dispersal kernel and Cauchy dispersal kernel give substantially identical goodness-of-fit spatial statistics, because of the confounding effect of primary infection. In Figure S7, we show that the spatial summary statistics used in the paper can discriminate between a "good" model (with exponential kernel and primary infection, cf. Manuscript Figure 5) and a wrong model that assumes negligible import of infection from outside (Cauchy kernel with very small primary infection). Model M ∆T α , ∆T = 6 months, with dispersal given by a Cauchy kernel as in Equation (S5b), was fitted to the data for the Miami Dade site D1. The primary infection rate ǫ was not estimated, but kept constant at a small value, ǫ = 10 −6 (enough to initiate the epidemic, but negligible for subsequent epidemic dynamics). The estimated parameters were then used to generate simulated epidemic datasets and related summary statistics. The results (for the same time periods as in Manuscript Figure 5) are shown in Figure S7. In the 0 − 6 month period ( Figures S7(A,B)), disease progress is inhibited by the small value of ǫ, hence the results are not relevant to the present discussion. For the other periods ( Figures S7(C-K)), although the progress of the epidemic is well reproduced (Figures S7(C,F,I)), simulated spatial statistics (Figure S7(D,G,J) and Figure S7(E,H,K)) clearly and consistently overestimate experimental spatial statistics (compare with Manuscript Figure 5 for the exponential kernel with external infection).