A Bayesian Inference Framework to Reconstruct Transmission Trees Using Epidemiological and Genetic Data

The accurate identification of the route of transmission taken by an infectious agent through a host population is critical to understanding its epidemiology and informing measures for its control. However, reconstruction of transmission routes during an epidemic is often an underdetermined problem: data about the location and timings of infections can be incomplete, inaccurate, and compatible with a large number of different transmission scenarios. For fast-evolving pathogens like RNA viruses, inference can be strengthened by using genetic data, nowadays easily and affordably generated. However, significant statistical challenges remain to be overcome in the full integration of these different data types if transmission trees are to be reliably estimated. We present here a framework leading to a bayesian inference scheme that combines genetic and epidemiological data, able to reconstruct most likely transmission patterns and infection dates. After testing our approach with simulated data, we apply the method to two UK epidemics of Foot-and-Mouth Disease Virus (FMDV): the 2007 outbreak, and a subset of the large 2001 epidemic. In the first case, we are able to confirm the role of a specific premise as the link between the two phases of the epidemics, while transmissions more densely clustered in space and time remain harder to resolve. When we consider data collected from the 2001 epidemic during a time of national emergency, our inference scheme robustly infers transmission chains, and uncovers the presence of undetected premises, thus providing a useful tool for epidemiological studies in real time. The generation of genetic data is becoming routine in epidemiological investigations, but the development of analytical tools maximizing the value of these data remains a priority. Our method, while applied here in the context of FMDV, is general and with slight modification can be used in any situation where both spatiotemporal and genetic data are available.

mutations at one position appear randomly and independently in time, so that the number 10 of mutations during a time interval ∆ is a Poisson distribution with intensity m∆. Let P ∆ 11 be the probability that, at position x, the value at time ∆ is the same as the value at time If a mutation appears, it is uniform between all possible new values, so the probability to 15 observe a given value different from the value at ∆ = 0 is 1−P ∆ 3 . Therefore, the conditional 16 distribution of the number of differences between two sequences given ∆ is: For the simple transmission tree drawn in Fig. S1, div(i, j) = i, div(k, j) = k, div(k, i) = i, 28 div(l, j) = i, div(l, i) = l and div(l, k) = i and the conditional pseudo-distribution of observed 29 genetic sequences is: Substitutes for the conditional distribution of observed genetic sequences 31 We tested two expressions of the conditional distribution of observed genetic sequences 32 p m,s (S obs | T obs , J, The expression which led to the best reconstruction of the transmission tree is given in bution can be written: In the following, premise indices are reordered at each MCMC iteration such that they are chosen to satisfy the following timing constraints:  Proposal distributions. In the following, the star * is used to denote proposed values.

62
In order to update J(i), the proposal distribution q i (J * | J) is different in the case of the 63 first infected premise i = 1, from the other premises i > 1:

64
• the first infected premise i = 1 is permuted with the second infected premise i = 2 65 (if several premises are infected at time T inf 2 , one of these premises is randomly and 66 uniformly selected). In order to maintain the consistency of the transmission tree, 67 we permuted T 1 and T 2 , L 1 and L 2 , and modified D 1 and D 2 to satisfy the equation

70
• for i > 1, a candidate value J * (i) for J(i) was drawn uniformly among possible source 71 premises satisfying constraints (5). All premises infected by i remain infected by i.

72
The proposal distribution was chosen as a trun-73 cated normal distribution:

86
Acceptance probabilities 87 At each iteration of the algorithm, variables were sequentially updated with the following 88 acceptance probabilities. , L * 1 , L * 2 , D * 1 , D * 2 ) is accepted with probability: The proposal distribution for J(i), i > 1, is symmetric. Thus, the proposed value J * (i) 94 is accepted with probability: where J * is equal to J except that J(i) is replaced by J * (i).

96
The proposed vector of values (T inf * i , L * i ) is accepted with probability: The proposed vector of values (D * i , L * i ) is accepted with probability: where (D * , L * ) is equal to (D, L) except that (D i , L i ) is replaced by (D * i , L * i ).

100
The proposed vector of values α * is accepted with probability: The proposed vector of values β * is accepted with probability: The analysis of Ref.

Performance of the estimation algorithm
[3] indicated that two of these 15 premises (A, N) were infected from 115 a second source outwith our sample. In order to maintain the assumption of a single intro-116 duction required by our model, we initally applied our inference scheme on the 13 remaining 117 premises. We inferred that premise B acted as a "hub" of the outbreak, infecting 7 premises 118 (see Fig. S8), in contrast with Cottam et al. [3], where the role of the hub was assigned to 119 premise K, which was inferred as a source for B. The sequences collected on the premises 120 infected by the hub are indeed closer to K than to B thus genetic data support K as the hub.  Finally, we notice that the posterior probabilities for the mean latency duration and the 160 mean transmission distance (Fig. S17) have a similar shape to those obtained for the 20 161 premise simulation (Fig. 2), but their width is much smaller. In the case of the mean latency 162 duration, the true value of this parameter is not contained in the 95% confidence interval of the 163 corresponding posterior distribution. This is probably due to the "extreme" character of the 164 epidemics, as described above. However, given the small width of this posterior distribution, 165 the difference between the true value of the parameter and the median of the posterior is 166 less than a day, which in absolute terms is less than what was obtained for the 20 premise 167 simulation. 168