^{1}

^{2}

^{2}

^{1}

^{2}

^{3}

The authors have declared that no competing interests exist.

Exploiting pathogen genomes to reconstruct transmission represents a powerful tool in the fight against infectious disease. However, their interpretation rests on a number of simplifying assumptions that regularly ignore important complexities of real data, in particular within-host evolution and non-sampled patients. Here we propose a new approach to transmission inference called SCOTTI (Structured COalescent Transmission Tree Inference). This method is based on a statistical framework that models each host as a distinct population, and transmissions between hosts as migration events. Our computationally efficient implementation of this model enables the inference of host-to-host transmission while accommodating within-host evolution and non-sampled hosts. SCOTTI is distributed as an open source package for the phylogenetic software BEAST2. We show that SCOTTI can generally infer transmission events even in the presence of considerable within-host variation, can account for the uncertainty associated with the possible presence of non-sampled hosts, and can efficiently use data from multiple samples of the same host, although there is some reduction in accuracy when samples are collected very close to the infection time. We illustrate the features of our approach by investigating transmission from genetic and epidemiological data in a Foot and Mouth Disease Virus (FMDV) veterinary outbreak in England and a

We present a new tool, SCOTTI, to efficiently reconstruct transmission events within outbreaks. Our approach combines genetic information from infection samples with epidemiological information of patient exposure to infection. While epidemiological information has been traditionally used to understand who infected whom in an outbreak, detailed genetic information is increasingly becoming available with the steady progress of sequencing technologies. However, many complications, if unaccounted for, can affect the accuracy with which the transmission history is reconstructed. SCOTTI efficiently accounts for several complications, in particular within-patient genetic variation of the infectious organism, and non-sampled patients (such as asymptomatic patients). Thanks to these features, SCOTTI provides accurate reconstructions of transmission in complex scenarios, which will be important in finding and limiting the sources and routes of transmission, preventing the spread of infectious disease.

Understanding the dynamics of transmission is fundamental for devising effective policies and practical measures that limit the spread of infectious diseases. In recent years, the introduction of affordable whole genome sequencing has provided unprecedented detail on the relatedness of pathogen samples [

A number of approaches have been developed that reconstruct transmission from genetic data. One method, based on pathogen genetic data, rules out direct transmission if isolates from different hosts are separated by a number of substitutions above a fixed threshold [

Reconstruction of transmission can be hindered by several complexities causing disagreement between the actual transmission history and the phylogeny of the sampled pathogen. Here we show four examples of these complexities:

Several methods emerged in recent years explicitly modelling both the transmission process and genetic evolution to perform inference of the history of transmission events [

Here, we propose a new Bayesian approach called SCOTTI (Structured COalescent Transmission Tree Inference) that not only accounts for diversity and evolution within a host, but also for other sources of bias, namely non-sampled hosts and multiple infections of the same host. This new method builds on our recent progress in efficiently modelling migration between populations using an approximation to the structured coalescent [

Many methods that infer transmission from pathogen genetic data assume that the pathogen population within a host is genetically homogeneous, thereby overlooking within-host variation. A popular example is Outbreaker [

In the present work we consider three different models of pathogen evolution within an outbreak:

A natural way to account for within-host evolution is via an extension of the multi-species coalescent model [

While we simulate outbreak data under the multispecies model, to infer transmission we propose a model based on the structured coalescent, SCOTTI. In the structured coalescent multiple distinct populations are present at the same time, lineages in the same population can coalesce (find a common ancestor), and lineages can migrate between populations at certain rates. In SCOTTI, each host represents a distinct pathogen population, and migration of a lineage represents a transmission event (

To test the accuracy of our new method SCOTTI in inferring the origin of transmission, and to compare it to the accuracy of the software Outbreaker, we simulated pathogen evolution within two distinct, fixed transmission histories, one from a 2001 FMDV outbreak [

In our base simulation setting, SCOTTI has higher accuracy than Outbreaker, in particular when provided multiple samples per host. The coloured “Maypole” tree (see Fig A in

SCOTTI shows higher accuracy than Outbreaker in all scenarios except with early sampling, while Outbreaker credible sets are poorly calibrated. Pathogen sequence evolution was simulated under transmission history 1, used in

Overall, SCOTTI shows higher accuracy than Outbreaker across scenarios (

Another difference between the two methods is that Outbreaker tends to infer a posterior distribution supporting a narrower range of origins. This, paired with its limited accuracy, leads the method to exclude a true origin from 95% credible sets in about half of the simulations. SCOTTI is instead much better calibrated, with 95% credible sets containing the true origin between 90% and 100% of the time (

In most of our scenarios the amount of genetic information available to distinguish different transmission histories is rather limited, with 2-3 SNPs per sampled host (total number of SNPs divided by the number of sampled hosts) on average. From such data a single phylogenetic tree relating the sequenced samples cannot be inferred unambiguously. When we increase phylogenetic signal, either by simulating longer infection times (“long infection” scenario) or with longer genetic sequences (“abundant genetic”), the accuracy of the methods substantially increases (Fig F in

SCOTTI and Outbreaker require distinct formats for epidemiological information. Outbreaker requires as input a probability distribution over the possible durations and intensity of infectivity, and sampling times. In contrast, SCOTTI requires the user to specify an exposure interval for each host. In simulations where the exposure intervals provided for each host were doubled in length compared to the true ones (“inaccurate epi” scenario) SCOTTI appeared relatively robust (

Furthermore, we investigated the effect of sampling times on the two methods. Outbreaker has higher accuracy when sampling times are close to the start of infection (“early sampling” scenario). Indeed this is the one setting in which Outbreaker outperforms SCOTTI in inference accuracy (

We did additional simulations to test the performance of SCOTTI under random transmission histories, variable host features, different levels of within-host genetic variation, and proportions of non-sampled hosts (see

We simulated a number of outbreaks of varying number of hosts and samples to test the computational applicability and efficiency of SCOTTI (see

To investigate the impact of our method on the study of real outbreaks, we examined the transmissions inferred by SCOTTI and Outbreaker in two real outbreaks of FMDV in 2007 [

FMDV infects cloven-hoofed animals, and is an economically devastating disease for the farming sector. The 2007 FMDV outbreak occurred in the South England as two distinct transmission clusters, one in August and one in September (Fig L in

Outbreaker (

We also investigated the same outbreak using Beastlier [

As another example of empirical analysis, we also investigated an antimicrobial resistant

Outbreaker (

Methods to infer transmission events within outbreaks are essential to determine the causes and patterns of transmission, and therefore to inform policies preventing and limiting transmission. Genomic data from pathogen samples give the opportunity to investigate, at an unprecedented level of detail, the relatedness of pathogens from different hosts. However, common real life complexities such as within-host variation (in particular for bacterial and chronic viral infections [

Here we have presented SCOTTI, a novel method of host-to-host transmission inference (who infected whom) that is built around a computationally efficient model of pathogen evolution based on the structured coalescent. By modelling each host as a distinct pathogen population, and transmission as migration of lineages between hosts, we have shown that it is possible to model within-host evolution and estimate transmission events with good accuracy, even in the presence of non-sampled hosts. We compared the accuracy of SCOTTI with that of the similar software Outbreaker [

Although SCOTTI has broad applicability, it has important limitations to be considered that we will address in future work. One problem is that SCOTTI ignores transmission bottlenecks, that is, the rapid growth in pathogen population size within a host following transmission. While the presence of strong transmission bottlenecks alone does not seem to cause an increase in error in SCOTTI, the presence of bottlenecks in conjunction with very early samples (close to the time of start of exposure) can considerably affect the accuracy of SCOTTI, as we showed in the simulation setting involving early sampling times. A somewhat ad-hoc solution to this problem could be to artificially shift back the starting time of exposure for hosts sampled very early, as we have also shown that such a decrease in informativity of epidemiological data has limited effect on SCOTTI. On the other hand, due to the absence of transmission bottlenecks in its model, SCOTTI is likely to suit outbreaks with frequent mixed infections and large transmission inocula (see e.g. [

Transmission tree topology also seems to have an important effect on the observed patterns (see e.g. [

Throughout our work we assumed neutrality, but in some cases selection can lead to phylogenetic and transmission inference biases [

In conclusion, we have presented a new method to reconstruct transmission events, SCOTTI, that addresses the urgent need for software to analyse genomic and epidemiological data while accommodating for incomplete or patchy host sampling, mixed infections, and within-host variation. For these reasons, our method can help to reconstruct transmission histories in a broad range of outbreaks, both bacterial and viral. This information will in turn be essential for devising effective strategies to fight the spread of infectious disease.

SCOTTI is distributed as an open source package for the Bayesian phylogenetic software BEAST2. It can be downloaded from

Recently, we proposed a BAyesian STructured coalescent Approximation (BASTA) that uses the structured coalescent framework (also known as the coalescent with migration) to infer migration rates and events between populations [

To allow the inclusion of epidemiological data, each population (host) _{i} ∈ (−∞, +∞] and a removal time _{r} ∈ [−∞, +∞), with _{r} < _{i} (we consider time backward as typical in coalescent theory). The interval [_{r}, _{i}] represents the exposure interval for population _{i} and _{r} represent respectively the times at which first it was possible for the host to have been infected, and last to have been infectious. For example, in a nosocomial outbreak, _{i} and _{r} would represent respectively the time of arrival and departure of host _{i} and _{r} are provided by the user and are therefore hereby treated as auxiliary data (we do not model host exposure, and exposure times are always conditioned on). In the worst case scenario where no information on host _{i} = +∞ and _{r} = −∞). We will denote as _{D} is not fixed, but is estimated within a range specified by the user. In the remainder of this work, we will assume that non-sampled demes have unlimited exposure times, but we also provide the option in SCOTTI of specifying regularly distributed introduction and removal times. Here, _{D} does not necessarily correspond to the number of non-sampled intermediate hosts in the outbreak, as each host can be infected multiple times, so a non-sampled host due to its infinite exposure time can model more than real life one non-sampled intermediate host. An additional important difference to [_{e}. This means that we assume that transmission is

We assume that a set of samples _{i} ∈ _{i} ∈ _{I}, and a sampled host _{i} ∈

To infer the transmission history in a Bayesian statistical framework, we aim to approximate the following joint posterior distribution:

To approximate _{I}, _{e}, _{D}) in _{i} = [_{i−1}, _{i}], where _{i} is the older event time of _{i} and _{i−1} the more recent, the probability density of interval _{i} can be written as
_{i} is the set of all extant lineages during interval _{i}, _{l} is the host to which lineage _{l} = _{l′} = _{i} is the contribution of the particular event:

To approximate _{i} we substitute _{l} = _{l′} = _{l} = _{l′} = _{l,t} to be the vector whose _{l,t,d} = _{l} = _{i} into two sub-intervals of equal length _{i1} = [_{i−1}, (_{i} + _{i−1})/2] and _{i2} = [(_{i} + _{i−1})/2, _{i}], and replace _{l,t} with _{l,αi−1} for all _{i1} and _{l,αi} for all _{i2}. This corresponds to approximating the distribution of lineages among hosts within an interval, as the same distribution at the interval boundaries. As the vast majority of intervals are generally relatively short, and no event occurs within them, this approximation has limited effect but substantially reduces the computations as now we only have to calculate _{l,t} at interval boundaries. We also call _{i}: = _{i} − _{i−1}. The approximated probability density contributions of _{i1} and _{i2} become:

The probability density of the genealogy under the structured coalescent, integrated over migration histories, is finally approximated as

The probability distribution of lineages among demes is updated iteratively starting from the most recent event toward the past as
_{i} is the number of hosts (sampled or non-sampled) exposed during interval _{i}. This comes from the assumption that any lineage migrates away from the current host at total rate _{l,t} is a vector whose _{1} and _{2} coalesce to an ancestral lineage _{i} = _{i} is the introduction time for a deme, all remaining lineages in host _{i} we update the probabilities as in _{i}, _{l,αi,d} is set to 0, and its value is distributed uniformly over all other hosts. If the considered event is a removal of host _{i} _{l,αi,d} is initiated with the value 0.

Samples from the posterior distribution in

SCOTTI allows a large number of populations to be investigated, as the assumption of uniformity of migration rates and effective population sizes greatly reduces the computational demand and parameter space compared to [_{D} has no effect on the computational demand of SCOTTI. Example files and data from the analyses described hereby can be found in

We test the performance in transmission inference of SCOTTI and Outbreaker using a broad range of simulation scenarios. We simulate within-outbreak pathogen evolution using the transmission events observed in two example real-life outbreaks. For half of simulations we use a subset of the FMDV transmission history inferred in [

While we simulate the coalescent process randomly, the transmission process is fixed _{e}, which is the same for all hosts. Lineages within a host can freely coalesce back in time as in the standard coalescent.

In addition to within-host evolution, we want to simulate a typical transmission: a small proportion of the pathogen population passed on at transmission (due to limited inoculum size), followed by rapid growth in the recipient (see e.g. [_{e} generations (a weak bottleneck through which two lineages have a probability of ≈ 63% of coalescing), or 100_{e} generations (a strong bottleneck through which two lineages almost surely coalesce). For half of the simulation scenarios we use a weak bottleneck, for the other half a strong one. In the population merger after the bottleneck all lineages remaining in the recipient host are moved to the donor host. Transmission bottlenecks are neither modelled in SCOTTI, nor in Outbreaker (

Finally, half of the simulations are performed providing one sample per host, the other half providing two samples per host, although Outbreaker is only used with one sample per host as it is the only permitted scenario. In summary we have 2 × 2 × 2 = 8 groups of simulations:

Weak vs strong bottleneck

First vs second transmission history

One vs two samples per host.

For each of the aforementioned eight groups, eight different scenarios (or subgroups) are simulated, for a total of 64 distinct simulation settings. We define a basic subgroup (called “base”), and seven variants, in each of which one aspect of the base subgroup is modified. In “base”, sampling times are picked uniformly at random and independently within host exposure times, the average time of infection is 2_{e} generations, host is sampled, the alignment length is 1500 bp, and the epidemiological data provided to SCOTTI is accurate (introduction and removal times correspond to infection and recovery time of hosts). The seven variant settings are:

_{e} generations).

For each of the total 64 subgroups, 100 datasets are simulated under an HKY substitution model [^{−3} substitution rate per base per _{e} generations, and uniform nucleotide frequencies.

For each simulated dataset we infer transmission with Outbreaker under the HKY substitution model and with 10^{6} MCMC iterations. Each analysis is initiated with a random starting tree and with uniform prior infection and sampling probabilities over the maximum observed infection time interval. We run SCOTTI with an HKY substitution model, between 0 and 2 non-sampled hosts, and 10^{6} MCMC iterations. For both methods we assess the performance by checking how often the correct origin of infection of each sampled host is recovered, and with what posterior probability. If transmission from host 1 to host 2 is inferred (either by SCOTTI or Outbreaker), then the infection origin of host 2 is inferred to be host 1. If an indirect transmission to host 2 is inferred, or host 2 is inferred to be an index host or an imported host, then the infection origin of host 2 is inferred to be non-sampled. Lastly, if SCOTTI infers multiple origins of the same host, then, if more than one origin is a sampled host, we always consider the inference as wrong; otherwise we consider the only sampled origin. We use two metrics to define the accuracy of an origin inference: (i) the number of replicates in which the simulated origin is the one with the highest posterior probability (ii) the average posterior probability of the simulated origin across replicates.

The simulations outlined above were performed under two fixed transmission trees. This gives us the possibility to investigate which particular transmission events are more accurately reconstructed than others. However, this also restricted us to a very limited number of scenarios. To address the question of how SCOTTI performs under more general and random transmission trees, we simulated outbreaks within a hospital setting. Every outbreak started with one case within the hospital, and two other non-infected patients. Each day, a new non-infected patient was allowed to enter the hospital with 10% probability (while we use days as time units, a different time unit would not qualitatively change the structure of the simulated outbreaks). These new patients remained in the hospital for a sojourn time normally distributed with mean 60 days and standard deviation 10. Sojourn time was therefore pre-determined and not affected by transmission events. Also, patients did not recover from infection. Each day, each infected patient had a 1/60 probability of infecting each other non-infected patient. Each patient was only allowed to be infected once. In the case where patient infectivity was variable, an infectivity multiplier was randomly sampled for each patient with a uniform distribution between 0.5 and 1.5. Infection was halted when a total of 12 patients were infected. A number of infected patients, between 12 and 3 depending on the particular scenario, was randomly selected and sampled. Sampling times were uniformly selected within the second and third quarters of the infected sojourn time for each sampled patient. In the base scenario, each infection lasted 5_{e} pathogen generations, and transmission bottlenecks corresponded to 5_{e} pathogen generations, causing lineages to coalesce within hosts with very high probability. However, when exploring the effect of within-host diversity, we set the duration of infections and the intensity of the bottleneck to 2,1, 0.5, or 0.2_{e} pathogen generations. In the scenario with variable _{e}, for each patient we sampled a within-host effective population size multiplier from an exponential distribution with mean 1.

To test the computational requirements of SCOTTI (Fig K in _{e}, and all other details were identical to the “Long infection” scenario in previous simulations. Estimations were run with SCOTTI with 4 million MCMC steps, 4 replicates for each scenario, 10% burn-in and a step size of 50. All replicates reached convergence (ESS>500 for posterior probability).

We apply and compare SCOTTI and Outbreaker on two real datasets: one from the 2007 FMDV outbreak in UK [^{6} MCMC iterations on FMDV, and 10^{9} iterations on ^{8} MCMC iterations under an HKY substitution model and with between zero and two non-sampled hosts. In all cases we checked MCMC convergence with the likelihood trace and effective sample size. These datasets are provided in

The supplementary text contains further details of the methods and analyses, in particular all supplementary figures and tables.

(PDF)

File containing information to replicate results.

(ZIP)

We thank Xavier Didelot, David Eyre, Thibaud Jombart, Erik Volz and Nicole Stoesser for comments and suggestions on early versions of the manuscript. We are also grateful to Matthew Hall for the help in running Beastlier.