Bayesian reconstruction of transmission within outbreaks using genomic variants

Pathogen genome sequencing can reveal details of transmission histories and is a powerful tool in the fight against infectious disease. In particular, within-host pathogen genomic variants identified through heterozygous nucleotide base calls are a potential source of information to identify linked cases and infer direction and time of transmission. However, using such data effectively to model disease transmission presents a number of challenges, including differentiating genuine variants from those observed due to sequencing error, as well as the specification of a realistic model for within-host pathogen population dynamics. Here we propose a new Bayesian approach to transmission inference, BadTrIP (BAyesian epiDemiological TRansmission Inference from Polymorphisms), that explicitly models evolution of pathogen populations in an outbreak, transmission (including transmission bottlenecks), and sequencing error. BadTrIP enables the inference of host-to-host transmission from pathogen sequencing data and epidemiological data. By assuming that genomic variants are unlinked, our method does not require the computationally intensive and unreliable reconstruction of individual haplotypes. Using simulations we show that BadTrIP is robust in most scenarios and can accurately infer transmission events by efficiently combining information from genetic and epidemiological sources; thanks to its realistic model of pathogen evolution and the inclusion of epidemiological data, BadTrIP is also more accurate than existing approaches. BadTrIP is distributed as an open source package (https://bitbucket.org/nicofmay/badtrip) for the phylogenetic software BEAST2. We apply our method to reconstruct transmission history at the early stages of the 2014 Ebola outbreak, showcasing the power of within-host genomic variants to reconstruct transmission events.


Introduction
Understanding transmission is important for devising effective policies and measures that limit the spread of infectious diseases. In recent years, affordable whole genome sequencing has provided unprecedented detail on the relatedness of pathogen samples [1][2][3][4]. Consequently, accurately inferring transmission between hosts is becoming more feasible. However, this requires robust statistical approaches that make use of the full extent of genetic and epidemiological data available. Here, we present a new approach that makes use of within-host genetic variation and epidemiological data to infer transmission.
A number of approaches have been developed that reconstruct transmission from genetic data. The number of substitutions between samples from different hosts can be used to rule out transmission [5][6][7], or the phylogenetic tree of the pathogen samples can be used as a proxy for the transmission history [8,9]. However, while the phylogenetic signal can be very informative of transmission, it can also be misleading [10,11], due to within-host variation that can generate discrepancies between the phylogenetic and epidemiological relatedness of hosts, and can bias estimates of infection times [12,13].
In recent years a number of methods have been proposed explicitly modelling both the transmission process and within-host pathogen genetic evolution to infer transmission events [11,[13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28]. Some of these methods use epidemiological data and genetic sequences from pathogen samples, and ignore within-host evolution and other causes of phylogenetic discordance with transmission history [14][15][16][17][18][19][21][22][23]. Methods that explicitly model pathogen evolution within hosts and within an outbreak [13,20,24,25,27] generally assume, among other things, that samples provide individual and reliable pathogen haplotypes. This is often true for bacteria that are sampled and cultured before being sequenced, but it is mostly false for viruses and bacteria that are sequenced directly from samples without culturing. In fact, in these cases the sequencing process delivers reads coming from the different pathogen haplotypes that constitute the within-host pathogen population, and it is often very hard (if not impossible) to reconstruct complete haplotypes from these reads. In such cases, within-sample genetic variation is often neglected, and a single haplotype (which we call the consensus sequence of the sample) is built. While this procedure might lead to errors (and maybe biases), it also certainly discards a very informative part of the available genetic data, because within-sample genetic variants can be very informative of epidemiological distance, direction of transmission, time from infection and transmission bottleneck intensity (see [29][30][31][32] and Fig 1). Furthermore, it is generally assumed that the pathogen does not recombine, so that a single phylogeny describes the variants (from possibly multiple samples per host) to reconstruct transmission (including directionality and time of infection), but also combines this information with epidemiological data and an explicit model of within-host pathogen population evolution and transmission. We use the phylogenetic models with polymorphisms PoMo [36][37][38] to model population evolution along branches of the transmission tree; thanks to this, our transmission tree and phylogenetic tree are the same entity, and within-host evolution and recombination (resulting from a single primary infection, not multiple infections) do not create discrepancies that make statistical inference hard and computationally demanding [24,25,27]. We also explicitly model transmission bottlenecks, with one parameter defining the intensity of the bottleneck, and therefore the number of pathogen particles that establish a new population at transmission. Another feature of our approach is that we assume that different genomic positions are unlinked, an assumption also made by other methods using within-host variants [30,32]; most coalescent-based methods assume instead no recombination at all. Because of our assumption of no linkage, we expect our approach to work well when recombination is strong enough to break linkage between genetic variants in the same host, or when the evolutionary rate is slow so that very few new mutations originate with each new transmission.
BadTrIP is implemented as an open-source package for the Bayesian phylogenetic software BEAST2 [39], and as such, it can be freely installed, used, and modified. We compare the performance of BadTrIP, of the shared variants-based clustering (SVC) method of [30], and of the coalescent-based method SCOTTI [13] on simulated data and on a real dataset from the early stages of the 2014 Ebola outbreak [40]. These applications show that BadTrIP has high accuracy to reconstruct transmission thanks to its explicit model of population evolution, the use of within-host genetic variants, and the inclusion of epidemiological data, and can provide important information to understand and limit the spread of infectious disease.
In the rest of the manuscript, we refer to a "host" as any entity that can contain and transmit a pathogen. Typically a host is a human within a community or nosocomial outbreak, or patients, but the concept of host can also be generalised for example to farms within a livestock outbreak. We will refer to the collection of all pathogens of the type under consideration within an individual host at a certain time as a "pathogen population" (for example all Ebola virions within an infected host, excluding non-Ebola pathogens and Ebola virions from other hosts). We will call a "pathogen unit" a single pathogen individual within a population, for example an individual bacterial cell or an individual virion. We call a pathogen population "polymorphic" at a particular genome position if pathogen units with different nucleotides at that position are present in the population; in this case, we also call the considered genome position a "genetic variant".
The adoption of such a population genetic model within a transmission tree structure means that the phylogenetic tree and the transmission tree are now the same entity, and that each point of the tree represents the state of the pathogen population at a certain time within a host (Fig 2). Each bifurcation in the tree represents a transmission event, where the pathogen population splits in two groups: one remaining in the current host, and a small sub-population colonising a new host. We use a population bottleneck at time of transmission for the colonising branch to better model the transmission process.
Our method uses two sources of information: epidemiological and genetic data. Epidemiological data is in the form of dates: the times when genetic samples are collected (it is possible to give any number of samples ! 0 for any host, even no sample at all) and a time interval for each host describing when it can contribute to the outbreak. Each host can only be infected, be sampled, and can infect other hosts within its time interval [13]. Genetic data from each sample is in the form of nucleotide counts: for each position of the genome, for a certain sample, There are three hosts in this outbreak, represented by the black rectangles: H1 infects H2, which in turn infects H3. Time is on the vertical axis, and transmission events are represented by the thick arrows between hosts. Within each host, while it is colonised, the pathogen population consists of 15 units, each of which can have one of the four nucleotides at the considered position and at any time. For example, H1 starts off with all 15 pathogen units having an A, but during infection one of them mutates to C, and through genetic drift when H1 infects H2 it has 4 C's and 11 A's. While instantaneously only small changes can occur (one pathogen unit changing its nucleotide), along a time interval any number of changes can occur. As H2 is infected by H1, H2 is colonised by a copy of the pathogen population of H1, but the transmission bottleneck in this case causes one of the nucleotides to be lost, so that H2 is founded by a homogenous population of A's. Within H2 again a mutation occurs and now a G is present in the pathogen population, but when H3 is colonised by H2 both nucleotides survive the transmission bottleneck, so H3 starts off with a polymorphic population. In the figure, H1 and H3 both have samples extracted and sequenced once, while H2 is not sampled at all. The sequencing process can result in any coverage (24 for H1 and 7 for H3 at the considered position). Furthermore, the observed nucleotide frequencies don't necessarily exactly match the real nucleotides frequencies due to the randomness of read sampling, and because sequencing error can cause absent nucleotides to be observed at very low frequencies. https://doi.org/10.1371/journal.pcbi.1006117.g002 Transmission reconstruction from genomic variants the model expects the number of times each of the four nucleotides is observed in the reads (for example: 59 As, 0 Cs, 12 Gs, 1 Ts). We assume that reads are sampled with replacement from the pathogen population according to the (hidden) true nucleotide frequencies, and we model the sequencing error. This in particular means that sites without any sequencing coverage, or with very low coverage, are also allowed, and that differently from similar approaches (i.e. [30,32]) we don't require the specification of a minimum genetic variant frequency threshold. While in our model we make the strong assumption that sites are completely unlinked, we test the performance of our approach with simulations in which we explicitly model withinhost recombination events and we assume that a limited number of individuals in the pathogen population is sequenced. We even simulate scenarios in the total absence of recombination (complete linkage) to measure the robustness of our method. We simulate a broad range of scenarios: different transmission bottleneck severities (weak vs. strong), different amounts of genetic information, different recombination and mutation rates, different sequencing coverage levels, different sequencing error rates, and different virtual population sizes. We give further details on the model used and the simulations in the Materials and Methods section.

Accuracy of inference on simulated data
To test the accuracy of our new method BadTrIP in inferring transmission events, and to compare it to previous methods [13,30], we simulated pathogen evolution within outbreaks and sample sequencing, and we used different methods to reconstruct the transmission history from sequencing and epidemiological data. To simulate pathogen evolution, first we simulated an outbreak using SEEDY [42] (we used a fixed population of 15 hosts, one initial case, and a basic reproduction number of 1.43, see Materials and methods); then, we translated the transmission history into a population history, and simulated within-population pathogen coalescent, recombination and mutation with fastsimcoal2 [43]. Throughout the simulations each host in the outbreak is sampled exactly once. We measure the accuracy of a method as the frequency with which the correct transmission source of each host is inferred to be the most likely a posteriori. We also give a measure of how well calibrated [44] methods are by counting how often the correct source is in the 95% posterior credible set, defined as the minimum set of sources with cumulative probability ! 95% such that all sources in the set have higher posterior probability than all sources outside of it.
BadTrIP shows elevated accuracy in detecting the correct source of transmission (between 50% and 90%) and calibration (between 80% and 100%), in particular compared to the SVC approach (accuracy between 20% and 45% and calibration between 45% and 95%), see Fig 3. This shows that the use of epidemiological data and an explicit model of evolution can help to reconstruct transmission. Using alternative statistics for accuracy and calibration leads to similar patterns (Fig F in S1 Text). BadTrIP also shows more accuracy than the coalescent approach SCOTTI (accuracy between 25% and 70%). The latter method appears very conservative in this application (calibration between 95% and 100%). SCOTTI uses the same epidemiological information as BadTrIP, but a different format of genetic data and a different model of genetic evolution. In fact, like most coalescent-based approaches, SCOTTI requires a full haplotype to be given for each sample; in these simulations we used the consensus sequence of a sample as its haplotype for SCOTTI, discarding within-sample genetic variation. The fact that SCOTTI has strictly less genetic information available than BadTrIP can explain why generally it has less accuracy and is more conservative, however it is not the only factor at play, another being recombination. For example in the scenario with 1x coverage BadTrIP seems to have higher accuracy than SCOTTI, despite the two methods having the same information available: this can be explained with the fact the SCOTTI wrongly assumes that there is no recombination. Similarly, the simulations suggest that the accuracy gap between SCOTTI and BadTrIP reduces with no recombination, and increase at high recombination: this fits well with the fact BadTrIP assumes no linkage between genomic positions, while SCOTTI assumes complete linkage (no recombination). While these results are very suggestive and fit with our expectations, we also have to warn that for each individual scenario we have 10-20 simulated outbreaks, so while the general patterns are clear, the specific patterns of each scenario are subject to considerable uncertainty.
Comparing the base scenario with the one with almost no mutation, we see that BadTrIP accuracy drops from about 80% to about 50%; this drop hints to the contribution given by genetic data to the inference of transmission. Also, since in the latter scenario almost no genetic information is available, it also suggests what is the contribution of epidemiological information alone. Calibration of BadTrIP seems to increase as mutation rate decreases, one probable contributing factor being that as mutation rate decreases the effect of genetic linkage on the pathogen evolutionary dynamics decreases (neither method models genetic linkage), or possibly as a result of the increased uncertainty on the evolutionary process. The complete We represent accuracy as the frequency with which the correct simulated transmission event is more likely a posteriori than the alternatives. B) Calibration is the frequency with which the correct transmission event is in the 95% posterior credible set (the minimum set of sources with cumulative probability ! 95% such that all sources in the set have higher posterior probability than all sources outside of it). Bars represent percentages (from 0, worst, to 100, best) for BadTrIP (red), SCOTTI [13] (yellow) and the shared variants-based clustering (SVC) approach [30] (blue). On the x axis are different simulation scenarios with the first one, "base", being the basic simulation scenario with 10-15 cases per outbreak, about 300-500 SNPs among all hosts, recombination 10 times stronger than mutation, complete bottleneck (no transmission of within-host genetic variants), read coverage of 40x, PoMo virtual population size of 15, actual pathogen population size of 1000, and genome size of 5 kb. All other scenarios are obtained from the base one changing one or two parameters: in "no recombination" the recombination rate is set to 0; in "high recombination" the recombination rate is 10 times higher; in "high mutation" the mutation rate is 10 times higher resulting in 2000-3000 SNPs per outbreak; in "low mutation" the mutation rate is 10 times lower resulting in 30-50 SNPs per outbreak; in "very low mutation" the mutation rate is 1000 times lower, resulting in 0-1 SNPs per outbreak; in "weak bottleneck" at transmission 5 pathogen units from the infector colonised the infected host, instead of just 1; in "high rec. weak bott." both the recombination rate is 10 times higher and the founding population at transmission is made of 5 pathogen particles; in "high coverage" read coverage in sequencing is 100x instead of 40x; in "1x coverage" read coverage in sequencing is 1x instead of 40x; in "sequencing error" 0.2% of read bases are randomly modified to simulate sequencing error, coverage is reduced to 20x, and genome size is reduced to 1kb; in "high N" the PoMo virtual population size is 25 instead of 15. https://doi.org/10.1371/journal.pcbi.1006117.g003 Transmission reconstruction from genomic variants absence of recombination seems to negatively affect calibration in BadTrIP, but the difference is not dramatic (from about 90% to about 80%) suggesting that even in the worst case scenario of complete absence of recombination BadTrIP can still provide meaningful inference and posterior distributions. Accuracy of all methods seems to decrease with decreasing mutation rate, as is expected because of the reduced genetic information. However, very high mutation rates (to the point that about half the genome, of length 5kb, is polymorphic within the outbreak) do not seem to improve inference, probably because of saturation.
Accuracy of BadTrIP seems higher (around 10% difference) in the presence of a strong bottleneck (small inoculum) than a weak bottleneck (large inoculum), while calibration seems almost unaffected; this probably happens because, with strong bottlenecks, polymorphisms are unlikely shared between hosts, and so polymorphisms leading to substitutions (see Fig 1B) become more informative for identifying infectors. An increase in coverage (from 40x to 100x) does not seem to bring improvement in accuracy or calibration to BadTrIP; on the other hand, when a single uniform colony is sequenced (which is equivalent to reducing coverage to 1x, and therefore removing information on within-host genetic variation), accuracy seems moderately reduced (% 10%) but not calibration. Introducing sequencing error (0.2% of mis-called bases, slightly more than what typical for high-throughput DNA sequencing [45]) accompanied by reduced coverage (20x) and genome length (1kb) still seems to result in elevated accuracy (72.5%) and calibration (97.5%). Increasing the PoMo virtual population size (from 15 to 25, while the actual simulated population size remains 1000) showed negligible effects on the inference.
BadTrIP also infers the time of infection. Calibration seems to increase with recombination, and to decrease with mutation (Fig 4), probably again an effect of our assumption of no linkage. Also, very high mutation rates seem to reduce the error in time inference, as do high coverage and virtual population size.
The running time of BadTrIP is affected by the number of genetic variants present in the alignment and by the number of hosts present in the outbreak (Fig A in S1 Text). The number of variants affect the number of likelihoods that need to be calculated at each MCMC step, while the number of hosts affects the size of the transmission/population tree (so both the computational and statistical complexity of BadTrIP). However, the time required to complete an analysis is not always a linear function of these two quantities: at low mutation rates Bad-TrIP requires similar times for different outbreak sizes. The reason is probably that with less data there is more uncertainty (in particular in the posterior distribution of the mutation rate), and so it takes longer to explore the the parameter space effectively. Overall, it takes a few hours to completely investigate an outbreak of moderate size (one or two dozen hosts) with BadTrIP.

Analysis of the early 2014 Ebola outbreak in Sierra Leone
To demonstrate the applicability of BadTrIP and the advantage of using a model that combines epidemiological and within-sample genetic variation data, we use BadTrIP to infer transmission within the early cases of the 2014 Ebola outbreak in Sierra Leone. We use data published by Gire and colleagues [40] and previously analysed with the SVC method by Worby and colleagues [30]. One of the factors that make this dataset important to this study is the presence of within-host variants shared by multiple hosts, with one genetic variant that was even shared by eleven hosts [40]. While classical approaches based on consensus sequences would struggle to accommodate such data, in particular due to their assumption of strong transmission bottleneck that would not allow the transmission of variants, BadTrIP can accommodate such features, and such shared genomic variants are expected to increase the resolution of our transmission history inference. We investigate a collection of 62 samples with associated time and location of sampling. As observed by previous researchers, the number of substitutions (and more generally the number of SNPs) within this partial outbreak is very limited, and as such we expect to see a lot of uncertainty in the inference [30]; furthermore, all the samples were collected over a time interval of two months, and we assume transmission from a host to be possible from three weeks prior to three weeks following the sample collection, so the epidemiological data are also not very informative. Indeed, we see that most of the cases are inferred by BadTrIP to have a flat distribution of possible infectors, with highest per-infectee values generally under 30% posterior probability (Fig 5). However, we also see that BadTrIP identifies some pairs of infector-infectee with very high posterior probabilities (Fig B in S1 Text). These pairs not only generally fit with the geographical epidemiological data, with most transmission with posterior probability > 50% happening within chiefdoms (with two exceptions discussed later), but also with the SVC inference [30]. Of these, transmission from EM119 to G3770 was inferred by Worby and colleagues [30] using consensus sequence genetic distance, while transmission from EM096 to G3679, from G3826 to G3827, from G3820 to G3838, from EM110 to G3809, and from G3729 to G3795 was inferred with the help of shared within-host genetic variants. All highly likely transmission pairs in [30] are also inferred by BadTrIP, but there are some highly likely transmission events inferred by BadTrIP that were not detected by SVC. For example, transmission from G3834 to G3817 is inferred by BadTrIP and is supported by a 3% frequency variant within G3834 that becomes fixed in G3817; however, such a variant fixation, attributable to the transmission dynamics described in Fig 1B, is not informative in the SVC method [30] and was further ignored due to the imposition of a 5% variant frequency  Transmission reconstruction from genomic variants threshold that we could avoid thanks to our explicit model of sequence evolution and sequencing error. Other cases similar to the latter are the inferred transmissions from EM110 to G3856, from EM110 to G3822, and from EM111 to G3724.
Cross-chiefdom transmissions inferred by BadTrIP with elevated posterior distributions are from EM110 in the chiefdom of Jawie, district of Kailahun, to G3856 in the chiefdom of Nongowa, district of Kenema; and from G3834 in the chiefdom of Kpeje to G3817 in the chiefdom of Jawie, both in the district of Kailahun. Neither of them had a high probability in [30], but they are both supported by low-frequency variants becoming fixed in the recipient.
Our inference of the sequencing error rate is extremely low (2 Á 10 −7 < < 7 Á 10 −7 ) consistent with the thorough filtering steps adopted by Gire and colleagues [40] prior to withinhost variant calling.

Discussion
Methods to infer transmission histories within outbreaks are important to determine the causes of transmission, and to limit and prevent future outbreaks. Genomic pathogen data from an outbreak reveals in detail the genetic relatedness of pathogens from different cases. Most methods to infer transmission from pathogen genetic data require full haplotypes, but it is often not possible to reconstruct haplotypes due to pathogen recombination and short or inaccurate reads. This leads in many cases to discarding information regarding within-sample genetic diversity, and only use a sample consensus.
In recent years two methods have been proposed to infer transmission from genetic distances between samples and shared within-sample variants [30,32]. Here we presented Bad-TrIP, a Bayesian approach to transmission inference that makes use of within-sample variants and allows inference of transmission direction and time. Compared to other similar methods [30,32], our approach has the advantage of implementing an explicit model of pathogen population evolution, transmission and sequencing, of allowing the inclusion of epidemiological data (sampling times and host exposure times), of not requiring minimum thresholds for within-host variant frequencies, of accounting for sequencing errors, and of being implemented as part of an open source phylogenetic package (BEAST2 [39]). These aspects can result in more applicability, but also, as we have seen in our simulations, in greater accuracy. Compared to existing methods based on the coalescent (e.g. [13,24,25,28]) BadTrIP does not require the reconstruction of haplotypes and consensus sequences, but instead uses data of within-sample genetic variability, therefore having access to important information that can reveal otherwise cryptic transmission events.
Using simulations, we show that our approach achieves higher accuracy and calibration than SVC [30], has more accuracy than the coalescent-based method SCOTTI [13] used on consensus sequences of pathogen population genetic data, and can reliably identify likely transmission histories. The comparison between BadTrIP and SCOTTI is particularly interesting, because it shows us that reducing the genetic data of a within-host pathogen population to a single consensus sequence leads not only to the loss of within-host genetic diversity information, but can also lead to errors by ignoring recombination and weak transmission bottlenecks. Also, using a dataset of the early 2014 Ebola outbreak in Sierra Leone, and making use of information of within-sample variation and an explicit population evolution model, BadTrIP could infer previously unidentified likely transmission events, including transmissions between different geographic locations.
BadTrIP infers transmission from both epidemiological time data and pathogen genetic data. In most circumstances, both types of data are extremely useful, and we see in our simulations that removing genetic data information leads to a loss of % 30% accuracy, and similarly the epidemiological data is expected to provide % 40% accuracy (the baseline accuracy without any data is expected to be around 10% in our simulations). However, the contribution of the two types of data will be extremely dependent on the particular context at hand. As we showed in our simulations, BadTrIP can account for uninformative genetic data, with which it still provides meaningful inference. Our approach can however also account for uninformative epidemiological data: in the absence of exact dates, the user can specify arbitrarily large exposure intervals, allowing hosts to be infected any time by any host; as with the lack of genetic data, in this case we would also expect a significant drop in the accuracy of our method.
Despite these results, BadTrIP also has limitations, for example its model of genetic linkage. By assuming that all sites are unlinked, our model could be poorly calibrated in cases where there is no within-host recombination but high within-host mutation, causing strong correlations between inherited variants that are not expected in our model. However, we show in our simulations that our method is robust in a large variety of scenarios, including in the absence of recombination and with reads coming from few pathogen units. Another limitation is that our approach is generally not fast enough to deal with very large datasets, and, at the current stage, application is recommended to outbreaks with fewer than 100 cases. Also, BadTrIP is only applicable to the case where all hosts in the outbreak have been observed. In fact, our current implementation does not allow to infer the number of non-observed hosts (hosts for which there is no sample or epidemiological data). However, BadTrIP does allow to model non-sampled hosts with epidemiological data, or a fixed number of non-observed hosts (such hosts could be given uninformative epidemiological data, such as exposure intervals without ends). The assumption that all cases are observed or sampled is very common among transmission inference methods [11, 14-20, 23-26, 28], but it limits their applicability. Extending our method to infer the presence of possible non-sampled and non-observed intermediate hosts would be relatively straightforward and would increase the method's applicability, but it would also lead to a significant increase in the statistical complexity and computational demand (but see [13,27]).
Another scenario that is not accounted for in our model is multiple infections of the same host (one host being infected by multiple sources, or by the same source multiple times). This scenario can be relatively frequent in many viruses, for example HIV [46], but it is very hard to model in our context as it would require the use of a population network (see e.g. [47]) instead of a population phylogeny, which would make likelihood calculation more computationally demanding. Another similarly looking and equally concerning problem is sample contamination. We recommend sequencing data to be tested for possible contaminations and multiple infections using methods such as PHYLOSCANNER [48] prior to being investigated with Bad-TRiP. In our Ebola dataset we found no obvious pattern of mixed infection or contamination (like an excess of similar frequency SNPs in one sample). However, none of these approaches would detect multiple infections from closely related cases. BadTrIP uses a very simple model of sequencing error, only accounting for the two most common nucleotides at a given position and sample. This sequencing error model would probably have sub-optimal performance when sequencing error rate is high (e.g. with Nanopore sequencing technologies) and coverage is high or mutation rate is elevated. In these circumstances, a more realistic and computationally demanding model of sequencing error might be preferable. Similarly, our model of evolution only allows 2 alleles for one genome position in one host at one time. If mutation rates are so high that more than 2 alleles are frequently present simultaneously in the same host, time and position, then our model could have sub-optimal performance. However, our approach can still account for the more common scenario where a site has more than 2 alleles but not all in the same host: for example if at a certain position host 1 has a fixed A, host 2 has a polymorphism with A and C, and host 3 has a polymorphism with C and G.
BadTrIP does not account for selective pressure, which could sometimes cause errors, for example by creating homoplasies due to the same mutation appearing multiple times in different hosts, or by the same polymorphism being maintained by balancing selection. However, our approach weighs information from both fixed substitutions and polymorphic variants, so the same mutation appearing in different genetic backgrounds will not be as nearly as misleading as for the SVC method (which gives much more weight to shared variants than to genetic distances). We assume that within-host population sizes are constant after an initial expansion. Size fluctuations in all hosts are unlikely to cause problems, as the PoMo drift rate would in this case represent the average drift rate in hosts. On the other hand, if fluctuations only happen in certain hosts, so that different hosts have different average drift rates, it might have averse effects on the estimate of infection times.
As our model is implemented in BEAST2, it is possible to specify a broad range of models of genomic variation in substitution rates which could at least partly account for the effects of selection. An additional feature that could be added to BadTrIP is indel evolution. For example, by assuming an infinite sites mutation model, indel data could be reduced to 0-1 states, and a PoMo matrix with two alleles instead of 4 nucleotides could be used. This approach could be useful to complement SNP data, but would only work at relatively low indel rates.
Finally, it is possible that errors in the bioinformatic processing of reads, for example mapping errors, cause the identification of the same spurious genetic variants in multiple hosts. We therefore encourage the investigations of genetic variants shared by many hosts to assess their biological plausibility. In the future we will work to solve some of the limitations of Bad-TrIP, in particular to reduce its computational demand and to model non-sampled nonobserved hosts.
In conclusion, we have presented a new method that addresses the urgent need for software to efficiently and accurately analyse genomic and epidemiological data, in particular taking advantage of within-sample genetic variants to identify transmission pairs and reconstruct direction and time of infection. BadTrIP can be used in a broad range of outbreaks, and will be important for devising effective strategies to fight the spread of infectious disease.

Model of transmission
We model each host as a deme d 2 D that can be colonised by a pathogen population, with total number of hosts-demes being n D . Each deme d is associated with an exposure interval limited by an introduction time i d 2 (−1, +1] and a removal time r d 2 [−1, +1), with r d < i d (we consider time backward as typical in coalescent theory); the host only contributes to the outbreak within this interval, which is determined by the epidemiological data. In the least informative scenario where no information on host d exposure is provided, it is assumed that d is exposed for the whole outbreak (i d = +1 and r d = −1). We will denote as X the collection of exposure times.
Each host-deme starts off as non-colonised and is colonised (infected) at some time t d between i d and the time that the first sample is collected from d (if no sample is collected from d, then we require only t d > r d ). Also, unless d is the first host to be infected in the outbreak, d is infected by another host in the outbreak I d 6 ¼ d, such that r I d < t d < t I d , that is, d is infected after I d is infected, but before I d reaches its removal time. If d is indeed the first case of the outbreak, then I d is assigned the ; (we assume ; = 2 D). We assume for simplicity that transmission between any pair of hosts and at any time is equally likely, as long as it is consistent with the epidemiological data. A transmission event of host d at time t d is inconsistent with the epidemiological data if t d is outside the exposure interval of d or its infector I d , or if d is sampled, infects another host, before t d . Given the epidemiological data, some infector-infectee pairs are a priori more likely than others, depending on the length of time that a transmission between them is allowed.
Each host is also provided with a (possibly empty) set of samples, S d . Each sample s consists of a sampling time t s and genetic data G s . Each sample s in S d has to be collected after d is infected (t s < t d ) and before d is removed (t s > r d ). Assuming that the genome is L bases long, then the genetic data G s of every sample s has to be in the form of a list of L quadruples, with for example the quadruple for genome position i being G si = (a i , c i , g i , t i ), the four positive natural values being the numbers of A's, C's, G's and T's observed at position i in the sample. If there is no read mapping to position i in sample s, then its quadruple is simply G si = (0, 0, 0, 0). We denote the set of all sequencing data as G.
All hosts share a common parameter B (with real positive values) describing the intensity of the transmission bottlenecks associated with transmission events. Generally, the value of B can be inferred jointly with other model parameters, however its interpretation in terms of the size of the transmission inoculum is not straightforward. T denotes the transmission-population tree consisting of all sampling times, all infection times and all infectors of each host, and μ denotes the pathogen evolution model (described below). An example of tree T and of model parameters is given in Fig C in S1 Text.
We aim to sample from the following joint posterior distribution with a Monte Carlo Markov Chain approach: PðT; μ; BjG; XÞ / PðGjT; μ; BÞPðTjXÞPðμÞPðBÞ: ð1Þ P(μ) and P(B) are the prior probabilities for respectively the substitution model and the bottleneck size, which can be chosen arbitrarily by the user. We ignore the prior for the transmission tree P(T|X) as in [13]. P (G|T, μ, B) is the likelihood of the sequences given the genealogy and substitution model, and is calculated as described below, using an adaptation of [36][37][38] to transmission trees. So once we calculate the likelihood P (G|T, μ, B), we can use Eq 1 with an MCMC to infer a posterior distribution of infection times, infectors, bottleneck size and substitution model parameters.

Model of pathogen evolution
Here, we make use of a phylogenetic model for population evolution, PoMo [36][37][38], to model mutation and drift in the within-host pathogen populations; also, we extend the model to include transmission bottlenecks and sequencing errors. Sequence evolution is usually modelled along phylogenetic trees, which can differ from the transmission tree [13]. However, PoMo describes evolution along species (or population) trees, and the population tree of a pathogen within an outbreak corresponds to the transmission tree T described in the previous section. If we consider the pathogen community within a host d as a population, we see that this population exists from time of infection t d , when it originates from a split with the population of its infector I d . So, transmission events corresponds to timed splits in the population tree, similar to the bifurcations of a species tree. However, one difference is that the split is asymmetrical, as we assume that the pathogen population size is not affected at t d in I d , but at the start of the branch leading to d it undergoes a bottleneck of intensity B. All events in the tree are timed in real time (e.g., days) with some values fixed (for samples) and some values inferred in the MCMC (infection times).
We use a procedure very similar to the Felsenstein pruning algorithm [49] to calculate the likelihood of the genetic data over the tree. First of all, the substitution process along the branches of the transmission-population tree is not a simple DNA substitution process, but is similar to a 4-allelic Moran model [41] with mutation. We assume we have a continuous-time Markov process along each branch of the tree, where the state space is not made by the four nucleotides, as is typical, but by all 1-and 2-allelic states possible for a population of N units. Typical values of N that we use here are 15 or 25, that is, we describe evolution of a large within-host pathogen population (possibly with billions of units) with a small virtual withinhost population of N units. Such an approximation generally leads to reasonably good results as long as we rescale the mutation rates between the real and the virtual population [36][37][38]. N here is not estimated, but is fixed by the user. Lower values of N are expected to reduce the computational demand of the method, but can result in lower accuracy. The states of our Markov process always include the four fixed states, where only one nucleotide is present in the population. In addition, they also include six groups of polymorphic states, where two nucleotides are present in the virtual population at the same site at the same time. Each group corresponds to one of the six unordered pairs of nucleotides ({A, C}, {A, G}, {A, T}, {C, G}, {C, T}, {G, T}) and contains N − 1 states: if the two nucleotides present in the population are n 1 and n 2 , then such N − 1 states are the ones in which the population contains i times nucleotide n 1 and N − i times nucleotide n 2 , for 0 < i < N. So in total our state space is of size 4 + 6(N − 1). Our substitution rate matrix is sparse, in that we only allow one unit in the virtual population to change at the time. So, from a fixed state with nucleotide n 1 , a instantaneous move is only possible to one of the three states with N − 1 times nucleotide n 1 and one time any other nucleotide n 2 different from n 1 . Such moves correspond to mutation events, and we represent their rates as m n 1 ;n 2 . Instead, if we are already in a polymorphic state with i times nucleotide n 1 and N − i times nucleotide n 2 , we only allow nucleotide counts to instantaneously change by one, so an instantaneous move is only possible to the state with i + 1 times nucleotide n 1 and N − i − 1 times nucleotide n 2 , or to state i − 1 times nucleotide n 1 and N + 1 − i times nucleotide n 2 (one of these two latter states might be a fixed state). The instantaneous rate at which such changes happen is iðN À iÞ N 2 R which corresponds to the rate of genetic drift; here R scales the rate of drift in the virtual population in units of real time; the rate of drift in the virtual population also depends on N, and it represents the rate of drift in a real pathogen population, which in turn depends on the pathogen effective population size, the pathogen generation time, and the time unit. All other non-diagonal substitution rates are set to 0. All these states and rates constitute the substitution process E. The rate matrix is further described in Fig D in  S1 Text. Our model only allows 2 alleles to be present in one host at one time at one position. This can be unrealistic where mutation rates are extremely high, or selection favours several variants at the same site.
The likelihood of T is calculated starting from the hosts in the outbreaks who don't infect others (the leaves of the transmission tree). For such leaves, the likelihood is first calculated from the latest sample (if no sample is present, then the likelihood of such leaf at time of their transmission is 1 for every state). Given any state of our substitution process with nucleotides n 1 and n 2 with respectively abundances i and N − i in the virtual population (here for generality i can also be 0), given a sample and site at which the nucleotides with the highest coverage are x 1 with coverage c 1 , and x 2 with coverage c 2 (we ignore the nucleotides with lower counts for numerical stability, and in case of a tie random nucleotides are selected from the tying ones), then the likelihood of this state at this sample and site is approximated as: Where is a parameter describing the sequencing error rate. Here, due to sequencing errors and to random sampling of reads from the pathogen population, the observed alleles c 1 and c 2 are allowed be different from the alleles n 1 and n 2 in the virtual population. We assume that each read has the same probability to represent any of the individuals in the virtual population, and that there is a probability that the considered position of the read is a sequencing error (in which case any of the three wrong nucleotides is equally likely to be on the read).
Þ is the probability to see a n 1 nucleotide: the first part is the probability that the read comes from an individual in the virtual population with nucleotide n 1 at the given position and that no sequencing error happened; the right end part represents the probability that the virtual individual had a different nucleotide but there was a sequencing error. can be estimated with the other model parameters as we do with the real data and with the simulations including sequencing error. For all other simulations we set = 0. This sequencing model assumes that there are at most 2 alleles in the reads data for one sample at one position. If more than 2 alleles are observed, then only the counts from the 2 most common alleles are retained.
Along branches of T, the likelihood is updated using the matrix exponential of E. At bifurcations (corresponding either to internal samples or transmission events) the likelihood is also updated according to the classical pruning algorithm, but at transmission events an extra step is added. A new drift-only substitution matrix E D is defined by setting the mutation rates in E to 0. Then, we describe a bottleneck as a branch of length B along which the population evolves under drift alone, that is, under E D . The length B does not count toward the branch lengths in real time, so that changing the intensity of the bottleneck does not affect the timing of the events in T. Under this model, a more intense bottleneck, corresponding to a small transmission inoculum, will be represented by a longer bottleneck branch, so a larger B. If we have a transmission event from I d to d at time t d , we first calculate the likelihood within population I d up to right before time t d (likelihoods are updated backward in time), then within population d up to right before time t d , then we update the likelihood within d using the bottleneck branch, and finally we multiply the two likelihoods in d and I d to obtain the likelihood in I d right after t d (again backward in time). This backward-in-time likelihood update process is terminated after the transmission event of the index case, and before its bottleneck we assume state equilibrium frequencies. We now describe an example of likelihood calculation in Fig E in S1 Text.

MCMC proposals
We use typical BEAST2 scalar proposals for B, and E, which, given a constant s and a random uniform real number 0 < u < 1, propose to scale the given parameter by a factor of s + (1/s − s)u; the reciprocal of this factor is the Hastings ratio of the proposal. We also define below five new operators (proposals) for updating our transmission-population tree. We will now give a very informal intuition of why the above proposals make an irreducible MCMC. We will focus on the transmission history, and not on B, or E, but the extension is trivial. We will discuss intuitively how it is possible to move from any given tree T to a specific treeT (the tree we use as a starting point of the MCMC). As proposals are reversible, this is sufficient to have irreducibility.T is defined as the tree where the host with the earliest introduction time is the index case; each non-index host d inT is infected by the host I d with the earliest introduction time i I d among those with an exposure overlap to d; inT infection time t d of any host d is set to i d (for a more formal proof an infinitesimal interval after i d might be considered). Starting from T, we first approachT by moving host h, the index case inT , up from its starting position in T by using repeatedly operator four. At each step, before applying operator four, we use operator one to move t h up to make sure it is the first infectee of its infector, and that it is infected before the first sample of its infector is collected. Repeating these two steps long enough, h is guaranteed to become the root, at which point we can apply operator one to make sure its infection time is the same as inT . We then proceed to apply a similar strategy iteratively on all other hosts in order based on their introduction time (from earlier to latest). We stop when we obtainT .

Simulations of pathogen evolution
To test the accuracy of our new method BadTrIP in inferring transmission events, and to compare it with the SVC method [30], we simulated pathogen evolution within outbreaks and sample sequencing, and we used different methods to reconstruct the transmission history from sequencing and epidemiological data. To simulate pathogen evolution, first we simulated an outbreak using SEEDY [42] with a host population of 15 hosts and an infection rate of 0.1 per day, a recovery rate 0.07 per day, and conditionally accepting only outbreaks that achieve a minimum total of 10 infected cases. Given these parameters, SEEDY will start at time 0 with one infected individual in the community of 15 hosts. Each day every infected host has a 0.1 chance of infecting any other host, and a 0.07 chance of recovering (recovered hosts are no more infectious or infectable). If the outbreak runs out of infected hosts before a total of 10 hosts are infected, the simulation is repeated. We then took the outbreak simulated by SEEDY and translated the transmission history into a population history, assuming a within-host pathogen population size of 1000 and using fastsimcoal2 [43] to simulate pathogen coalescent, recombination and mutation with scenario-dependent parameters. fastSimCoal2 is an approximate coalescent simulator implementing the sequential Markov coalescent model [50,51] with cross-over recombination. This model describes viral recombination more appropriately than bacterial recombination, for which a coalescent simulation software modeling gene conversion is preferable [52,53]. The use of a coalescent simulator with recombination is also the main difference with the simulations made by [30], where within-host recombination was not allowed. Within each infection we assume that the population size is constantly 1000 individuals, but at the time of transmission we assume an instantaneous population bottleneck (founding population size either 1 or 5 individuals depending on the scenario). At the time of a transmission (simulated by SEEDY) the whole infectee population is, backward in time, merged with the infector population. We observed that some times, in particular at high recombination rates, fastSimCoal can crash: if this happens we simply repeat the fastSimCoal2 simulation with a different seed. Throughout all simulations each host was sampled exactly once.
We define a basic group of simulations (called "base"), and nine variants, in each of which one or two aspects of the base group of simulations is modified. In "base" we simulated about 300-500 SNPs (counting also variants present at very low frequency in just one host) or 45 substitutions per outbreak (which might be typical for HIV but high for many other pathogens), recombination rate 10 times higher than the mutation rate, complete bottlenecks (no transmission of within-host genetic variants), homogeneous read coverage of 40x, no sequencing error, PoMo virtual population size of 15, all equal mutation rates, and genome size of 5 kb. The eleven variant settings are: • no recombination-the recombination rate is set to 0.
• high recombination-the recombination rate is increased 10-fold.
• high mutation-the mutation rate is 10-fold higher resulting in 2000-3000 SNPs and about 385 substitutions per outbreak.
• low mutation-the mutation rate is 10-fold lower resulting in 30-50 SNPs and about 4-5 substitutions per outbreak.
• very low mutation-the mutation rate is 1000-fold lower, resulting in 0-1 SNPs and 0 substitutions per outbreak.
• weak bottleneck-at transmission, 5 pathogen particles from the infector colonise the infected host, instead of just 1.
• high recombination and weak bottleneck-the recombination rate is 10-fold higher and the founding population at transmission is made of 5 pathogen particles.
• high coverage-read coverage is higher (100x instead of 40x).
• sequencing error-read coverage is lower (20x instead of 40x), genome size is reduced (1kb instead of 5kb) and read bases are randomly modified to simulate sequencing error (0.2% of bases in reads are wrong).
• high N-the PoMo virtual population size is 25 instead of 15 (this only affects the BADTRIP inference and not the simulation itself).
We ran 10 replicates for all scenarios, and 20 for "base", "weak bottleneck" and "no recombination" (some scenarios are more computationally demanding due to the effect of recombination on coalescent simulations and of genetic diversity on transmission inference). For each repeat in each scenario we ran a completely different simulation with different seeds resulting in different transmission and coalescent histories, even when outrbeak or coalescent parameters do not change across scenarios. We ran the BadTrIP MCMC for 5 Á 10 5 steps for each replicate, sampled from the posterior every 100 steps and with a 20% burn-in. We specified in BadTrIP the true simulated sampling time and removal time of each host, while we specified as introduction time of each host its infection time minus one quarter of the mean duration of infection (so that the true infection time is contained within the exposure time of the host). For SCOTTI we used the same epidemiological data and options as for BadTrIP, except that we ran the MCMC for 2 Á 10 6 steps for each replicate. We did not allow unobserved cases in SCOTTI. We measured accuracy as the frequency with which the correct transmission source of each host is inferred by a method to be the most likely a posteriori. We also measured calibration as how often the correct transmission source is the the 95% posterior credible set (the minimum set of sources with cumulative probability ! 95% such that all sources in the set have higher posterior probability than all sources outside of it).
We also used the SVC method [30] to infer transmission from simulated data. This method consists of selecting, for each host d, the set of possible infectors as those cases with most shared variants with d, or, if d does not share variants with other hosts, the cases with the smallest consensus genetic distance from d. If a single possible infector is found, it is assigned 100% posterior probability, otherwise if multiple possible infectors are found they are assigned the same posterior probability. For example, if 4 cases all have 2 shared genetic variants with d, and all other cases have fewer than 2, than each of those 4 cases is assigned a posterior probability of 25% of infecting d. This is very different from BadTrIP, which always weighs the information from shared variants, genetic distances, and epidemiological data simultaneously from all cases. So, the 4 cases sharing 2 genetic variants with d can have very different posterior probabilities in BadTrIP of being infectors of d, depending on the other data. For example, if one of these 4 cases has very high genetic distance from d, or epidemiological data incompatible with a transmission to d, BadTrIP would infer very low (or null) probability of it being the infector of d.

The 2014 Sierra Leone Ebola dataset
We use sequencing and epidemiological data published by Gire and colleagues [40] and analysed by Worby and colleagues [30]. In particular, we use information from sampling dates, nucleotide frequencies and sequencing coverage. We specify the introduction date (removal date) of each host as its sampling date minus (plus) 21 days. This means that we allow each host to be infected at most 21 days before it being sampled, and to infect others at most 21 days after being sampled. We ran the BadTrIP MCMC until an effective sample size of 1000 was reached for each parameter and for the posterior probability (requiring % 3.5 million MCMC steps). to reduce the computational time required we subsampled the reads from each sample to obtain a per-base coverage of at most 100.

Software availability
BadTrIP is distributed as an open source package for the Bayesian phylogenetic software BEAST2 [39]. It can be downloaded from https://bitbucket.org/nicofmay/badtrip/ or via the BEAUti interface [54] of BEAST2.
Supporting information S1 Text. The supplementary text containing supplementary figures. (PDF) S1 Data. Contains the xml script to replicate our Ebola analysis with BadTrIP, an R script to run the SVC approach, and a python script to replicate our simulations.