New method to reconstruct phylogenetic and transmission trees with sequence data from infectious disease outbreaks

Whole-genome sequencing (WGS) of pathogens from host samples becomes more and more routine during infectious disease outbreaks. These data provide information on possible transmission events which can be used for further epidemiologic analyses, such as identification of risk factors for infectivity and transmission. However, the relationship between transmission events and WGS data is obscured by uncertainty arising from four largely unobserved processes: transmission, case observation, within-host pathogen dynamics and mutation. To properly resolve transmission events, these processes need to be taken into account. Recent years have seen much progress in theory and method development, but applications are tailored to specific datasets with matching model assumptions and code, or otherwise make simplifying assumptions that break up the dependency between the four processes. To obtain a method with wider applicability, we have developed a novel approach to reconstruct transmission trees with WGS data. Our approach combines elementary models for transmission, case observation, within-host pathogen dynamics, and mutation. We use Bayesian inference with MCMC for which we have designed novel proposal steps to efficiently traverse the posterior distribution, taking account of all unobserved processes at once. This allows for efficient sampling of transmission trees from the posterior distribution, and robust estimation of consensus transmission trees. We implemented the proposed method in a new R package phybreak. The method performs well in tests of both new and published simulated data. We apply the model to to five datasets on densely sampled infectious disease outbreaks, covering a wide range of epidemiological settings. Using only sampling times and sequences as data, our analyses confirmed the original results or improved on them: the more realistic infection times place more confidence in the inferred transmission trees. Author Summary It is becoming easier and cheaper to obtain whole genome sequences of pathogen samples during outbreaks of infectious diseases. If all hosts during an outbreak are sampled, and these samples are sequenced, the small differences between the sequences (single nucleotide polymorphisms, SNPs) give information on the transmission tree, i.e. who infected whom, and when. However, correctly inferring this tree is not straightforward, because SNPs arise from unobserved processes including infection events, as well as pathogen growth and mutation within the hosts. Several methods have been developed in recent years, but none so generic and easily accessible that it can easily be applied to new settings and datasets. We have developed a new model and method to infer transmission trees without putting prior limiting constraints on the order of unobserved events. The method is easily accessible in an R package implementation. We show that the method performs well on new and previously published simulated data. We illustrate applicability to a wide range of infectious diseases and settings by analysing five published datasets on densely sampled infectious disease outbreaks, confirming or improving the original results.

datasets with matching model assumptions and code, or otherwise make simplifying assumptions 23 that break up the dependency between the four processes. To obtain a method with wider 24 applicability, we have developed a novel approach to reconstruct transmission trees with WGS 25 data. Our approach combines elementary models for transmission, case observation, within-host 26 pathogen dynamics, and mutation. We use Bayesian inference with MCMC for which we have 27 designed novel proposal steps to efficiently traverse the posterior distribution, taking account of 28 all unobserved processes at once. This allows for efficient sampling of transmission trees from 29 the posterior distribution, and robust estimation of consensus transmission trees. We 30 implemented the proposed method in a new R package phybreak. The method performs well in 31 tests of both new and published simulated data. We apply the model to to five datasets on 32 densely sampled infectious disease outbreaks, covering a wide range of epidemiological settings. 33 Using only sampling times and sequences as data, our analyses confirmed the original results or 34 improved on them: the more realistic infection times place more confidence in the inferred 35 transmission trees. 36 Introduction 53 As sequencing technology becomes easier and cheaper, detailed outbreak investigation 54 increasingly involves whole-genome sequencing (WGS) of pathogens from host samples [1]. 55 These sequences can be used for studies ranging from virulence or resistance related to particular 56 genes [1,2], to the interaction of epidemiological, immunological and evolutionary processes on 57 the scale of populations [3,4]. If most or all hosts in an outbreak are sampled, it is also possible 58 to use differences in nucleotides, i.e. single-nucleotide polymorphisms (SNPs), to resolve 59 transmission clusters, individual transmission events, or complete transmission trees. With that 60 information it becomes possible to identify high risk contacts and superspreaders, as well as 61 characteristics of hosts or contacts that are associated with infectiousness and transmission [5,6]. 62 Much progress has been made in recent years in theory and model development, but existing 63 methods typically include assumptions to address specific datasets, with fit-for-purpose code for 64 data analysis. An easily accessible method with the flexibility to cover a wide range of infections 65 is currently lacking, and would bring analysis of outbreak sequence data within reach of a much 66 broader community. 67 The interest in easily applicable methods for sequence data analysis in outbreak settings 68 is demonstrated by the community's widespread use of the Outbreaker package in R [7-10]. 69 However, the model in Outbreaker assumes that mutations occur at the time of transmission, 70 which does not take the pathogen's in-host population dynamics into account, nor the fact that 71 mutations occur within hosts. The publications by Didelot et al [11] and Ypma et al [12] revealed 72 that within-host evolution is crucial to relate sequence data to transmission trees, as is illustrated 73 in Fig 1A: there are four unobserved processes, i.e. the time between subsequent infections, the 74 time between infection and sampling, the pathogen dynamics within hosts, and mutation. The 75 difference in sequences between host b and infector a result from all of these processes. As a 76 result, a host's sample can have different SNPs from his infector's ( Fig 1B: hosts a and b); a host 77 can even be sampled earlier than his infector with fewer SNPs (Fig 1B: hosts a and c). topology, and accounted for unobserved hosts by a sampling model (which is an additional 93 complication). This two-step approach is likely to work better if the phylogenetic tree is properly 94 resolved (unique sequences with many SNPs), but less so if there is uncertainty in the 95 phylogenetic tree. However, also in that case construction of the phylogenetic tree is done 96 without taking into account that only lineages in the same host can coalesce, and that these go 97 through transmission bottlenecks during the whole outbreak. That is likely to result in incorrect 98 branch lengths and consequently incorrect infection times. 99 Hall and Rambaut [19] implemented a method in BEAST for simultaneous inference of 100 transmission and phylogenetic trees. BEAST allows for much flexibility when it comes to 101 phylogeny and population dynamics reconstruction (for which it was originally developed [17,102 We built further on that principle to reconstruct the transmission trees of outbreaks, in a 126 new model and estimation method. The method requires data on pathogen sequences and 127 sampling times. The model includes all four underlying stochastic processes (Fig 1A), each 128 described in a flexible and generic way, such that we avoid putting unnecessary prior constraints 129 on the order of unobserved events (Fig 1B) between a primary and a secondary case. The model contains two parameters: the 159 shape a G , which we fixed at 3 during our analyses, and the mean m G , which is 160 estimated and has a prior distribution with mean μ G and standard deviation σ G . In an 161 uninformative analysis, μ G = 1, and σ G = ∞. 162 -sampling: the sampling model consists of a Gamma distribution for the sampling 163 interval, which is the interval between infection and sampling of a case. The model 164 contains two parameters: the shape a S , which is fixed during the analysis, and the 165 mean m S , which is estimated and has a prior distribution with mean μ S and standard 166 deviation σ S . In an uninformative analysis, μ S = 1, and σ S = ∞; in a naïve analysis we 167 additionally set a S = 3. 168 -within-host dynamics: The within-host model describes a linearly increasing 169 pathogen population size ( ) τ τ = w r , at time τ since infection of a host. The slope r 170 has a Gamma distributed prior distribution with shape a r and rate b r . In an 171 uninformative analysis, a r = b r = 1. 172 -The mutation model is a site-homogeneous Jukes-Cantor model, with per-site 173 mutation rate μ. The prior distribution for log(μ) is uniform. 174

Analysis of the newly simulated datasets 175
We generated new simulated datasets were generated with the above model, in a 176 population of 86 individuals and a basic reproduction number R 0 = 1.5, to obtain 25 datasets of 177 50 cases. Parameters were a G = a S = 10, m G = m S = r = 1, μ = 10 -4 and sequence length 10 4 , 178 resulting in 1 genome-wide mutation per mean generation interval of one year. 179  Results). Using prior information on the mean sampling interval did not improve on this, but if 209 prior information is incorrect, fewer hosts have a strongly supported infector, which makes 210 inference more uncertain. In conclusion, the method is fast and efficient if applied to simulated 211 data fitting the model. In that case, no informative priors are needed for transmission tree 212 inference. 213

Analysis of previously published simulated data 214
We applied the method to previously published outbreak simulations [19]. Briefly, a 215 spatial susceptible-exposed-infectious-recovered (SEIR) model was simulated in a population of 216 50 farms, with a latent period (exposed) of two days and a random infectious period with mean 217 10 days and standard deviation 1 day, at the end of which the farm was sampled. Two mutation 218 rates were used, either Slow Clock or Fast Clock, equivalent to 1 or 50 genome-wide mutations 219 per generation interval of one week, respectively. 220 Table 2 shows performance of the method with naïve and informative prior information 221 on the sampling interval distribution (see S1 Results for uninformative). Effective sample sizes 222 are a bit smaller than with the new simulations, but in most cases still good for infection times, 223 whereas sampling of infectors was excellent. The low variance of the sampling interval 224 distribution caused some problems in efficient sampling of m S because of its high correlation 225 with the associated infection times. This is best seen in the ESS of m S and infection times in the 226 uninformative Slow Clock analysis (S1 Results), but it also causes problems in the burn-in phase 227 if inference starts with parameter values far from their actual values (not shown). Posterior 228 median mutation rates are slightly higher than used for simulation, which could be due to 229 different rates for transition and transversion in the simulation model [19]. 230 The results for the four smaller datasets are shown in Table 3, which shows that mixing 253 of the MCMC chains was generally good.    The main difference compared to the original analysis lies in the shape of the phylogenetic tree 280 and the estimated infection times (Fig 3a). Whereas the topology is very similar, the timing of 281 the branching events is different: in the original tree, internal branches decrease in length when 282 going from root to tips, consistent with a coalescent tree based on a single panmictic population. 283 By taking into account the fact that coalescent events take place within individual hosts, our 284 analysis shows branch lengths that are more balanced in length across the tree. Importantly, this 285 results in a more recent dating of root of the tree: early 2008 (Fig 3a) instead of early 2004 [11]. 286 The MRSA data were analysed with an informative prior for the mean sampling interval 287 m S and a shape parameter a S based on data on the intervals between hospitalisation and the first 288 positive sample. The estimated mutation rate is similar to literature estimates [31, 32], but the 289 posterior median m S of 30 days is considerably higher than the prior mean of 15 days (Table 3). 290 This may be explained by the two health-care workers (HCW_A and HCW_B), which have very 291 long posterior sampling intervals that were not part of the data informing the prior (Edmonds' 292 consensus tree, Fig 2b). In contrast with the original analysis, we now identify a transmission 293 tree rather than only a phylogenetic tree, resulting in the observation that the two health-care 294 workers may not have infected any patient in spite of their long infection-to-sampling interval. 295 Almost all transmission events with low support (<20%) involved unsequenced hosts. Three of 296 them were identified as possible infector, in the initial stage of the outbreak, when only few 297 samples were sequenced. This indicates that some unsequenced hosts may have played a role in 298 transmission, but that it is not clear which. Finally, a major difference between our results and 299 those in the original publication is the shape of the phylogenetic tree and the dating of the tree 300 root: around 1st January (Fig 3b) instead of 1st September the year before [21]. 301 Analysis of the FMD2001 and FMD2007 datasets resulted in posterior sampling intervals 302 with means of 13 and 9 days, respectively, close to the 8.5 days estimated from epidemic data 303 [33] . Generation intervals were about the same (Table 3). Both datasets contained more SNPs 304 than the Mtb and MRSA data, with unique sequences for each host and higher mutation rates, 305 similar to published rates in FMD outbreak clusters [34]. This resulted in equal Edmonds' and 306 MPC consensus transmission trees, and much higher support for most infectors (Figs 2cd, 3cd). 307 Our transmission tree is almost identical to the one from Ypma et al [12], who used a closely 308 related method but did not allow for transmission after sampling. When comparing to the 309 analysis of these data by Morelli et al [24], the topologies of the phylogenetic trees (Fig 3cd)  The H7N7 dataset was analysed with the sequences of the three genes HA, NA, and PB2 317 separately, and combined; with informative priors for both the mean sampling and mean 318 generation intervals. Five parallel chains were run, and mixing was generally good (Table 4); it 319 took about 7 hours on a 2.6GHz CPU to obtain 25,000 unthinned samples in a single chain. 320 Analysis of the three genes combined resulted in a posterior median m S of 8.5 days, slightly 321 longer than the 7 days on which the informative prior was based [35], and longer than in the 322 separate analyses. The mean generation time was shorter than the prior mean: 3.9 days with all 323 genes. We also calculated the parsimony scores of the posterior sampled trees, defined as the 324 minimum numbers of mutations on the trees that can explain the sequence data [36], and 325 compared these with the numbers of SNPs in the data (Table 4). It appeared that with the genes 326 separately analysed, parsimony scores were 6-13% higher than the numbers of SNPs, indicating 327 some homoplasy in the phylogenetic trees (this was not seen with any of the other datasets). The 328 analysis of all genes together resulted in parsimony scores of 18% higher than the number of 329 SNPs. The estimated mutation rates are among the highest estimates for Avian Influenza Virus, 330 as already noted before in earlier analyses of the same data [25,37]. Fig 4 shows  to enter the data, to run the MCMC chain, and to analyse the output, e.g. by making consensus 359 trees and plotting these (as in Fig 3). for the interval between infection and sampling, has a direct link to inferred infection times, and 374 is the model for which it is most likely that prior information is available from epidemiological 375 data in the same or other outbreaks. We used simulated data to study the effect of uninformative 376 or incorrect prior information on shape parameter a S and mean m S . It appears that an incorrect a S 377 or an incorrect informative prior for m S does reduce accuracy of inferred infection times. 378 However, consensus trees are hardly affected, at least if the number of SNPs is in the order of the 379 number of hosts as we saw in the actual datasets (Table 1 and Table 2 Slow Clock). Only the 380 precision of consensus trees is reduced, i.e. there are fewer inferred infectors with high support. 381 Results with the Fast Clock simulations did show a significant reduction in consensus tree 382 accuracy. In that case, there are so many SNPs that the phylogenetic tree topology and times of 383 coalescent nodes are almost fixed; then, too much variance in sampling intervals (low a S ) results 384 in many incorrect placements of infection events on that tree. Possibly, with so many SNPs it 385 could be more efficient to first make the phylogenetic tree, and then use that tree to infer 386 transmission events [11, 16], but it is questionable whether genome-wide mutation rates are ever 387 so high that this becomes a real issue [38]. 388 The submodel for transmission is relevant for transmission tree inference in describing 389 the times between subsequent infection events. Transmission is modelled as a homogeneous 390 branching process, implicitly assuming that there was a small outbreak in a large population, 391 with a reproduction number (mean number of secondary cases per primary case) of 1. Our 392 approach assumes that everyone in the outbreak was known, which is a potential limitation, as would then be possible to test for increased infectivity of the health-care workers, or to test for 465 differences in transmissibility in the three wards. In general, the use of additional host data 466 would make dealing with hosts for which a sequence is not available less problematic: the 467 method currently can include these hosts, but without additional data their role is unclear and 468 they are often placed at the end of transmission chains in consensus trees (Fig 2b, Fig 3). (2) 525 We now introduce the four models, the associated likelihoods, and prior distributions for 526 associated parameters. 527 Transmission. We assume that the outbreak started with a single case. Each case produced 528 secondary cases at random generation intervals after their own infection (Gamma distribution 529 with shape a G and mean m G ). We consider that all untimed transmission tree topologies are 530 equally likely, so that the probability of the transmission tree only depends on its timing. The (3) 535 Sampling. We assume that all cases are observed and sampled at random times after they were 536 infected, according to a Gamma distribution with shape a S and mean m S . Transmission and 537 sampling are independent, so transmission can take place after sampling, and a case can be 538 sampled earlier than its infector. The likelihood is the product of probability densities of all 539 sampling intervals in the outbreak: 540  increases by 1 at each coalescent node and decreases by 1 at each transmission event and at 568 sampling. The likelihood for each mini-tree can then be written as 569 The first term is the probability to have no coalescent events during the 571 intervals in which there are two or more lineages, the second term is the product of coalescent 572 rates at the coalescent nodes. 573 Mutation. We use a single fixed mutation rate μ for all sites, with mutation resulting in any of 574 the four nucleotides with equal probability (Jukes-Cantor). This parameterisation means that the 575 effective rate of nucleotide change is 0.75μ. Given the phylogenetic tree, this results in the 576 likelihood: 577 Here, we multiply over all coalescent and transmission nodes x, which occur at time t

Inference method 590
We use Bayesian statistics to infer transmission trees and estimate the model parameters 591 from the data, and MCMC methods to obtain samples from the posterior distribution. The 592 procedure is implemented as a package in R (phybreak), which can be downloaded from GitHub 593 (www.github.com/donkeyshot/phybreak). The package also contains functions to simulate data, 594 and to summarize the MCMC output. 595 The main novelty of our method lies in the proposal steps for the phylogenetic and 596 transmission trees, used to generate the MCMC chain. It makes use of the hierarchical tree 597 perspective, in which the phylogenetic tree is described as a collection of phylogenetic mini-trees 598  proposal, the topology of the transmission tree is changed (see below), and in most cases the 633 phylogenetic tree as well (80% probability). However, we also allowed for proposal steps 634 without changing the phylogenetic tree (20% probability); this greatly improves mixing of the 635 MCMC chain if there are many SNPs, which more or less fixes the phylogenetic tree topology. 636 The 80%-20% distribution for the two types of proposal was not optimized but chosen such that 637 mixing of the phylogenetic tree is only limitedly less efficient than without the second type of 638 proposal (keeping the phylogenetic tree fixed). 639

Proposals for changes in transmission and phylogenetic trees.
Here we describe how changes 640 in the transmission and phylogenetic trees are proposed for six different situations, based on the 641 preliminary proposal for the infection time I i ' and on whether the index case is involved. Fig 5  642 shows the proposed changes. More detail on the proposal distribution and calculation of 643 acceptance probability is given in the S2 Methods. 644 A. The focal host i is index case, and the preliminary I i ' is before the first transmission 645 event. In that case, the infection time of host i becomes I i ', and no topological changes 646 are made in the transmission tree (Fig 5A). 647 B. The focal host i is index case, and the preliminary I i ' is after the first transmission event, 648 but before host i's second transmission event, if there is any. In that case, the infection 649 time of host i becomes I i ', and host i's first infectee becomes index case, transmitting to i 650 ( Fig 5B). 651 C. The focal host i is index case, and the preliminary I i ' is after host i's second transmission 652 event, if there is any. In that case, the infection times of host i and its first infectee are 653 switched, and host i's first infectee becomes index case. They may or may not exchange 654 infectees, with 50% probability (Fig 5C). 655 D. The focal host i is not index case, and the preliminary I i ' is before infection of the index 656 case. In that case, the infection time of host i becomes I i ', and host i becomes index case, 657 transmitting to the original index case (Fig 5D). 658 E. The focal host i is not index case, and the preliminary I i ' is after infection of the index 659 case, but before host i's first transmission event. In that case, the infection time of host i 660 becomes I i ', and a new infector is proposed from all hosts infected before I i ' (Fig 5E). 661 F. The focal host i is not index case, and the preliminary I i ' is after host i's first transmission 662 event. In that case, the infection times of host i and its first infectee are switched, as well 663 as their position in the transmission tree. They may or may not exchange infectees, with 664 50% probability (Fig 5F). the focal host is not the index case, and I i ' is before MRCA I,II of the focal host and his infector 717 (the index case), and I j ' is after MRCA I,II ; (C) the focal host is not the index case, and I i ' is 718 before MRCA II,III of the focal host and his infector, but after the MRCA I,III of the focal host and 719 his infector's infector; also, I j ' is after MRCA II,III . 720

Evaluating the method 721
We took three approaches to evaluate our method: analysis of newly simulated data, 722 analysis of published simulated data [19], and analysis of published observed data. When not 723 specified, the following parameter settings and priors were used: shape parameters for sampling 724 and generation interval distributions a S = a G = 3, uninformative priors for mean sampling and 725 generation intervals with μ S = μ G = 1 and σ S = σ G = ∞, and an uninformative prior for within-host 726 growth parameter r with a r = b r = 1. The prior for log(μ) (mutation rate) is always uniform. Simulations were repeated until 25 outbreaks were obtained of the desired size. 747 Given the infection times, sampling times were drawn, and phylogenetic mini-trees were 748 simulated for each host. These were combined into one phylogenetic tree on which random 749 mutation events were placed according to a Poisson process with rate 1. Each mutation event was 750 randomly assigned to one site, and generated one of the four nucleotides with equal probabilities 751 (reducing the effective mutation rate by 25%). By giving the root an arbitrary sequence, the 752 sampled sequences were obtained by following the paths from root to sample and changing the 753 nucleotides at the mutation events. can be found. Briefly summarizing some characteristics, 50 hosts were placed on a grid and a 766 spatial transmission model was run, with exponential transmission kernel. Outbreaks with fewer 767 than 45 cases were discarded. An SEIR (susceptible -exposed -infectious -removed) 768 transmission model was used, with fixed latent period of 2 days and normally distributed 769 infectious period (mean(sd) of 10(1) days). Sampling occurred at the time of removal. 770 Phylogenetic mini-trees were simulated using a logistic within-host growth model 771 [35], from which the mean and standard deviation of the sampling interval was calculated (7.0 788 and 2.2 days). We used this to inform the sampling intervals with: a S = 10, μ S = 7, σ S = 0.5. 789 Because transmission after culling is not possible, we also used a weak informative prior for the 790 mean generation interval: μ G = 5, σ G = 1. Two methods were used to make consensus transmission tree topologies (who infected 806 whom), both based on the frequencies of infectors for each host among the 50,000 posterior 807 trees. The support of host j being the infector of host i is defined as the proportion of posterior 808 trees in which host i infected host j. The first consensus tree is the maximum parent credibility 809 (MPC) tree [19], which is the tree among all posterior trees that has the highest product of 810 infector supports. The second consensus tree is the tree constructed using an adaptation of 811 Edmond's algorithm, which starts by taking the infector with highest support for each host, and 812 resolves cycles if there are any [20]. Because the actual algorithm requires prior choice of an 813 index case, we adapted it by repeating the algorithm for all supported index cases, and selecting 814 the tree with highest sum of posterior supports (the measure used in the algorithm itself). 815 Posterior infection times were summarized either outside the context of a consensus tree, 816 i.e. based on all MCMC samples, or for a particular consensus tree, i.e. for each host based only 817 on those samples in which the infector was the consensus infector. The latter is to improve 818 consistency between topology and infection times, although even then consistency is not 819 guaranteed. For plotting transmission trees only, we used the Edmond's consensus tree; for 820 plotting transmission and phylogenetic trees together, we used the MPC consensus tree, which 821 comes with a consistent phylogenetic tree because it is one of the sampled trees. 822 In panels A, B, D, and E, the initial situation is at the top, and the proposal below. In panels C 1024 and F, the initial situation is in the middle, and two alternative proposal above and below. Every 1025