## Figures

## Abstract

The use of genetic data to reconstruct the transmission tree of infectious disease epidemics and outbreaks has been the subject of an increasing number of studies, but previous approaches have usually either made assumptions that are not fully compatible with phylogenetic inference, or, where they have based inference on a phylogeny, have employed a procedure that requires this tree to be fixed. At the same time, the coalescent-based models of the pathogen population that are employed in the methods usually used for time-resolved phylogeny reconstruction are a considerable simplification of epidemic process, as they assume that pathogen lineages mix freely. Here, we contribute a new method that is simultaneously a phylogeny reconstruction method for isolates taken from an epidemic, and a procedure for transmission tree reconstruction. We observe that, if one or more samples is taken from each host in an epidemic or outbreak and these are used to build a phylogeny, a transmission tree is equivalent to a partition of the set of nodes of this phylogeny, such that each partition element is a set of nodes that is connected in the full tree and contains all the tips corresponding to samples taken from one and only one host. We then implement a Monte Carlo Markov Chain (MCMC) procedure for simultaneous sampling from the spaces of both trees, utilising a newly-designed set of phylogenetic tree proposals that also respect node partitions. We calculate the posterior probability of these partitioned trees based on a model that acknowledges the population structure of an epidemic by employing an individual-based disease transmission model and a coalescent process taking place within each host. We demonstrate our method, first using simulated data, and then with sequences taken from the H7N7 avian influenza outbreak that occurred in the Netherlands in 2003. We show that it is superior to established coalescent methods for reconstructing the topology and node heights of the phylogeny and performs well for transmission tree reconstruction when the phylogeny is well-resolved by the genetic data, but caution that this will often not be the case in practice and that existing genetic and epidemiological data should be used to configure such analyses whenever possible. This method is available for use by the research community as part of BEAST, one of the most widely-used packages for reconstruction of dated phylogenies.

## Author Summary

With sequence data becoming available in increasing high volumes and at decreasing costs, there has been substantial recent interest in the possibility of using pathogen genome sequences as a means to retrace the spread of disease amongst the infected hosts in an epidemic. While several such methods exist, many of them are not fully compatible with phylogenetic inference, which is the most commonly-used methodology for exploring the ancestry of the isolates represented by a set of sequences. Procedures using phylogenetics as a basis have either taken a single, fixed phylogenetic tree as input, or have been quite narrow in scope and not available in any current package for general use. For their part, standard phylogenetic methods usually assume a model of the pathogen population that is overly simplistic for the situation in an epidemic. Here, we bridge the gap by introducing a new, highly flexible method, implemented in the publicly-available BEAST package, which simultaneously reconstructs the transmission history of an epidemic and the phylogeny for samples taken from it. We apply the procedure to simulated data and to sequences from the 2003 H7N7 avian influenza outbreak in the Netherlands.

**Citation: **Hall M, Woolhouse M, Rambaut A (2015) Epidemic Reconstruction in a Phylogenetics Framework: Transmission Trees as Partitions of the Node Set. PLoS Comput Biol 11(12):
e1004613.
https://doi.org/10.1371/journal.pcbi.1004613

**Editor: **Marcel Salathé,
Ecole Polytechnique Federale de Lausanne, SWITZERLAND

**Received: **February 15, 2015; **Accepted: **October 17, 2015; **Published: ** December 30, 2015

**Copyright: ** © 2015 Hall et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

**Data Availability: **All simulated data are included in the XML input files under Supporting Information. Sequence data for the H7N7 outbreak is available from the GISAID database (accession numbers EPI_ISL_67934, EPI_ISL_68268-68352, EPI_ISL_82373-82470, and EPI_ISL_83984-84031). The distance matrix of farm locations is available on request for researchers who meet the criteria to access the data (gertjan.boender@wur.nl).

**Funding: **MH was supported by a Ph.D. studentship from the Scottish Government-funded EPIC (Epidemiology, Population Health and Infectious Disease Control) program under grant number EPIC II RA1651 544MWX (http://epicscotland.org/). AR has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement no. 278433-PREDEMICS and ERC Grant agreement no. 260864 (http://cordis.europa.eu/fp7/home_en.html). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

The increasing availability of faster and cheaper sequencing technologies is making it possible to acquire genetic data on the pathogens involved in outbreaks and epidemics at a very fine resolution. It is likely that in future outbreaks where most or all clinical cases can be identified, pathogen nucleotide sequences will be available from each one as a matter of course. Identification of a high proportion of cases is plausible in several scenarios, such as agricultural outbreaks, where the infected unit will usually be taken to be the farm and considerable government resources will be employed to identify every one, HIV, where almost all infected individuals will eventually seek treatment, and epidemics involving a population that can be closely monitored, such as those occurring in hospitals or prisons. The prospect of acquiring complete or nearly complete sequence datasets from an outbreak naturally suggests the possibility that genetic data could be used to reconstruct the transmission tree, determining which infected host or premises infected which others. Such a procedure would be of value in epidemiological investigations, with genetic data providing a means to complement traditional methods of contact-tracing.

There has been considerable recent work in the development of computational methods to perform analyses of this sort. Early papers inferred links based on pairwise comparisons between isolate sequences [1–4], sometimes combined with epidemiological data, but without explicitly modelling the mutation process. More recent work has instead employed a phylodynamic framework, in which inference is performed using a combination of epidemiological and evolutionary models [5–12]. A Bayesian Markov Chain Monte Carlo (MCMC) approach has almost always been used, as the probability spaces involved are of very high dimension and mathematically complicated.

The most frequent approach has been to start with a model of transmission and attach a mutation model to it, making simplifications that link the evolutionary process with host-to-host transmission events. These simplifications often violate some of the basic principles of phylogenetic inference. Jombart el al. [4, 9] treat mutation as a consequence of transmission, with none occurring within-host, whereas the work of Morelli et al [7], extended by Mollentze et al [11], while allowing for within-host mutation, still only allows a single pathogen lineage to exist within each host at any given time. Such simplifications may be reasonable to make when analysing an epidemic. However, as phylogenetic analysis is the most commonly-used tool for investigating of the history of pathogen lineages, there is scope for the development of methods which are fully compatible with it.

Also implicit within these assumptions of no within-host genetic diversity is another assumption, which is that branching times in the phylogeny and transmission events coincide; this is shared with a number of phylodynamic methods for analysis of less well-sampled datasets [13–17]. Effectively, the transmission tree and phylogeny are taken to be the same entity. Previous studies have shown that this assumption can be problematic [8, 10]. Two pathogen lineages coexisting in a host can share a common ancestor immediately after infection (or even before), but the first transmission from that host to another is unlikely to occur until several days afterwards. An error of several days is of some significance in investigating an infectious disease emergency.

Some methods do acknowledge that the phylogenetic and transmission trees are separate, although related, entities. An exploration of this was performed by Ypma et al. [8], who linked up individual within-host phylogenies according to a transmission tree structure to build a single tree describing the history of the pathogen lineages for an entire epidemic. Other papers have noted that, instead of dealing with multiple phylogenies, a transmission history can be reconstructed by augmenting the internal nodes of a single tree for samples taken from the epidemic with information about the host in which the corresponding lineage was located. This would be a preferable approach in general because it is much more compatible with existing computational methods for phylogeny reconstruction which estimate only a single tree. Cottam et al. [5] were the first to identify this, and it was revisited and refined by Didelot et al. [10].

These two studies, however, were constrained by the lack of a method to co-estimate the complete phylogeny simultaneously with its node labels; they have instead employed a “two-step” procedure, using a fixed tree pre-generated by a standard phylogenetic method. This approach has two problems. Firstly, it will ignore any uncertainty in estimates of the phylogeny. If a Bayesian phylogeny reconstruction method is used, this can be mitigated by using the same method on each one of a sample of trees drawn from the posterior distribution, but at the cost of greater computational time. Secondly, the method used to construct such a fixed tree will often have made assumptions about the structure of population of pathogens or infected hosts that is inconsistent with that of an epidemic. Standard analyses for estimation of time-resolved phylogenies will assume that, all lineages are part of a single, freely mixing population, with the probability of a tree calculated based on the assumption that it was generated by a coalescent process in this population. The result is that phylogenies may display features that are not epidemiologically plausible. For example, even for the fastest-evolving RNA viruses it remains true that many sequences collected over the short timescale of an epidemic will be identical [18]. If this is the case for two isolates, they are likely to form a “cherry” in the reconstructed phylogeny whose time of most recent common ancestor (TMRCA) can take values very close to the sampling time of the earlier isolate, because in a panmictic population, there is no reason to rule this out. In an epidemic situation where each sample is taken from a different host, we know that this is impossible, as there must have been at least one infection event since that TMRCA, and in the time from infection to sampling, a host will have gone through an incubation period and probably also a non-negligible period from manifestation of symptoms to sampling. If a single tree with these short terminal branch lengths is then used to estimate epidemiological parameters, estimates of times from infection to sampling are unlikely to be reliable.

Phylogenetic inference, too, would benefit from a more realistic population model for data from epidemics than the free mixing that is assumed in the standard coalescent-based methods. Much more sophisticated models, designed specifically with epidemics in mind, exist in the field of mathematical epidemiology. Of particular interest are those [19–21] that treat each infected host or premises as an individual entity rather than the member of a compartment, as this aligns closely with phylogenetics, where each isolate must come from one particular host, and allows inference that uses detailed epidemiological data, which can be acquired at the same time that a pathogen sample is taken for sequencing.

Our contribution here is threefold. Firstly, we provide a more rigorous mathematical definition of the correspondence identified by Didelot et al. [10] between an annotation of the internal nodes of a phylogeny with host data and a transmission tree. Secondly, we provide a full, flexible, Bayesian MCMC framework for “one-step”, simultaneous estimation of transmission trees and phylogenies, which uses a model of the pathogen population that is consistent with host-to-host transmission during an epidemic, and can make use of relevant epidemiological data. Thirdly, as our method is fully integrated into the existing phylogenetics application BEAST [22], it provides a freely-available implementation of a method of this type for use by the research community, as well as platform for future development that has access to all the models and methods that are already implemented in that package.

## Methods

### Transmission trees as partitions of the set of nodes of a phylogeny

We suppose that during an infectious disease outbreak or epidemic that infected *N* different units (be they infected organisms or infected premises—we use the word “host” in this section to avoid ambiguity), each of the *N* underwent one or more examinations which detected whether it was infected or not. Hosts that were found to be infected at an examination provided a pathogen isolate from which was obtained a nucleotide sequence (a positive examination); hosts that were not provided nothing (a negative examination). An examination could produce at most one sequence but multiple examinations could be performed simultaneously. We assume that each host experienced at least one positive examination, so we are aware of all infections. The nucleotide sequences resulting from these examinations, together with information on negative examinations, forms our dataset *D*. It may be that there are known hosts in the epidemic for which no sequence is actually available; in these cases it is obvious that if an examination was made at a time at which we know infection was present, it would have been positive and have provided a sequence, so we declare that such an examination occurred but produced a noninformative sequence (i.e. consisting entirely of the code “N” representing “any nucleotide”). As a result we have *M* pathogen sequences with *N* ≤ *M*. We denote the set of examination times by **T**^{exam}. We also assume no superinfection or reinfection, and that transmission is a complete bottleneck; only one genetic variant is passed from an infectious host to a newly infected one.

A phylogeny , with branch lengths in units of time, describes the ancestral relationship between the sequences from all positive examinations. The “height” of nodes in this full tree is defined in backwards time relative to the time at which the last positive examination was made. If **A** is the complete set of *N* hosts, a transmission tree in our terminology is a rooted tree with *N* nodes labelled with the elements of **A**. The root node of such a tree is labelled with the first host in the outbreak and edges indicate infections. They do not include information on timings and consist solely of a description of which host infected which others. As such, the tree can be regarded as a map taking each host to its infector, or to nothing if it is the index host.

Didelot et al. [10], traced the spread of a pathogen amongst hosts by annotating each internal node of a phylogeny with the host that the corresponding lineage was present in. We use the same principle. Formally, there is a correspondence between possible transmission trees and ways in which the internal nodes of can be partitioned into subsets subject to two rules: that, for each such subset, the subgraph of consisting of all the nodes in the subset and all edges connecting them (this is called the subgraph induced by the subset) is connected, and that each subset contains all the tips corresponding to the isolates taken from one and only one host. The annotation then associates each node with the host from which the tips in its subset were taken from. Fig 1 shows this correspondence for the simple case where there are three hosts in the epidemic with one isolate taken from each. In S1 Text we give a more extensive mathematical treatment of this correspondence, demonstrating that it is one to one if the phylogeny is fixed, that not every transmission tree arises as a partition of the nodes of such a fixed phylogeny if there are more than two hosts, but that every one does arise as a partition of the nodes of some phylogeny.

In such a partitioned phylogeny, infection events occur on branches joining nodes which have been annotated with different hosts. The host that the parent node is annotated with infects the host that the child node is annotated with. If the latter host is *a*_{i}, we call this branch the infection branch of *a*_{i}. The infection branch of the host that was the index case in the epidemic is the root branch of the phylogeny, which we, in contrast to most phylogenetic methods, give a finite length. The timings of the two nodes joined by this branch constrain the infection time of *a*_{i}, but the partition does not exactly specify it. Assuming that infection times and times of coalescence of lineages cannot exactly coincide, to fully describe the epidemic we introduce, for each host *a*_{i}, a parameter *q*_{i} between 0 and 1 such that if the infection branch of *a*_{i} starts at *t*_{1} and ends at *t*_{2}, then *a*_{i} was infected at .

### MCMC procedure

As many transmission histories cannot be reconstructed by partitioning the nodes of a single phylogeny, a reconstruction procedure that is able to fully explore the space of transmission trees cannot simply take a fixed tree as input. Instead, it must explore the full space of phylogenetic trees as well. The most common methods for estimation of time-resolved phylogenies involve the use of Bayesian MCMC to sample from the probability distribution of phylogenetic trees given the available sequence data. If the data is as outlined above, such procedures can be extended to simultaneously sample from the probability distribution of reconstructed epidemics if each sampled tree is augmented with a partition of its internal nodes as well as parameters determining the exact times of infection of each host. We have implemented this procedure in the package BEAST [22]. Because of the special requirements of this type of augmentation, the standard MCMC moves on a phylogenetic tree topology are unsuitable as they will generally not make modifications that respect the rules of the node partitions. Instead, a specialised set moves have been devised to alter the phylogeny and partition in such a way that the transmission tree structure is maintained, which we now describe.

#### Infection branch operator.

This operator changes the partition of the phylogeny while keeping itself fixed. We first need to introduce some terminology. If is a partition of the nodes of as described above, and *u* is a node of , let be the host from which the tips in the same element of (remembering that elements of are subsets of the set of nodes of ) as *u* were sampled. This is in fact the host in which the pathogen lineage at *u* was present. Say that an internal node *u* of is ancestral under if it is an ancestor of at least one tip of that is a member of the same element of as itself (a tip that is associated with the same host as itself). For a host *a*_{i}, let *c*(*a*_{i}) be the most recent common ancestor node of all the tips in that correspond to samples taken from *a*_{i}. (If there is only one such tip, *c*(*a*_{i}) is just that tip.) Observe that must be *a*_{i} itself, because if it is not then the subgraph induced by all nodes mapping to *a*_{i} under will not be connected. Say that a host *a*_{i} is root-blocked by a host *a*_{j} if *c*(*a*_{j}) is an ancestor of *c*(*a*_{i}). The reason for this nomenclature is that if this is true, the root *r* of can never have because if the nodes of a connected subgraph of include both *r* and *c*(*a*_{i}) then they must also include *c*(*a*_{j}), and this is untrue because . See subfigure A in Fig 2 for an illustration of these concepts.

Nodes in all cases are coloured by the partition element containing them. (A) An example partitioned phylogeny. Tips are labelled by the hosts that the isolates corresponding to them were taken from. Where more than one isolate is taken from a host *a*_{i}, *c*(*a*_{i}) is labelled; in all other cases *c*(*a*_{i}) is the single tip corresponding to an isolate taken from that host. Black diamonds designate nodes that are not ancestral under the partition. The hosts *a*_{7} and *a*_{8} are root-blocked by *a*_{6} due to the position of *c*(*a*_{6}) (black cross). (B) The downward infection branch move. The move attempts to move the node *u* from the green partition element to the red (which already contains its parent *uP*). In i), the move is impossible because *u* is the MRCA node of the tips in the green element. In ii), it can be done with no further modifications required to obey the rules. In iii), the node *uC*_{2}, which is not ancestral under the initial partition, must also be moved to the red element so the result obeys the rules. (C) The upward infection branch move. The move attempts to move the node *vP* from the red partition element to the green (which already contains its child *v*). In i), the move is impossible because *vP* is ancestral under the partition and the host represented by the green element is root-blocked by the host represented by the red. In ii), it can be done with no further modifications required to obey the rules. In iii), the node *vS*, which is not ancestral under the initial partition, must also be moved to the green element, and in iv) the node *vG* must be because *vS* is ancestral. (D) The type A phylogeny moves. The exchange move exchanges the nodes *u* and *v*; the subtree slide and Wilson-Balding moves change the position of the node *u* and its parent *uP*. (E) The type B phylogeny moves. The exchange move exchanges the nodes *u* and *v*; the subtree slide move the node *w* and its parent *wP*, and the Wilson-Balding the node *v* and its parent *vP*. After the latter two moves the transplanted parent node is randomly assigned to one of two new partition elements with equal probability.

The move operates by first randomly picking a host *a*_{i} that is not the index host, and finding its infection branch. If this ends in *u* and begins in *uP*, then and are different hosts. Suppose *p*_{1} is the partition element containing *u* and *p*_{2} contains *uP*. We want to produce a new partition in which both *u* and *uP* are in either *p*_{1} or *p*_{2}, adjusting the membership of all elements so that the subgraphs remain connected. This pushes the infection branch of *a*_{i} up or down the tree (in our terminology “up” is towards the root). This is not always possible in both directions, but if it is, then we select upwards or downwards each with probability 0.5. If neither is possible then the move fails.

The downward move, which moves the infection branch of *a*_{i} towards the tips and puts *u* into *p*_{2}, is impossible if *u* = *c*(*a*_{i}) as removing the MRCA of all the tips in a subtree will always disconnect that subtree. This also prevents the move from changing which partition element any tip belongs to, because *u* is a tip then *u* = *c*(*a*_{i}). If *u* ≠ *c*(*a*_{i}), then let *uC*_{1} and *uC*_{2} be the two children of *u*. If then the reassignment of *u* to *p*_{2} will have disconnected the subgraph induced by *p*_{1}. But we know that all the tips that were contained in *p*_{1} must be descended from only one of *uC*_{1} and *uC*_{2} (or else *u* = *c*(*a*_{i})); in other words only one of these nodes is ancestral under . Without loss of generality say it is *uC*_{1}. We move *uC*_{2} and all descendants of it that are also in *p*_{1} to *p*_{2} as well. This move is depicted in Fig 2B.

The upward move works in the opposite way; *uP* is moved to *p*_{1}. If , then this is impossible if *a*_{i} is root-blocked by *a*_{j} and *uP* is ancestral under . Suppose *uG* is *uP*’s parent (which may not exist if *uP* is the root) and *uS* its other child. If either *uG* does not exist, or *uP* and *uS* are in different elements of , then nothing further is required. Otherwise, changing the partition element containing *uP* disconnects the subgraph induced by *p*_{2}. Only one of its components can contain any tips, because if both did, *uP* would be ancestral under since it would be the ancestor of the tips in one component, and *a*_{i} would be root-blocked by *a*_{j}. If *uP* is ancestral under then the component containing *uS* contains them; if it is not then the one containing *uG* does. We complete the move by moving all nodes in the component with no tips to *p*_{2} are well. This move is depicted in Fig 2C.

We then note that:

- The downward move on
*u*is reversed by the upward move on the child*uC*_{1}of*u*that is ancestral under . The Hastings ratio is 1 multiplied by 2 if and by 0.5 if*u*is ancestral under and is root-blocked by . - If
*uP*is not ancestral under , then the upward move on*u*is reversed by the downward move on*uP*. The Hastings ratio is 1 multiplied by 0.5 if and by 2 if*uG*is ancestral under and is root-blocked by . - If
*uP*is ancestral under , and the upward move on*u*is possible, then it is reversed by the upward move on its sibling*uS*. The Hastings ratio is 1 multiplied by 0.5 if and then by 2 if .

#### Phylogeny operators.

BEAST in its default configuration uses three types of phylogeny operator: exchange [23], subtree slide [24], and Wilson-Balding [25]. All three have been modified to produce two special cases which respect node partitions. The “type A” operators do not change the transmission tree, whereas the “type B” moves simultaneously rearrange both trees. For brevity we sketch these here; a full treatment can be found in S1 Text, in which we also show that the Markov chain is irreducible. Figs 2D and 2E depict examples of the modifications made to a partitioned tree by the type A and type B moves respectively.

The “wide” version of the standard exchange operator randomly selects two nodes in the phylogeny that are not siblings or the root, and whose heights and parent heights are such that swapping their parents would not lead to a situation where a node is lower in the tree than its child, and does that swap. Our type A modification randomly selects two nodes that are also not siblings or the root and whose parents are members of the same partition element, and does the same. This preserves the infector of every host (see section S1.3.2.1 in S1 Text for details). The type B version selects instead a random two nodes whose parents are in different partition elements to themselves, and again swaps these parents. If *u* and *v* are the two nodes moved and is the original partition, then and exchange infectors in the transmission tree (if they were different to start with). The Hastings ratio must be calculated by specifically enumerating the number of possible exchange partners for each node if they were the first of the pair to be selected.

The standard subtree slide operator picks a random node *u*, draws a value *d* from a probability distribution that has support on the whole real line and is symmetric about 0, and moves *u*’s parent *uP* a distance Δ up or down the tree (according to Δ’s sign) along a path connecting the root to the tips; where the move is towards the tips a random branch is chosen at every split encountered. The move fails if *uP* is taken so far down the tree that its height is equal to or smaller than that of *u*. It also may be that *uP* moves so far up the tree that it becomes the root, which is a legal move. The type A modification does the same, but insists that the final position of *uP* is such that it is adjacent to a node in the same partition element as itself. This ensures the transmission tree structure is unchanged. The Hastings ratio must be calculated, again, by enumerating the set of possible origins and destinations. The type B version picks *u* such that *uP* is in a different partition element to itself, then performs the standard move. Afterwards, *uP* is randomly allocated to the partition element containing either its new parent or its new second child with probability 0.5 each. This moves transplants the subtree of the transmission tree rooted at to a new location. The Hastings ratio is the same as for the standard move, save for a trivial modification to take into account the random allocation of *uP* to a new partition element.

The standard Wilson-Balding move picks a random node, and then prunes and reattaches its parent to a random position so long as there is no height conflict. The modifications are largely the same as those for subtree slide; type A can be applied to any node and reattaches its parent in a position adjacent to a node in the same partition element, whereas type B chooses a node with a parent in a different partition element to itself, performs the standard move, then reallocates the parent to a new element. The modifications to the Hastings ratio calculations are also analogous to subtree slide.

Moves are also needed to adjust branch lengths in the phylogeny; these are inherited from BEAST with no modification required.

#### Infection times.

None of these these moves change the value of any of the *q*_{i} parameters that exactly determine infection times; new values of those are proposed and evaluated separately by draws from a uniform distribution. Nevertheless, changes to either tree may involve modifications of the times of infection of some hosts. For example, the infection branch operator changes the branch on which ’s infection occurs, so it must change even if it does not change *q*_{i}. Even a move that has no effect on the partition or phylogenetic tree topology, such as a change to branch lengths, may alter the height of the nodes which *a*_{i}’s infection branch connects, which will also modify while *q*_{i} remains the same.

### Model description

We assume that the epidemiological and evolutionary processes involved in an epidemic can be described by three models: a stochastic model of infection and between-host transmission dynamics, a deterministic model of the population dynamics of a within-host population of “agents”, and a stochastic model of sequence evolution. Table 1 summarises the notation we will use to describe them in the following paragraphs.

In contrast to the previous work of Didelot et al. [10], whose underlying model of transmission was a compartmental SIR model, we use an individual-based model similar to those employed in previous work on agricultural outbreaks [5–8]. This much more readily allows for the accommodation of host heterogeneity, and makes no assumption of random mixing. We start with a population of susceptible hosts. We may know *a priori* some characteristics that allow us to define relationships between these hosts; if so, call these characteristics *L*. *L* could, for example, be the spatial locations of farms in an agricultural outbreak. The epidemic starts when a single susceptible is infected by an external source. If *a*_{i} is a host, is its time of infection. It is infectious from until a time . The value of is randomly determined at , by a draw from a probability distribution with parameters *ρ*. Let **T**^{inf} be the complete set of infection times and **T**^{end} the complete set of noninfectiousness times. For now, we assume that a host becomes infectious immediately upon infection; we relax this assumption in a later section. If *a*_{i} is infectious and *a*_{j} susceptible, *a*_{i} inflicts a constant force of infection on *a*_{j} given by a rate *b* modified by multiplication by a positive real number *F*(*a*_{i}, *a*_{j}), where *F* is a positive function with parameters *ϕ* defining a relationship between *a*_{i} and *a*_{j} based on the information in *L*. In other words, the time between the infection of *a*_{i} and a possible infection of *a*_{j} by *a*_{i} is drawn from an exponential distribution with mean 1/(*bF*(*a*_{i}, *a*_{j})). If the time drawn is such that *a*_{i} was no longer infectious at that point, or if some other infectious host had infected *a*_{j} at an earlier time, nothing happens. Otherwise, *a*_{j} becomes infected after this time. After , *a*_{i} is considered removed and plays no further part in the epidemic.

There are many possible choices for *F*. If we assume no spatial structure or other heterogeneity affecting transmission then we can just take *F*(*a*_{i}, *a*_{j}) = 1 for all hosts *a*_{i} and *a*_{j}. Otherwise, it can be based on, for example, geographical distance between sampling sites, a network metric, or shared membership in some risk group. It can also be used to state prior information about the transmission tree structure; if it is known *a priori* that *a*_{i} did not infect *a*_{j}, then *F*(*a*_{i}, *a*_{j}) can be set to zero. There is also no requirement that *F* be symmetric.

We assume that each host is examined at least once while it is infected, and that examination does not disturb the course of the infection. Beyond that no concrete assumptions need to be made about the examination process; any number of examinations can be made of any hosts at any time. If examinations are instead restricted so that they only occur at at the point of noninfectiousness of each host, however, there are mathematical advantages, as will be seen.

As in previous work [8, 10] we take the model of the dynamics of the “agents” to be a coalescent process, with parameters *ψ*, amongst lineages in a freely-mixing population within each host. If the hosts are single organisms, the agents will naturally be individual pathogens. If, on the other hand, they are infected locations, they could instead be considered to be infected organisms. In either case, only a very small proportion of the total agent population are represented by lineages in the tree, and the assumption of a low sampling fraction required for use of the coalescent process is satisfied.

The sequence evolution model is of the standard type used in the reconstruction of time-resolved phylogenies [23]. It consists of both a continuous-time Markov chain model of sequence evolution (such as the commonly-used HKY [26] or GTR [27] models) and a molecular clock model. Denote the parameters of both by *ω*. We assume that mutation is a neutral process, and that it occurs independently of the host-to-host transmission structure.

### Bayesian decomposition

In this section we show how the likelihood of a partitioned phylogeny can be calculated using the three models described above. We condition on **T**^{exam} and *L*. The noninfectiousness times **T**^{end} are formally considered to be latent variables and could be estimated, but for this paper we assume them to be known and fixed to their actual values, much as Didelot et al. [10] treat removal times. If any or all hosts are known to have remained infectious indefinitely, the corresponding values of **T**^{end} can be set to the time of analysis. It should be noted that the **T**^{exam} are not strictly sampling times. They instead represent times at which it is known that hosts were examined, and an infected host would provide a sequence. *D* is the results of these examinations, including the results of negative ones. This formulation allows for some convenient mathematics but has consequences for estimation of the prior distribution (see Discussion). Alternatively, if the data is such that all samples from each host were taken at the same time and the assumption that all hosts ceased to be infected immediately after this time is reasonable, we need not treat **T**^{exam} in this way and *D* can consist solely of sequence data as it does in standard phylogenetic analyses; see “an alternative approach” below.

Ideally, it should be possible to enumerate all individuals or premises which were susceptible to infection but never experienced it, and *L* should include background information on them. Their never-infected status is assumed *a priori* and since they will never appear in the phylogeny, there is no need to consider examination times for them. For convenience we give these hosts infection and noninfectiousness times whose values are fixed to the time of analysis, as we need to evaluate the probability that they were not infected at any time before the present. Consideration of the never-infected set is necessary for unbiased estimation of *b* and *ϕ*, which should only be interpreted literally if such data is present in the analysis. If it is not, we are actually estimating parameters *b*′ and *ϕ*′, which is what *b* and *ϕ* would be if all susceptibles did experience infection.

The posterior probability we are interested in calculating is . By Bayes’ Theorem this is equal to
As usual, we need not calculate the denominator if we are uninterested in model comparison as it does not vary. If *D* may contain the results of negative examinations, we must explicitly state that if *D* includes *M* sequences but has any number of tips other than *M*, then the probability of *D* given is zero. A with a different number of tips does not necessarily have zero prior probability, but it does result in zero likelihood for the data, so we need not concern ourself with exploring the posterior probability space of such phylogenies. Given a with the right number of tips, *D* depends by the assumptions of the mutation model only on and *ω*, and the likelihood reduces to , which can be calculated using the Felsenstein pruning algorithm and the chosen molecular clock model in the normal way [23, 28, 29]. It remains to calculate the prior probability . We decompose this as
We make the following assumptions:

- All parameters are independent of
*ω*; the mutation process has no bearing on the infection dynamics inside or between hosts. - The phylogeny is conditionally independent of
*b*,*ϕ*,*ρ*, and*L*given*ψ*, ,**T**^{inf},**T**^{end}, and**T**^{exam}. The former parameters determine the transmission model and are not relevant if we already know the full transmission tree and its timings. - The transmission tree and its event timings
**T**^{inf}and**T**^{end}are conditionally independent of**T**^{exam}and*ψ*given*ϕ*,*ρ*, and*L*. The parameters of the within-host model are not relevant to the between-host model and examination is assumed not to disturb the transmission process. *b*,*ϕ*,*ψ*,*ρ*, and*ω*are independent of**T**^{exam},*L*and each other. The parameters determining transmission, within-host growth, infectious periods, and mutation are independent of each other, the examination process, and the exact relationships amongst this set of hosts.

The decomposition is therefore reduced to

We first need to calculate . The first observation we make is that the combination of **T**^{inf}, **T**^{end} and **T**^{exam} determines which examinations were positive, and that positive examinations correspond to the tips of . If the number of positive examinations of a given host and the number of tips corresponding to sequences taken from that host differ, then this term must be equal to zero. In theory, we can calculate it for a phylogeny with any number of tips up to the total number of examinations, but in practice we need not if we are sampling from the posterior distribution, as any tree that does not have *M* tips will have zero posterior probability because the likelihood will be zero. So we can assume that has *M* tips and that no tip date is before the infection date or after the noninfectiousness date of the corresponding host, and merely check that **T**^{inf}, **T**^{end} and **T**^{exam} imply *M* positive observations.

If the tip count is correct, we then calculate this probability by extending the procedure outlined by Didelot et al. [10] to allow for the use of any of the standard models of deterministic population growth, and the possibility of host heterogeneity. The latter is accomplished by dividing the set of hosts into categories and assigning a separate demographic model to all of those in each one. Categories can be assigned from known epidemiological data about the hosts; for example, in a livestock disease outbreak, they may reflect the size of farm. Naturally, there is no requirement that there be more than one category. If **c** is such a category, there is a corresponding demographic function with parameters *ψ*_{c} where *N*_{c}(*t*) is the product of the effective population size and generation time of the agents at time *t* on a separate backwards timescale in each host. Let *cc*(*i*) be the category that *a*_{i} belongs to.

Suppose that, according to , **T**^{inf} and **T**^{exam}, *a*_{i} ∈ **A** infected *n*_{i} other hosts and that there were *m*_{i} positive observations of *a*_{i}. Suppose is a phylogenetic tree that describes the part of the outbreak that took place within *a*_{i}. Because we assume transmission is a complete bottleneck, it is a single tree with a root note *r*. It will have *n*_{i} + *m*_{i} tips, one for each infection event and each positive observation. If the time of the root *r* is , we know that is later than and we give a root branch of length . If we have a for each *a*_{i}, and we know , we can build a phylogenetic tree for the entire epidemic by, if , attaching the root node of to the tip of that corresponds to the infection of *a*_{i} by a branch with length equal to the root branch length of . If cannot be built up from in this way, . Otherwise, we calculate it as

In the standard coalescent model [30], the probability density function for the time *t* (in backwards time) of the first coalescence of *K* ≥ 2 lineages after *t*_{0} where the demographic function is *N*_{c} is given by
and if we know which two specific lineages coalesced, the first *K*(*K* − 1)/2 cancels. As Didelot et al. [10] note, this is not quite sufficient for our purposes because we have a maximum height for the last coalescence. If this is *t*_{max}, the normalised probability distribution for the time of first coalescence is
(1)
This is the probability of an interval in ending in a coalescent event. The probability of an interval ending in a transmission or sampling event is the probability that no events occur in the interval, which is one minus the cumulative distribution function *P*(*t*|*t*_{max})
(2)
Note that while with no maximum root height, the formula happens to work for *K* = 1, here it does not as the denominator is 0 for *t*_{0} ≤ *t* < *t*_{max}, and we instead set the probability of any interval with one lineage to 1. In particular, if *a*_{i} has no children then .

These formulae can be used to calculate for every in the established way for a tree with temporally offset tips [23]. It is most intuitive to standardise the timescale of each such that the effective population size at the point of the infection can be the same across all hosts. As a result, when (and only when) dealing with within-host phylogenies we depart from the convention of making height 0 the time of the last tip, and instead put it at the time of infection (i.e. *t*_{max} = 0), with all later events occurring at negative heights. Appropriate demographic functions should be picked for the *N*_{c}s; we suggest exponential or logistic growth [30, 31].

We now calculate the product . The first half, , is the probability that the observed transmission tree and all its timings occurred for a given *b*, *ϕ* and *ρ*. This can be calculated using a procedure similar to that employed by Deardon et al. [21]. If there are, in addition to the *N* infected hosts *a*_{1}, …, *a*_{N}, *N*′ known potential hosts *a*_{N+1}, …, *a*_{N+N′} that were never infected, let *o* be a permutation function such that is in increasing order of time (breaking ties arbitrarily and remembering that never-infected hosts are given “infection times” after any others).

For the real infection events, there are *N* − 1 inter-infection intervals *I*_{2}, …, *I*_{N} where each ; let and (the end of the latter interval is the time of infection assigned to all the never-infected susceptibles). Let be the set of susceptible hosts, the set of infected hosts, and the set of noninfectious hosts at time *t*. For any *i* > 1, and . Recall that we assume that the noninfectiousness time of each host is determined upon infection by a draw from a probability distribution with parameters *ρ*. For any *i* ≤ *N*, the event *E*_{i} is taken to represent the combined occurance of a) *a*_{o(i)} being infected by at the end of *I*_{i}, b) the time of removal of *a*_{o(i)} being , and c) no other infection events taking place during *I*_{i}. The event *E*_{N+1} represents no infections occurring amongst any remaining susceptibles during *I*_{N+1}. We now derive the probability :
(3)
The first term *p*(*E*_{1}|*b*, *ϕ*, *L*, *ρ*) is the product of , and the probability that *a*_{o(1)} was infected at time and was the first in the epidemic. The latter should be defined by a prior; call this . For *E*_{i} with 2 ≤ *i* ≤ *N*:

- Let
*X*_{i}be the probability infected*a*_{o(i)}at , but not before that during*I*_{i}: - Let
*Y*_{i}be the probability that no host in other than infected*a*_{o(i)}before in*I*_{i}. Noting that the upper bound on the time that such an*a*_{j}could have infected*a*_{o(i)}before is either itself if*a*_{j}was still infectious at that point or if it was not, this is given by - Let
*Z*_{i}be the probability that no host in infected any host other than*a*_{o(i)}in (a set that always includes all the never-infected susceptibles) during*I*_{i}. Again, the upper bound on the time at which an*a*_{j}could infect a third host*a*_{k}before is .

Then .

Finally, consider *E*_{N+1}. After , the only remaining susceptibles were never infected, and their assignment of as an infection date is a consequence of this. If *W* = *p*(*E*_{N+1}|{*E*_{i}: *i* < *N*+1}, *b*, *ϕ*, *ρ*, *L*), then it is just the probability that no never-infected susceptible is infected after (the probability that these were not infected earlier is handled in the construction of *Z*_{i} above) and this is given by

The time period after need not be considered. If the epidemic ended with the host *a*_{j} ceasing to be infectious at , the infectious pressure applied to all never-infected susceptibles after will be zero permanently and hence the probability that they were not infected after , if they were not infected before it, is 1. All other hosts were removed by this time and cannot be reinfected. If, on the other hand, any hosts remain infectious at the time of analysis, the probability of no infection for any member of the never-infected set must be calculated up to that time, which is, by construction, . This method is not designed to infer future events, so the timeline is truncated at this point.

We multiply the conditional probabilities of each *E*_{i} together to form Eq (3) and we have
*W* and each product *X*_{i} *Y*_{i} *Z*_{i} can be calculated individually. A term can be seen as the probability that the infectious period of *a*_{i} has length given *ρ*. Writing and assuming infectious periods are independent of each other, then

There are two ways to handle this term. The simpler is to treat *ρ* as known, in which case *p*(*ρ*) = 1 and what we are effectively doing is placing a prior distribution on the length of each infectious period. No assumption is made that each is drawn from the same distribution; indeed every single infectious period can be treated as coming from a different distribution. Previous work on foot-and-mouth disease virus [5, 7] has used clinical data to estimate times of infection, and if this kind of information is available, it can be used to determine an individual prior for each *l*_{i}. This is also the preferable approach if infections are ongoing at the time of sampling. If we cannot use information of this type, a similar approach can be taken to that in the coalescent calculations above, assigning each host *a*_{i} to an infectious period category *ic*(*a*_{i}). This again allows us to accommodate known heterogeneity; for example in an agricultural outbreak it is likely that infectious periods decrease as time goes by and control measures are brought to bear.

It may be, however, that we want to estimate the distribution of infectious periods from the genetic data. Suppose hosts in category *A* have infectious periods distributed according to a distribution *D*_{A} with unknown parameters *ρ*_{A}. In this case *p*(*ρ*_{A}) would be determined by hyperpriors. We suggest that, as an alternative to using MCMC to estimate both the parameters of *D*_{A} and a set of draws from it, the actual values of *ρ*_{A} be integrated out by appealing to a conjugate prior distribution for *D*_{A} and calculating the marginal likelihood of the set {*l*_{i}: *ic*(*a*_{i}) ∈ *A*} given the hyperpriors. Candidates for *D*_{A} are then those continuous distributions for which this marginal likelihood is analytically tractable. Examples are normal, lognormal, exponential, and gamma if the shape parameter is known. Although it it not absolutely ideal as infectious periods are non-negative parameters, we suggest the normal distribution as the prior for the reason that its mean and variance are independent.

Finally, all that remains is to place prior distributions on the parameters making up *b*, *ϕ*, *ψ*, and *ω*.

### An alternative approach

If it is reasonable to assume that all hosts cease to be infectious upon examination (and all examinations of each host take place simultaneously), then **T**^{end} and **T**^{exam} coincide (meaning that we no longer must assume examination does not disturb the transmission process) and we can replace the latter with a term **N**^{exam} which simply counts the number of sequences taken from each host. Negative examinations no longer need to be considered, and *D* is simply sequence data. A tree with a number of tips other than *M* has zero prior probability as well as zero likelihood because we condition on **N**^{exam}. The calculations are identical except that the initial check for consistency of the number of examinations with **T**^{inf} and **T**^{end} can be skipped.

### Latent periods

The above formulation has taken the course of infection to follow a SIR process; hosts are assumed to be infectious as soon as they are infected. It is straightforward to replace this with a SEIR process instead. While it is possible to treat latent periods as draws from a probability distribution in the same way as described for infectious periods in the previous section, in simulations this resulted in poor mixing of the MCMC chain if an strongly informative hyperprior on the parameters of this distribution was used, and poor estimation of their values if the hyperprior was weaker. Instead, we again subdivide the set of hosts into one or more discrete categories and assign a single value to the latent period for all hosts in each category, so that the latent period of host *a*_{i} is *lc*(*a*_{i}). Let the complete set of latent periods be *λ*. Let **T**^{trans} be the set of infectiousness times of each host; then if is the infectiousness time of *a*_{i}, . We assume that hosts are infectious by the time they cease to be infected, and that examinations of infected but noninfectious hosts are positive. The phylogeny is assumed to be conditionally independent of **T**^{trans} given **T**^{inf} and **T**^{end}.

Aside from the prior on *λ*, only the second term in this product is different, and it is not a major modification to the SIR version to calculate it; see S2 Text.

### Simulations

Epidemics and sequences were simulated using examples of the three models described above. The epidemic simulations were intended to represented a situation analogous to an agricultural outbreak, with the hosts as farms. The units of time were intended to represent days. In each replicate of the simulation, **A** consisted of 50 potential hosts arranged spatially on a regular 5 × 5 grid contained in the unit square, such that every grid point contained two whose distance from each other was zero. A single host was chosen at random to be infected first at time 0. The infection of each followed a SEIR process: upon infection, a host *a*_{i} was latently infected for a time *P*^{lat} which was identical across all hosts and subsequently infectious for a period drawn from a normal distribution (negative draws were discarded, but the distribution used was such that the probability of these occurring was negligible). Let **P**^{inf} be the set of all the infectious periods.

*F* was an exponential spatial transmission kernel function: the time for an newly infectious *a*_{i} to infect a susceptible *a*_{j} was drawn from an exponential distribution with mean *be*^{ − αd(ai, aj)} where *d*(*a*_{i}, *a*_{j}) is the Euclidean distance between the locations of *a*_{i} and *a*_{j}. The process was run until no infections remained. A single positive examination was simulated at the point of noninfectiousness of each host. As no infections persisted following the acquisition of a sequence, we are in the special case outlined under “an alternative approach” above and so there is no need to consider the possibility of negative examinations in the analysis. Only simulations in which at least 45 of the 50 susceptibles (i.e. *N* ≥ 45) were eventually infected were kept.

Once the epidemic simulation was completed, the transmission tree was transformed into a phylogenetic tree by simulating a within-host phylogeny under a coalescent process. Variation in the product of effective population size and generation time of the agents within each host was identical and obeyed a logistic growth function *N*_{e}(*t*):
where the timescale is in negative time and distinct for each host and *t* = 0 is the point of infection. *N*_{0} represents the effective population size at *t* = 0, *r* the growth rate during the exponential growth phase of the logistic function, and *T*_{50} the time such that *N*_{e}(*T*_{50}) is half the value of the limit of *N*_{e}(*t*) as it approaches −∞. We conditioned the simulation on all lineages coalescing before *t* = 0. The complete set of such phylogenies was then joined up to produce a single phylogeny for the entire simulated epidemic.

This full phylogeny was then used to generate simulated sequences using the program πBUSS [32]. Sequences consisted of 14,000 base pairs (roughly equivalent to a full influenza A genome). A strict molecular clock model with no rate variation between sites and equal nucleotide frequencies was used. Two sets of sequences were generated. The first used an unrealistically fast molecular clock with a rate of 5 ⋅ 10^{−4} substitutions per site per day (0.183 per site per year) while the second had a rate of 1 ⋅ 10^{−5} per site per day (3.65 ⋅ 10^{−3} per site per year). The slower rate was intended to be be similar to the genuine substitution rate for influenza A. Both used the HKY substitution model [26] with a *κ* value of 2.718. Table 2 gives the parameter values actually used in the simulations.

Mathematical symbols are given where they appear in the text.

Sequence datasets from a total of 25 simulation replicates were used for analysis. We used the within-host coalescent (WHC) method outlined in the previous sections, implemented in BEAST, to reconstruct the full phylogeny and transmission tree for each replicate, and estimate the parameters of the model that generated them. We also performed the same analysis using a blank alignment, sampling from the prior distribution only. Uninfected susceptibles were included in the analysis. For comparison, we also reconstructed the phylogeny only using a GMRF Bayesian skyride [33] tree prior. Table 2 also details the prior distributions used on all parameters. In this paper we concentrate primarily on the between-host model, so the chosen priors on the within-host parameters were somewhat informative about their known values. In the prior, the identity of the index host and its time of infection were taken to be independent, so . A couple of points warrant further explanation.

Firstly, in the reconstruction we assumed that all infectious periods are drawn from an unknown normal distribution with mean *μ*_{inf} and precision *τ*_{inf} and placed a conjugate NormalGamma(*μ*_{0}, *κ*_{0}, *α*_{0}, *β*_{0}) hyperprior on *μ*_{inf} and *τ*_{inf}. The meaning of this is that *τ*_{inf} is gamma distributed with shape *α*_{0} and rate *β*_{0}, and for a known value of *τ*_{inf}, *μ*_{inf} is normally distributed with mean *μ*_{0} and precision *κ*_{0} *τ*_{inf}. Initial analyses of both datasets had *μ*_{0} = 10, *κ*_{0} = 0.01, *α*_{0} = 1, *β*_{0} = 1. While this value of *μ*_{0} is equal to the actual mean of the distribution used to generate the simulations, the low *κ*_{0} actually means that this hyperprior is only very weakly informative about *μ*_{inf}. As it proved that for datasets generated with the slower clock this resulted in a systematic underestimation of the length of infectious periods (see Results), the analysis was repeated with *κ*_{0} = 100, a modification which makes the hyperprior much more informative about *μ*_{inf}.

Secondly, very large values of the probability expressions Eqs (1) and (2) can be obtained when their denominators are very small. This occurs when the coalescence of all lineages before the point of infection (in backwards time) is actually highly unlikely given the parameters of the within-host model (because the denominators are the probability of coalescence of all lineages before infection), contrary to the assumption that transmission is a complete bottleneck. There is therefore a mismatch between this bottleneck assumption and some values of the parameters making up *ψ*, and this must be remedied by appropriate prior distributions for the latter; uninformative priors are completely inappropriate. The nature of the mathematics of the coalescent process used here is such that no values will literally make the bottleneck complete, so we instead ensure that it is not unreasonably wide. The ratio *S* of the final asymptotic value of *N*(*t*) to *N*_{0}, its value at the point of infection, in our logistic model is
The concerning situation is where *S* is small. If *T*_{50} is positive then *S* cannot be greater than 2, so we assume it is negative. We then place a lognormal prior on *S*. This prior, combined with one on either *T*_{50} or *r*, specifies the prior probability of *r* so we give the latter no explicit distribution. We also fixed *N*_{0} to its correct value in all simulations.

In analysing the sequence datasets generated by the slower molecular clock, the amount of genetic variation accumulated over the timescale of each epidemic was found to be insufficient, for some simulations, to provide good estimates of the clock rate. As a result, this parameter was also fixed to its correct value. The same was done for the prior analysis. All MCMC chains were run for sufficiently long to give effective sample sizes of at least 200 for all numerical model parameters.

Accuracy of the reconstructed phylogenetic tree topology was assessed by counting, for each tree in the posterior sample, the number of subtree prune and regraft (SPR) moves required to take it to the correct phylogeny and taking the posterior median value of this count; we used the program rSPR [34] to determine this. In addition, to investigate the extent to which the imposition of a transmission model constrains the space of plausible phylogenies, we calculated the number of unique clades in the 50% credible set of phylogenies for the WHC and skyride analyses of each slow and fast clock dataset.

We used two methods to assess procedures by which transmission tree might be reconstructed in practice. Firstly, the posterior set of trees was summarised in a single maximum parent credibility (MPC) transmission tree, analogous to the maximum clade credibility (MCC) tree for phylogenies. The posterior distribution of parents for each host in the epidemic was calculated for each host in turn, and the parent credibility of each tree in the sample was calculated as the product of the posterior probabilities of each link in the chain. The MPC tree is the tree in the sample that maximises this product. This was compared to the correct transmission tree, and the proportion of parents that were correctly identified calculated.

As an alternative approach we identified, for each host, the infector with the highest posterior probability, regardless of whether the result of doing this for every host actually constituted a proper transmission tree that was connected with no cycles. We calculated the proportion of parents that would be correctly identified by doing this, firstly if the actual value of the posterior probability was not considered, and subsequently for different values of a threshold probability below which inference of parental relationships would not be made.

### Analysis of sequences from the 2003 H7N7 avian influenza outbreak in the Netherlands

We used the WHC method to reanalyse the data from the Dutch H7N7 avian influenza outbreak of 2003. The outbreak has been the subject of many previous papers [35–37], including several that incorporated genetic data [6, 18, 38–40].

Epidemiological data from the Dutch epidemic consisted of cull dates for all 241 farms, and the matrix of spatial distances between them, rounded to the nearest kilometer. (We did not have access to their precise spatial locations.) The GISAID database [41] contains sequences for isolates taken from 229 of the farms (95.0%); this consists of the HA, NA and PB2 segments in 226 cases, the HA and PB2 in 2, and the HA and NA in 1. The dates upon which these samples were taken were also available. In the absence of any other information, we assumed that a single examination of each farm took place at this time.

The HA, NA and PB2 sequences were each aligned using the MUSCLE algorithm [42]; segments which were missing were given noninformative sequences consisting entirely of the code “N”. This included entirely noninformative sequences for the twelve farms for which we had no genetic data at all; the examination date of these was set to the cull date of the farm, a time at which it was certainly possible to acquire a sequence. The three segments were then concatenated to produce a single alignment. The 143rd codon position of the HA segment, which has been observed to cause discrepancies between reconstructed phylogenies for each segment probably as the result of convergent evolution [18], was removed. As we lacked data on the location of uninfected farms in the country, we did not include uninfected premises in the analysis, and as a result we were estimating *b*′ and *ϕ*′ (see Bayesian Decomposition).

The parameters of this analysis, and the prior distributions used for them, are summarised in Table 3. We used the SRD06 nucleotide substitution model [43] and an uncorrelated lognormal relaxed molecular clock [29]; the mean clock rate was not fixed *a priori*. The type of spatial transmission kernel function used here was the same as that used by Boender et al. [37] in their analysis of the same epidemic, determined by a logistic expression
where *d*(*a*_{i}, *a*_{j}) is the distance between the farms *a*_{i} and *a*_{j}. As before, the latent period of the disease was assumed to be constant, and we placed a strong prior with a mean of two days on its length. We also followed Boender et al. in assuming that the distribution of farm infectious periods prior to the discovery of the epidemic and the implementation of control measures was distinct from that afterwards, and grouped the set of farms into “high-risk” and “low-risk” categories accordingly. The first five detected cases (F1–F5) were in the high-risk category. The hyperpriors on the distribution of infectious periods in both categories and the prior distribution for the time of the initial infection were informed by estimates from Boender et al. and Stegeman et al. [36]. The prior distribution for the identity of the index farm was such that each high-risk farm was given ten times the weight of each low-risk farm.

We chose to regard the agent population as being made up of infected birds. As in the simulations, we assumed that the product of the effective size and the generation time of this population within each farm underwent logistic growth, and that the same growth function was shared by all farms. Also as in the simulations, we did not estimate *N*_{0} and instead assumed that the effective population size at the point of infection was 1, and that the generation time (the serial interval of the infection) was 3.37 days, a number derived from White and Pagano [44].

Multiple MCMC runs were performed, and the results combined using the LogCombiner utility in order to achieve ESS values over 200 for the posterior and prior probabilities, the likelihood, and all parameters listed in Table 3. The MPC transmission tree was visualised with Cytoscape 3.2 [45].

## Results

### Simulations

The simulation needed to be run only 26 times in order to obtain 25 instances in which at least 45 hosts were infected, suggesting that there was little bias involved in discarding those that failed to meet this threshold. Fig 3 summarises the accuracy of the reconstruction of the transmission tree and the estimates of infection times for each host. For the latter, we saw low bias and error when the molecular clock rate was fast. However, the use of realistic sequences led to a systematic tendency to underestimate times from infectiousness to removal when the mean parameter of the probability distribution from which infectious periods are drawn was not given a strongly informative prior. It is clear from the results of the prior analysis that the effective prior distribution favours short infected periods. Re-running the analysis with an informative prior on *μ*_{inf} (by setting *κ*_{0} = 100, see Methods) greatly reduced this effect, but did not entirely eliminate it.

Each violin plot represents the density of a statistic calculated from the results of separate analyses of 25 simulated datasets; the clock model used to generate the dataset and the analysis method are indicated on the *y*-axis. (A) posterior median of mean bias in estimation of infection dates. (B) posterior median of mean error in estimation of infection dates. (C) Posterior median proportion of hosts whose infector is correctly identified. (D) Proportion of hosts whose infector is correctly identified in the maximum parent credibility (MPC) transmission tree.

The transmission tree was very well reconstructed when the clock was fast, with the posterior median proportion of parents being correctly identified, across the 25 simulations, having a median of 0.94 (range 0.8–1). For the slower clock this was considerably reduced, with a median of 0.64 (0.46–0.78). Increasing *κ*_{0} had no noticeable effect on this (median 0.62, range 0.37–0.78). As expected, reconstruction of the transmission tree when MCMC samples were taken from the prior distribution only was extremely poor. The MPC transmission tree’s median proportion of correctly identified parents was 0.96 (0.82, 1.00) for the fast clock dataset, 0.71 (0.46, 0.88) for initial slow clock dataset, and 0.72 (0.43, 0.86) for the slow clock dataset with *κ*_{0} = 100.

Table 4 summarises the accuracy of the procedure of picking the infector with the highest posterior probability, for no probability threshold and thresholds of 0.5, 0.8, and 0.9. It can be seen that for a threshold of 0.8, inferences are highly accurate even for the slow clock dataset and that the use of a value of this size leaves up to two-thirds of hosts with an inferred infector.

Numbers are median and range across the 25 simulations.

For the fast clock sequences, the phylogeny was sufficiently well resolved by the genetic data that both methods, the established skyride coalescent tree prior and the WHC introduced in this paper, performed similarly in reconstructing it, but WHC performed better when the molecular clock rate was more realistic (Fig 4). Error and bias in the estimates of the TMRCA of each pair of sequences was notably reduced for WHC. Using an informative prior on *μ*_{inf} made estimates better still. The reconstruction of the topological structure of the phylogeny was also improved, with the number of SPR moves needed to take a sampled tree from the MCMC chain to the true phylogeny being consistently smaller for WHC, where the median (across the 25 simulations) posterior median number of required SPR moves was 15 (range 8–21), compared to the skyride analysis, where it was was 18 (range 11–24). The informative prior on *μ*_{inf} made no noticeable difference in this case. In the slow clock analyses, the number of unique clades in the 50% credible set of phylogenies was always larger when the skyride was used than when the WHC was, sometimes by a factor of as much as 2; the median ratio of the former to the latter was 1.87 (range 1.19–2.78). For the fast clock datasets, however, the numbers from each method were extremely similar, indicating the extent to which the phylogeny could be resolved by the genetic data alone.

(A) posterior median of mean bias in estimation of all pairwise TMRCAs. (B) posterior median of mean error in estimation of all pairwise TMRCAs. (C) Posterior median SPR distance from the true phylogeny.

Table 5 summarises the posterior parameter estimates and their accuracy. Figures given are the medians across the 25 simulations. The tendency of WHC to substantially underestimate infectious periods unless an informative prior is used on the mean of their distribution is also clear here; latent periods were also slightly underestimated although the true values were always well within the 95% highest posterior density (HPD) interval. It is also noticeable that the parameters *r* and *T*_{50} of the logistic growth function describing within-host effective population size are not well estimated for the slow clock dataset, with very wide HPD intervals and also a great bias towards underestimating the value of the latter, to the extent that the 95% HPD was frequently inaccurate. The informative prior on *μ*_{inf} improved matters somewhat, at least ensuring that the true *T*_{50} usually lay within the HPD interval. On the other hand, the ratio *S* was recovered with much more precision and much less error and bias. These within-host parameters were in fact rather better estimated when sampling was from the prior only, but this is presumably because the prior distributions on them were chosen with knowledge of their true values. To investigate this further, we re-ran the analyses on the fast and slow clock datasets (with *κ*_{0} = 0.01), fixing the parameters of the within-host model to their true values, and once again determined the accuracy of the reconstruction of the transmission tree. The results of this can be seen in S2 Fig. For the slow clock dataset, this improved the reconstruction somewhat, although the effect was not dramatic.

The median values, across the 25 simulations, of the posterior median, relative error in the posterior median, relative bias in the posterior median and relative width of the 95% HPD interval of each parameter are given, along with the number out of 25 simulations that the correct value was contained within the 95% HPD interval. Where estimates are not given for a particular analysis, this parameter was either fixed to its correct value or, in the case of WHC-related parameters in skyride analyses, not part of the analysis. Mathematical symbols are given where they are referred to in the text.

### Analysis of sequences from the 2003 H7N7 avian influenza outbreak in the Netherlands

The MPC transmission tree can be seen in Fig 5. It can be seen that most of the inferred transmissions have a quite low posterior probability. In fact, if we were to use a posterior probability threshold of 0.5 to infer transmissions, we would draw conclusions about only 90 farms (37.3%), with this dropping to 24 (10%) for a threshold of 0.8, and 9 (3.7%) for a threshold of 0.9. None of the five “high-risk” farms met the 0.5 threshold, although the posterior probability that the index case was among these five was 0.62 compared to the prior probability of 0.17. This lack of resolution is the reason why in the MPC tree the presumed index farm F1 is not correctly identified, and why, while it and the other five high-risk period farms are close together at the start in the transmission chain, they are intermingled with other infected farms identified early in the epidemic. The posterior median date of the first infection was the 19th February, 2003, nine days prior to detection, with the 95% HPD ranging from the 16th until the 21st. This is somewhat later than previous estimates [36, 46]. The orange-bordered nodes in the tree are the twelve farms for which no sequence is available. Notably, the procedure placed them amongst their geographical neighbours. S2 Fig is the MCC phylogeny, with branches coloured by individual farm. It should be noted that the branch colourings in this figure do not reflect a history of the epidemic that is particularly representative of the posterior sample of transmission trees; they are simply the colourings of the phylogeny from the posterior with the highest clade credibility.

Nodes represent farms and are coloured by geographical region. Arrows represent direct transmissions and are coloured by the posterior probability of this particular direct infection. The cyan-bordered nodes, which are also labelled with farm ID numbers from previous literature [18], are were detected during the “high-risk” period before the implementation of control measures. Orange-bordered nodes are farms for which no sequence was available.

Table 6 summarises the parameter estimates. Of note, while we used an extremely informative prior distribution with a mean of two days on the length of the latent period, the estimate was still considerably shorter (posterior median 1.47 days, 95% HPD 1.26–1.69). The estimate of the mean infectious period in the low-risk period did not deviate greatly from the prior expected value of 7.3 (posterior median 7.52 days, 95% HPD 7.02–8.05). The posterior distribution for the mean during for the high-risk period, however, actually had a smaller median at 6.19 days (95% HPD 4.57–7.65), considerably shorter than the prior expected value of 13.8. On the other hand, the median infectious period of the index case (whichever it was in each MCMC state) was 12.7 days (9.59–16.7). Compared to the prior expected value of the precision of the distribution in both high-risk and low-risk periods of 0.263 days^{−2}, the estimated precisions were lower, with posterior medians of 0.167 (6.1 ⋅ 10^{−2}-5.41)days^{−2} for the high-risk period and 8.03 ⋅ 10^{−2} (6.28 ⋅ 10^{−2}-0.105)days^{−2} for the low-risk period. The parameters of the within host population function suggested that the effective size of the infected population rose very quickly towards its asymptotic value, achieving values extremely close to it within a day or so. If the median estimates were used, this asymptotic population size was 10.6 times the value at the point of infection. While this behaviour would not seem to reflect the likely course of an epidemic within a flock, it should be remembered that the within-farm model was extremely simplistic for this example. As the analysis in here lacks data on uninfected susceptibles, the estimated parameters of the transmission kernel here differ considerably to the maximum likelihood figures from Boender et al (which were, in our notation, *b* = 2 ⋅ 10^{−3} days^{−1}, *α*_{1} = 2.1 and *α*_{2} = 1.9) since the authors of that paper had access to data about every farm in the Netherlands.

While properties of epidemics such as reproduction numbers are not readily derived from the parameters of a model of this type, they can be estimated post-hoc. For example, the posterior median number of farms infected by the index farm in this epidemic, which is analogous to the basic reproduction number (for farms) *R*_{0}, was 7 (3–11). We also calculated, for each day in the epidemic, the mean number of farms subsequently infected by a farm infected on that day, which is equivalent to the effective case reproduction number *R*; the posterior distribution of this is summarised in S3 Fig. It should be noted the high values of *R* towards the start of the timeline come only from MCMC states for which the date of the index case was earlier than the median date of the 19th February; for many states no epidemic was present at that time and hence there were no infections to contribute to the calculation.

## Discussion

We provide here a novel method for simultaneous reconstruction of both phylogenies and transmission trees, fully incorporated into the existing BEAST package. Being part of an established package has the advantage that users of our method have access to existing models and methods for, for example, relaxed molecular clocks, ancestral sequence reconstruction, coalescent population models, and marginal likelihood estimation, without the need for extra programming work. The prior probability decomposition outlined above is also very flexible, allowing for many different distributions of infectious periods and models of spread between hosts.

The framework here builds primarily on the previous work of Didelot et al. [10] and Ypma et al. [8] combining the co-estimation of both trees from the latter with the internal node annotation of the former. The “colours” of Didelot et al correspond to our partition elements. (The earlier work of Cottam et al. [5] took a similar approach but insisted that each node was in the same partition element as one of its children, which makes the implicit assumption that lineages split at transmission as it is then impossible for any lineage to exist in a host if it is not the ancestor of a tip taken from that host.) We have extended this annotation to allow more than one tip to come from the same host, under the assumption that the host was infected only once. (If this may not be true, it might be more appropriate to treat the two introductions as separate “hosts”, particularly in an agricultural scenario.) This would be of use in, for example, the study of HIV, where multiple samples are often taken from the same patient over the course of treatment [47]. Our infection branch move serves the same purpose as the single move described by Didelot et al. but takes a rather different approach. The main difference is that it is a change to the tree partition, which may indirectly change an infection date, rather than to the infection dates themselves. Direct changes to infection dates in our framework are constrained to be those that cannot change the transmission tree, as they modify just the *q*_{i}s. Other differences are that our version makes only moves that respect the partition rules (and hence the proposal never violates them) and makes no assumption that hosts cease to be infectious at any point (which is left as a job for likelihood calculations).

The work of Ypma et al. [8] has now been placed fully in the framework of modern phylogenetic inference. That paper treated every within-host phylogeny as a separate entity to be modified individually. For compatibility with existing packages such as BEAST, which estimate a single tree, it is more convenient to partition that tree instead. We also provide a more formal mathematical exploration of the properties of the joint space (see S1 Text), demonstrating that it is impossible for an MCMC procedure to fully explore the space of transmission trees without letting the phylogeny vary if the latter has more than two tips, and also that varying the phylogeny does allow the algorithm complete access. Finally, we show that this is indeed true in our case, by demonstrating irreducibility of our Markov chain. The move modifying the transmission tree proposed by Ypma et al. is most similar (although not identical) to our type B Wilson-Balding proposal, but irreducibility is not proven in the paper; the partition structure that we propose makes this much easier to show.

Exploration of the space of both trees is important, as the short timespan of phylogenies from epidemics and outbreaks often results in phylogenies that are not particularly well resolved, and as a result, a two-step procedure such as that of Didelot et al. or Cottam et al. may make it impossible to infer many plausible transmission histories. While in some circumstances access to the full space of transmission trees may not be necessary because some are very implausible (it is unlikely, for example, that the last host in an outbreak lasting months was infected by the first), which trees are implausible will vary greatly depending on the nature of the pathogen. It would be reasonable to rule out direct transmission between two individuals if their infection dates were separated by years if the pathogen was influenza, but not if it was HIV. Therefore, we considered it important, in designing a method intended to be general and flexible, to allow access to every single transmission tree. A simultaneous procedure such as this is to be preferred to the option of running a separate fixed-tree analysis on each of a set of trees from a Bayesian posterior sample for reasons of computational time, and because the fixed trees may have been estimated using an inappropriate model (see below).

The extent to which a fixed phylogeny constrains the space of transmission trees, or indeed vice versa, is a mathematical problem which is worthy of investigation. For a fixed phylogeny with three tips, 5 out of 9 transmission tree are possible (a proportion of 0.56); for four tips, it is either 12 or 13 out of 64 (0.19 or 0.2) (see Figure S1.1 in S1 Text); the total number of transmission trees increases faster than the total number of partitions of the phylogeny. As the partition count varies with the phylogenetic topology, there is no simple expression for the former based simply on the tip count of the latter, and analytical work to formally describe the relationship would be instructive. The opposite question, by how much the set of potential phylogenetic topologies is constrained if the transmission tree is known, also arises. While every transmission tree is possible in the joint framework presented here, the imposition of a transmission model certainly has a constraining effect on the credible set of phylogenies, as we saw when we compared the number of clades in the credible sets between the WHC and skyride slow clock results. The lack of difference between the two for the fast clock results is due to the fact that the genetic data alone is enough to largely resolve the phylogeny in that case and the reconstruction needs no additional help from the population model.

We found here that the WHC method was superior in estimating both the topological structure of the phylogeny and its node heights than an analysis using the GMRF Skyride tree prior; the latter, which assumes all lineages belong to a single, freely-mixing population, is amongst the most frequently employed current methods for the reconstruction of time-resolved phylogenies. This adds weight to the concerns we expressed in the Introduction about two-step methods that use a phylogeny or set of phylogenies estimated by another method as input for epidemiological reconstruction; the assumptions under which those trees were made may violate the population model that the epidemiological inference is using, and as a result they may not be accurate. As we have here developed a more accurate tree prior for an epidemic situation, we would recommend that the WHC be used for reconstruction of the phylogeny of suitable datasets even if the transmission tree is not of interest.

A frequent concern surrounding analyses of this sort has been the question of unsampled hosts or clinical cases. Some progress has been made in dealing with this issue recently in the non-phylogenetic methods [4, 11], and Numminen et al. [48] outlined a novel two-step, importance sampling method for the investigation of transmission trees using potentially sparsely sampled data based on a fixed, maximum-parsimony phylogeny. We go some way to addressing this problem in a one-step process because, as demonstrated, our method can include epidemiological information for known clinical cases for which no sequence is available. Scenarios of this sort, indeed, provide another reason to prefer a one-step approach; as a standard phylogenetic analysis is unaware of any epidemiological information other than dates of sampling, it has no information to use in placing a noninformative sequence in the tree. The position of a corresponding tip in a fixed phylogeny used as input for a two-step method will be effectively random. Our method can, instead, use epidemiological data such as the location of the case, as well as a prior or hyperprior on the time from infection to noninfectiousness, to place these with more certainty. It can be seen from the reconstructed transmission tree (Fig 5) from the H7N7 epidemic that the farms for which no sequence was available are placed amongst their geographical neighbours, which would be expected unless there was a particular reason to believe otherwise.

This is obviously not a complete solution to the problem; more challenging is the issue of the identification of unknown unsampled hosts in the transmission chain, and the quantification of the number of them. This is the principal limitation of this method, and further work is needed to address it. Two approaches have been suggested previously [10, 11], both of which could be accommodated as a modification to the WHC. The first is to create a pool of unsampled hosts, of variable size, and use reversible-jump MCMC [49] to add and subtract from it. Internal nodes in the phylogeny can then be assigned to elements of this set, obeying the rules about connectedness but disregarding that about each partition element containing a tip. The second is to allow hosts to “indirectly” infect others even after they ceased to be infectious. The assignment of a node to a particular partition element would no longer indicate that the lineage represented by that node was actually present in the host represented by the tip in the same element, but just that it was infected by that host before it entered any other sampled host. We suggest a third option, which is to allow the assignment of internal nodes to no host at all. The mathematical framework would require modification; for example, an expression would be needed for the probability of the infection of a host from an unknown source.

The assumption that transmission is a complete bottleneck is hard to relax, as one of the fundamental principles of the correspondence between transmission trees and partitions, that the nodes in a partition element form a connected subtree of the whole phylogeny, must then be discarded as the common ancestral node of two nodes in the same host may be outside that host. The realism of this assumption is often unclear and will vary from pathogen to pathogen; while the bottleneck has been found to be quite loose for individual-to-individual transmission of influenza [50], this may be less true when, as in our example, transmission is between farms [18]. For other organisms, such as HIV [51] and hepatitis C virus [52], it has been found that the number of transmitted variants is usually very small between individuals.

The assumption that the parameters of the three models are independent of each other is a simplification which could be relaxed in subsequent work. Potentially, all three could interact: within-host dynamics are likely to affect both the infectiousness of a particular host and the duration of its infection, and particular mutations may modify the behaviour of the infection both within and between hosts. The assumption that mutation is a neutral process is quite standard in phylogenetics, but there is considerable scope for work which instead accounts for selection.

Treating the infection status of a host upon examination as part of the data, rather than as background information, simplifies the mathematics surrounding infectious periods while allowing multiple examinations of the same host at different times. It has other consequences. On the positive side, it opens up the possibility of including the results of genuinely negative examinations as data, in a way that does not involve adjusting individual prior probabilities for infection times. Such a negative examination must be as near as possible to conclusive, however; the absence of clinical symptoms is certainly not sufficient. On the negative side, it means that an algorithm to sample from the *prior* probability distribution must be able to vary the number of tips in the tree, a very non-standard procedure which is not implemented currently in BEAST or any other commonly-used package. In addition, the assumption that examination does not disturb the infection will be in most cases a simplification; for many pathogens, positive examinations will result in clinical intervention which will shorten the time to noninfectiousness. The alternative described here whereby hosts become noninfectious upon examination avoids both these issues. If this too is regarded as unacceptable, the framework here could be adjusted to model the time from infection to first examination, rather than from infection to noninfectiousness.

As outlined in Methods, the parameters *b* and *ψ* determining between-host transmission cannot be estimated unless the set of uninfected susceptibles can be described. This may not be possible in many cases; in our H7N7 analysis we did not have such data and hence can only estimate *b*′ and *ψ*′, the values that they would take if no susceptibles remained at the end of the epidemic. One would expect transmission rates to be overestimated if there is no contribution from hosts that were never infected. If it is not feasible to acquire this information and these parameter are still of particular interest, then a method to estimate the contribution of the set of uninfecteds is needed.

The WHC method was very successful in recapturing the epidemiological parameters of the simulations where sequences were generated by a fast clock, in which case the level of genetic diversity was such that there was little uncertainty in the phylogeny. Moving to a more realistic level of diversity decreased the accuracy of estimates considerably, and this illustrates the importance of using existing information, be it genetic or epidemiological, when configuring an analysis of this type. In particular, the results of the simulation analysis showed a clear bias towards underestimating the infectious periods of hosts. The reason for this is that the kind of within-host phylogenies that maximise the probability expressions Eqs (1) and (2) with an increasing within-host population are those that have only short periods in which the tree has more than one lineage, and where those short periods are close to the time of infection. The probability each such phylogeny is therefore increased, all else being equal, by moving the infection time towards the tips. This phenomenon seems similar to that which arises from the specification of “calibration densities” [53] for the root height in a standard time-tree analysis where that prior distribution conflicts with the root height that would be expected from the priors on the coalescent parameters, although as the WHC considers multiple coalescent trees with variable tip dates, the situation is considerably more complicated. The interaction between within- and between-host models is clearly complex and could motivate analytical work to find variants in which the effective priors do not interfere with each other in this way.

In any case, the bias in estimation of infectious periods is overwhelmed by sufficient genetic data (as in the fast clock dataset) and can be largely mitigated by placing a suitably informative prior on the length of infectious periods. A clear implication of this is that genetic data should not be relied upon to estimate infectious periods on their own if other information is available to inform such a prior. (We note that this bias does not affect estimation of who infected who, as the accuracy of transmission tree reconstruction did not significantly change when we changed the prior on *μ*_{inf}.) In a similar vein, the lack of genetic diversity in the slow clock dataset meant that in some cases the molecular clock rate itself could not be reliably estimated and had to be fixed to its known value. In an actual epidemic situation it would seem perfectly reasonable to do this, using a rate derived from older data, unless it was clear that the pathogen in question was novel.

The concerns about the use of uninformative priors on infectious periods led us to use informative distributions taken from previous literature in the H7N7 analysis, and we continued the practice from the simulations of using highly informative priors on latent periods as preliminary work had shown that these tended not to be well estimated using uninformative priors. Some analysis results nevertheless deviated from what would be expected under the prior distributions; in particular the estimated mean infectious period during the high-risk period, and the estimated latent period, were both considerably smaller than their prior expected values. While it is possible that these underestimates are at least partly the result of the bias that was noted in the simulations, the difference is considerably more extreme than anything observed there. While our analysis agrees that the infectious period in the index case may have been around two weeks, the genetic data seems to suggest, contrary to previous work [36, 37], this was not the case for the remaining high-risk farms and that they may have been infectious for a shorter time than the low-risk period farms. This suggests that the infection was present within the index farm for a considerable time before transmission began to occur. The estimated shorter periods in the other high-risk farms may be due to increased surveillance in nearby facilities once the index infection was discovered. In the case of the latent period, the MCMC never actually sampled a value of two days or more at all, which previous work had assumed *a priori* as its length. While it is true that the assumption of a single latent period for all farms is a considerable simplification, this still suggests that the phylogeny is simply unable to accommodate a situation where all latent periods are of two days or greater. This analysis also suggested much greater variation in the lengths of infectious periods than had been previously estimated, in the low-risk period at least. These three observations suggest possible insights that genetic data can provide have not been apparent in traditional analyses.

The other model parameters that are not well recovered in the slow clock simulation dataset are those of the within-host coalescent process. This paper has concentrated on the between-host model, and if the WHC method it is to be used to investigate within-host dynamics in detail then further work is needed. It may be that the situation is improved if multiple sequences are taken from the same host. The estimated parameters of the logistic growth function for H7N7 should certainly not be overinterpreted, for several reasons. First, it is a gross oversimplification to assume that the infected population of each farm grew according to the same function, especially when the farms infected in this epidemic ranged from hobby farms to large agricultural facilities. Secondly, the “effective number of infections” will differ from the true size of the infected population due to the violation of assumptions made in the coalescent process. While the WHC is designed to deal with violations of the assumption of homogeneous mixing of lineages that in fact infect separate hosts in an epidemic, lineages would not be expected to mix freely within farms (or indeed host organisms) either. It has also been shown that if the population in a coalescent model is treated as being made up of infected individuals, the relationship between the effective population size and prevalence is not as straightforward as might be assumed [17, 54]. Lastly, logistic growth may be too simplistic a model of the infected population. It was picked here because it is clearly a better fit to growth within a farm than a constant population size or exponential growth, but true dynamics are no doubt more complicated still.

Even the “slow clock” simulation dataset was intended to represent the full genome of influenza A, one of the most fast-evolving pathogens that is likely to cause an outbreak of this type. The resolution in the reconstruction of the transmission tree for the H7N7 outbreak could be increased if sequences for the remaining segments of the genome were available; consistently higher posterior probabilities for infectors were observed in the simulation analyses. As the short timescale of an epidemic already places a limit on the amount of information that can be gleaned from genetic data, we would suggest that resources be expended to sequence as much of the pathogen genome as possible in a situation of this sort.

In conclusion, what we have demonstrated here is both a new phylogenetic method for the analysis of genetic data taken from outbreaks and epidemics, and a new transmission tree reconstruction method. For phylogeny reconstruction we have developed a population model that is more realistic than the assumptions of freely-mixing lineages that are made in the most widely-used current methods. For transmission tree reconstruction, we have advanced the development of models that accommodate within-host diversity with a procedure that maintains the previously-noted correspondence between transmission trees and the annotation of internal nodes in a phylogeny while exploring the full space of phylogenies, which is required to allow access to the full space of transmission trees. As part of BEAST, it is publicly available (as of version 1.8.2), and compatible with any other model of interest that is implemented in that package. We hope that it will prove useful in future for researchers working on genetic analysis of outbreaks.

## Supporting Information

### S1 Fig. The effects of fixing the within-host model parameters on the reconstruction of the transmission tree.

Each violin plot represents the density of a statistic over the 25 simulations; results come from analyses where the within-host model parameters were estimated and fixed for the fast and slow clock datasets. (A) posterior median of mean bias in estimation of infection dates. (B) posterior median of mean error in estimation of infection dates. (C) Posterior median proportion of hosts whose infector is correctly identified. (D) Proportion of hosts whose infector is correctly identified in the maximum parent credibility (MPC) transmission tree.

https://doi.org/10.1371/journal.pcbi.1004613.s001

(EPS)

### S2 Fig. Maximum clade credibility phylogentic tree for the H7N7 outbreak.

This is the actual sampled tree with highest clade credibility from the posterior set; branch lengths have not been adjusted. Branches are coloured by farm; colour changes along branches reflect infection events.

https://doi.org/10.1371/journal.pcbi.1004613.s002

(EPS)

### S3 Fig. Reconstruction of the change in effective case reproduction number over the H7N7 epidemic.

The black line represents the posterior median value of the mean number of secondary cases caused by a case infected during each day, the blue lines the upper and lower bounds of the 95% highest posterior density interval.

https://doi.org/10.1371/journal.pcbi.1004613.s003

(EPS)

### S1 Text. Supplementary methods.

Full details of the correspondence between transmission trees and partitions of the nodes of a phylogeny, and of the MCMC proposals.

https://doi.org/10.1371/journal.pcbi.1004613.s004

(PDF)

### S2 Text. Supplementary methods 2.

Derivation of the probability .

https://doi.org/10.1371/journal.pcbi.1004613.s005

(PDF)

### S1 Data. BEAST XML files for simulated data.

BEAST input file for all WHC analyses of all simulated datasets.

https://doi.org/10.1371/journal.pcbi.1004613.s006

(ZIP)

## Acknowledgments

We would like to thank Trevor Bedford, Samantha Lycett and Melissa Ward for input into the development of this method. We also thank Jantien Backer, Rolf Ympa and the Central Veterinary Institute of Wageningen UR for giving us access to the farm distance matrix for the H7N7 outbreak.

## Author Contributions

Conceived and designed the experiments: MH MW AR. Performed the experiments: MH. Analyzed the data: MH. Contributed reagents/materials/analysis tools: AR. Wrote the paper: MH AR MW.

## References

- 1. Liu J, Lim SL, Ruan Y, Ling AE, Ng LFP, Drosten C, et al. SARS transmission pattern in Singapore reassessed by viral sequence variation analysis. PLOS Med. 2005;2:e43. pmid:15736999
- 2. Spada E, Sagliocca L, Sourdis J, Garbuglia AR, Poggi V, Fusco CD, et al. Use of the minimum spanning tree model for molecular epidemiological investigation of a nosocomial outbreak of hepatitis C virus infection. J Clin Microbiol. 2004;42:4230–4236. pmid:15365016
- 3. Aldrin M, Lyngstad TM, Kristoffersen AB, Storvik B, Borgan Ø, Jansen PA. Modelling the spread of infectious salmon anaemia among salmon farms based on seaway distances between farms and genetic relationships between infectious salmon anaemia virus isolates. J Roy Soc Interface. 2011;8:1346–1356.
- 4. Jombart T, Eggo RM, Dodd PJ, Balloux F. Reconstructing disease outbreaks from genetic data: a graph approach. Heredity. 2011;106:383–390. pmid:20551981
- 5. Cottam EM, Thébaud G, Wadsworth J, Gloster J, Mansley L, Paton DJ, et al. Integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus. P Roy Soc B. 2008;275:887–895.
- 6. Ypma RJF, Bataille AMA, Stegeman A, Koch G, Wallinga J, van Ballegooijen WM. Unravelling transmission trees of infectious diseases by combining genetic and epidemiological data. P Roy Soc B. 2011;279:444–450.
- 7. Morelli MJ, Thébaud G, Chadœuf J, King DP, Haydon DT, Soubeyrand S. A Bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data. PLOS Comput Biol. 2012;8:e1002768. pmid:23166481
- 8. Ypma RJF, Ballegooijen WMv, Wallinga J. Relating phylogenetic trees to transmission trees of infectious disease outbreaks. Genetics. 2013;195:1055–1062. pmid:24037268
- 9. Jombart T, Cori A, Didelot X, Cauchemez S, Fraser C, Ferguson N. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. PLOS Comput Biol. 2014;10:e1003457.
- 10. Didelot X, Gardy J, Colijn C. Bayesian inference of infectious disease transmission from whole genome sequence data. Mol Biol Evol. 2014;31:1869–1879. pmid:24714079
- 11. Mollentze N, Nel LH, Townsend S, Roux Kl, Hampson K, Haydon DT, et al. A Bayesian approach for inferring the dynamics of partially observed endemic infectious diseases from space-time-genetic data. P Roy Soc B. 2014;281:20133251.
- 12. Worby CJ, Chang HH, Hanage WP, Lipsitch M. The distribution of pairwise genetic distances: a tool for investigating disease transmission. Genetics. 2014;198:1395–1404. pmid:25313129
- 13. Stadler T, Kouyos R, von Wyl V, Yerly S, Böoni J, Bürgisser P, et al. Estimating the basic reproductive number from viral sequence data. Mol Biol Evol. 2011;29:347–357. pmid:21890480
- 14. Stadler T, Kühnert D, Bonhoeffer S, Drummond AJ. Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). P Natl Acad Sci USA 2013;110:228–233.
- 15. Leventhal GE, Günthard HF, Bonhoeffer S, Stadler T. Using an epidemiological model for phylogenetic inference reveals density dependence in HIV transmission. Mol Biol Evol. 2014;31:6–17. pmid:24085839
- 16. Kühnert D, Stadler T, Vaughan TG, Drummond AJ. Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth–death SIR model. J Roy Soc Interface. 2014;11:20131106.
- 17. Volz EM, Pond SLK, Ward MJ, Brown AJL, Frost SDW. Phylodynamics of infectious disease epidemics. Genetics. 2009;183:1421–1430. pmid:19797047
- 18. Bataille A, van der Meer F, Stegeman A, Koch G. Evolutionary analysis of inter-farm transmission dynamics in a highly pathogenic avian influenza epidemic. PLOS Pathog. 2011;7:e1002094. pmid:21731491
- 19. Gibson GJ, Austin EJ. Fitting and testing spatio-temporal stochastic models with application in plant epidemiology. Plant Pathol. 1996;45:172–184.
- 20. Keeling MJ, Woolhouse MEJ, Shaw DJ, Matthews L, Chase-Topping M, Haydon DT, et al. Dynamics of the 2001 UK foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape. Science. 2001;294:813–817. pmid:11679661
- 21. Deardon R, Brooks SP, Grenfell BT, Keeling MJ, Tildesley MJ, Savill NJ, et al. Inference for individual-level models of infectious diseases in large populations. Statistica Sinica. 2010;20:239–261. pmid:26405426
- 22. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29:1969–1973. pmid:22367748
- 23. Drummond AJ, Nicholls GK, Rodrigo AG, Solomon W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics. 2002;161:1307–1320. pmid:12136032
- 24. Hohna S, Defoin-Platel M, Drummond A. Clock-constrained tree proposal operators in Bayesian phylogenetic inference. 8th IEEE International Conference on BioInformatics and BioEngineering, 2008. BIBE 2008; 2008. p. 1–7.
- 25. Wilson IJ, Balding DJ. Genealogical inference from microsatellite data. Genetics. 1998;150:499–510. pmid:9725864
- 26. Hasegawa M, Kishino H, Yano Ta. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J of Mol Evol. 1985;22:160–174.
- 27. Rodriguez F, Oliver JL, Marin A, Medina JR. The general stochastic model of nucleotide substitution. J Theor Biol. 1990;142:485–501. pmid:2338834
- 28. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. pmid:7288891
- 29. Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLOS Biology. 2006;4:e88. pmid:16683862
- 30. Slatkin M, Hudson RR. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics. 1991;129:555–562. pmid:1743491
- 31. Pybus OG, Charleston MA, Gupta S, Rambaut A, Holmes EC, Harvey PH. The epidemic behavior of the hepatitis C virus. Science. 2001;292:2323–2325. pmid:11423661
- 32. Bielejec F, Lemey P, Carvalho LM, Baele G, Rambaut A, Suchard MA. πBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios. BMC Bioinformatics. 2014;15:133. pmid:24885610
- 33. Minin VN, Bloomquist EW, Suchard MA. Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Mol Biol Evol. 2008;25:1459–1471. pmid:18408232
- 34. Whidden C, Zeh N, Beiko RG. Supertrees based on the subtree prune-and-regraft distance. Syst Biol. 2014;63:566–581. pmid:24695589
- 35. Elbers ARW, Fabri THF, Vries TSd, Wit JJd, Pijpers A, Koch G. The highly pathogenic avian influenza A (H7N7) virus epidemic in the Netherlands in 2003: lessons learned from the first five outbreaks. Avian Dis. 2004;48:691–705. pmid:15529997
- 36. Stegeman A, Bouma A, Elbers ARW, De Jong MCM, Nodelijk G, De Klerk F, et al. Avian influenza A virus (H7N7) epidemic in the Netherlands in 2003: course of the epidemic and effectiveness of control measures. J Infect Dis. 2004;190:2088–2095. pmid:15551206
- 37. Boender GJ, Hagenaars TJ, Bouma A, Nodelijk G, Elbers ARW, de Jong MCM, et al. Risk maps for the spread of highly pathogenic avian influenza in poultry. PLOS Comput Biol. 2007;3:e71. pmid:17447838
- 38. Jonges M, Bataille A, Enserink R, Meijer A, Fouchier RAM, Stegeman A, et al. Comparative analysis of avian influenza virus diversity in poultry and humans during a highly pathogenic avian influenza A (H7N7) virus outbreak. J Virol. 2011;85:10598–10604. pmid:21849451
- 39. Van Borm S, Jonges M, Lambrecht B, Koch G, Houdart P, van den Berg T. Molecular epidemiological analysis of the transboundary transmission of 2003 highly pathogenic avian influenza H7N7 outbreaks between the Netherlands and Belgium. Transbound Emerg Dis. 2014;61:86–90. pmid:22994451
- 40. Ypma RJF, Jonges M, Bataille A, Stegeman A, Koch G, Boven Mv, et al. Genetic data provide evidence for wind-mediated transmission of highly pathogenic avian influenza. J Infect Dis. 2013;207:730–735. pmid:23230058
- 41. Bogner P, Capua I, Lipman DJ, Cox NJ, et al. A global initiative on sharing avian flu data. Nature. 2006;442:981–981.
- 42. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. pmid:15034147
- 43. Shapiro B, Rambaut A, Drummond AJ. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol Biol Evol. 2006;23:7–9. pmid:16177232
- 44. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–2504. pmid:14597658
- 45. White LF, Pagano M. A likelihood-based method for real-time estimation of the serial interval and reproductive number of an epidemic. Stat Med. 2008;27:2999–3016. pmid:18058829
- 46. Bos MEH, Boven MV, Nielen M, Bouma A, Elbers ARW, Nodelijk G, et al. Estimating the day of highly pathogenic avian influenza (H7N7) virus introduction into a poultry flock based on mortality data. Vet Res. 2007;38:493–504. pmid:17425936
- 47. Vrancken B, Rambaut A, Suchard MA, Drummond A, Baele G, Derdelinckx I, et al. The genealogical population dynamics of HIV-1 in a large transmission chain: bridging within and among host evolutionary rates. PLOS Comput Biol. 2014;10:e1003505. pmid:24699231
- 48. Numminen E, Chewapreecha C, Sirén J, Turner C, Turner P, Bentley SD, et al. Two-phase importance sampling for inference about transmission trees P Roy Soc B. 2014;281:20141324.
- 49. Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732.
- 50. Hughes J, Allen RC, Baguelin M, Hampson K, Baillie GJ, Elton D, et al. Transmission of equine influenza virus during an outbreak is characterized by frequent mixed infections and loose transmission bottlenecks. PLOS Pathog. 2012 Dec;8:e1003081. pmid:23308065
- 51. Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, Salazar MG, et al. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. P Natl Acad Sci USA. 2008;105:7552–7557.
- 52. Bull RA, Luciani F, McElroy K, Gaudieri S, Pham ST, Chopra A, et al. Sequential bottlenecks drive viral evolution in early acute hepatitis C virus infection. PLOS Pathog. 2011 Sep;7:e1002243. pmid:21912520
- 53. Heled J, Drummond AJ. Calibrated tree priors for relaxed phylogenetics and divergence time estimation Syst Biol. 2012;61:138–149. pmid:21856631
- 54. Frost SDW, Volz EM. Viral phylodynamics and the search for an ‘effective number of infections’ Philos T Roy Soc B. 2010;365:1879–1890.