Bayesian phylodynamic inference with complex models

Population genetic modeling can enhance Bayesian phylogenetic inference by providing a realistic prior on the distribution of branch lengths and times of common ancestry. The parameters of a population genetic model may also have intrinsic importance, and simultaneous estimation of a phylogeny and model parameters has enabled phylodynamic inference of population growth rates, reproduction numbers, and effective population size through time. Phylodynamic inference based on pathogen genetic sequence data has emerged as useful supplement to epidemic surveillance, however commonly-used mechanistic models that are typically fitted to non-genetic surveillance data are rarely fitted to pathogen genetic data due to a dearth of software tools, and the theory required to conduct such inference has been developed only recently. We present a framework for coalescent-based phylogenetic and phylodynamic inference which enables highly-flexible modeling of demographic and epidemiological processes. This approach builds upon previous structured coalescent approaches and includes enhancements for computational speed, accuracy, and stability. A flexible markup language is described for translating parametric demographic or epidemiological models into a structured coalescent model enabling simultaneous estimation of demographic or epidemiological parameters and time-scaled phylogenies. We demonstrate the utility of these approaches by fitting compartmental epidemiological models to Ebola virus and Influenza A virus sequence data, demonstrating how important features of these epidemics, such as the reproduction number and epidemic curves, can be gleaned from genetic data. These approaches are provided as an open-source package PhyDyn for the BEAST2 phylogenetics platform.

c , i c , j c )} c=1:(n−1) is the set of internal nodes represented by a tuple and each also correspond to coalescent events. The indices i and j correspond to the daughter lineages of the internal node, and t a is the time when the internal node occurs T S = (t 1 , · · · , t n ) denotes times of sampling. We assume the state at time of sampling s i (t i ) is known.
x , k x )} x=1:n denotes the set of sample events represented by a tuple. The time of sampling is t s and the observed deme at time of sampling is k. D = {(t d , k d , l d )} d=1:N d denotes the set of N d deme-change events represented by a tuple. The time of the event is denoted t d . The initial deme is k and the deme following the change is l.
Let T = (t 1 , · · · , t N ) denotes the ordered sequence of all event times (sampling, migration, or coalescence) in the tree building process.
Note that the total coalescence rate between lineages i and j given known lineage-deme configuration S is λ ij (t)|S = φ si(t),sj (t) (t) + φ sj (t),si(t) (t) We can define the cumulative hazard of migration or coalescence between consecutive time points in T . Note that the configuration S is fixed during this period, but migration and coalescence rates may vary. The cumulative hazard over an interval is: The joint probability density of the genealogy and deme-configuration S(t) can now be defined: The first line provides the probability of no events occurring in each internode interval, the second and third lines provide the rates of each coalescent and deme-change event, and the last line describes the density of the sample times and sample states [5]; in most applications there is not a model of the sampling process or prior information to establish this density, and it is not included. Since in practice the configurations S(t) are not observed except at times of sampling, estimation of migration rates and/or effective population sizes then requires a strategy for marginalizing over latent S(t), and this is typically done by MCMC simulation [4]. This entails a large computational cost of integrating over a vary large configuration space [6], so instead we pursue a strategy of developing equations to marginalize over lineage states and will derive an approximate likelihood in terms of the probability that a lineage is in a particular deme. Our strategy is to derive equations for p ik (t) = p(s i (t) = k) at all times t over the history of each lineage i. Deriving equations for the joint probability of s i and s j is complicated, so we make the simplifying approximation that the conditional probability of state j is equal to the marginal: p(s j = l|s i = k) ≈ p(s j = l). We may then compute rates of deme-change for lineage i and the rate of coalescence of i with all other lineages. Where the configuration is unknown, we have We compute the cumulative hazard of coalescence in internode intervals as before, but considering only times of sampling and times of coalescence. In contrast to the above equation for Ω i |S, the lineage configuration will change over the course of each internode interval. The symbol S(t) denotes the set of all possible lineage-deme configurations at time t and p(S(t)) is the marginal probability of a particular configuration S.
We introduce the indicator function I(x = y) which evaluates as one if and only if x equals y and is zero otherwise. Rearranging the order of summation Recognizing that p(s i = k)p(s j = l|s i = k) = S∈S(t) p(S(t))I(s i = k)I(s j = l) , we can simplify the preceding: Thus we can avoid computing the full intractable density p(S(t)) and use the approximation for λ ij (t) in equation 3 provided that we can solve for the marginal probabilities p(s i = k) and it is shown how to do that below.
Marginalising over all possible configurations in S(t), the density of a genealogy is We now derive a system of m + 1 equations for the probability p ik (t) = p(s i (t) = k). These equations are defined in terms of the (F, G, Y ) model decomposition and the reader should review the main text and reference [1] for how this works. As stated here, these equations are self-consistent, and it is only of importance that the rates of different migration and coalescent events through time are pre-specified. The events which can modify the deme of lineage i when tracing its history backwards in time are Migration. Lineage i will move from deme k to l at a rate G lk (t)/Y k (t).
Birth. A lineage not ancestral to the sample (not represented by a lineage in the tree) may give birth to lineage i. The deme of lineage i would then be the deme of the lineage which gave birth to i.
Coalescence. We must condition on the event that no coalescence is observed over each internode interval.
We find it necessary to distinguish between birth events which result in coalescence and birth events which will not. Note that this is not usually required in phylogeographic coalescent models because birth events would not result in deme change but only in potential coalescence. Collecting the first two types of events which change the deme of a lineage, we can define the following time dependent migration matrix for lineage i: This provides the rate that lineage i migrates from deme k to l at time t. As above, the joint density for s i and s j is not tractable, so we use the approximation To account for the effect of coalescence on evolution of the state vector p i (t), we instead develop equations for the augmented state vectorp i (t), which has m + 1 elements and wherep i,m+1 (t) represents the probability that lineage i has coalesced at some time t < t. If t i represents the initial time of lineage i (the most recent time that i exists on forward time axis), we havẽ p ik (t) = p ik (t i ) k < m + 1 andp i,m+1 (t i ) = 0. This vector provides the marginal probability of a 'leaky' continuous time Markov process with m + 1 states, where m + 1 is an absorbing state. To retrieve the state vector conditioning on no coalescent event having occurred, we use the identity if k < m + 1 4/8 The forward equations for the evolution ofp i are found by tabulating the effects of migration, birth, and coalescent events outlined above: +O(∆t) Using the approximation p(s j = l|s i = k) ≈ p jl (t) and simplifying this becomes

+O(∆t)
Note that the state vector p j is used instead ofp j when computing rates of coalescence since we must condition as well on no coalescent event having occurred on lineage j. Computing the limit ∆t → 0 and making appropriate substitutions of M (t) and φ(t) for F (t) and G(t) gives the system of differential equations: The 'initial' state of a lineage following coalescence is found according to the model in [1]. If i and j coalesce at a lineage a, p ak (t) is the probability that i was in state k and gave birth to l or vice versa:

Derivation of PL1
In [1], the following approximation (denoted "com12") was presented which required the solution of the following differential equations: 5/8 where R com12 is the m × m matrix with elements in equation 15, which was an approximation intended to account for the fact that only births from lineages not ancestral to the sample could cause a lineage to change state without resulting in coalescence. The form of this equation was found to provide an accurate approximation to the lineages through time in [1], when solving the system of equations A drawback of the com12 model is that it does not condition on all possible events that lineage i may experience during its evolution; namely it does not condition on the observation that no coalescent events occurred during an internode interval. Consequently, this model tends to overestimate the probability that a lineage occupies a deme with a high coalescent rate. This issue was thoroughly explored in a recent investigation by Muller et al. [7] in the special case of phylogeographic models (diagonal F (t), constant rates, constant population size). A preliminary model which also addresses this issue in was introduced by Volz in the rcolgem R package [8] which was also described in [9], and we build on the approach developed therein which explicitly computes the cumulative probability that lineage i has coalesced.
Now we collect state vectors into a matrixP (P L1) (t) with m rows and columns (p 1 , · · · ,p 2n−2 ). And we will define a matrix Ψ with the same dimensions asP P L1 which gives the rate of coalescence in each lineage due to births by that lineage and conditional on membership in each deme. Ψ has elements We can now solve for the state vectors with the following system of equations Normalizing vectors and computing likelihoods then proceeds as above.
Note that PL1 is a little faster than PL2, but is making an additional approximation: It does not account for the number or type of ancestral lineages when computing the rate that a lineage changes state do to birth events (note the way that F (t) enters into the equation for R(t)).

Derivation of QL
Unfortunately, the system of equations 12 and 19 can be slow to solve since it requires recursion over extant lineages (twice) and m + 1 ancestral states. We therefore provide an additional approximation which greatly reduces computational cost. As above,p i represents an augmented state vector with m + 1 elements wherep i,m+1 represents the probability that lineage i has coalesced. Note that m + 1 is an absorbing state which increases at the rate λ i· (t). We define Q(t, T ) to be the m × m matrix of transition probabilities such thatp QL i (τ ) = Q (t, τ )p QL i (t) andp QL i provides the length m unnormalized state vector excluding the m + 1'th element. This vector can be renormalized giving the unit vector p QL ik =p QL ik /|p QL i |. We can also approximate the number of lineages in each deme over the interval using where t < τ < T . Finally, we can modify equation 12 to use the vector A(τ ), avoiding the need to sum over extant lineages. The following matrix provides transition rates between states.
Note that the diagonal of the rate matrix also includes the approximate rate that a lineage will coalesce given that it begins the internode interval in each of the m demes. Thus this is a 'leaky' Markov process and the probabilities in each column of Q will not in general sum to one. Initially Q(t, t) = I is the identity matrix. The matrix Q(t, T ) is computed by solving the equations d dτ Q(t, τ ) = R QL (τ ) Q(t, τ ) Therefore only m 2 differential equations need to be solved over every internode interval. Note that the rate of coalescence in equation 21 exceeds that in 12, so that this fast approximation will slightly over-estimate the probability that a lineage coalesces over an interval.