Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes

Demographic events shape a population’s genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.


Introduction
The demographic history of a species has a profound impact on its genetic diversity. Changes in population size, migration and admixture events, and population splits and mergers, shape the genealogies describing how individuals in a population are related, which in turn shape the pattern and frequency of observed genetic variants in extant genomes. By modeling this process and integrating out the unobserved genealogies, it is possible to infer the population's demographic history from the observed variants. However, in practice this is challenging, as individual mutations provide limited information about tree topologies and branch lengths. If many mutations were available to infer these genealogies this would not be problematic, but the expected number of observed mutations increases only logarithmically with the number of observed genomes, and recombination causes genealogies to change along the genome at a rate proportional to the mutation rate. As a result there is considerable uncertainty about the genealogies underlying a sample of genomes, and because the space of genealogies across the genome is vast, integrating out this latent variable is hard. A number of approaches have been proposed to tackle this problem [reviewed in 1]. A common approximation is to treat recombination events as known and assume unlinked loci, either by treating each mutation as independent [2][3][4][5][6][7], or by first identifying tracts of genetic material unbroken by recombination [8][9][10][11][12]. To account for recombination while retaining power to infer earlier demographic events, it is necessary to model the genealogy directly. ARGWeaver [13] uses Markov chain Monte Carlo (MCMC) for inference, but does not allow the use of a complex demographic model, and since mutations are only weakly informative about genealogies this leaves the inferred trees biased towards the prior model and less suitable for inferring demography. Restricting itself to single diploid genomes, the Pairwise Sequentially Markovian Coalescent (PSMC) model [14] uses an elegant and efficient inference method, but with limited power to detect recent changes in population size or complex demographic events. Several other approaches exist that improve on PSMC in various ways [15][16][17][18], but they remain limited particularly in their ability to infer migration.
We here focus on the general problem of inferring demography from several whole-genome sequences, which is informative about demographic events in all but the most recent epochs [13,14,16]. A promising approach which so far has not been applied to this problem is to use a particle filter. Particle filters have many desireable properties [19][20][21][22], and applications to a range of problems in computational biology have started to appear [23][24][25][26]. Like MCMC methods, particle filters converge to the exact solution in the limit of infinite computational resources, are computationally efficient by focusing on realisations that are supported by the data, do not require the underlying model to be approximated, and generate explicit samples from the posterior distribution of the latent variable. Unlike MCMC, particle filters do not operate on complete realisations of the model, but construct samples sequentially, which is helpful since full genealogies over genomes are cumbersome to deal with.
To use particle filters, we use a formulation of the coalescent model in which the state is a genealogical tree at a particular genome locus, which "evolves" sequentially along the genome, rather than in evolutionary time. To avoid confusion, in this paper "time" by itself refers to the variable along which the model evolves, while evolutionary (coalescent, recombination) time refers to an actual time in the past on a genealogical tree.
Originally, particle filters were introduced for models with discrete time evolution and with either discrete or continuous state variables [19,27]. In this paper, the latent variable is a piecewise constant sequence of genealogical trees along the genome, with trees changing only after recombination events that, in mammals, occur once every several hundred nucleotides. The observations of the model are genetic variants, which are similarly sparse. Realizations of the discrete-time model of this process (where "time" is the genome locus) are therefore stationary (remain in the same state) and silent (do not produce an observation) at most transitions, leading to inefficient algorithms. Instead, it seems natural to model the system as a Markov jump process (or purely discontinuous Markov process, [28]), a continuous-time stochastic process with as realisations piecewise constant functions x : ½1; L�7 !T, where T is the state space of the Markov process (the space of genealogical trees over a given number of genomes) and L the length over which observations are made (here the genome size).
Particle filters have been generalised to continuous-time diffusions [29][30][31], as well as to Markov jump processes on discrete state spaces [32,33], and hybrids of the two [34,35], as well as to piecewise deterministic processes [36]; for a general treatment see [37,38]. Here we focus on Markov jump processes that are continuous in both time and state space; to our knowledge the method has not been extended to this case. The algorithm we propose relies on Radon-Nikodym derivatives [see e.g. 31], and we establish criteria for choosing a finite set of "waypoints" that makes it possible to reduce the problem to the discrete-time case, while ensuring that particle degeneracy remains under control.
Although the algorithm generally works well, we found that for the CwR model we obtain biased inferences for some parameters. For example, coalescent rates for recent epochs are associated with tree nodes that persist across long genomic segments (the model exhibits "long forgetting times"), because their short descendant branches attract few recombinations. They have few informative mutations as well, and collecting these mutations therefore require long lags in the fixed-lag smoothing procedure, in turn resulting in increased particle degeneracy [39]. For discrete-time models the Auxiliary Particle Filter [40] addresses a related problem by "guiding" the particle filter towards states that are likely to be relevant in future iterations, using an approximate likelihood that depends on data one step ahead. This approach does not work well for some continuous-time models, including ours, that have no single preferred time scale. Instead we introduce an algorithm that shapes the resampling process by an approximate "lookahead likelihood" that can depend on data at arbitrary distances ahead. Using simulations we show that this substantially reduces the bias.
The particle filter generates samples from the posterior distribution of the latent variable, here the sequence of genealogies along the genome, and we infer the model parameters from this sample. One strategy is to use stochastic expectation-maximization [SEM ; 41]. However, such approaches yield point estimates, ignoring any uncertainty in the inferred parameters. Combined with the bias due to self-normalized importance sampling which cause particle filters to under-sample low-rate events, this result in a non-zero probability of inferring zero event rates, which are fixed points of any SEM procedure. In principle this can be avoided by using an appropriate prior on the rate parameters. To implement this we use Variational Bayes to estimate an approximate joint posterior distribution over parameters and latent variables, partially accounting for the uncertainty in the inferred parameters, as well as providing way to explicitly include a prior. In this way zero-rate estimates are avoided, and more generally we show that this approach further reduces the bias in parameter estimates.
Applying these ideas to the coalescent-with-recombination (CwR) model, we find that the combination of lookahead filter and Variational Bayes inference enables us to analyze four diploid human genomes simultaneously, and infer demographic parameters across epochs spanning more than 3 orders of magnitude, without making model approximations beyond passing to a continuous-locus model.
The remainder of the paper is structured as follows. We first introduce the particle filter, generalise it to continuous-time and -space Markov jump processes, describe how to choose waypoints, introduce the lookahead filter, and describe the Variational Bayes procedure for parameter inference. In the results section we first introduce the continuous-locus CwR process, then discuss the lookahead likelihood, choice of waypoints and parameter inference for this model, before applying the model to simulated data, and finally show the results of analyzing sets of four diploid genomes of individuals from three human populations. A discussion concludes the paper.

The sequential coalescent with recombination model
The coalescent-with-recombination (CwR) process, and the graph structures that are the realisations of the process, was first described by Hudson [42], and was given an elegant mathematical description by Griffiths [43], who named the resulting structure the Ancestral Recombination Graph (ARG). Like the coalescent process, these models run backwards in evolutionary time and consider the entire sequence at once, making it difficult to use them for inference on whole genomes. The first model of the CwR process that evolves sequentially rather than in the evolutionary time direction was introduced by Wiuf and Hein [44], opening up the possibility of inference over very long sequences. Like Griffiths' process, the Wiuf-Hein algorithm operates on an ARG-like graph, but it is more efficient as it does not include many of the non-observable recombination events included in Griffiths' process. The Sequential Coalescent with Recombination Model (SCRM) [45] further improved efficiency by modifying Wiuf and Hein's algorithm to operate on a local genealogy rather than an ARG-like structure. Besides the "local" tree over the observed samples, this genealogy includes branches to noncontemporaneous tips that correspond to recombination events encountered earlier in the sequence. Recombinations on these "non-local" branches can be postponed until they affect observed sequences, and can sometimes be ignored altogether, leading to further efficiency gains while the resulting sample still follows the exact CwR process. An even more efficient but approximate algorithm is obtained by culling some non-local branches. In the extreme case of culling all non-local branches the SCRM approximation is equivalent to the SMC' model [46,47]. With a suitable definition of "current state" (i.e., the local tree including all non-local branches) these are all Markov processes, and can all be used in the Markov jump particle filter; here we use the SCRM model with tunable accuracy as implemented in [45].
The state space T of the Markov process is the set of all possible genealogies at a given locus. The probability measure of a complete realisation x can be written as Here x is the sequence of genealogies along the genome; |x| is the number of recombinations that occurred on x; b u (x s ) is the number of branches in the genealogy at locus s at evolutionary time u; Bðx s Þ ¼ R rootðx s Þ u¼0 b u ðx s Þdu is the total branch length of x s ; ρ(s) is the recombination rate per nucleotide and per generation at locus s, so that ρ(s)B(x s ) is the exit rate of the Markov process in state x s ; (s j , ν j ) is the locus and recombination time of the jth recombination event; τ j > ν j is the coalescence time of the corresponding coalescence event; and C(u) = 1/2N e (u) is the coalescence rate in generation u. See Appendix ("The sequential coalescent with recombination process") for more details. The distribution π x (x) has a density with respect to the Lebesgue measure (ds) |x| (du) 2|x| , because each of the |x| recombination events is associated with a sequence locus, a recombination time, and a coalescent time.
Mutations follow a Poisson process whose rate at s depends on the state x s via μ(s)B(x s ) where μ(s) is the mutation rate at s per nucleotide and per generation. Mutations are not observed directly, but their descendants are; a complete observation is represented by a set y ¼ fðs j ; A j Þg j¼1;...;jyj 2 Y where s j 2 [1, L) is the locus of mutation j, and A j 2 {0, 1} S are the wildtype (0) and alternative (1) alleles observed in the S samples. The conditional probability measure of the observations y given a realisation x is where P(A|x s , μ) is the probability of observing the allelic pattern A given a genealogy x s and a mutation rate μ per nucleotide and per generation; this probability is calculated using Felsenstein's peeling algorithm [48]. Note that B(x s )μ(s) = ∑ A6 ¼(0,. . .,0) P(A|x s , μ(s)).

Particle filters
Particle filters methods, also known as Sequential Monte Carlo (SMC) [22], generate samples from complex probability distributions with high-dimensional latent variables. An SMC method uses importance sampling (IS) to approximate a target distribution using weighted random samples (particles) drawn from a tractable distribution. We briefly review the discrete-time case. Suppose that particles {(x (i) , w (i) )} i=1,. . .,N , approximate a distribution with density p(x), such that This shows that fðx ðiÞ ;w ðiÞ Þg approximate q(x)dx. The normalisation ensures that any constant factor in w (i) drops out, so that it is sufficient to know the ratio q(x)/p(x) up to a constant. A particle filter builds the desired distribution sequentially, making it suited to hidden Markov models, for which the joint distribution of latent variables X and observations Y has the form Here 1 � � s denotes the set {1, 2, . . ., s}, and x ¼ x 1��s ¼ ðx 1 ; x 2 ; . . . ; x s Þ and y ¼ y 1��s are vectors. Let {(x (i) , w (i) )} be particles approximating the target distribution PðX 1��s ¼ x 1��s jY 1��s ¼ y 1��s Þ, which for brevity we write as Pðx 1��s jy 1��s Þ. Ifx ðiÞ is the vector obtained by extending x (i) with a sample from Pðx sþ1 jx ðiÞ s Þ, then from (4) and (5) it follows that fðx ðiÞ ; w ðiÞ Þg approximate Pðx 1��sþ1 jy 1��s Þ / Pðx 1��sþ1 ; y 1��s Þ. Now, Pðx 1��sþ1 jy 1��sþ1 Þ / Pðx 1��sþ1 ; y 1��sþ1 Þ ¼ Pðx 1��sþ1 ; y 1��s Þgðy sþ1 jx sþ1 Þ, so that using IS and setting we obtain particles fðx ðiÞ ;w ðiÞ Þg that approximate Pðx 1��sþ1 jy 1��sþ1 Þ. This shows how to sequentially construct particles that approximate the target distribution Pðx 1��L jy 1��L Þ. Instead of sampling from pðx sþ1 jx ðiÞ s Þ, any proposal distribution qðx sþ1 jx ðiÞ s ; y 1��L Þ (subject to conditions) can be used, which is advantageous if q is easier to sample from, is closer to the target distribution, or has heavier tails than p. Again, IS accounts for the change in sampling distribution, resulting inw

PLOS ONE
For now we will choose q to be independent of y. Because samples from q do not follow the desired target P(x|y), the fraction of particles close to the target's mode diminishes exponentially at each iteration until (3) fails altogether. To address this, we occasionally draw samples from the approximating distribution itself, assigning each resampled particle weight 1/Ninterestingly, if we interpret fitness as (proportional to) the likelihood gðy sþ1 jx ðiÞ sþ1 Þ, this is the same process that is used in the Wright-Fisher model with selection to describe how fitness differences shape an evolving constant-size population [49]. Doing this tends to remove particles that have drifted from the mode of the target and have low weight, and duplicates particles with large weights, while (3) remains valid. Although resampling substantially decreases the future variance of (3), it increases the variance at the current iteration. To avoid increasing this variance unnecessarily, resampling is performed only when the estimated sample size, defined as ESS = (∑w (i) ) 2 /∑(w (i) ) 2 , drops below a threshold, e.g. N/2. In addition, we use systematic resampling to minimize the variance that is introduced when resampling is performed [50]. This leads to Algorithm 1 [19].
Note that the algorithm can be seen as a recipe to transform a sample from P(X) to a sample from P(X)P(Y|X)/P(Y) = P(X|Y), that is, an application of Bayes' theorem. Following this interpretation we will refer to P(X) as the prior distribution, and P(X|Y) as the posterior.
The algorithm generates an approximation to Pðx 1��s jy 1��s Þ rather than Pðx s jy 1��s Þ, but we follow [22] in calling it a particle filter algorithm instead of a smoothing algorithm (although our use of fixed-lag distributions for parameter estimation is a partial smoothing operation).
The marginal likelihood can be estimated (although with high variance, see [51]) by setting the weights to N À 1 P i w ðiÞ s rather than N −1 when particles are resampled. This makes the weights asymptotically normalized, so that (3)

Continuous-time and -space Markov jump processes
For the hidden process we now consider Markov jump processes, which have as realisations piecewise constant functions x : ½1; LÞ7 !T where T is the state space of the Markov process.
Recall that in the model we consider, T is the space of rooted genealogical trees with branch lengths. Let ðX; F x ; p x Þ be a probability space, where X ¼ T ½1;LÞ is the space of possible realisations of the hidden stochastic process X = {X s } s 2 [1, L) , F x � PðXÞ is the σ-algebra of events, and π x (X) is the probability measure on X induced by the stochastic process X. See the Appendix ("Conditional distributions and the Markov property") for some remarks on how to define a Markov model when the phase space T is uncountable. The complete model is defined by specifying the observation process. We consider models where observations Y are generated by a Poisson process whose intensity at time (i.e. locus) s depends on X s [a Cox process, see e.g. 52]. The space of observations Y consists of finite subsets of [1, L) × M, where M is a discrete set of potential events, each of which may occur at some s 2 [1, L). For a full observation y ¼ ððs 1 ; m 1 Þ; . . . ; ðs k ; m k ÞÞ 2 Y we write |y| ≔ k for the number of events in y. Writing λ(y) for the Lebesgue measure (ds) |y| , the emission distribution π(Y|X = x) has a density r(y|x) relative to λ(y). For Cox processes this density has the form where r(s, m|x s ) is the rate at which event m occurs at time s conditional on X s = x s and is the intensity of the emission Poisson process at s conditional on X s = x s . The probability space for the joint process is ðX � Y; F ; pÞ, and the posterior distribution of interest is π conditioned on an observation y 2 Y, written as π(X|Y = y).
The absence of events in an interval s 2 [a, b) is also informative about the latent variable through the exponential factor in (8). In practice however, not all intervals may have been observed, so that events may or may not have occurred in these intervals. Assuming that the "observation process" is independent of the Markov jump process X, such unobserved intervals can simply be left out of integral (8).
Some more notation is needed to describe the Markov jump process version of algorithm 1. As above π x denotes the prior distribution of the latent variable X, and ξ x denotes the proposal distribution, both Markov processes on X, playing the role of p(x) and q(x) in the discrete case. We write a:b for the interval ½a; bÞ � R, and α a: b for the restriction of a measure or function α to a:b; similarly [a,b) . The particle filter algorithm uses the notation (dα/dβ)(x) for distributions α and β to denote their Radon-Nikodym derivative: the ratio of their density functions with respect to a common reference measure, evaluated at x. To simplify notation we write the Radon-Nikodym derivative of two conditional distributions aðXjGÞ and bðXjGÞ at x as ðda=dbÞðxjGÞ, and we also do not explicitly restrict distributions to their appropriate intervals when this is clear from the context, so that we write for example ðdp=dlÞðy s j :s jþ1 jX s j :s jþ1 ¼ xÞ instead of ðdp s j :s jþ1 ðYjX s j :s jþ1 ¼ xÞ=dl s j :s jþ1 Þðy s j :s jþ1 Þ. With this notation we can formulate Algorithm 2.
Algorithm 2 Particle filter for Markov jump processes The choice of waypoints s 1 , . . ., s K is discussed below; in particular they need not be the same as the event locis 1 ; . . . ;s jyj of the observation y. Note that there is no initialization step; instead, initially x ðiÞ 1:1 ¼ ;, and the first sample will be drawn from ξ conditioned on an empty set, i.e. the unconditional distribution. The loop invariant holds when j = 0 since 1:s 0 ¼ ;. As with Algorithm 1 it is possible to estimate the likelihood density π θ (y 1:L ) by replacing the factors N −1 with N À 1 P i w ðiÞ s j ; then the likelihood density w.r.t. λ(dy) = (ds) |y| is approximated by P i w ðiÞ L . Note that by the nature of Markov jump processes, particles that start with identical latent variables have a positive probability of remaining identical after a finite time. Combined with resampling, this causes a considerable number of particles to have one or more identical siblings. For computational efficiency we represent such particles once, and keep track of their multiplicity k. When evolving a particle with multiplicity k > 1, we increase the exit rate kfold, and when an event occurs one particle is spawned off while the remaining k − 1 continue unchanged.

Using lookahead to improve the particle filter
At the jth iteration, Algorithm 2 uses data up to waypoint s j to build particles approximating pðX 1:s j jY 1:s j ¼ y 1:s j Þ. This is reasonable as pðX 1:s j jy 1:s j Þ is independent of data beyond s j . However, not all particles are equally important for approximating subsequent posteriors, which suggests to emphasise particles that will be relevant in future at the expense of those relevant only to pðX 1:s j jy 1:s j Þ. This echoes the justification of resampling: although resampling increases the variance of the approximation to the current partial posterior, the variance at subsequent iterations by increasing the number of particles that are likely to contribute to future distributions. For discrete-time models p(X 1:n |y 1:n ), the Auxiliary Particle Filter (APF) [40] implements this intuition by targeting a resampling distribution [53], which includes a "lookahead" factor pðy iþ1 jx i Þ approximating the probability of observing data y i+1 given the current state x i . Importance sampling is used to keep track of the desired distribution p(X 1:i |y 1:i ).
In the continuous-time context it is natural to look an arbitrary distance ahead. Similar to APF, the lookahead distribution can be conditioned on the current state only, and must be an approximation of the true distribution. It should be heavy-tailed with respect to the true distribution to ensure that the variance of the estimator remains finite [22], which implies that the distribution should not depend on data too far beyond s; what is "too far" depends on how well the lookahead distribution approximates the true distribution.
The lookahead distribution is only evaluated on a fixed observation y, and is used to quantify the plausibility of a current state x ðiÞ s , rather than to define a distribution over y. For this reason we call it a lookahead likelihood. In fact, for correctness of the algorithm it is not necessary that this likelihood derives from a probability distribution. We define the lookahead likelihood as a family of functions h s ðy s:L jx s Þ : Y s:L � T ! R, and an associated family of unnormalized distributionsp s ðx 1:s ; y 1:L Þ ¼ p 1:s ðx 1:s ; y 1:s Þh s ðy s:L jx s Þl s:L ðy s:L Þ on X 1:s � Y. The functions h s can be chosen arbitrarily, except that h s (�, x s )λ s:L must be absolutely continuous w.r.t. π s:L (�|X s = x s ) to ensure that importance sampling is justified. The lookahead Algorithm 3 keeps track of two sets of weights, which together with a single set of samples form two sets of particles that approximate the resampling and target distributions.
Algorithm 3 Markov-jump particle filter with lookahead   (i = 1, . . ., N) For i from 1 to N: s j dp x dx x ðx ðiÞ s j :s jþ1 jX s j ¼ x ðiÞ s j Þ dp s jþ1 dp s j ðy s j :L jX s j :s jþ1 ¼ x ðiÞ s j :s jþ1 Þ (see Appendix, "Proof of Algorithm 3".) To implement the lookahead particle filter we need a tractable approximate likelihood of future data given a current genealogy. To do this we simplify the full likelihood, and ignore all data except for a digest of singletons and doubletons that are informative of the topology and branch lengths near the tips of the genealogy -in particular, singletons are informative of terminal branch lengths, and doubletons identify the existence of nodes with precisely two descendants ("cherries"). This digest consists of the distance s i to the nearest future singleton for each haploid sequence, and � n/2 mutually consistent cherries c k = (a k , b k ) identified by their two descendants a k , b k , together with loci s 0 k � s 00 k where their first and last supporting doubleton were observed (Fig 1a). Under some simplifying assumptions we derive an approximation of the likelihood h s ðft i g; fc k ; s 0 k ; s 00 k gjx s Þ of the current genealogy given these data; see Appendix ("A lookahead likelihood") for details.

Choosing waypoints
The choice of waypoints s j can significantly impact the performance of the algorithm: choosing too few increases the variance of the approximation, and choosing too many slows down the algorithm without increasing its accuracy. Waypoints determine where the algorithms might perform a resampling step. A high density of waypoints is therefore always acceptable, but a low density may result in particle degeneracy. Choosing a waypoint at every event ensures that any weight variance induced at these sites is mitigated, but there is still the opportunity for weight variance to build up between events. If ξ x = π x , particle weights diverge only because different particles ðx ðiÞ 1:s ; w ðiÞ s Þ experience a different total intensity rðx ðiÞ s Þ of observed events. If ESS 0 is the current estimated sample size, then under some assumptions, along an interval of length L where no events occur we have (see Appendix, "Particle weight variance"), where σ 2 the variance of the total event intensity r(x s ) (9) under the prior π x (x). Therefore, if we choose waypoints at every event, adding additional waypoints so that they are never more than a distance 1= ffi ffi ffi ffi ffi ffi ffi 2s 2 p apart, the ESS will not drop more than a factor ffi ffi ffi ffi ffi ffi ffi 1=e p � 0:6 between waypoints, and particle degeneracy is avoided. To apply this to our situation, assume a panmictic population with constant diploid effective population size N e . The variance of the total coalescent branch length in a sample of n individuals is ð4N e Þ 2 P nÀ 1 i¼1 i À 2 [54]. The variance of total mutation intensity σ 2 is obtained by multiplying this by μ 2 , since the rate of mutations on the coalescent tree is μ times the total branch length. Rewriting this in terms of the heterozygosity θ = 4N e μ, and approximating the sum with P 1 i¼1 i À 2 ¼ p 2 =6 gives σ 2 = θ 2 π 2 /6, and a minimum waypoint distance of 1= ffi ffi ffi ffi ffi ffi ffi 2s 2 =py � 1=2y. Because the assumptions mentioned above are in practice only met approximately, this minimum waypoint density should be taken as a guide; breakdown of the assumptions can be monitored by tracking the ESS, increasing the density of waypoints if necessary.

Parameter inference
Parameters can be inferred by stochastic expectation maximization (SEM), which involves maximizing the expected log likelihood over the posterior distribution of the latent variable. The probability density for a Poisson process is 1 c! y c e À qy , where c is the event count, and θ is the rate of events per unit of "opportunity" q, measured in units of time or space or some combination of them. The expected log likelihood c log θ − qθ (ignoring constants) is maximized for θ = c/q, where c and q are the expected event count and opportunity. We consider Markov jump processes X s with parameters θ and distribution where |x| i is the event count and Q i (x) is the total opportunity for events of type i in realisation x; both can be random variables. Similar to the Poisson case, the parameters maximizing the expected log likelihood are The expectations can be computed by using samples over x � π(x|y, θ) as approximated by Algorithm 3.
To evaluate the expectations above we do not use the complete set of events in the full realisations x, since resampling causes early parts of x to become degenerate due to "coalescences" of the particle's history of events along the sequence, which would lead to high variance of the estimates. Using only the most recent events is also problematic as these have not been shaped by many observations and mostly follow the prior π x (x|θ), resulting in highly biased estimates.
Smoothing techniques such as two-filter smoothing [55] cannot be used here since finite-time transition probabilities are intractable. For discrete-time models fixed-lag smoothing is often effective [39]. For our model the optimal lag depends on the epoch, as the age of tree nodes strongly influence their correlation distance. For each epoch we determine the correlation distance empirically, and for the lag we use this distance multiplied by a factor α; we obtain good results with α = 1.
Particularly in cases where some event types are rare, Variational Bayes can improve on EM by iteratively estimating posterior distributions rather than point estimates of θ. A tractable algorithm is obtained if the joint posterior π(x, θ|y)dxdθ is approximated as a product of two independent distributions over x and θ, and an appropriate prior over θ is chosen. For the Poisson example above, combining a Γ(θ|α 0 , β 0 ) prior with the likelihood θ c e −qθ results in a Γ(θ|α 0 + c, β 0 + q) posterior. Similarly, with this choice the Variational Bayes approximation results in an inferred posterior distribution of the form where expectations are taken over x � R π(x|y, θ)π(θ)dθ, and π(θ) is the current posterior over θ. It would appear that obtaining samples x from this distribution is intractable. However, if π(θ) is a Gamma distribution, θ can be integrated out analytically in the likelihood π(x, y|θ)Γ(θ|α, β), resulting in an expression that is identical to the likelihood of the point estimate θ i = α i /β i except for an additional scaling factor e ψ(α i )/α i for each event of type i in x, where ψ is the digamma function. These scaling factors render the normalization constant of the likelihood intractable, but fortunately SMC algorithms only require densities to be defined up to normalization. As a result, Algorithm 3 can be used to generate samples from this distribution at no additional computational cost. See the Appendix ("Variational Bayes for Markov Jump processes") for more details. Explicitly, for model (1) the parameters y 0 ¼ ðr EM ; C EM Þ maximising E½log p x ðxjy 0 Þ�, where the expectation is taken over the posterior x � π(x|y, θ)dx as approximated by Algorithm 3, is where θ = (ρ, C) is the vector of current parameter estimates. Note that C 0 EM in (14) is constant in evolutionary time. In practice we maximize (1) with respect to piecewise constant functions C 0 EM ðtÞ, which yields for t 2 [ν, τ), where |x| ν,τ denotes the number of coalescent events in x that occur in the epoch [ν, τ). Similarly, a Variational Bayes inference procedure uses where expectations are taken over x � R π(x|y, θ)p(θ)dθ, where p(θ) is the posterior parameter distribution (16 and 17) of the previous iteration, and α ρ , β ρ , α C , β C parameterize the prior distributions ρ � Γ(α ρ , β ρ ) and C � Γ(α C , β C ).

Simulation study
We implemented the model and algorithm above in a Python/C++ program SMCSMC (Sequential Monte Carlo for the Sequentially Markovian Coalescent) and assessed it on simulated and real data.
To investigate the effect of the lookahead particle filter, we simulated four 50 megabase (Mb) diploid genomes under a constant population-size model (N e = 10, 000, μ = 2.5 × 10 −8 and ρ = 10 −8 , both per generation and per site, generation time g = 30 years). We inferred population sizes N e through evolutionary time, defined as the inverse of twice the instantaneous coalescent rate, as a piecewise constant function across 9 epochs (with boundaries at 400, 800, 1200, 2k, 4k, 8k, 20k, 40k and 60k generations) using particle filters Algorithms 2 and 3, as well as a recombination rate, which was taken to be constant through evolutionary time (and along the genome). Although recombination rate can be inferred, we here focus on the accuracy of the inferred N e through evolutionary time. Observations are often available as unphased genotypes, and we assessed both algorithms using phased and unphased data, using the same simulations for both. Experiments were run for 15 EM iterations and repeated 15 times (Fig 2a).
On phased data (Fig 2a, top rows), N e values inferred without lookahead show a strong positive bias in recent epochs, corresponding to a negative bias in the inferred coalescence rate. Increasing the number of particles reduces this bias somewhat. By contrast, the lookahead filter shows no discernable bias on these data, even for as little as 1, 000 particles. On unphased data (Fig 2a, bottom rows), the default particle filter continues to work reasonably well; in fact the bias appears somewhat reduced compared to phased data analyses, presumably because integrating over the phase makes the likelihood surface smoother, reducing particle degeneracy. By contrast, the lookahead particle filter shows an increased bias on these data compared to the default implementation. This is presumably because of the reliance of the lookahead likelihood on the distance to the next singleton; this statistic is much less informative for unphased data, making the lookahead procedure less effective, and even counterproductive for early epochs.
We next investigated the impact of using Variational Bayes instead of stochastic EM, using the lookahead filter on phased data. We simulated four 2 gigabase (Gb) diploid genomes using human-like evolutionary parameters (μ = 1.25 × 10 −8 , ρ = 3.5 × 10 −8 , g = 29, N e (0) = 14312) under a "zigzag" model similar to that used in [16] and [18], and inferred N e across 37 approximately exponentially spaced epochs; see Appendix ("Implementation Details"). Both approaches give accurate N e inferences from 2, 000 years up to 1 million years ago (Mya); other experiments indicate that population sizes can be inferred up to 10 Mya (but see Fig 3b). The upwards bias in the most recent epochs is reduced considerably by the Variational Bayes approach compared to SEM (Fig 2b), although some bias remains.

Inference on human subpopulations
We applied SMCSMC to three sets of four phased diploid samples, of Northern European (CEU), Han Chinese (CHB) and Yoruban (YRI) ancestry respectively, from the 1000 Genomes project. For comparison we also ran msmc [16] inferring on the same data, and on subsets of 2 and 1 diploid samples. Inferences show good agreement where msmchas power (Fig 3). Since the inferences show some variation particularly in more recent epochs, we simulated data under a demographic model closely resembling CEU and YRI ancestry as inferred by multiple methods (see Appendix, "Implementation Details"), and we inferred population sizes using SMCSMC and msmc as before. This confirmed the accuracy of SMCSMC inferences from about 5,000 to 5 million years ago, while inferences in more recent epochs show more variability. A representative comparison of run times is provided in Table 1.

Discussion
Motivated by the problem of recovering a population's demographic history from the genomes of a sample of its individuals [1], we have introduced a continuous-locus approximation of the CwR model, and developed a particle filter algorithm for continuous-time Markov jump processes with a continuous phase space, by evaluating the doubly-continuous process at a suitably chosen set of "waypoints", and applying a standard particle filter to the resulting discretetime continuous-state process. It however proved very challenging to obtain reliable parameter inferences for our intended application using this approach. To overcome this challenge we have extended the standard particle filter algorithm in two ways. First, we have generalized the Auxiliary Particle Filter of Pitt and Shephard [40] from a discrete-time one-step-lookahead algorithm to a continuous-time unbounded-lookahead method. This helped to address a challenging feature of the CwR model, namely that recent demographic events induce "sticky" model states with very long forgetting times. With an appropriate lookahead likelihood function (and phased genotype data), we showed that the unbounded-lookahead algorithm mitigates the bias that is otherwise observed in the inferred parameters associated with these recent demographic events. Some bias however remained, particularly for very early epochs. We reduced this remaining bias by a Variational Bayes alternative to stochastic expectation maximization (SEM), which explicitly models part of the uncertainty in the inferred parameters, and avoid zero rate estimates which are fixed points for the SEM procedure. The combination of a continuous-time particle filter, the unbounded-lookahead method, and VB inference, allowed us to infer demographic parameters from up to four diploid genomes across many epochs, without making model approximations beyond passing to the continuous-locus limit.
On three sets of four diploid genomes, from individuals of central European, Han Chinese and Yoruban (Nigeria) ancestry respectively, we obtain inferences of effective population size over epochs ranging from 5,000 years to 5 million years ago. These inferences agree well with those made with other methods [14][15][16][17][18], and show higher precision across a wider range of epochs than was previously achievable by a single method. Despite the improvements from the unbounded-lookahead particle filter and the Variational Bayes inference procedure, the proposed method still struggles in very recent epochs (more recent than a few thousand years ago), and haplotype-based methods [e.g., 12] remain more suitable in this regime. In addition, methods focusing on recent demography benefit from the larger number of recent evolutionary event present in larger samples of individuals, and the proposed model will not scale well to such data, unless model approximations such as those proposed in [18] are used.
A key advantage of particle filters is that they are fundamentally simulation-based. This allowed us to perform inference under the full CwR model without having to resort to model approximations, such as requiring coalescences to occur at certain evolutionary times only, that characterizes most other approaches. The same approach will make it possible to analyze complex demographic models, as long as forward simulation (along the sequential variable) is tractable. The proposed particle filter is based on the sequential coalescent simulator SCRM [45], which already implements complex models of demography that include migration, population splits and mergers, and admixture events. Although not the focus of this paper, it should therefore be straightforward to infer the associated model parameters, Table 1. Runtimes (total CPU time, hours) for analyzing one or two diploid human genomes using msmc (40 EM iterations), and SMCSMC (15 Variational Bayes iterations). Table lists means ± one standard deviation across 10 independent runs in a high performance compute environment. Note that due to parallel execution of SMCSMC (146 genomic chunks) and msmc (8 cores), wall clock time was considerably less than the total CPU time.

PLOS ONE
including directional migration rates. In addition, several aspects of the standard CwR model are known to be unrealistic. For instance, gene conversions and doublet mutations are common [56,57], and background selection profoundly shapes the heterozygosity in the human genome [58]. These features are absent from current models aimed at inferring demography, but impact patterns of heterozygosity and may well bias inferences of demography if not included in the model. As long as it is possible to include such features into a coalescent simulator, a particle filter can model such effects, reducing the biases otherwise expected in other parameter due to model misspecification. Because a particle filter produces an estimate of the likelihood, any improved model fit resulting from adding any of these features can in principle be quantified, if these likelihoods can be estimated with sufficiently small variance. However, even improved models will capture only a fraction of relevant features of a population's evolution, and the inferred effective population sizes will continue to have a complex relationship with census population due to population substructure, variation in family size, and many other aspects [59].
A further advantage of a particle filter is that it provides a sample from the posterior distribution of ancestral recombination graphs (ARGs). Such explicit samples simplify the estimation of the age of mutations and recombinations, and explicit identification of sequence tracks with particular evolutionary histories, for instance tracts arising from admixture by a secondary population. In contrast to MCMC-based approaches [13], a particle filter can provide only one self-consistent sample of an ARG per run. However, for marginal statistics such as the expected age of a mutation or the expected number of recombinations in a sequence segment, a particle filter can provide weighted samples from the posterior in a single run.
The algorithm presented here scales in practice to about 4 diploid genomes, but requires increasingly large numbers of particles as larger numbers of genomes are analyzed jointly. This is because the space of possible tree topologies increases exponentially with the number of genomes observed, while the number of informative mutations grows much more slowly, resulting in increasing uncertainty in the topology given observed mutations. This uncertainty is further compounded by uncertainty in branch lengths. Nevertheless, the many effectively independent genealogies underlying even a single genome provide considerable information about past demographic events [14], and a joint analysis of even modest numbers of genomes under demographic models involving migration and admixture events enable more complex demographic scenarios to be investigated. Our results show that particle filters are a viable approach to demographic inference from whole-genome sequences, and the ability to handle complex model without having to resort to approximations opens possibilities for further model improvements, hopefully leading to more insight in our species' recent demographic history.

Conditional distributions and the Markov property
Here we outline how to define a conditional distribution pð�jGÞ given a distribution π on X and a conditioning subset G � X of measure 0. Suppose G t is a family of subsets of X so that [ t G t ¼ X. A particular subset G t for a fixed τ plays the role of the conditioning event B in the standard definition P(A|B) = P(A \ B)/P(B). It can be shown that, under some conditions, there exists an essentially unique family of measures p G t and a measure μ so that p G t is concentrated on G t , p G t ðXÞ ¼ 1 for all τ, and E p ½f � ¼ RR f ðxÞp G t ðdxÞmðdtÞ for well-behaved functions f [60], making it possible to define the conditional expectation as E p ½f jG� ¼ E p G ½f � ¼ R f ðxÞp G ðdxÞ: Using this, the Markov property of π can be expressed in terms of conditional expectations: for loci s 1 < s 2 < . . . < s k < t and any well-behaved function f.

Proof of Algorithm 3
The algorithm is proved by induction on j. For j = 0 the loop invariant holds, while for j = K it implies the output condition. Suppose the loop invariant is true for some j. If ESS < N/2, assume w.l.o.g. that v ðiÞ ¼ v ðiÞ s j are normalized, let i k be the index of the kth new particle, v ðkÞ ¼ N À 1 andŵ ðkÞ ¼ N À 1 w ði k Þ =v ði k Þ be its weights, and write p j ¼ pðX 1:s j jY 1:s j ¼ y 1:s j Þ, p j ¼p s j ðX 1:s j jY 1:s j ¼ y 1:s j Þ, then so that the loop invariant continues to hold after the optional resampling step.
After sampling x ðiÞ s j :s jþ1 � x x ðX s j :s jþ1 jX s j ¼ x ðiÞ s j Þ, the particles fðx ðiÞ 1:s j ; w ðiÞ s j Þg approximate pðX 1:s j jY ¼ y 1:s j Þx x ðX s j :s jþ1 jX s j Þ. To make this distribution absolutely continuous w.r.t. pðX 1:s jþ1 ; Y s j :s jþ1 jY 1:s j Þ, multiply it with the constant measure l s j :s jþ1 ðy s j :s jþ1 Þ; any measure will do as long as it has a density w.r.t. l s j :s jþ1 and is independent of X. Taking the Radon-Nikodym derivative of these two distributions gives dp 1:s jþ1 ðX 1:sþ1 ; Y s j :s jþ1 jY 1:s j Þ x ðX s j :s jþ1 jX s j Þl s j :s jþ1 ðY s j :s jþ1 Þ� ðx 1:s jþ1 ; y s j :s jþ1 Þ ¼ dp x dx x ðx s j :s jþ1 jX s j ¼ x s j Þ dp dl ðy s j :s jþ1 jX s j :s jþ1 ¼ x s j :s jþ1 Þ This shows that fw ðiÞ s jþ1 ; X ðiÞ 1:s jþ1 g form particles approximating p 1:s jþ1 ðX 1:sþ1 ; y s j :s jþ1 jy 1:s j Þ, and since pðx 1:s jþ1 ; y s j :s jþ1 jY 1:s j ¼ y 1:s j Þ / pðx 1:s jþ1 jY 1:s jþ1 ¼ y 1:s jþ1 Þl s j :s jþ1 ðy s j :s jþ1 Þ they also approximate pðx 1:s jþ1 jY 1:s jþ1 ¼ y 1:s jþ1 Þ. The argument showing that ðv ðiÞ s jþ1 ; X ðiÞ 1:s jþ1 Þ �p s j ðx 1:s jþ1 jY¼ y 1:L Þ is analogous. This proves the loop invariant for j + 1, and the algorithm.

Particle weight variance
To derive a criterion on the waypoints that limits the effect of weight variance build-up, let R(s) = f(X s ) be the stochastic variable that measures the instantaneous rate of occurrence of emission events for a particular (random) particle X, and let WðsÞ ¼ W 0 expðÀ R s 0 RðuÞduÞ be that particle's time-dependent weight; the dependence on W on X is not written explicitly. Note that the expression for W(s) is valid as long as no events have occurred in the interval [0, L). We assume that R(s) is time-homogeneous, that it can be approximated by a Gaussian process, that particles are drawn from the equilibrium distribution, and that W 0 and R(s) are independent. Write hV(X)i ≔ R V(X)dπ(X) for the expectation of V over π(X). Writing RðsÞ ¼ m þRðsÞ where μ = hR(s)i is the mean event rate (which is independent of s by assumption), then where in the second equality we used the formula for higher moments of a Gaussian distribution, K is the covariance function of the Gaussian processRðtÞ, and C is the integral R L s 1 ;s 2 ¼0 Kðs 1 ; s 2 Þds 1 ds 2 . Now define σ 2 ≔ K(s, s) and assume that the covariance function satisfies 0 � K(s 1 , s 2 ) � σ 2 , then 0 � C � σ 2 L 2 and so that across an interval [0, L) where no events occur, where ESS 0 is the expected sample size at s = 0, and � denotes convergence in distribution as N ! 1 as before.
In practice particles will not be drawn from the equilibrium distribution π x (X), but from the joint distribution on X and Y conditioned on observations y. However, for most likelihoods conditioning will reduce the variance of R as observations tend to constrain the distribution of likely particles, making this a conservative assumption. The other assumption that is likely not met is that R(t) is a Gaussian process; it is less clear whether making this approximation will in practice be conservative.

The sequential coalescent with recombination process
In formula (1), if s is a recombination point, x s is the genealogy just left of the recombination point and includes the infinite branch from the root, so that b u (x s ) = 1 for u above the root.
The measure (1) describes the CwR process exactly as long as x encodes both the local genealogy and the non-local branches used by the SCRM algorithm. In practice the SCRM algorithm prunes some of these branches, and we use (1) on the pruned x.
Note that we take the view that the realisation x encodes not only the sequence of genealogies x s but also the number of recombinations |x| (some of which may not change the tree), their loci s j ¼ s x j , and the recombination and coalescence times n x j and t x j . This information is also kept in the implementation of the algorithm, and is used to calculate the sufficient statistics required for inference of the coalescence and recombination rates.

Variational Bayes for Markov jump processes
We consider hidden Markov models where the latent variable follows a Markov jump process over x 2 X, that with respect to a suitable measure dxdy admits a probability density of the form Here, |x| i is the event count for events of type i in realisation x, and B i (x) is the total opportunity for events of that type in x. For example, in our case and jxj Ug, for recombinations and coalescence opportunities and counts occurring in an epoch U � [0, 1).
A Variational Bayes approach approximates the true joint posterior density π(x, θ|y) / π xy (x, y|θ)π θ (θ), where π θ is a prior on the parameters, with a probability density ϕ(x, θ) that is easier to work with (here the constant of proportionality implied by "/" hides a constant density λ(y)). Following Hinton and van Camp [61] and MacKay [62], we choose to constrain ϕ by requiring it to factorize as ϕ(x, θ) = ϕ x (x)ϕ θ (θ), and we choose to optimize it by minimizing the Kullback-Leibler divergence KL(ϕ||π), also referred to as the variational free energy [63], To optimize ϕ θ (θ) we write F(ϕ) as a function of ϕ θ with ϕ x fixed, as This is minimized by setting logϕ θ (θ) equal to the log of the denominator. We can still choose the prior π θ (θ); a product of Gamma distributions Q i Gða i ; b i Þðy i Þ / Q i y a i À 1 i expfÀ b i y i g is suitable as it is conjugate to the factors appearing in the denominator. The result is that Next, to optimize ϕ x (x) we write F(ϕ) as a function of ϕ x with ϕ θ fixed, Define � y i :¼ E � y ½y i � and y � i :¼ expfE � y ½logy i �g, then using properties of the Gamma distribution we get � y ¼ a 0 i =b 0 i and y � i ¼ expfcða 0 i Þ À log b 0 i g where ψ is the digamma function. Again, F(ϕ) is minimized if the numerator and denominator are proportional, which happens for As given, the algorithms in this paper sample from a distribution of the form pðxjy; � yÞ, but they can easily be modified to sample from ϕ x (x) instead by including an additional factor η i in a particle's weight for every event of type i that occurs.
To approximate the likelihood of the doubleton data, note that a node c with precisely two descendants (a, b) (a "cherry") at height l changes if a recombination occurs in either branch a or b and the new lineage coalesces out, or a recombination occurs outside of a and b and coalesces into either (Fig 1b). Again assuming that all TBLs are l and coalescences occur before l, the total rate of change is 2lr nÀ 2 n þ ðn À 2Þlr 2 n ¼ 4lr nÀ 2 n :¼ r C . When a cherry changes, we assume that the new cherry is drawn from the equilibrium distribution. To calculate the probablity of observing c = (a, b) at equilibrium, assume that a tree supports 1 � k � n/2 cherries. The branches of c are among the 2k branches subtended by the tree's k cherries with probability 2k n 2kÀ 1 nÀ 1 , and a is paired with b with probability 1 2kÀ 1 . Since k has mean n/3 if n � 3 [64], the probability of observing (a, b) at equilibrium is 2 3ðnÀ 1Þ . We approximate the likelihood of a doubleton by 0 if the c is not in the tree, and by 1 if it is. Then, the likelihood of observing c k = (a k , b k ) at the last known locus s 00 k conditional on the tree currently containing c k is pða k ; b k ; s 0 k ; s 00 k jða k ; b k ; lÞ 2 tÞ ¼ e À r C s 00 k þ 2 3ðn À 1Þ ð1 À e À r C s 00 k Þ; ð30Þ where (a k , b k ;l) 2 τ expresses that τ contains cherry c k = (a k , b k ) at height l. Now suppose c k = 2 τ and let � l be the average TBL in τ. Under similar assumptions, cherries are created at a rate ðn À 1Þr � l and assuming that new cherries are drawn from the equilibrium distribution, the likelihood of observing c k at the first known locus s 0 k is pða k ; b k ; s 0 k ; s 00 k j � l; ða k ; b k Þ= 2tÞ ¼ 2 3ðn À 1Þ ð1 À e À r 0 where r 0 C ¼ ðn À 1Þ � lr is the effective rate of recombinations that potentially result in the creation of c k . Note that (30 and 31) are likelihoods for τ supporting c k at the given locus, rather than for a doubleton mutation actually occurring.
These likelihoods show good performance, but result in some negative bias in inferred population size for recent epochs. We traced this to the lack of correlation between l i and l 0 i , requiring a single very recent coalescence to explain a long segment devoid of singletons, rather than allowing for the possibility of several correlated coalescences each in slightly earlier epochs. To model correlations, we averaged the likelihood above over ρ 0 = ρ and ρ 0 = ρ/2 each weighted with probability 1/2. This effectively removed the negative bias.
To deal with missing data, we reduce μ proportionally to the missing segment length and the number of lineages missing. For unphased mutation data, singletons and doubletons can still be extracted, and are greedily assigned to compatible lineages. The likelihoods are also similarly calculated, by greedily assigning cherries to observed doubletons. Unphased singletons can result from mutations on either of the individual's alleles; the likelihood term uses the sum of the two branch lengths for that individual to calculate the expected rate of unphased singletons.

Implementation details
While x 1:s refers to the entire sequence of genealogies along the sequence segment 1: s, storing this sequence would require too much memory. Instead we only store the most recent genealogy x s (including non-local branches where appropriate), which is sufficient to simulate subsequent genealogies using the SCRM algorithm. To implement epoch-dependent lags when harvesting sufficient statistics, we do store records of the events (recombinations, coalescences and migrations) that changed x along the sequence, as well as the associated opportunities, for each particle and each epoch; this implicitly stores the full ARG. To avoid making copies of potentially many event records when particles are duplicated at resampling, these are stored in a linked list, and are shared by duplicated particles where appropriate, forming a tree structure. Records are removed dynamically after contributing to the summary statistics, and when particles fail to be resampled, ensuring that memory usage is bounded.