Figures
Abstract
Phylodynamic models capture joint epidemiological-evolutionary dynamics during an outbreak, providing a powerful tool to enhance understanding and management of disease transmission. Existing phylodynamic approaches, however, mostly rely on various non-mechanistic or semi-mechanistic approximations of the underlying epidemiological-evolutionary process. Previous work by Lau and colleagues has shown that full Bayesian mechanistic models, without relying on these approximations, can enable highly accurate joint inference of the epidemiological-evolutionary dynamics including the unobserved transmission tree. However, the Lau method faces major computational bottlenecks. As the volume of genomic data collected during outbreaks continues to grow, it is crucial to develop scalable yet accurate phylodynamic methods. Here we propose a new Bayesian phylodynamic model, overcoming the major scalability issue in the previous method and enabling a readily deployable, yet accurate, phylodynamic modeling framework. Specifically, we develop a scalable spatio-temporal phylodynamic framework for inferring the transmission tree (ScITree) and other key epidemiological parameters considering the infinite sites assumption in modeling mutation on the sequence level, in contrast to the Lau method in which mutation was modeled explicitly on the nucleotide level. Our approach features full Bayesian implementation utilizing an exact likelihood to mechanistically integrate epidemiological and evolutionary processes. We develop a computationally-efficient data-augmentation Markov Chain Monte Carlo algorithm, inferring key model parameters and unobserved dynamics including the transmission tree. We assess performance of our method using multiple simulated outbreak datasets. Our results indicate that our method can achieve high inference accuracy, comparable to the performance of the Lau method. Additionally, our method scales significantly more efficiently for large outbreaks, with computing time increasing linearly with outbreak size, compared to the exponential scaling of the Lau method. We also demonstrate our method’s utility by applying our validated modeling framework to a dataset describing a foot-and-mouth disease outbreak in the UK. Our results show that our method is able to generate estimates of the transmission dynamics consistent with those from the prior method, further demonstrating the robustness of our new approach. In summary, our method provides a computationally-efficient, highly scalable, accurate modeling framework for inferring the joint spatio-temporal dynamics of epidemiological and evolutionary processes, facilitating timely and effective outbreak responses in space and time. Our method is implemented in our R package ScITree.
Author summary
Phylodynamic models integrate epidemiological and evolutionary dynamics to better understand disease transmission during outbreaks. However, many existing models rely on approximations that limit their accuracy and interpretability. Previous work by Lau and colleagues has shown that full Bayesian mechanistic models, without relying on these approximations, can enable highly accurate joint inference of the epidemiological-evolutionary dynamics including the unobserved transmission tree. Their method, however, faced significant computational challenges, particularly with large datasets. In this study, we present a new Bayesian phylodynamic model, ScITree, designed to overcome the scalability issues of the Lau method. By adopting the infinite-sites assumption for modeling mutations, rather than explicitly modeling nucleotide-level changes, ScITree achieves a significant improvement in computational efficiency while retaining accuracy of model inference. Our method, validated through simulations and real outbreak data, provides results comparable to the original Lau model but at a fraction of the computational cost, demonstrating its scalability and practical application for real-time outbreak responses. ScITree is implemented as an R package, making it accessible for further research and public health use.
Citation: Waddel H, Koelle K, Lau MSY (2025) ScITree: Scalable Bayesian inference of transmission tree from epidemiological and genomic data. PLoS Comput Biol 21(6): e1012657. https://doi.org/10.1371/journal.pcbi.1012657
Editor: Timothy Glenn Vaughan, ETH Zürich, SWITZERLAND
Received: November 19, 2024; Accepted: May 22, 2025; Published: June 10, 2025
Copyright: © 2025 Waddel et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data and code are available on a Github repo https://github.com/hbwddl/ScITree.
Funding: H.W. was partially supported by the Emory MP3 (Molecules and Pathogens to Populations and Pandemics) initiative. The funder has no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Advances in next-generation sequencing have reduced the cost and complexity of genetic sequencing, making pathogen sampling and sequencing routine in outbreak surveillance and response [1,2]. Consequently, the volume of available pathogen genetic data has expanded rapidly. For example, millions of pathogen genetic sequences are publicly available on platforms such as GISAID or Genbank [3,4], in addition to private or restricted repositories. Practitioners using these datasets require methods that scale to the size of available genetic data, while maintaining accuracy and producing actionable information. This creates an environment conducive to the use of phylodynamic methods, which unify the epidemiological process of disease transmission and evolutionary process of the pathogens to sharpen the inference of the transmission dynamics [5–9]. One variable of particular interest to practitioners is the transmission tree (i.e., who-infected-whom), which can facilitate the estimation of key population-level epidemiological parameters such as the reproductive number [10] and fine-grained transmission dynamics including superspreading events [11,12].
While methods solely built on phylogeny can capture the relationships between genetic samples in an outbreak, a phylogeny calculated from those samples does not directly represent the transmission tree of that outbreak [13,14]. For instance, a phylogeny alone will not show the direction of disease transmission, and complications arise when ancestor and descendant pathogens are sampled in the same outbreak [5,15,16]. While multiple phylodynamic models which explicitly incorporate the underlying transmission dynamics have been proposed, they are subject to certain limitations. First, many phylodynamic methods conduct sequential or iterative inference of the phylogenetic tree and the transmission tree[1,15,17–20]. While these approaches and their variants can expedite inference and perform well when the primary inferential target is the phylogenetic tree, they may not yield the most accurate estimation of the transmission tree [21]. Other non-iterative methods simplify parts of the data likelihood and conduct inference using an ad-hoc pseudo-likelihood or composite likelihood, rather than a complete likelihood [22,23]. However, with these approaches, systematic inference and interpretation of certain epidemiological quantities of interest, such as the timing of the transmission tree, are generally challenging.
In this paper, we develop a fully Bayesian mechanistic phylodynamic model that utilizes an exact likelihood to describe the underlying joint epidemiological-evolutionary process. In particular, we extend a method previously developed by Lau et al. [24]. Briefly, the Lau method is an individual-level method which fits a mechanistic transmission model to epidemiological data (location, demographics, or other characteristics) and sampled pathogen genetic data. It accomplishes this by developing a custom Bayesian data-augmentation MCMC algorithm [25,26] which is able to efficiently leverage the complete-data likelihood and explore the high-dimensional parameter space of the joint epidemiological-evolutionary model. As such, it can robustly infer the posterior distributions of model quantities including the unobserved transmission tree (i.e., who-infected-whom) and its timing. It was shown to be able to accurately estimate the joint (epidemiological-evolutionary) process, and performed best in estimating the transmission tree in a study comprehensively comparing multiple phylodynamic methods [21]. However, the Lau method faces major computational constraints due to explicitly modeling mutation on the nucleotide level. Specifically, it requires nucleotide-wise imputation of all unobserved transmitted sequences, which results in a vast parameter space, including missing data and model parameters, which must be explored in the MCMC. Though its accuracy makes it an attractive option to use in disease outbreak transmission tree reconstruction, such nucleotide-wise imputation significantly impedes its computational scalability. The rapid nature of disease outbreaks as they unfold necessitates methods that can be used quickly and accurately. This paper aims to develop a new phylodynamic model scaling the Lau method to larger outbreaks, without compromising the accuracy of the inference of the transmission dynamics [24]. We accomplish this by incorporating an infinite-sites assumption to describe the evolutionary process, with mutations accumulating within an individual following a Poisson process. Rather than imputing mutations at each base pair in a genetic sequence, we model the genetic mutations between sequences through time. In parallel, we develop an efficient data-augmentation MCMC algorithm for this new model, which enables full Bayesian inference of the transmission dynamics leveraging a complete likelihood describing the observations including the observed epidemiological data and sampled genetic data.
Our methodology is first tested using simulated datasets, where we demonstrate that our new model can achieve similar parameter estimation and transmission tree coverage as the Lau method. Additionally, our results show that our model scales much more efficiently with increasing outbreak size compared to the Lau method. Furthermore, our methodology effectively accommodates incomplete sampling of infected individuals, maintaining reasonable accuracy in estimating the transmission tree under moderate sampling coverage. To demonstrate the real-life utility of our methodology, we apply our model to a previously analyzed outbreak of Foot-and-Mouth Disease (FMD) in livestock that occurred in the UK in 2001 [24,27]. Our results suggest that our new method can reconstruct critical estimates similar to those from the Lau method. Thus, we demonstrate our model’s ability to perform accurate and scalable phylodynamic inference for disease outbreaks.
Model and methods
Stochastic epidemiological process
We model the epidemiological transmission process using a general continuous-time spatio-temporal SEIR framework with Susceptible (S), Exposed (E), Infectious (I), and Removed (R) compartments. Let indicate the sets of individuals in each category at time t (Table 1). A susceptible individual
is exposed as a new infection with a stochastic rate from a currently infectious individual
with rate
. The function
denotes a spatial kernel function, which allows the infectious challenge from individual i to j to vary with distance dij. For this work, we assume an exponentially-decaying
, though other kernel options are possible. The overall probability of j becoming exposed to the pathogen in the time interval
is
where o(dt) represents a term that becomes negligible when dt is infinitesimally small.
After an individual is exposed, their sojourn time spent in the exposed category (class E) is modelled using a Gamma(a, b) distribution with shape a and scale b, with a density fE(a,b) and cumulative distribution FE(a,b) (Table 2) [28,29]. After this latent period has elapsed, the individual moves into the infectious category (class I) and spends an amount of time governed by a Weibull(c, d) distribution with shape c, scale d, density fI(c,d) and distribution FI(c,d) [28,29]. Following this sojourn time in the infectious category, the individual recovers or is removed from the population (class R). Note that these sojourn time distributions are selected as appropriate, and do not necessarily need to be the Gamma or Weibull distributions. The sojourn times are assumed to be independent between individuals.
Stochastic evolutionary process
Surrogate modeling of evolutionary dynamics: Infinite-sites model.
Our model aims to mechanistically and fully capture the joint epidemiological-evolutionary process schematically illustrated in Fig 1, which requires inference of unobserved transmission and partially-observed evolutionary dynamics.
Grey circles and lines represent unobserved timepoints and events, such as the exact transmission time from individual i j, and the number of mutations that occur between events, denoted by
. Genetic sequences
for the individuals are sampled at timepoints
.
In the Lau method, pathogen evolution is explicitly modeled on the nucleotide level using continuous-time Markov processes such as the Kimura model [24,30]. Here we describe the evolutionary process using a parsimonious model that adopts the infinite-sites assumption, where mutations at each nucleotide site can occur only once, and no mutation reversions can take place. Specifically, we assume that mutations occur according to a Poisson process, where the number of mutations of a sequence with length n, through a period of time dt is modeled as
). The parameter
characterizes the rate of mutations per unit time per site.
It is assumed that this evolutionary process is conditionally independent of the epidemiological process, given the transmission tree and exposure times (modeled by the stochastic epidemiological process previously described).
Simulating ground-truth evolutionary dynamics: Nucleotide substitution Markov model.
Note that the infinite sites assumption offers a surrogate modeling approach for more complicated Markov nucleotide substitution processes [31]. To test the robustness of our surrogate model incorporating the infinite-site assumption, in simulation studies (described in Results) we adopt a two-parameter continuous-time Kimura Markov model [30] used in the Lau method [24] as the ‘ground-truth’ model to simulate fine-grained nucleotide-level mutations. Briefly, the Kimura model we use is a nucleotide substitution model which assumes that transition mutations (pyrimidine to pyrimidine or purine to purine) occur at a different, typically higher rate than transversion mutations (pyrimidine to purine or vice versa). The model is governed by two parameters, and
, which determine the rates of transition and transversion from base pair x to base pair y over an interval length t according to probabilities
The model allows for reversion of a mutation back into the original nucleotide. Note that, while this model allows substitutions between time points, it assumes no genetic diversity within a host at a given time, following the single-dominant-strain assumption [24].
A Bayesian modeling framework
Complete-data likelihood.
As our inferential procedures make extensive use of data augmentation techniques [25,26], we begin by discussing the formulation of a complete-data likelihood for the joint epidemiological-evolutionary model, assuming all model quantities are known. It is noteworthy that some of the quantities required to calculate the likelihood will be observed directly while others will be inferred/augmented. Notation is explained in Table 1.
We model a population of N individuals, for which we observe the geospatial locations of all the individuals. We observe an outbreak in this population between time 0 and time Tmax. We define qj(T) as the accumulated infectious challenge for individual j until time T:
where K is the spatial kernel function, and dij is the Euclidean distance between individuals i and j. The function defines the contribution to the likelihood arising from the exposure of j by the source
, and is given by
For an individual j who is exposed, there is a set of mj timepoints , where these timepoints correspond to “critical" events such times of infections and sequence sampling times. Rather than modeling the missing genetic sequences that would be observed at those timepoints, we model the genetic distances between successive sequences. The genetic distance is measured using the Hamming distance, which is the number of base pair differences between sequences [32]. We denote these distances for an individual j as
, where
is the set of genetic distances between the critical events in individual j. (Fig 1).
The contribution to the likelihood from the mutation of an individual j’s genetic sequences is given by the following formula:
where is a Poisson distribution with a rate of
.
Denoting = (
as the scalar model parameter vector, our full likelihood, combining the epidemiological and genomic contributions, as well as the complete data z (both observed and unobserved), is thus given by:
where parameters are defined previously (Tables 1 and 2). Note that denotes the set of ever-exposed individuals which excludes the earliest exposure or index case.
Custom MCMC inferential algorithm: Joint sampling of transmission tree, exposure time, and genetic mutations.
One major challenge in conducting full Bayesian inference for our previously described joint epidemiological-evolutionary model is developing an efficient MCMC algorithm to explore the vast latent/unobserved parameter space, particularly the joint space of the transmission tree, exposure times, and genetic mutations. Here, we develop a custom MCMC algorithm which can efficiently and effectively explore the vast model parameter space.
We begin by proposing a new infecting source for j, , drawn with probability
Given the new source , a new exposure time
is proposed from a uniform distribution bounded by the infectious times of the infection source and recipient, as well as the genomic sampling time, as an individual would not have a genetic sample before being exposed:
Given the sampled source of exposure and exposure time
, we now describe how we sample the genetic mutations. The main idea is to exploit a local greedy sampler that respects and imposes the infinite-sites assumption in the vicinity of the newly proposed exposure time
(Fig 2). The algorithm is greedy in the sense that it does not necessarily respect the infinite-sites assumption when considering aggregate mutations across all the time points within a particular host, nor across the entire transmission tree. This maintains computational efficiency, while still recapturing the parameters and transmission tree (Figs 3 and 4). Specifically, our greedy sampler proposes new genetic distances along the lineage connecting the observed genetic samples
and Gj associated with the newly proposed source
and the individual j respectively. Under the infinite-sites assumption along a lineage, a particular
genetic distance
neighboring
follows a Binomial distribution, i.e.,
Our local greedy algorithm respects the infinite-sites assumption for mutations neighboring the newly proposed exposure time , along a lineage (represented by a blue line). We illustrate our algorithm using two scenarios in which both the newly proposed source
and the infectee individual j have observed genetic samples neighboring the newly proposed exposure time
. In Scenario A, the genetic sample of
is an ancestor of that of the individual j. In Scenario B, the two genetic samples share the same most recent common ancestor.
The coverage rate here is defined as the proportion of maximum posterior sources which are correct. We see comparable performance by ScITree and the Lau method. If we expand our source estimates for each method to also include second-most likely sources which are also correct, we see comparable or even improved performance.
Dashed lines denote the true simulation values of the parameters which are comparable between the Lau method and ScITree. The dashed line shows the approximation value under the Kimura model .
where is the genetic distance between the observed samples of
and j, and pk is generally the time duration associated with
normalized by the total elapsed time along the lineage. Fig 2 illustrates our greedy sampler for two typical scenarios. Other scenarios can be readily accommodated and are described in S1 Text. Note that the acceptance probability for a particular proposed value needs to be properly specified.
Further details of our algorithm are described in S1 Text, including the sampling of the scalar parameters in =
and the prior distributions adopted.
Results
Simulation studies
We tested our proposed method using multiple simulated synthetic outbreaks. We simulated our outbreaks under the more fine-grained evolutionary model described in Lau, in which a 2-parameter Kimura model is used to model genetic mutation on the nucleotide level (see section Stochastic Evolutionary Process) [24,30]. The Kimura model also allows for reversion of mutations through time. By testing our model, using the infinite-sites assumption, against data simulated under the Kimura model, we can rigorously evaluate the robustness of our surrogate modeling approach under more general genetic mutation conditions.
We simulate datasets based on the simulation studies from the Lau paper [24]. We consider that an outbreak in a population of N = 150 individuals begins at time t = 0, given an index case, and proceeds for 104 days. We use epidemiological parameters of , with sojourn times in the exposed category following a Gamma(10,0.5) distribution (mean 5 days, standard deviation 1.58 days) and sojourn times in the infectious category following a Weibull(2, 2) (mean 1.77 days, standard deviation 0.93 days). Pathogen genetic sequences with n = 8000 bases are simulated using the mutation parameters of the Kimura model,
and
. The Lau model estimates the Kimura model parameters
and
, while ScITree estimates
, the rate of mutations per day per nucleotide site.
We compare performance by assessing the coverage accuracy of the imputed transmission tree. As in [21], we define coverage accuracy as the proportion of individuals for whom the source with the most posterior support was the true source. We observe in Fig 3 that our new method achieves a comparable coverage rate when compared against the Lau method, with coverage rate above 90% for each dataset. We also examined the posterior distributions of scalar parameters for each replicate (Fig 4). We also see similar posterior distributions for the scalar parameters in , though the estimates for the new method are typically slightly wider (Fig 4). Although the Kimura substitution rate parameters
and
are not directly comparable to
, we may approximate the corresponding expected mutation rate (i.e.,
under the Kimura model) as
. The posterior distribution of our mutation parameter
is also broadly consistent with this approximation, suggesting that our algorithm is able to explore and infer the latent model space efficiently. We report credible interval coverage rates for scalar parameters and the transmission tree for further simulations in S3 Table.
We also assess the robustness of our method when the onset of infectiousness is imperfectly observed (S1 Text). In general, we find that our method remains robust when infectious times are known within a time window.
In addition, we investigate the limiting scenarios of our method (see S1 Text). Our results show that when the true substitution rate is extremely (if not unrealistically) high, the infinite-site assumption may be less applicable and our model will tend to underestimate the rate. However, the accuracy of transmission tree inference generally remains high.
Improved computational scalability
We evaluated the computational performance of our method for simulated outbreaks under the same parameter values, with population sizes between N = 100 and N = 1000 to evaluate computational performance. We simulated a viral genetic sequence base pairs in length, a similar length to the Foot-and-Mouth Disease Virus and on a similar scale to other viral genome lengths [33,34].
We see significant improvement in computation time for larger outbreaks when compared to the Lau method, as the computation time for ScITree scales linearly with increasing outbreak size while the computation time for the Lau method scales non-linearly. For a moderately-sized and densely-sampled outbreak of N=1000 individuals, ScITree computation times are, on average, less than one-fourteenth of the Lau method (Fig 5). This is expected, as the Lau method must make 8,000 nucleotide-level imputations for each transmitted sequence, resulting in exploration of the vast joint parameter space of the transmission tree, exposure time, and sequence data [24]. In contrast, our model requires far fewer imputations. We expect the discrepancy in computational performance between our method and the Lau method to further increase beyond the length of the sequence () being considered here.
Among populations of N = 100 to N = 1000, five outbreaks at each population amount were simulated under the same parameter set.
Tolerance to incomplete genetic sampling
In real-world conditions, it is possible that only a subset of infected individuals in an outbreak will have pathogen genetic samples sequenced. Thus, we tested our method’s tolerance to incomplete sampling, where less than 100% of the infected individuals have a pathogen genetic sample. Following Lau, we examine the coverage at each MCMC iteration to assess incomplete sampling performance [24]. For each of the five simulated datasets, we considered the baseline scenario with 100% sampling (i.e., every infection has a genetic sample), then subsequently reduced the sampling percentage by randomly removing the genetic samples of some infections, though still observing the case, and re-ran the inference. We observe in Fig 6 that the tree accuracy of our method decreases as the sampled percentage decreases. This is not unexpected, however, and the coverage of our method still remains around 60%, even as the sampling proportion drops to 10%. This demonstrates that our ScITree model can effectively incorporate available genetic samples to improve inference, while also accommodating a moderate sampling rate in practice and achieving a reasonable estimate.
Each density plot represents the posterior distribution of the tree coverage rate, pooled from five subsampled datasets derived from the baseline fully sampled datasets, at a specific subsampling rate.
Impact of unobserved infections
Our method assumes that every infectious individual is observed during an outbreak. Here we test whether our method is still able to detect the transmission tree “lineage” and infer the previous source in the chain of transmission as the maximum posterior source, when some infections are unobserved.
For each outbreak simulated in section “Simulation studies”, we randomly removed observations of ten individuals which were the source of exposure for at least one other individual in the outbreak. For those individuals who were infected by a removed individual, we tested whether their new inferred source is the source of transmission for their removed true source of infection. We calculated whether the most probable posterior source recaptured the lineage, as well as whether the top two most probable posterior sources detected the lineage. We repeated this experiment three times for each outbreak, removing different sets of ten individuals for each inference.
In general, we found that the ScITree method is still able to reconstruct the chain of transmission when a source of exposure is missing. Across replicates, we were able to recapture, on average, 81.9% of the transmission chains using the most probable posterior sources, and 91.3% of the transmission chains if we also included the second-most probable posterior sources.
Case study: Foot-and-Mouth Disease Virus (FMDV) outbreak in the UK
We apply our algorithm to a FMDV outbreak which occurred in 2001 in 12 farms in Darlington, Durham County, UK. This dataset was previously analyzed in Lau 2015 and in Morelli et al 2012 [24,27]. In this case study, following previous work [24,27], we consider premises with spatially confined host populations as the unit of infection, using the centroids of premises as geographical coordinates. Consensus FMDV genomes sampled from each premise will be used in the inference, with the removal of a premise from the population representing its culling. Each premise was sampled and the outbreak ultimately had 12 genomic samples with a sequence length of n = 8196 nucleotides each. The data included geographical location, sampling time and sampled sequences, estimated onset time of lesions (here, taken to be the time of becoming infectious), and removal/culling times of the infected premises.
Fig 7 shows the transmission tree constructed by taking each individual’s most probable posterior source. Our results largely agree with the transmission tree estimated in Lau, and in particular, we reconstruct the longest sequence of transmission ( identified in both previous analyses [24,27].
The infection source with the highest posterior probability of transmission is used as the estimated source for each farm. The same set of labels for the farms used in [27] is used. We reconstruct the transmission chain, the longest sequence of transmissions that was identified both in [27] and [24] (highlighted in red). Farms are plotted based on their geographic locations (longitude and latitude).
Table 3 displays posterior summaries for the model’s scalar epidemiological and mutation parameters. We estimate similar median epidemiological and mutation parameters as the Lau method, indicating that our method is able to achieve robust estimates of the joint epidemiological-evolutionary dynamics without explicitly considering nucleotide-level mutations. Using the approximation , with the Lau method we obtained a median estimate of the mutation rate under the Kimura model of
and a 95% credible interval of (
,
). This is broadly consistent with our estimate of
using ScITree, with a median of
and a 95% credible interval of (
,
) (Table 3). We observe that the posterior credible intervals for scalar parameters tend to be slightly wider for our results than for the Lau method.
Discussion
Phylodynamic models capture the joint epidemiological and evolutionary dynamics of an outbreak, providing a powerful tool for enhancing our understanding and management of disease transmission. However, existing phylodynamic approaches have several limitations. In particular, many rely on ad-hoc, non-mechanistic, or semi-mechanistic approximations of the underlying epidemiological-evolutionary process. While these approximations have proven robust when the primary focus is on estimating evolutionary dynamics, systematic inference and mechanistic interpretation of the underlying epidemiological dynamics, particularly the transmission tree, are generally challenging with these approximations.
Lau et al. made the first attempt of fully mechanistically integrating epidemiological and genomic data within a Bayesian data-augmentation framework [24]. Their methodology is able to utilize a genuine complete-data likelihood that more realistically captures the underlying epidemiological-evolutionary process, as opposed to using ad-hoc pseudo likelihood in many approaches. As such, as shown in a study comprehensively comparing multiple phylodynamic methods [21], their methodology can yield the most accurate estimate of the transmission tree. Their method, however, is limited by poor computational scalability as epidemic size increases. As the amount of genetic data available for outbreaks continues to grow, it becomes imperative to develop a phylodynamic model that not only performs well but is also feasible for use with large, modern outbreak datasets.
In this paper, building on the framework developed by Lau et al.[24], we develop a more efficient and scalable phylodynamic framework for inferring the transmission dynamics including the transmission tree. Our results show that our method retains the inferential accuracy of the underlying dynamics achieved by the original approach, while significantly reducing the computational burden by bypassing the need to explicitly model mutations at the nucleotide level. Our results also suggest our method can reasonably accommodate the scenarios of incomplete genomic sampling of infected individuals relatively effectively without significantly impacting the tree accuracy.
We also demonstrate our method’s utility by applying our validated modeling framework to a dataset describing a FMD disease outbreak in the UK. Our results show that our method is able to generate estimates of the transmission dynamics consistent with those from the Lau method, further demonstrating the robustness of our new method [24]. In summary, our method provides a computationally-efficient, highly scalable, accurate modeling framework for inferring the joint spatiotemporal dynamics of epidemiological and evolutionary processes, facilitating timely and effective outbreak responses in space and time. Our method represents a distinct modeling framework that complements existing phylodynamic methods built on coalescent theory [17,35–37]. It employs genuine joint inference of the (partially observed) epidemiological-evolutionary dynamics, directly and accurately reconstructing the transmission tree.
Our study has several limitations, and future work extending our methodology can be considered. An inherent limitation to our method, as compared to the Lau method, comes as we work with a summary statistic, the Hamming distance, rather than the nucleotide-level model [24]. There is a trade-off between scalability and precision which requires careful thought to balance in practice. The parsimony in our method results in slightly “flatter" posterior distributions for both scalar parameters and the source of infection. Nevertheless, we observe very comparable performance in inferring the underlying epidemiological-evolutionary dynamics between ScITree and the Lau method. In addition, the local greedy algorithm, while efficient, does not necessarily apply the infinite-sites assumption globally in the transmission tree. Further work could involve imposing the assumption more globally, such as across a single host or even the entire transmission tree. Finally, our method works with a consensus sequence of the host pathogen populations, which may not show much divergence over a very short time period. Further work with this model may incorporate non-consensus sequence data or haplotype networks to better capture within-host population dynamics which are present during an outbreak.
Supporting information
S1 Text. We present the following supplementary information in the S1 Text.
1) Our general MCMC framework for sampling unobserved data and scalar parameters; 2) Additional scenarios for jointly sampling , and
and scenarios in which genetic sampling data is unavailable for a transmission pair; 3) Additional simulations and credible interval coverage rates; 4) Inference of infectious time Ij; 5) Method performance under extreme mutation rates; 6) Performance benchmarking; 7) Computing environment.
https://doi.org/10.1371/journal.pcbi.1012657.s001
(PDF)
S1 Fig. Proposing sequence genetic distances with a local greedy algorithm when genomic sampling data is available for a transmission pair.
The local greedy algorithm we use respects the infinite-sites assumption for mutations adjacent to the proposed exposure time . When we have available genomic sampling data, we propose genetic distances adjacent to
,
, and
, with a binomial draw from the sample genetic distance, while respecting the current genetic distances in the source i.
https://doi.org/10.1371/journal.pcbi.1012657.s002
(TIF)
S2 Fig. Proposing sequence genetic distances with our local greedy algorithm when genomic sampling data is unavailable.
When sample data is not available for the transmission pair i and j, we propose the new genetic distances adjacent to with a binomial draw if we are inserting
into an existing genetic distance (scenario A), or with a Poisson draw using the current value of
in the MCMC (scenario A, host j, and scenario B).
https://doi.org/10.1371/journal.pcbi.1012657.s003
(TIF)
S3 Fig. Posterior distribution of scalar parameters when Ij is also estimated.
In this case, we assumed that we knew Ij within a 2-day window (which, in practice, may be informed by the symptom onset time).
https://doi.org/10.1371/journal.pcbi.1012657.s004
(TIF)
S4 Fig. Posterior distribution of scalar parameters at extreme values of λ
Inference was done for a simulated dataset with 8,000 base pairs in the pathogen genomic data. In figure (a), the mutation rate was λ≈0.006 mutation per base per day, for a mutation rate across the entire sequence of approximately 48 mutations per day. In figure (b), the mutation rate was #x03BB;≈ 0.03 mutations per base per day for a mutation rate across an entire sequence of 240 mutations per day.
https://doi.org/10.1371/journal.pcbi.1012657.s005
(TIF)
S5 Fig. Effective sample size of scalar posterior distributions for ScITree and the Lau 2015 method.
We approximate λ for the Lau 2015 method with the formula λ≈μ1 + 2μ2.
https://doi.org/10.1371/journal.pcbi.1012657.s006
(TIF)
S1 Table. Prior distributions for model parameters in simulation analyses.
https://doi.org/10.1371/journal.pcbi.1012657.s007
(PDF)
S2 Table. Prior distributions for model parameters in Foot-and-Mouth Disease outbreak analysis.
https://doi.org/10.1371/journal.pcbi.1012657.s008
(PDF)
S3 Table. 95% Credible Interval coverage rate and posterior source coverage rate for 50 simulations.
https://doi.org/10.1371/journal.pcbi.1012657.s009
(PDF)
S4 Table. Transmission tree coverage when estimating Ij.
https://doi.org/10.1371/journal.pcbi.1012657.s010
(PDF)
S5 Table. Transmission tree coverage under extreme values of the mutation rate λ
https://doi.org/10.1371/journal.pcbi.1012657.s011
(PDF)
Acknowledgments
We acknowledge the support received through the Emory MP3 (Molecules and Pathogens to Populations and Pandemics) initiative, and Drs. Anice Lowen and David VanInsberghe in the Swine MP3 group for discussions.
References
- 1. Cottam EM, Wadsworth J, Shaw AE, Rowlands RJ, Goatley L, Maan S, et al. Transmission pathways of foot-and-mouth disease virus in the United Kingdom in 2007. PLoS Pathog. 2008;4(4):e1000050. https://doi.org/10.1371/journal.ppat.1000050 pmid:18421380
- 2. Attwood SW, Hill SC, Aanensen DM, Connor TR, Pybus OG. Phylogenetic and phylodynamic approaches to understanding and combating the early SARS-CoV-2 pandemic. Nat Rev Genet. 2022;23(9):547–62. pmid:35459859
- 3. Shu Y, McCauley J. GISAID: global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 2017;22(13):30494. pmid:28382917
- 4. Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2020;48(D1):D84–6. pmid:31665464
- 5. Grenfell BT, Pybus OG, Gog JR, Wood JLN, Daly JM, Mumford JA, et al. Unifying the epidemiological and evolutionary dynamics of pathogens. Science. 2004;303(5656):327–32. pmid:14726583
- 6. Pybus OG, Rambaut A. Evolutionary analysis of the dynamics of viral infectious disease. Nat Rev Genet. 2009;10(8):540–50. https://doi.org/10.1038/nrg2583 pmid:19564871
- 7. Volz EM, Kosakovsky Pond SL, Ward MJ, Leigh Brown AJ, Frost SDW. Phylodynamics of infectious disease epidemics. Genetics. 2009;183(4):1421–30. https://doi.org/10.1534/genetics.109.106021 pmid:19797047
- 8. Stadler T, Kouyos R, von Wyl V, Yerly S, Böni J, Bürgisser P, et al. Estimating the basic reproductive number from viral sequence data. Mol Biol Evol. 2012;29(1):347–57. https://doi.org/10.1093/molbev/msr217 pmid:21890480
- 9. Volz EM, Koelle K, Bedford T. Viral phylodynamics. PLoS Comput Biol. 2013;9(3):e1002947. https://doi.org/10.1371/journal.pcbi.1002947 pmid:23555203
- 10. Keeling MJ, Grenfell BT. Individual-based perspectives on R(0). J Theor Biol. 2000;203(1):51–61. pmid:10677276
- 11. Lloyd-Smith JO, Schreiber SJ, Kopp PE, Getz WM. Superspreading and the effect of individual variation on disease emergence. Nature. 2005;438(7066):355–9. pmid:16292310
- 12. Lau MSY, Grenfell B, Thomas M, Bryan M, Nelson K, Lopman B. Characterizing superspreading events and age-specific infectiousness of SARS-CoV-2 transmission in Georgia, USA. Proc Natl Acad Sci U S A. 2020;117(36):22430–5. pmid:32820074
- 13. Romero-Severson E, Skar H, Bulla I, Albert J, Leitner T. Timing and order of transmission events is not directly reflected in a pathogen phylogeny. Mol Biol Evol. 2014;31(9):2472–82. pmid:24874208
- 14.
Klinkenberg D, Colijn C, Didelot X. Methods for outbreaks using genomic data. Handbook of infectious disease data analysis. Chapman and Hall/CRC. 2019. p. 245–62.
- 15. Didelot X, Gardy J, Colijn C. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol Biol Evol. 2014;31(7):1869–79. https://doi.org/10.1093/molbev/msu121 pmid:24714079
- 16. Gavryushkina A, Welch D, Stadler T, Drummond AJ. Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration. PLoS Comput Biol. 2014;10(12):e1003919. https://doi.org/10.1371/journal.pcbi.1003919 pmid:25474353
- 17. Hall M, Woolhouse M, Rambaut A. Epidemic reconstruction in a phylogenetics framework: transmission trees as partitions of the node set. PLoS Comput Biol. 2015;11(12):e1004613. pmid:26717515
- 18. Wymant C, Hall M, Ratmann O, Bonsall D, Golubchik T, de Cesare M, et al. PHYLOSCANNER: inferring transmission from within- and between-host pathogen genetic diversity. Mol Biol Evol. 2018;35(3):719–33. pmid:29186559
- 19. Sashittal P, El-Kebir M. Sampling and summarizing transmission trees with multi-strain infections. Bioinformatics. 2020;36(Suppl_1):i362–70. pmid:32657399
- 20. Colijn C, Hall M, Bouckaert R. Taking a BREATH (Bayesian Re-construction and Evolutionary Analysis of Transmission Histories) to simultaneously infer phylogenetic and transmission trees for partially sampled outbreaks. bioRxiv. 2024:2024-07.
- 21. Firestone SM, Hayama Y, Bradhurst R, Yamamoto T, Tsutsui T, Stevenson MA. Reconstructing foot-and-mouth disease outbreaks: a methods comparison of transmission network models. Sci Rep. 2019;9(1):4809. pmid:30886211
- 22. Jombart T, Cori A, Didelot X, Cauchemez S, Fraser C, Ferguson N. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. PLoS Comput Biol. 2014;10(1):e1003457. https://doi.org/10.1371/journal.pcbi.1003457 pmid:24465202
- 23. Campbell F, Didelot X, Fitzjohn R, Ferguson N, Cori A, Jombart T. outbreaker2: a modular platform for outbreak reconstruction. BMC Bioinformatics. 2018;19(Suppl 11):363. pmid:30343663
- 24. Lau MSY, Marion G, Streftaris G, Gibson G. A systematic Bayesian integration of epidemiological and genetic data. PLoS Comput Biol. 2015;11(11):e1004633. pmid:26599399
- 25. Gibson G. Estimating parameters in stochastic compartmental models using Markov chain methods. Math Med Biol. 1998;15(1):19–40.
- 26. Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. J Am Statist Assoc. 1987;82(398):528–40.
- 27. Morelli MJ, Thébaud G, Chadœuf J, King DP, Haydon DT, Soubeyrand S. A Bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data. PLoS Comput Biol. 2012;8(11):e1002768. https://doi.org/10.1371/journal.pcbi.1002768 pmid:23166481
- 28. Streftaris G, Gibson GJ. Bayesian inference for stochastic epidemics in closed populations. Statist Model. 2004;4(1):63–75.
- 29. Streftaris G, Gibson GJ. Non-exponential tolerance to infection in epidemic systems--modeling, inference, and assessment. Biostatistics. 2012;13(4):580–93. pmid:22522236
- 30. Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980;16(2):111–20. pmid:7463489
- 31. Rodríguez F, Oliver JL, Marín A, Medina JR. The general stochastic model of nucleotide substitution. J Theor Biol. 1990;142(4):485–501. https://doi.org/10.1016/s0022-5193(05)80104-3 pmid:2338834
- 32.
Li SZ, Jain A. Hamming distance. Encyclopedia of biometrics. Boston, MA: Springer US. 2009. p. 668.
- 33.
Johnson N. A short introduction to disease emergence. The role of animals in emerging viral diseases. Elsevier. 2014. p. 1–19. https://doi.org/10.1016/b978-0-12-405191-1.00001-6
- 34.
Cann AJ. Genomes. In: Cann AJ, editor. Principles of molecular virology. 6th ed. Boston: Academic Press. 2016. p. 59–104.
- 35. Ypma RJF, van Ballegooijen WM, Wallinga J. Relating phylogenetic trees to transmission trees of infectious disease outbreaks. Genetics. 2013;195(3):1055–62. https://doi.org/10.1534/genetics.113.154856 pmid:24037268
- 36. De Maio N, Wu C-H, Wilson DJ. SCOTTI: efficient reconstruction of transmission within outbreaks with the structured coalescent. PLoS Comput Biol. 2016;12(9):e1005130. pmid:27681228
- 37. Klinkenberg D, Backer JA, Didelot X, Colijn C, Wallinga J. Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks. PLoS Comput Biol. 2017;13(5):e1005495. https://doi.org/10.1371/journal.pcbi.1005495 pmid:28545083