How does date-rounding affect phylodynamic inference for public health?

Leo A. Featherstone; Danielle J. Ingle; Wytamma Wirth; Sebastian Duchene

doi:10.1371/journal.pcbi.1012900

Abstract

Phylodynamic analyses infer epidemiological parameters from pathogen genome sequences for enhanced genomic surveillance in public health. Pathogen genome sequences and their associated sampling dates are the essential data in every analysis. However, sampling dates are usually associated with hospitalisation or testing and can sometimes be used to identify individual patients, posing a threat to patient confidentiality. To lower this risk, sampling dates are often given with reduced date-resolution to the month or year, which can potentially bias inference. Here, we introduce a practical guideline on when date-rounding biases the inference of epidemiologically important parameters across a diverse range of empirical and simulated datasets. We show that the direction of bias varies for different parameters, datasets, and tree priors, while compounding with lower date-resolution and higher substitution rates. We also find that bias decreases for datasets with longer sampling intervals, implying that our guideline is most applicable to emerging datasets. We conclude by discussing future solutions that prioritise patient confidentiality and propose a method for safer sharing of sampling dates that translates them them uniformly by a random number.

Author summary

Phylodynamic analyses estimate epidemiological parameters using pathogen genome sequences and offer insight for public health. The essential data in every analysis are genome sequences, which allow measurement of evolutionary divergence, and their associated sampling times, which allow evolutionary divergence to be modelled as a rate over time. However, the sampling times of pathogen genome sequences are frequently associated with hospitalisation and can be used to identify particular patients. As a result, sampling times are often shared between public health labs and phylodynamics practitioners with reduced date resolution to protect patient identity (such as to the month or year). Using real-world data and a matching simulation study, we emulate the effects of date rounding on phylodynamic inference to characterise how reduced date resolution introduces error into inference. We find that error arises where sampling dates are given at a resolution less than the average amount of time it takes for a pathogen to accrue one substitution. We find that this relationship is useful for predicting biased estimation for datasets reflecting short term sampling. We conclude by discussing how accurate sampling dates can be shared in a way that preserves both patient identity and accuracy

Citation: Featherstone LA, Ingle DJ, Wirth W, Duchene S (2025) How does date-rounding affect phylodynamic inference for public health? PLoS Comput Biol 21(4): e1012900. https://doi.org/10.1371/journal.pcbi.1012900

Editor: Joel O. Wertheim, University of California San Diego, UNITED STATES OFAMERICA

Received: September 16, 2024; Accepted: February 21, 2025; Published: April 11, 2025

Copyright: © 2025 Featherstone et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All of the code and date required to reproduce the simulation study, empirical analyses, and figures/tables are available at https://github.com/LeoFeatherstone/pdp

Funding: This work received funding from: the Inception program (Investissement d’Avenir grant ANR-16-CONV-0005 awarded to SD), the Australian National Health and Medical Research Council (2017284 awarded to SD), and the Australian Research Council (FT220100629 awarded to SD). SD received a salary from the Inception program and the Australian Research Council. LAF received a salary from the National Health and Medical Research Council and Australian Research Council (DP230102424). DJI received a salary from, the National Health and Medical Research Council (GNT1195210). WW was partially supported by a Chan Zuckerberg Initiative Essential Open Source Software for Science grant EOSS6-0000000637. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Phylodynamicss commonly used to estimate the parameters of viral spread with increasing application to bacteria. It allows estimation of important epidemiological parameters including rates of transmission, the age of outbreaks, rates of spatial advance, and the prevalence of variants of concern [1–4]. It is applicable across the scales of transmission from the pandemic and epidemic scales, such as for SARS-CoV-2 and Ebola virus [5,6], to long-term bacterial transmission such as in Salmonella enterica and Klebsiella pneumoniae. Phylodynamic analyses are most useful where temporal and spatial records of transmission are sparse, using genomic information to help fill in the gaps.

The basis of all phylodynamic inference is that epidemiological spread leaves a trace in the form of substitutions in pathogen genomes that can be used to reconstruct transmission histories. Pathogen populations meeting this assumption are said to be ‘measurably evolving populations’ [7,8]. In accordance, phylodynamics uses a combination of genome sequences and associated sampling dates to leverage measurable evolution and infer temporally explicit parameters of transmission and pathogen demography.

Ideal phylodynamic datasets should include precise sampling dates alongside genome sequences [9], but sampling dates necessarily carry over sensitive information about times of hospitalisation, testing, or treatment than can be used to identify individual patients. This can pose an unacceptable risk for patient confidentiality. In some cases, sampling dates or dates of admission are even available for purchase or have allowed identification for a majority of patients in a given record [10]. In acknowledgement of this risk, [11] suggest that Expert Determination govern whether sampling dates be released alongside genome sequences, and the resolution to which they are disclosed (day, month, year). Essentially, this approach involves an expert opinion on whether information is safe to release on a case-by-case basis.

From a phylodynamic point of view, sampling dates with reduced resolution are usable. Uncertainty in sampling dates can be accommodated in Bayesian inference [12], but such an approach is only effective when samples with uncertain dates comprise a small proportion of the total data [13].

The most common technique for incorporating data with a majority of uncertain sampling dates is to assume that sampling occurred at the middle of the uncertainty range, such as all samples from 2020 being assigned 15 June 2020. Other approaches would include sampling a random day within 2020 using a probability distribution over the duration of 2020 for each sample. Both approaches introduce a degree of noise, which may cause bias because sampling dates often drive phylodynamic inference [14–16]. Understanding this bias has practical significance, as there are many examples of phylodynamic analyses conducted with reduced date resolution for a diverse array of pathogens. These include viral pathogens such as Rabies virus, Enterovirus, SARS-CoV-2, Dengue virus [17–20], and bacterial pathogens, such as Klebsiella pneumoniae, Streptococcus pneumoniae, and Mycobacterium tuberculosis [21–23].

Precision in sampling dates is also relevant to the design and curation of pathogen sequence databases because sampling dates are often considered as metadata, and thus recorded inconsistently throughout repositories [24]. For example, as of early September 2024, there were roughly 19.9M SARS-CoV-2 genome sequences available on GISAID with roughly 2.4% (382K) of these having incomplete date information, where sampling dates are absent or only given to the month or year. In other words, roughly 1 in 50 sequences lacked clear date resolution, reflecting global inconsistency in SARS-CoV-2 sampling time records.

In recognition of this issue, we characterised the conditions under which biases arise from reduced date resolution in phylodynamic inference. We analysed four empirical datasets of SARS-CoV-2, H1N1 Influenza, M. tuberculosis, Staphylococcus aureus, and conducted a simulation study with parameters corresponding to each empirical dataset. We also included a supplementary H3N2 influenza dataset. These pathogens are key examples of candidates for genome surveillance, with SARS-CoV-2, H1N1, and H3N2 having caused pandemics and S. aureus and M. tuberculosis being global priority pathogens [25]. These data also have diverse infectious periods and molecular evolutionary rates, thus providing a broad representation of phylodynamics’ applicability to pathogens presenting human-health threats. For each empirical and simulated dataset, we studied the bias in estimated epidemiological parameters across treatments with sampling dates rounded to the day, month, or year. For example, 2021-10-11 would be specified as 2021-10-15 when rounding to the month and 2021-06-15 when the month and day are not provided.

We focused on inference of the reproductive number ( or for the basic and effective reproductive number, respectively), defined as the average number of secondary infections stemming from an individual case (reviewed by [1,3,26]), the time to the most recent common ancestor (tMRCA), and the substitution rate (substitutions per site per year) in each dataset. Together, these parameters span much of the insight that phylodynamics offers through inferring when an outbreak started and how fast it proceeded. The evolutionary rate is also the central parameter relating evolutionary time to epidemiological time, so any resulting bias in this parameter is expected to have a pervasive effect throughout each phylodynamic model.

Download:

Fig 1. Graphical representation of the hypothesis.

The average time to accrue one substitution based on a fixed genome size and evolutionary rate, against the temporal resolution lost by date-rounding. We hypothesised that and showed that when analyses for a given pathogen round dates to an extent nearing or crossing the diagonal from left to right, biases are induced in , tMRCA, and substitution rate. substitution rates are taken from each source for the empirical data. We do not report the numerical axis as this figure is designed to illustrate a concept rather than serve as a reference, in the same spirit as its inspiration in Fig 2 of [8].

https://doi.org/10.1371/journal.pcbi.1012900.g001

We hypothesised that reduced date resolution causes bias that compounds where the uncertainty in dates exceeds the average time for a substitution to arise in a given pathogen. That is, the point from which substitution events are conflated in time. We visualise the relationship between date resolution and average substitution time in Fig 1. For example, H1N1 influenza virus accumulates substitutions at a rate of about 4 subs/site/year [27]. With a genome length of 13,158bp, we then expect roughly one substitution to accrue per week. Therefore, rounding dates to the month or year conflates molecular evolution in time and biases inference. Based on this, we expected the SARS-CoV-2 and H1N1 datasets to exhibit bias from month resolution onwards, the S. aureus dataset to exhibit bias at year resolution, and the M. tuberculosis dataset to not display bias up to and including year resolution (See Table 1 for average substitution times). Throughout this manuscript, we refer to bias where we recover error in estimated parameters with a consistent direction among replicates in our simulation study (consistently over- or underestimating). We do not chiefly consider the variance in posterior distributions of estimated parameters, but discuss this point in the results.

Download:

Table 1. Substitution rates and genome length for sequence simulation.

https://doi.org/10.1371/journal.pone.0313772.t001

Our results across the simulation study and analyses of empirical data support using the average substitution time as a rough threshold for when date-rounding causes bias. We also consider factors that modulate the extent of bias, in particular noting that it declines with longer sampling intervals, and varies in direction between datasets and tree prior. We finish by discussing future solutions that prioritise both patient confidentiality and accurate data sharing for routine phylodynamic analyses in public health.

Methods

Overview

Our study is based on four empirical datasets including two viruses, H1N1 influenza and SARS-CoV-2, and two bacterial species, Staphylococcus aureus and Mycobacterium tuberculosis. We also conducted a simulation study with parameters tailored to each dataset. These data were chosen to span the usual parameter space for substitution rate and sampling duration in phylodynamics for epidemiology (roughly (subs/site/yr) for substitution rate and months-to-decades for duration of sampling). We also included a supplementary H3N2 influenza empirical dataset to illustrate the effects of date rounding on longer-term viral datasets.

To assess the effects of date-rounding, we conducted phylodynamic analyses for both the empirical and simulated datasets with sampling dates rounded to the day, month, or year. For example, two samples from 2000-05-29 and 2000-05-02 would both become 2000-05-15 if rounded to the month. We then measured the resulting bias in epidemiologically- or phylodynamically-important parameters: the reproductive number ( or ), substitution rate (subs/site/year), and the tMRCA. The tMRCA gives a measure of the age of the pathogen population driving the outbreak and is often interpreted as the age of the outbreak. We also consider the tMRCA to facilitate comparison, because there is variability in which phylodynamic models include the length of the root branch in the age of the outbreak [28].

The viral datasets consist of samples from the 2009 H1N1 pandemic (n=161) from [27], and a cluster of early SARS-CoV-2 cases from Victoria, Australia in 2020 (n = 112) [29]. The bacterial datasets consist of S. aureus, with 104 samples from New York sampled over ≈ 2 years [30–32], and 30 M. tuberculosis samples from an ≈ 25 year outbreak studied by [33]. These data were chosen because they encompass a diversity of epidemiological dynamics, timescales, and variable substitution rates.

Simulation study

We simulated outbreaks as birth-death sampling processes using the ReMaster package in BEAST v2.7.6 [34,35]. Simulations employed four parameter settings corresponding to each empirical dataset (Table 2), with 100 replicates of each. All parameter sets include a proportion of sequenced cases (p), outbreak duration (T), a ‘becoming un-infectious’ rate (), and transmission rates via reproductive numbers. We matched the values of each parameter to those in the originating literature (Table 2). We fixed the sequencing proportion in the H1N1 simulations by dividing the sample size (n = 161) and the cumulative number of North American cases over the empirical data’s sampling interval, resulting in the order of 1% [36].

For simulations corresponding the viral datasets, transmission was modelled via , the average number of secondary infections (assuming a fully susceptible population). For those corresponding to the bacterial datasets, we allowed the effective reproductive numbers to vary over two intervals ( and respectively). For the S. aureus setting, the change time for was set at t = 22 with the sequencing proportion (p) also set to zero before this time to replicate the sampling effort in the empirical dataset. For the M. tuberculosis dataset, the change time was fixed at halfway through simulations (t = 12 . 5) with one fixed sequencing proportion throughout.

Download:

Table 2. Parameter sets for the simulation study corresponding to each empirical dataset.

δ is the ‘becoming un-infectious‘ rate, which is the reciprocal of the duration of infection in units of years⁻¹. is the basic reproductive number, describing the average number of secondary infections arising at the beginning of an outbreak where the susceptible population is greatest. refers to the effective reproductive number over two successive intervals of an outbreak as the susceptible population varies. p is the proportion of sequenced cases. T is the duration of the outbreak.

https://doi.org/10.1371/journal.pone.0313772.t002

Simulations generated a total of 400 outbreaks which we then used to simulate sequences data under a Jukes-Cantor model using Seq-Gen v1.3.4 [37] with fixed substitution rates (Table 1). We chose a simple substitution model to reduce parameter space and because substitution model mismatch has been widely explored elsewhere (e.g. [38]).

We then analysed each of the 400 simulated datasets under three date resolutions (day, month, and year), and two tree priors: the birth-death [28] and coalescent with exponential growth, referred to hereon as the ‘coalescent exponential‘ [39]. This yielded 1800 analyses in total (1200 for the birth-death and 600 for the coalescent exponential). We used identical model specifications and prior distributions as for the corresponding empirical datasets. We ran each MCMC chain for steps, sampling every step and discarding the first 50% as burnin. We then discarded all analyses that did not have effective sample sizes (ESS) of at least 200 (ESS ≥ 200) for every parameter, leaving a total of 1670 replicates incorporated in our results.

Empirical data

We conducted Bayesian phylodynamic analyses using a birth-death skyline tree prior in BEAST v2.7.6 for all datasets [35]. We also fit a coalescent exponential tree prior to the viral datasets. We did not fit the coalescent exponential to the bacterial datasets because they capture transmission beyond the exponential phase, which would therefore result in model misspecification. We sampled from the posterior distribution using Markov chain Monte Carlo (MCMC), with steps ( for SARS-CoV-2 data), sampling every steps, and discarding the initial 10% as burnin. We assessed sufficient sampling from the stationary distribution by ensuring ESS ≥ 200 for all parameters and likelihoods.

H1N1.

The H1N1 data consist of 161 samples from North America during the 2009 H1N1 influenza virus pandemic, previously analysed by [27]. Samples originate from April to September 2009 and provide an example of a rapidly evolving pathogen sparsely sequenced during an emerging outbreak.

Under the birth-death model, we placed a Lognormal ( μ = 0 , σ = 1 ) prior on , β ( 1 , 1 ) prior on p, and fixed the becoming-uninfectious rate to (), corresponding to a four-day duration of infection. We also placed an improper (U ( 0 , ∞ ) ) prior on the age of the outbreak and a Gamma ( shape = 2 , rate = 400 ) prior on the substitution rate.

Under the coalescent exponential, we placed a Laplace ( μ = 0 , scale = 100 ) prior on the growth rate, which was later transformed to ( where r is the growth rate and D is the duration of infection). We also placed an improper prior () on the effective population size, which is the maximally uninformative Jeffrey’s prior for coalescent intervals [40]. We otherwise included the same priors as for the birth-death.

SARS-CoV-2.

The SARS-CoV-2 data consist of 112 samples from a densely sequenced transmission cluster from Victoria, Australia over late July to mid September 2020 [29]. These data are similar to the H1N1 datasets in presenting a quickly evolving viral pathogen, but differ in that a high proportion of cases were sequenced.

Under the birth-death, we placed a Lognormal ( mean = 1 , sd = 1 . 25 ) prior on and an Inv-Gamma ( α = 5 . 807 , β = 346 . 020 ) prior on the becoming-uninfectious rate (δ). The sampling proportion was fixed to p = 0 . 8 since the target was to sequence every known SARS-CoV-2 case in Victoria at this stage of the pandemic, with a roughly 20% sequencing failure rate. We also placed an Exp ( mean = 0 . 019 ) prior on the origin, corresponding to a lag of up to one week between the index case and the first putative transmission event. In this case, the origin parameter corresponds to the length of the root branch. In the results we still report the age of the outbreak as the tMRCA for consistency with the other datasets. Lastly, we placed a Gamma ( shape = 2 , rate = 2000 ) prior on the substitution rate.

Under the coalescent exponential, we placed an improper prior () on the effective population size and a Laplace ( μ = 0 . 01 , scale = 0 . 5 ) prior on the growth rate. Other parameters were given the same priors as under the birth-death. Note that the coalescent exponential is not a natural choice of tree prior for the SARS-CoV-2 data because of its very high sequencing proportion [41]. We nevertheless include it for the SARS-CoV-2 data to provide a comparison to the coalescent exponential for the H1N1 data, as well as an example of how data-rounding may exacerbate error in conjunction with poorly fitting models. The model’s poor fit is reflected later in the results.

H3N2.

We included a supplementary H3N2 dataset to assess the effects of date rounding for a viral dataset with longer term sampling. Using the multi type birth-death model, we analysed a 60 H3N2 influenza samples taken from 2000 to 2005 in Hong Kong and New Zealand, with demes corresponding to each location [42]. The data are a subset of those originally used in [43], and is also available from the structured birth-death ‘Taming the BEAST’ tutorial [44].

We placed a Lognormal ( μ = 0 , σ = 1 ) prior on for both demes and fixed the becoming un-infectious rate at δ = 71, corresponding to a roughly 5-day duration of infection. We also placed a Lognormal ( μ = 0 . 001 , σ = 1 . 25 ) prior on the substitution rate and sampling probability (p). The sampling probability was fixed to zero shortly before the oldest sample in all date treatments to reflect no sampling effort prior to this date.

Staphylococcus aureus.

The S. aureus dataset originates from [32] and we analysed a subset of the data later analysed in [30] and [31]. It consists of a single nucleotide polymorphism (SNP) alignment of 104 sequenced isolates sampled in New York from 2009 to 2011. Populations growth is understood to have been driven by β-lactam antibiotic use beginning in the 1980s. These data therefore provide a comparison to the M. tuberculosis dataset in a briefer sampling span from an outbreak of similar duration.

To accommodate changing transmission dynamics, we included two intervals for with a Lognormal ( μ = 0 , σ = 1 ) prior on each. We also placed a β ( 1 , 1 ) prior on the sampling proportion, which was otherwise fixed to 0 before the first sample to capture the lag in sampling. We also placed a U ( 0 , 1000 ) prior on the origin, and fixed the becoming un-infectious rate at δ = 0 . 93, corresponding to a nearly year-long duration of infection following [31].

Mycobacterium tuberculosis.

The M. tuberculosis dataset consists of 36 sequenced isolates from a retrospectively recognised outbreak in California, USA, that originated in the Wat Tham Krabok refugee camp in Thailand. The data were originally analysed using the birth-death tree prior by [33]. We applied the same prior configurations as [33], with the exception of including two intervals for and fitting a strict molecular clock with a Gamma ( shape = 0 . 001 , rate = 1000 . 0 ) prior.