Figures
Abstract
The serial interval of an infectious disease is a key instrument to understand transmission dynamics. Estimation of the serial interval distribution from illness onset data extracted from transmission pairs is challenging due to the presence of censoring and state-of-the-art methods mostly rely on parametric models. We present a fully data-driven methodology to estimate the serial interval distribution based on interval-censored serial interval data. The proposed nonparametric estimator of the cumulative distribution function of the serial interval is based on the class of uniform mixtures. Closed-form solutions are available for point estimates of different serial interval features and the bootstrap is used to construct confidence intervals. Algorithms underlying our approach are simple, stable, and computationally inexpensive, making them easily implementable in a programming language that is most familiar to a potential user. The nonparametric user-friendly routine is included in the EpiDelays package for ease of implementation. Our method complements existing parametric approaches for serial interval estimation and permits to analyze past, current, or future illness onset data streams following a set of best practices in epidemiological delay modeling.
Author summary
Epidemiological delay distributions play a key role in outbreak analyses and in modeling infectious diseases. The serial interval is the time from illness onset in a primary case to illness onset in a secondary case and ranks among the most important delay quantities as it can be used to infer transmission patterns in mathematical and statistical models. From a statistical perspective, estimation of the serial interval distribution is complicated by the fact that the exact timing of illness onset is usually unknown and the latter event is only known to have occurred between two time points; a phenomenon called interval censoring. We propose a new inferential method to estimate the serial interval distribution from interval-censored illness onset data without relying on a parametric model. The nonparametric methodology comes with a low degree of mathematical complexity and the underlying algorithms are simple, fast and stable. A user-friendly routine written in the R programming language is available in the EpiDelays package. The proposed data-driven method accounts for a set of best practices in epidemiological delay modeling and can be used to obtain point estimates and confidence intervals for often reported serial interval features.
Citation: Gressani O, Hens N (2025) Nonparametric serial interval estimation with uniform mixtures. PLoS Comput Biol 21(8): e1013338. https://doi.org/10.1371/journal.pcbi.1013338
Editor: Benjamin Peirce Holder, Grand Valley State University, UNITED STATES OF AMERICA
Received: November 16, 2024; Accepted: July 18, 2025; Published: August 4, 2025
Copyright: © 2025 Gressani, Hens. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Simulation results and real data applications underlying this article can be reproduced with the material provided on the GitHub repository (https://github.com/oswaldogressani/Serial_interval) based on the EpiDelays package version 0.0.1 (https://github.com/oswaldogressani/EpiDelays).
Funding: OG and NH were supported by the VERDI project (101045989) and the ESCAPE project (101095619), funded by the European Union. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Health and Digital Executive Agency (HADEA). Neither the European Union nor the granting authority can be held responsible for them. OG and NH acknowledge the financial support of the Fondation Universitaire de Belgique (file nr. AS-0608). OG and NH were also supported by the BE-PIN project (contract nr. TD/231/BE-PIN) funded by BELSPO (Belgian Science Policy Office) as part of the POST-COVID programme. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
The serial interval (SI) is an epidemiological delay characterizing a duration between two well-defined events related to a disease. It represents the time between symptom onset in a primary case or infector and symptom onset in a secondary case or infectee [1]. This time delay can be negative as nothing restrains the infectee to experience symptom onset earlier than the infector [2]. In the literature, this interval is also known as the clinical onset serial interval [3,4]. Epidemiological and biological factors are responsible for introducing variation in times between primary and secondary events [5], so that serial intervals can be represented by a time delay distribution [6]. A different, but closely related delay quantity is the generation interval, which is defined as the duration between infection events in an infector-infectee pair [7]. The timing of an infection event is typically less likely to be observed than the timing of a symptom event and it is common practice to approximate the distribution of generation times by the SI distribution [8,9]. Serving as a proxy for generation intervals, serial intervals can be used as an instrument to measure the time scale of disease transmission [10] and are therefore key in linking the epidemic growth rate with the reproduction number [11,12]. The crucial role played by the serial interval distribution in disease transmission models emphasizes the need to have reliable, stable, and replicable statistical methodologies to estimate this quantity. Ideally, these methodologies should also follow best practices recently described in [5].
Different methods exist to estimate the distribution and features of the serial interval of an infectious disease based on data. When time intervals of illness onset between infectors and infectees are observed, the data is considered as a random sample from the population. In that case, essential features of the serial interval are estimated by either directly computing summary statistics from empirical serial intervals (e.g. mean, median, standard deviation) or by fitting a parametric distribution to observed data [13,14]. Parametric methods are by far the most common and usually include the Lognormal, Weibull, Gamma or Gaussian distributions [15–20]. For instance, a systematic review and meta-analysis of serial interval estimates for COVID-19 [21] shows that a majority of studies rely on parametric models with a frequent use of Gamma and Gaussian distributions. Estimation of model parameters is typically carried out with the maximum likelihood principle or by using the Bayesian approach, and is often based on a relatively small sample size. To our knowledge, only few attempts have been made in applying nonparametric methods to serial interval data analysis. For instance, [3] compute a nonparametric estimate of the cumulative distribution function of the serial interval of influenza based on the method of [22] to see whether different parametric models are in agreement with it, and [23] use the nonparametric bootstrap to compute confidence intervals for the clinical onset SI of SARS-CoV-2.
By definition, serial intervals involve transmission pairs. It means that a minimal requirement for SI estimation is to have data on symptom onset times for the infector and infectee. Such data can be extracted from contact tracing programmes, which permit to gain knowledge about who infected whom and provide information on timings of symptoms in infector-infectee pairs [24,25]. Commonly, serial interval data are interval censored in that only lower and upper limits of illness onset timing is observed. This characteristic adds a layer of complexity to the estimation problem. If censoring concerns either the infector or infectee, data are said to be single interval-censored; and if censoring affects both actors in the transmission pair, data are called doubly interval-censored [26]. Thinking from a continuous time perspective, serial interval data is more often than not doubly interval-censored due to the time resolution of reporting. When the time resolution for reporting illness onset is a calendar day (as is often the case), then censoring is inherent to the calendar day, i.e. the precise timing of illness onset within the reported calendar day remains unknown. Therefore, even if exact calendar dates are observed, it is good practice to still consider the data as doubly interval-censored [5].
Despite the large number of studies conducted on the serial interval of different pathogens, most methods are difficult or impossible to reproduce in the sense that independent researchers are confronted with serious difficulties in reusing existing procedures to new data [27]. The field of infectious disease modeling suffers from alarmingly low computational reproducibility rates [28], which hinders applicability and misaligns with pandemic preparedness objectives. This reproducibility conundrum has several causes. For instance, recent meta-epidemiological surveys found that very few publications share code or data [29,30]. Other potential causes are code incompleteness and complex dependencies among multiple scripts without clear guidelines regarding computation order [28]. The study of [31] highlights that finding evidence supporting frequently cited serial interval values in the literature is a challenging task.
Hopefully, more applicable tools and methods have recently emerged to estimate epidemiological delay distributions. Originally developed for estimation of incubation period distributions, the methodology of [26] is available in an R software package [32] and associated routines are embedded in the EpiEstim package of [33] to estimate the serial interval [34]. [31] reanalyze published serial interval data on different respiratory infections by using a common statistical method and provide R code and data sets for reproducibility. The epidist package [35] and the primarycensored package [36] are also operational for serial interval estimation and account for censoring and truncation. These tools rely on parametric methods imposing distributional assumptions on the serial interval distribution and leave no room for data-driven inference.
In an attempt to complement the above-mentioned parametric methods, we develop a nonparametric approach to estimate the serial interval distribution based on illness onset data. The proposed method is entirely data-driven and applicable on a wide range of serial interval data commonly analyzed in the literature. Its chief merits are its mathematical and computational simplicity. The proposed method also aligns with some of the best practices recommended by [5], namely: (1) adjusting for double interval censoring, (2) reporting guidelines for epidemiological delays, (3) accounting for negative serial intervals and (4) reproducibility guidelines. Since R is among the most popular programming languages used in the infectious disease modeling community [28,37], the code underlying our nonparametric methodology is written in the R language and available in the EpiDelays package (https://github.com/oswaldogressani/EpiDelays). Source code comes in a lightweight format and spans only a few lines. It can thus be easily translated in another programming language if needed (e.g. Python or C++).
Next, we present our nonparametric estimator and briefly discuss some of its theoretical properties. An entire section is dedicated to simulations in order to assess the performance of our data-driven approach. Applications to transmission pair data extracted from previous outbreaks for a diverse set of pathogens underlines the wide, general, and straightforward applicability of our method. The article concludes with a discussion surrounding different aspects of the proposed nonparametric methodology for serial interval estimation.
2. Methods
2.1. The coarse structure of serial interval data
Datasets used to estimate serial interval features are usually obtained from line list information collected during epidemics [10,34,38]. The structure is such that a line in the data list conveys information about calendar dates for the infector and infectee [31,39]. Calendar dates are not well-suited for statistical analysis. Therefore, conversion from calendar time to analysis time is carried out through a mapping from the set of calendar dates to a set of real numbers, and more commonly to a set of integers. The precise calendar date of symptom appearance may be unknown and this uncertainty translates into a range of reported dates. In that case, serial interval data are referred to as coarse data following the terminology of [40] in the sense that the timing of symptom onset is only observed to lie within a time interval; a feature also known as interval censoring. Even when precise dates are reported, there is still uncertainty with respect to the exact timing of symptom onset within a day. As such, a calendar day can be coarsened to an interval of two consecutive calendar dates, where the reported day is the lower bound and the following day is the upper bound of the interval. This means that serial interval data are usually treated as doubly interval-censored [26], i.e. the data contain a range of symptom onset dates for each primary and secondary case. After conversion of calendar time to analysis time, denote by the illness onset time of the infector and by ti the illness onset time of the infectee in the ith transmission pair. These quantities are treated as interval-censored because the precise symptom onset time within a day is usually unknown. Let
and
denote the observed left and right bound, respectively, of the symptom onset time of the infector in the ith transmission pair and assume
. For the infectee, a similar notation is used and we assume
. The four time points
can be used to compute the earliest possible SI time
and the latest possible SI time
. The SI window width
sheds light upon the degree of coarseness associated with the (unobserved) serial interval in the ith infector-infectee pair. A schematic representation of SI data and its underlying coarseness is shown in Fig 1.
(A) The timings of symptom onset in infector-infectee pairs are usually reported as calendar dates. (B) Conversion from calendar time to analysis time is done through a mapping from reported calendar dates to a set of numbers (usually integers). (C) To account for the uncertainty in the timing of symptom onset within a day when a single calendar day is reported by the infector or infectee, a one-day coarsening of the data is implemented by constructing an interval with endpoints corresponding to two numbers resulting from the mapping of two successive calendar days in analysis time. (D) Coarseness at the serial interval level is obtained by taking the difference between the right serial interval bound siR and the left serial interval bound siL.
To address uncertainty about who infected whom in outbreak data, robust likelihoodbased methods can be employed that explicitly account for missing or potentially incorrect transmission links (see e.g. [41]). These methods estimate the probability of transmission between individuals using contact information and the timing of symptom onset. An iterative approach is then used to reconstruct plausible transmission trees and to estimate key epidemiological delay distributions, while simultaneously identifying and mitigating the influence of unreliable data points. An alternative and simplified approach to mitigate the risk of misspecifying the infector is to concentrate on a subset of transmission pairs for which there is reasonable evidence about who the infector is [42], although this could introduce bias in the analysis.
2.2. A uniform mixture model
Let be a real-valued random variable representing the serial interval of an infectious disease and denote by
the cumulative distribution function (cdf) of
with
. We adopt Laplace’s principle of insufficient reason [43,44] and assume that the interval-censored SI variable of the ith transmission pair is uniformly distributed over the censoring interval with endpoints siL and siR, i.e.
. The resulting cdf associated with
is denoted by:
where is the indicator function. The ordered pair
denotes the ith transmission pair SI window constructed from the observed data points siL and siR. Also, let
denote the set of ordered pairs representing the information set (or set of observables) constructed from serial interval data with n transmission pairs. Following previous work on mixtures of uniform distributions (see e.g. [45,46]), we propose to estimate
by the n-component mixture
with weights
for
. The resulting data-driven estimate is:
The above estimate is a finite convex combination of continuous functions and is therefore itself a continuous function in
. Moreover, it is a non-decreasing function since it essentially accumulates probability mass over intervals when moving along the real line in the positive direction. It is also easy to verify that
and
, so that
is a bona fide cdf. Note also that
is a piecewise-linear function with breakpoints or “bends” arising at observed data points. Piecewise-linear cumulative distribution functions are endowed with interesting properties that have for instance been studied in [47,48]. These properties will guide us in computing point estimates of different serial interval features.
In parametric approaches, it is customary to work with the estimated probability density function (pdf) of the serial interval distribution, while our methodology concentrates around the estimated cumulative distribution function . This implies no loss of generality as the cdf gives a complete description of the underlying target distribution. For instance, our method can be used to compute an estimate of the basic reproduction number
, the average number of secondary cases generated by a primary case in a fully susceptible population [49]. The generation interval distribution provides a link between the exponential growth rate of an epidemic r and the basic reproduction number via the Lotka-Euler equation [11], namely
, where
is the pdf of the generation interval. Using the serial interval as a proxy for the generation interval, the latter equation becomes
, where
is the pdf of the serial interval. Relying on the Riemann-Stieltjes integral notation, the estimated basic reproduction number using our nonparametric method is
, where the integral can be solved numerically. An alternative way to proceed in estimating
without entirely leveraging our nonparametric cdf estimate is to work with a classic parametric distribution for the generation time
and use a parameterization that aligns with our nonparametric estimate of the mean and variance of the serial interval.
2.3. Point estimation
The uniform mixture model in (1) is mathematically appealing as it permits to compute frequently reported point estimates of features of the SI distribution in closed form based on the information set . Using the Riemann-Stieltjes integral representation of the expected value, point estimates of the SI mean
and standard deviation
are given by:
The estimated quantile function of the random variable is
, where
is the p-quantile of
for a given
. Denote by
and
two neighboring breakpoints of
satisfying
. When p is such that
, then
has a flat behavior between
and
and so
by definition of the quantile function. When p is such that
, the piecewise-linear property of
can be used to compute the desired estimated quantile
. In fact, by piecewise linearity, the slope of
between
and
is equal to the slope of
between
and
. This allows to write an equation that can be solved for the single unknown
. Mathematically:
Solving (2) for yields:
If p satisfies , then
and if
, then
.
2.4. Quantification of uncertainty
The generic notation is used to represent a given feature of the SI distribution, for instance
if the mean is of interest or
if the focus is on the standard deviation of
. The bootstrap method will be used to compute measures of accuracy associated with the estimate
[50,51]. Let
denote a bootstrap sample obtained by sampling randomly and with equal probability n transmission pair SI windows with replacement from
. With n = 4 transmission pairs, a possible realization is
and the bootstrap sample
is simply a set of n ordered pairs. For the features of
presented in Sect 2.3, the bootstrap replication of
denoted by
can easily be computed based on
. Generating B bootstrap samples and computing their corresponding bootstrap replicate of
gives access to
, which characterizes the bootstrap distribution of the statistic
. The bootstrap estimate of the standard error of
can be used as a measure of accuracy of the estimate
. It corresponds to the empirical standard deviation of the values in
:
A confidence interval for can be constructed from the empirical quantiles of the sample of bootstrap estimates in
. Let
and
denote the
and
sample quantiles of the values in
. Most software has readily available routines to compute these quantiles (e.g. the quantile function in R). The
confidence interval for
using the quantile method is denoted by
. Following [52], we recommend using a bootstrap sample size of at least B = 2000 for confidence interval construction.
3. Results
3.1. Simulations
3.1.1. Generating mechanism for artificial serial interval data.
To simulate artificial serial interval data, we assume that the target SI has a distribution with mean
and standard deviation
. At the transmission pair level, the interval-censoring mechanism is governed by a discrete random variable
with values cl = l for
and probability mass function
with
. Given a set of parameters
and
, a complete dataset for n transmission pairs is obtained by repeating the following four steps n times. 1. Draw
from a
distribution. 2. Sample
based on the chosen distribution. 3. Draw
from a uniform distribution U(0,1). 4. Compute the left bound
and right bound
of the SI window of a transmission pair as
and
, where
is the floor function returning the greatest integer less than or equal to its argument and
is the ceiling function returning the smallest integer greater than or equal to its argument. The distribution of
controls the degree of data coarseness, i.e. the width of the generated serial interval windows. This simple mechanism permits to simulate frequently encountered serial interval data in the epidemiologic literature and properly takes into account the uncertainty regarding the timing of symptoms onset within the day. Said differently, for the infector and infectee, symptoms onset are only known to lie between two successive calendar days so that transmission pair data are doubly interval-censored. Mathematically this means that, under the common mapping of calendar dates to integers, infector coarseness
and infectee coarseness
are both bounded below by one. This implies that SI coarseness measured by
is bounded below by two. Fixing c1 = 1 in our data generating mechanism ensures that the SI window
is at least equal to two days. Fig 2 illustrates two sets of simulated serial interval data with n = 15,
,
and censoring distribution p1 = 0.80, p2 = 0.15, p3 = 0.05.
3.1.2. First set of simulations.
The performance of our nonparametric method is first assessed by assuming two target SI distributions, namely a SI distribution inspired from the SARS-CoV-2 Omicron variant [20] and a SI distribution that imitates results obtained for smallpox
[31]. The distribution for the censoring mechanism is given by p1 = 0.80, p2 = 0.15, and p3 = 0.05, so that generated SI window widths vary between 2 and 4 days. For each target SI distribution, we simulate M = 1000 datasets with four different sample sizes
; covering frequently encountered numbers of transmission pairs in the literature [21,53]; yielding a total of
scenarios. The performance of our nonparametric approach is assessed on the following often reported features of the SI distribution: mean
, standard deviation
and quantiles q0.05, q0.25 q0.50, q0.75, q0.95. We use bias, empirical standard error (ESE), root mean squared error (RMSE), coverage probability of
and
confidence intervals and median confidence interval width as performance criteria (formulas of these criteria are provided in S1 Text). Confidence intervals are constructed based on B = 2000 bootstrap samples.
Results for Scenarios 1-4 with underlying SARS-CoV-2 Omicron-like target SI distribution are shown in Table 1. Overall, our nonparametric method based on uniform mixtures exhibits fairly good performance with relatively low bias. The coverage of confidence intervals for all the chosen SI features are satisfactorily close to their nominal level starting from n = 50. Under smaller sample sizes, confidence intervals for the selected SI features tend to undercover, yet results for the mean and median remain reasonable given the underlying SI coarseness of at least two days and the small number of transmission pairs. The ESE, RMSE and width of confidence intervals tend to decrease as the sample size increases. It is also worth mentioning that estimation of remote quantiles, i.e. q0.05 and q0.95 is more challenging and the bias for these features is usually higher. Results for Scenarios 5-8 with a smallpox-like target SI distribution are given in Table 2 and the interpretation is the same as for Scenarios 1-4 with an overall good performance of our data-driven approach for all the considered SI features. Further simulations with higher average coarseness (Scenarios S1-S3) and scenarios with a Gamma target SI distribution inspired from measles (Scenarios S4-S7) are provided in S1 Text.
3.1.3. The impact of coarseness.
To illustrate the negative impact of coarseness on estimates of certain SI features, we consider a target serial interval distribution inspired from influenza A [31] and run four simulation scenarios (Scenarios 9-12) with censoring distribution p1 = 0.80, p2 = 0.15, p3 = 0.05 and sample size
. Results shown in Table 3 reveal how estimation performance is impacted by working with data having a coarseness degree of at least two days (i.e. doubly interval-censored SI data). Except for the mean and median, estimates of the chosen SI features tend to suffer from a larger bias as compared to the previous scenarios (Scenarios 1-8). This is because the standard deviation of the assumed influenza A serial interval target distribution
is smaller than its counterpart in the SARS-CoV-2 Omicron (
) and smallpox (
) settings. Such a small standard deviation coupled with a degree of coarseness of at least two days for the serial interval windows blurs the information conveyed by the variation of the true (and unobserved) serial interval realizations. Said differently, the degree of coarseness “dominates” or hides the rather small variations of the true (unobserved) SI values around the mean
. The price to pay for such a degree of coarseness is a larger bias and lower coverage, especially for estimates of
and q0.05, q0.25, q0.75, q0.95 as shown in Table 3.
To further stress the role played by the degree of coarseness in SI data, we consider the same influenza A setting but with a hypothetical coarseness degree that is close to zero. In particular, after generating , we assume that the left bound of the serial interval window is
and that the right bound is
with
so that the degree of coarseness of the SI window is equal to
and we refer to this censoring scheme as
-coarseness. Simulations under
-coarseness for the influenza A setting are implemented for a sample size
, yielding Scenarios 13-16 and results are shown in Table 4. Without surprise, when coarseness is virtually zero, our nonparametric method shows good performance with negligible bias and confidence intervals that tend to have close to nominal coverage values starting from n = 50 for all the considered SI features.
3.1.4. A note on asymptotic bias.
Coarseness is responsible for introducing bias in estimates of SI features. As can be seen from Scenarios 1-12, when n increases, the bias does not necessarily decrease. This is because the sample size considered in these scenarios is not large enough to fully reveal how the underlying degree of coarseness impacts the estimates. To show the “asymptotic" impact of coarseness, we run simulations for the SARS-CoV-2 Omicron, smallpox and influenza A settings with n = 500. Results are shown in S1 Text (Scenarios S8-S10) and reveal good performance for the mean and median but undercoverage for the standard deviation and quantiles depending on the setting. To further highlight the asymptotic bias argument, we compare how estimates provided by our data-driven approach evolve with sample size when assuming a degree of coarseness of at least two days and when considering the hypothetical case of -coarseness in the SARS-CoV-2 Omicron setting. For a sequence of sample sizes between n = 6 and n = 500, we compute M = 50 estimates for each selected SI feature and analyze how the mean estimate evolves with n. Fig 3 shows results for the SARS-CoV-2 Omicron target SI distribution with a degree of coarseness of at least two days (panel A) and under the hypothetical case of
-coarseness (panel B). A coarseness degree of at least two days introduces bias in our estimates. This is especially visible for the standard deviation and quantiles q0.05 and q0.95. This bias reaches a limit as n grows large and is an unavoidable facet of coarseness that negatively impacts the confidence interval coverage performance. Under the hypothetical
-coarseness setting, our estimates exhibit good performance and stabilize around the true SI features as n increases.
Dotted lines indicate the true value of a SI feature. (A) Serial interval coarseness of at least two days generated by the censoring distribution p1 = 0.8, p2 = 0.15, p3 = 0.05. (B) Hypothetical -coarseness setting with
.
3.2. Applications
To further validate our nonparametric method, we consider different applications on serial interval data from past outbreaks that are publicly available. A textual analysis is provided for each individual dataset and results are summarized in Table 5.
3.2.1. Influenza A (2009 H1N1 influenza) at a New York City school.
We start by analyzing a dataset based on illness onset dates of n = 16 infector-infectee pairs obtained from the supplementary appendix of [15]. After fitting a Weibull distribution to the data, the authors obtain a median serial interval of 2.7 days (CI: 2.0-3.5) and a 95th quantile of 5.1 days (CI
: 3.6-6.5). Our nonparametric method estimates that the median SI is 2.8 days (
) and the 95th quantile estimate is 5.2 days (
). Fig 4A summarizes the observed serial interval windows. Fig 4B shows the estimated cdf
(black curve), point estimates (dots) and
CIs for selected quantiles of
.
(B) Nonparametric estimate (black curve), point estimates (dots) and
CIs (horizontal lines) for selected quantiles of
.
3.2.2. Influenza A (2009 H1N1 influenza) in San Antonio, Texas, USA.
We analyze another influenza dataset [34,54] containing doubly interval-censored serial interval data from the 2009 influenza A outbreak in San Antonio, Texas, USA [55]. Our methodology estimates the mean serial interval at 4.0 days (). The standard deviation is estimated at 1.9 days (
) and the 95th quantile is at 7.8 days (
). Serial interval windows and estimates of different features of
are shown in Fig 5.
(B) Nonparametric estimate (black curve), point estimates (dots) and
CIs (horizontal lines) for selected quantiles of
.
3.2.3. Illness onset data for COVID-19 in Wuhan, China.
[17] share data on illness onset dates of n = 6 infector-infectee pairs and estimate that the serial interval has a mean of 7.5 days (CI: 5.3-19) based on a parametric model involving a Gamma distribution. Raw data come as calendar dates of illness onset for infector-infectee pairs. We therefore apply a one-day coarsening of the data to recover the desired doubly interval-censored structure. Our nonparametric method gives a mean serial interval estimate of 6.3 days (
) and a median SI of 6.7 days (
).
3.2.4. Illness onset data for COVID-19 with n = 28 infector-infectee pairs.
A richer serial interval dataset on COVID-19 is provided by [18]. They obtained doubly interval-censored data on n = 28 infector-infectee pairs and estimated features of the serial interval based on a Bayesian parametric approach. The authors estimate the median serial interval to be 4.0 days (CrI: 3.1-4.9), where CrI denotes the credible interval. The mean and standard deviation of the serial interval are estimated at 4.7 days (CrI
: 3.7-6.0) and 2.9 days (CrI
: 1.9-4.9), respectively. Our nonparametric method estimates the median serial interval at 3.8 days (
). Estimates for the mean and standard deviation are 4.6 days (
) and 2.6 days (
), respectively. A graphical output of the nonparametric results is shown in Fig 6.
(B) Nonparametric estimate (black curve), point estimates (dots) and
CIs (horizontal lines) for selected quantiles of
.
3.2.5. Illness onset data for COVID-19 in Belgium.
[20] report data on illness onset dates of n = 2161 transmission pairs for the Omicron variant of SARS-CoV-2 and n = 334 infector-infectee pairs for the Delta variant. Fitting a Gaussian distribution to the data using a Bayesian approach, the authors obtain a median serial interval of 2.75 days (CrI: 2.65-2.86) and a standard deviation of 2.54 days (CrI
: 2.46-2.61) for Omicron. For Delta, they obtain a median serial interval of 3.00 days (CrI
: 2.73-3.26) and a standard deviation of 2.49 days (CrI
: 2.31-2.69). Treating the data as doubly interval-censored, our data-driven approach estimates the median SI at 2.62 days (
) and the standard deviation at 2.60 days (
) for Omicron. For the Delta variant, the nonparametric approach estimates the median SI at 3.06 days (
) and the estimated standard deviation is 2.54 days (
).
4. Discussion
Our new data-driven methodology permits to estimate serial interval features based on coarse illness onset data without making parametric assumptions with respect to the SI distribution. The proposed nonparametric estimates are based on uniform mixtures and the resulting piecewise-linear structure of the cumulative distribution function allows to compute point estimates of several SI features in closed form. Such a mathematical tractability implies a low computational cost in quantifying uncertainty via the bootstrap. Simulation results suggest that the proposed nonparametric methodology will provide a reasonable approximation to the true underlying SI distribution in a large number of real-world use cases if the spread of the target distribution is not too much dominated by the degree of coarseness. A visual inspection of serial interval windows after adjusting for double interval censoring already gives an insightful assessment of whether or not coarseness dominates the spread of the underlying target SI distribution. The smaller the frequency of overlapping SI windows in one-day intervals, the richer is the signal conveying information about the spread of the underlying distribution and hence the more confident we can be in estimates of the standard deviation and tail quantiles. Furthermore, we have shown that estimates of some SI features (mainly the standard deviation and tail quantiles) will remain biased even under large sample sizes due to the presence of coarseness.
While our method is specifically tailored for working with serial interval data that has been adjusted for double interval censoring, it is important to highlight that it does not adjust for right truncation. Right truncation means that SI windows are absent from the data because, at the time of data collection, the information required to build a SI window for an infector-infectee pair (i.e. two successive symptom onset times) is not yet available. The problem of right truncation appears in real-time settings and implies an overrepresentation of shorter serial intervals, which in turn can lead to underestimation of SI statistics [5,18]. Right truncation is accentuated during the early stage of an epidemic when it undergoes a growing phase [35]. In retrospective analyses, right truncation is usually not a problem if the surveillance period is long enough to provide a representative sample [5], and in that case, our data-driven approach does not require further adjustment. Methodological developments to correct for right truncation bias in estimating serial interval distributions have only recently emerged in parametric settings [6,18,35]. An interesting future research direction would be to extend existing right truncation adjustment approaches to our nonparametric setting.
There is a surface-level similarity between the nonparametric approach proposed here and our previous work on incubation period estimation [56], however these methods are radically different in several ways. First, in our incubation period paper, we work from a Bayesian perspective and leverage the power and flexibility of Laplacian-P-splines [57,58] to estimate the incubation density. The nonparametric approach proposed here is not Bayesian and does not require the specification of a prior. Second, the distribution of incubation times is modeled in a semiparametric way and the model includes spline parameters, while our data-driven method for serial intervals is entirely parameter-free. Third, there is a non-negligible difference in terms of computational complexity. In [56], we use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution of the model parameters, while here, the computational cost to obtain estimates of SI features is drastically reduced and mostly present in the resampling scheme of the bootstrap. Our nonparametric method for serial intervals is also mathematically less technical and thus perhaps more accessible to a broader set of users. For all these reasons, we believe that our approach for estimating incubation times and the newly proposed nonparametric method for estimating the serial interval distribution can be seen as complementary tools.
The proposed nonparametric method has several distinct strengths. First, being entirely data-driven, the method can be directly used to sketch the main characteristics of the SI distribution without imposing any parametric assumption. Moreover, the nonparametric estimate of the cumulative distribution function can be used as a benchmark to visually assess whether a chosen parametric model agrees with our data-driven fit, i.e. as an informal lack-of-fit test. Second, our method naturally deals with negative serial interval values and can thus be applied in a wide range of practical settings. Third, mathematical technicalities and computational complexity are kept minimal. This means that algorithms underlying our approach are very simple and can be easily translated and used in a programming language most preferred by the user. We developed a user-friendly routine for the proposed nonparametric serial interval estimation methodology that is available in the EpiDelays package (https://github.com/oswaldogressani/EpiDelays). Fourth, our method is in alignment with some of the best practices recommended by [5]. For instance, it naturally accounts for doubly interval-censored data. Also, our method automatically provides an estimate of variability (standard deviation) along with an estimate of central tendency, and these estimates are accompanied by confidence intervals via the bootstrap. Furthermore, the fact that the underlying code has a small footprint means that the method is easily reproducible. This facilitates serial interval analyses on past, current or future illness onset data streams.
A limitation of our method is that the estimated cdf obtained with uniform mixtures tells us that there is zero probability below the smallest observed left SI bound and that the serial interval lies with probability one below the largest observed right SI bound. Allowing for more flexible tails that go beyond the range of the observation set may be more realistic.
As previously mentioned, a challenging future research direction would be to adjust the proposed nonparametric approach for right truncation. Alternatively, it could be interesting to investigate how the data-driven method behaves under different weighting schemes. Instead of attributing an equal weight of to each serial interval window, we can for instance think of a rule that puts more weight to SI windows with smaller widths (i.e. with a lower degree of coarseness) since those windows are endowed with less uncertainty as compared to wider serial interval windows. Finally, a more theoretic study related to asymptotic properties and coarseness could provide interesting insights about the behavior of bias in our setting and give a flavor about the quality of information that can be extracted if the underlying serial interval data are characterized by an overall high degree of coarseness.
Supporting information
S1 Text.
Formulas of performance criteria used in the simulation study and additional simulation results.
https://doi.org/10.1371/journal.pcbi.1013338.s001
(PDF)
References
- 1. Simpson REH. The period of transmission in certain epidemic diseases; an observational method for its discovery. Lancet. 1948;2(6533):755–60. pmid:18100577
- 2. Madewell ZJ, Yang Y, Longini IM Jr, Halloran ME, Vespignani A, Dean NE. Rapid review and meta-analysis of serial intervals for SARS-CoV-2 Delta and Omicron variants. BMC Infect Dis. 2023;23(1):429. pmid:37365505
- 3. Cowling BJ, Fang VJ, Riley S, Malik Peiris JS, Leung GM. Estimation of the serial interval of influenza. Epidemiology. 2009;20(3):344–7. pmid:19279492
- 4. Te Beest DE, Henderson D, van der Maas NAT, de Greeff SC, Wallinga J, Mooi FR, et al. Estimation of the serial interval of pertussis in Dutch households. Epidemics. 2014;7:1–6. pmid:24928663
- 5. Charniga K, Park SW, Akhmetzhanov AR, Cori A, Dushoff J, Funk S, et al. Best practices for estimating and reporting epidemiological delay distributions of infectious diseases. PLoS Comput Biol. 2024;20(10):e1012520. pmid:39466727
- 6. Ward T, Christie R, Paton RS, Cumming F, Overton CE. Transmission dynamics of monkeypox in the United Kingdom: Contact tracing study. BMJ. 2022;379.
- 7. Svensson A. A note on generation times in epidemic models. Math Biosci. 2007;208(1):300–11. pmid:17174352
- 8. Lehtinen S, Ashcroft P, Bonhoeffer S. On the relationship between serial interval, infectiousness profile and generation time. J R Soc Interface. 2021;18(174):20200756. pmid:33402022
- 9. Chen D, Lau Y-C, Xu X-K, Wang L, Du Z, Tsang TK, et al. Inferring time-varying generation time, serial interval, and incubation period distributions for COVID-19. Nat Commun. 2022;13(1):7727. pmid:36513688
- 10. Park SW, Sun K, Champredon D, Li M, Bolker BM, Earn DJD, et al. Forward-looking serial intervals correctly link epidemic growth to reproduction numbers. Proc Natl Acad Sci U S A. 2021;118(2):e2011548118. pmid:33361331
- 11. Wallinga J, Lipsitch M. How generation intervals shape the relationship between growth rates and reproductive numbers. Proc Biol Sci. 2007;274(1609):599–604. pmid:17476782
- 12. Torneri A, Libin P, Scalia Tomba G, Faes C, Wood JG, Hens N. On realized serial and generation intervals given control measures: The COVID-19 pandemic case. PLoS Computat Biol. 2021;17(3):e1008892.
- 13. Boëlle P-Y, Ansart S, Cori A, Valleron A-J. Transmission parameters of the A/H1N1 2009 influenza virus pandemic: A review. Influenza Other Respir Viruses. 2011;5(5):306–16. pmid:21668690
- 14. Griffin J, Casey M, Collins Á, Hunt K, McEvoy D, Byrne A, et al. Rapid review of available evidence on the serial interval and generation time of COVID-19. BMJ Open. 2020;10(11):e040263. pmid:33234640
- 15. Lessler J, Reich NG, Cummings DAT, New York City Department of Health, Mental Hygiene Swine Influenza Investigation Team, Nair HP, Jordan HT and et al. Outbreak of 2009 pandemic influenza A (H1N1) at a New York City school. N Engl J Med. 2009;361(27):2628–36. pmid:20042754
- 16. Cowling BJ, Chan KH, Fang VJ, Lau LLH, So HC, Fung ROP, et al. Comparative epidemiology of pandemic and seasonal influenza A in households. N Engl J Med. 2010;362(23):2175–84. pmid:20558368
- 17. Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, et al. Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. N Engl J Med. 2020;382(13):1199–207. pmid:31995857
- 18. Nishiura H, Linton NM, Akhmetzhanov AR. Serial interval of novel coronavirus (COVID-19) infections. Int J Infect Dis. 2020;93:284–6. pmid:32145466
- 19. Ma Y, Jenkins HE, Sebastiani P, Ellner JJ, Jones-López EC, Dietze R, et al. Using cure models to estimate the serial interval of tuberculosis with limited follow-up. Am J Epidemiol. 2020;189(11):1421–6. pmid:32458995
- 20. Kremer C, Braeye T, Proesmans K, André E, Torneri A, Hens N. Serial intervals for SARS-CoV-2 omicron, delta variants, Belgium and November 19–December 31 2021 . Emerg Infect Dis. 2022;28(8):1699–702. pmid:35732195
- 21. Rai B, Shukla A, Dwivedi LK. Estimates of serial interval for COVID-19: A systematic review and meta-analysis. Clin Epidemiol Glob Health. 2021;9:157–61. pmid:32869006
- 22. Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc Series B: Stat Methodol. 1976;38(3):290–5.
- 23. Mettler SK, Kim J, Maathuis MH. Diagnostic serial interval as a novel indicator for contact tracing effectiveness exemplified with the SARS-CoV-2/COVID-19 outbreak in South Korea. Int J Infect Dis. 2020;99:346–51. pmid:32771634
- 24. Yang L, Dai J, Zhao J, Wang Y, Deng P, Wang J. Estimation of incubation period and serial interval of COVID-19: Analysis of 178 cases and 131 transmission chains in Hubei province, China. Epidemiol Infect. 2020;148:e117. pmid:32594928
- 25. Müller J, Kretzschmar M. Contact tracing—Old models and new challenges. Infect Dis Model. 2020;6:222–31. pmid:33506153
- 26. Reich NG, Lessler J, Cummings DAT, Brookmeyer R. Estimating incubation period distributions with coarse data. Stat Med. 2009;28(22):2769–84. pmid:19598148
- 27.
Gandrud C. Reproducible research with R and R studio. Chapman and Hall/CRC; 2018.
- 28. Henderson AS, Hickson RI, Furlong M, McBryde ES, Meehan MT. Reproducibility of COVID-era infectious disease models. Epidemics. 2024;46:100743. pmid:38290265
- 29. Collins A, Alexander R. Reproducibility of COVID-19 pre-prints. Scientometrics. 2022;127(8):4655–73. pmid:35813409
- 30. Zavalis EA, Ioannidis JPA. A meta-epidemiological assessment of transparency indicators of infectious disease models. PLoS One. 2022;17(10):e0275380. pmid:36206207
- 31. Vink MA, Bootsma MCJ, Wallinga J. Serial intervals of respiratory infectious diseases: A systematic review and analysis. Am J Epidemiol. 2014;180(9):865–75. pmid:25294601
- 32.
Reich NG, Lessler J, Azman AS. coarseDataTools: A collection of functions to help with analysis of coarsely observed data. R package version 0.6-6; 2021. Available from: https://cran.r-project.org/package=coarseDataTools
- 33. Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. Am J Epidemiol. 2013;178(9):1505–12. pmid:24043437
- 34. Thompson RN, Stockwin JE, van Gaalen RD, Polonsky JA, Kamvar ZN, Demarsh PA, et al. Improved inference of time-varying reproduction numbers during infectious disease outbreaks. Epidemics. 2019;29:100356. pmid:31624039
- 35.
Park SW, Akhmetzhanov AR, Charniga K, Cori A, Davies NG, Dushoff J, et al. Estimating epidemiological delay distributions for infectious diseases. Cold Spring Harbor Lab. 2024. https://doi.org/10.1101/2024.01.12.24301247
- 36.
Abbott S, Brand S, Pearson C, Funk S, Charniga K. Primary event censored distributions. 2025. https://doi.org/10.5281/zenodo.13632839
- 37.
Batra, Neale, et al. The Epidemiologist R Handbook; 2021. https://epirhandbook.com/en/ [Accessed 16 October 2024 ].
- 38. Ryu S, Kim D, Lim J-S, Ali ST, Cowling BJ. Serial interval and transmission dynamics during SARS-CoV-2 delta variant predominance, South Korea. Emerg Infect Dis. 2022;28(2):407–10. pmid:34906289
- 39. Donnelly CA, Finelli L, Cauchemez S, Olsen SJ, Doshi S, Jackson ML, et al. Serial intervals and the temporal distribution of secondary infections within households of 2009 pandemic influenza A (H1N1): Implications for influenza control recommendations. Clin Infect Dis. 2011;52 Suppl 1(Suppl 1):S123-30. pmid:21342883
- 40. Heitjan D. Ignorability and coarse data. Ann Stat. 1996;19:207–13.
- 41. Hens N, Calatayud L, Kurkela S, Tamme T, Wallinga J. Robust reconstruction and analysis of outbreak data: Influenza A(H1N1)v transmission in a school-based population. Am J Epidemiol. 2012;176(3):196–203. pmid:22791742
- 42. McAloon CG, Wall P, Griffin J, Casey M, Barber A, Codd M, et al. Estimation of the serial interval and proportion of pre-symptomatic transmission events of COVID-19 in Ireland using contact tracing data. BMC Public Health. 2021;21(1):805. pmid:33906635
- 43. Birnbaum A. On the foundations of statistical inference. J Am Stat Assoc. 1962;57(298):269–306.
- 44. Kass RE, Wasserman L. The selection of prior distributions by formal rules. J Am Stat Assoc. 1996;91(435):1343–70.
- 45. Gupta AK, Miyawaki T. On a uniform mixture model. Biometrical J. 1978;20(7–8):631–7.
- 46. Craigmile PF, Tirrerington DM. Parameter estimation for finite mixtures of uniform distributions. Commun Stat – Theory Methods. 1997;26(8):1981–95.
- 47.
Bratley P, Fox BL, Schrage LE. A guide to simulation. Springer New York; 1987.
- 48. Kaczynski W, Leemis L, Loehr N, McQueston J. Nonparametric random variate generation using a Piecewise-Linear cumulative distribution function. Commun Stat – Simul Computat. 2011;41(4):449–68.
- 49. Diekmann O, Heesterbeek JA, Metz JA. On the definition and the computation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneous populations. J Math Biol. 1990;28(4):365–82. pmid:2117040
- 50. Efron B. Bootstrap methods: Another look at the jackknife. Ann Stat. 1979;7(1):1–26.
- 51.
Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall/CRC; 1994.
- 52.
Efron B, Hastie T. Computer age statistical inference, student edition: Algorithms, evidence, and data science. Cambridge University Press; 2021.
- 53. Alene M, Yismaw L, Assemie MA, Ketema DB, Gietaneh W, Birhan TY. Serial interval and incubation period of COVID-19: A systematic review and meta-analysis. BMC Infect Dis. 2021;21(1):257. pmid:33706702
- 54.
Cori A, Cauchemez S, Ferguson NM, Fraser C, Dahlqwist E, Demarsh PA. Package ‘EpiEstim’. Vienna, Austria: CRAN; 2020.
- 55. Morgan OW, Parks S, Shim T, Blevins PA, Lucas PM, Sanchez R. Household transmission of pandemic (H1N1) 2009, San Antonio, Texas, USA, April–May 2009. Emerg Infect Dis. 2010;16(4):631–7.
- 56. Gressani O, Torneri A, Hens N, Faes C. Flexible Bayesian estimation of incubation times. Am J Epidemiol. 2025;194(2):490–501. pmid:38988237
- 57. Gressani O, Lambert P. Fast Bayesian inference using Laplace approximations in a flexible promotion time cure model based on P-splines. Computat Stat Data Anal. 2018;124:151–67.
- 58. Gressani O, Wallinga J, Althaus CL, Hens N, Faes C. EpiLPS: A fast and flexible Bayesian tool for estimation of the time-varying reproduction number. PLoS Comput Biol. 2022;18(10):e1010618. pmid:36215319