Skip to main content
Advertisement
  • Loading metrics

Nonparametric serial interval estimation with uniform mixtures

  • Oswaldo Gressani ,

    Roles Conceptualization, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft

    oswaldo.gressani@uhasselt.be

    Affiliation Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Data Science Institute, Hasselt University, Hasselt, Belgium

  • Niel Hens

    Roles Funding acquisition, Supervision, Writing – review & editing

    Affiliations Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Data Science Institute, Hasselt University, Hasselt, Belgium, Centre for Health Economics Research and Modelling Infectious Diseases, Vaxinfectio, University of Antwerp, Antwerp, Belgium

Abstract

The serial interval of an infectious disease is a key instrument to understand transmission dynamics. Estimation of the serial interval distribution from illness onset data extracted from transmission pairs is challenging due to the presence of censoring and state-of-the-art methods mostly rely on parametric models. We present a fully data-driven methodology to estimate the serial interval distribution based on interval-censored serial interval data. The proposed nonparametric estimator of the cumulative distribution function of the serial interval is based on the class of uniform mixtures. Closed-form solutions are available for point estimates of different serial interval features and the bootstrap is used to construct confidence intervals. Algorithms underlying our approach are simple, stable, and computationally inexpensive, making them easily implementable in a programming language that is most familiar to a potential user. The nonparametric user-friendly routine is included in the EpiDelays package for ease of implementation. Our method complements existing parametric approaches for serial interval estimation and permits to analyze past, current, or future illness onset data streams following a set of best practices in epidemiological delay modeling.

Author summary

Epidemiological delay distributions play a key role in outbreak analyses and in modeling infectious diseases. The serial interval is the time from illness onset in a primary case to illness onset in a secondary case and ranks among the most important delay quantities as it can be used to infer transmission patterns in mathematical and statistical models. From a statistical perspective, estimation of the serial interval distribution is complicated by the fact that the exact timing of illness onset is usually unknown and the latter event is only known to have occurred between two time points; a phenomenon called interval censoring. We propose a new inferential method to estimate the serial interval distribution from interval-censored illness onset data without relying on a parametric model. The nonparametric methodology comes with a low degree of mathematical complexity and the underlying algorithms are simple, fast and stable. A user-friendly routine written in the R programming language is available in the EpiDelays package. The proposed data-driven method accounts for a set of best practices in epidemiological delay modeling and can be used to obtain point estimates and confidence intervals for often reported serial interval features.

1. Introduction

The serial interval (SI) is an epidemiological delay characterizing a duration between two well-defined events related to a disease. It represents the time between symptom onset in a primary case or infector and symptom onset in a secondary case or infectee [1]. This time delay can be negative as nothing restrains the infectee to experience symptom onset earlier than the infector [2]. In the literature, this interval is also known as the clinical onset serial interval [3,4]. Epidemiological and biological factors are responsible for introducing variation in times between primary and secondary events [5], so that serial intervals can be represented by a time delay distribution [6]. A different, but closely related delay quantity is the generation interval, which is defined as the duration between infection events in an infector-infectee pair [7]. The timing of an infection event is typically less likely to be observed than the timing of a symptom event and it is common practice to approximate the distribution of generation times by the SI distribution [8,9]. Serving as a proxy for generation intervals, serial intervals can be used as an instrument to measure the time scale of disease transmission [10] and are therefore key in linking the epidemic growth rate with the reproduction number [11,12]. The crucial role played by the serial interval distribution in disease transmission models emphasizes the need to have reliable, stable, and replicable statistical methodologies to estimate this quantity. Ideally, these methodologies should also follow best practices recently described in [5].

Different methods exist to estimate the distribution and features of the serial interval of an infectious disease based on data. When time intervals of illness onset between infectors and infectees are observed, the data is considered as a random sample from the population. In that case, essential features of the serial interval are estimated by either directly computing summary statistics from empirical serial intervals (e.g. mean, median, standard deviation) or by fitting a parametric distribution to observed data [13,14]. Parametric methods are by far the most common and usually include the Lognormal, Weibull, Gamma or Gaussian distributions [1520]. For instance, a systematic review and meta-analysis of serial interval estimates for COVID-19 [21] shows that a majority of studies rely on parametric models with a frequent use of Gamma and Gaussian distributions. Estimation of model parameters is typically carried out with the maximum likelihood principle or by using the Bayesian approach, and is often based on a relatively small sample size. To our knowledge, only few attempts have been made in applying nonparametric methods to serial interval data analysis. For instance, [3] compute a nonparametric estimate of the cumulative distribution function of the serial interval of influenza based on the method of [22] to see whether different parametric models are in agreement with it, and [23] use the nonparametric bootstrap to compute confidence intervals for the clinical onset SI of SARS-CoV-2.

By definition, serial intervals involve transmission pairs. It means that a minimal requirement for SI estimation is to have data on symptom onset times for the infector and infectee. Such data can be extracted from contact tracing programmes, which permit to gain knowledge about who infected whom and provide information on timings of symptoms in infector-infectee pairs [24,25]. Commonly, serial interval data are interval censored in that only lower and upper limits of illness onset timing is observed. This characteristic adds a layer of complexity to the estimation problem. If censoring concerns either the infector or infectee, data are said to be single interval-censored; and if censoring affects both actors in the transmission pair, data are called doubly interval-censored [26]. Thinking from a continuous time perspective, serial interval data is more often than not doubly interval-censored due to the time resolution of reporting. When the time resolution for reporting illness onset is a calendar day (as is often the case), then censoring is inherent to the calendar day, i.e. the precise timing of illness onset within the reported calendar day remains unknown. Therefore, even if exact calendar dates are observed, it is good practice to still consider the data as doubly interval-censored [5].

Despite the large number of studies conducted on the serial interval of different pathogens, most methods are difficult or impossible to reproduce in the sense that independent researchers are confronted with serious difficulties in reusing existing procedures to new data [27]. The field of infectious disease modeling suffers from alarmingly low computational reproducibility rates [28], which hinders applicability and misaligns with pandemic preparedness objectives. This reproducibility conundrum has several causes. For instance, recent meta-epidemiological surveys found that very few publications share code or data [29,30]. Other potential causes are code incompleteness and complex dependencies among multiple scripts without clear guidelines regarding computation order [28]. The study of [31] highlights that finding evidence supporting frequently cited serial interval values in the literature is a challenging task.

Hopefully, more applicable tools and methods have recently emerged to estimate epidemiological delay distributions. Originally developed for estimation of incubation period distributions, the methodology of [26] is available in an R software package [32] and associated routines are embedded in the EpiEstim package of [33] to estimate the serial interval [34]. [31] reanalyze published serial interval data on different respiratory infections by using a common statistical method and provide R code and data sets for reproducibility. The epidist package [35] and the primarycensored package [36] are also operational for serial interval estimation and account for censoring and truncation. These tools rely on parametric methods imposing distributional assumptions on the serial interval distribution and leave no room for data-driven inference.

In an attempt to complement the above-mentioned parametric methods, we develop a nonparametric approach to estimate the serial interval distribution based on illness onset data. The proposed method is entirely data-driven and applicable on a wide range of serial interval data commonly analyzed in the literature. Its chief merits are its mathematical and computational simplicity. The proposed method also aligns with some of the best practices recommended by [5], namely: (1) adjusting for double interval censoring, (2) reporting guidelines for epidemiological delays, (3) accounting for negative serial intervals and (4) reproducibility guidelines. Since R is among the most popular programming languages used in the infectious disease modeling community [28,37], the code underlying our nonparametric methodology is written in the R language and available in the EpiDelays package (https://github.com/oswaldogressani/EpiDelays). Source code comes in a lightweight format and spans only a few lines. It can thus be easily translated in another programming language if needed (e.g. Python or C++).

Next, we present our nonparametric estimator and briefly discuss some of its theoretical properties. An entire section is dedicated to simulations in order to assess the performance of our data-driven approach. Applications to transmission pair data extracted from previous outbreaks for a diverse set of pathogens underlines the wide, general, and straightforward applicability of our method. The article concludes with a discussion surrounding different aspects of the proposed nonparametric methodology for serial interval estimation.

2. Methods

2.1. The coarse structure of serial interval data

Datasets used to estimate serial interval features are usually obtained from line list information collected during epidemics [10,34,38]. The structure is such that a line in the data list conveys information about calendar dates for the infector and infectee [31,39]. Calendar dates are not well-suited for statistical analysis. Therefore, conversion from calendar time to analysis time is carried out through a mapping from the set of calendar dates to a set of real numbers, and more commonly to a set of integers. The precise calendar date of symptom appearance may be unknown and this uncertainty translates into a range of reported dates. In that case, serial interval data are referred to as coarse data following the terminology of [40] in the sense that the timing of symptom onset is only observed to lie within a time interval; a feature also known as interval censoring. Even when precise dates are reported, there is still uncertainty with respect to the exact timing of symptom onset within a day. As such, a calendar day can be coarsened to an interval of two consecutive calendar dates, where the reported day is the lower bound and the following day is the upper bound of the interval. This means that serial interval data are usually treated as doubly interval-censored [26], i.e. the data contain a range of symptom onset dates for each primary and secondary case. After conversion of calendar time to analysis time, denote by the illness onset time of the infector and by ti the illness onset time of the infectee in the ith transmission pair. These quantities are treated as interval-censored because the precise symptom onset time within a day is usually unknown. Let and denote the observed left and right bound, respectively, of the symptom onset time of the infector in the ith transmission pair and assume . For the infectee, a similar notation is used and we assume . The four time points can be used to compute the earliest possible SI time and the latest possible SI time . The SI window width sheds light upon the degree of coarseness associated with the (unobserved) serial interval in the ith infector-infectee pair. A schematic representation of SI data and its underlying coarseness is shown in Fig 1.

thumbnail
Fig 1. Schematic representation of the coarse structure of serial interval data.

(A) The timings of symptom onset in infector-infectee pairs are usually reported as calendar dates. (B) Conversion from calendar time to analysis time is done through a mapping from reported calendar dates to a set of numbers (usually integers). (C) To account for the uncertainty in the timing of symptom onset within a day when a single calendar day is reported by the infector or infectee, a one-day coarsening of the data is implemented by constructing an interval with endpoints corresponding to two numbers resulting from the mapping of two successive calendar days in analysis time. (D) Coarseness at the serial interval level is obtained by taking the difference between the right serial interval bound siR and the left serial interval bound siL.

https://doi.org/10.1371/journal.pcbi.1013338.g001

To address uncertainty about who infected whom in outbreak data, robust likelihoodbased methods can be employed that explicitly account for missing or potentially incorrect transmission links (see e.g. [41]). These methods estimate the probability of transmission between individuals using contact information and the timing of symptom onset. An iterative approach is then used to reconstruct plausible transmission trees and to estimate key epidemiological delay distributions, while simultaneously identifying and mitigating the influence of unreliable data points. An alternative and simplified approach to mitigate the risk of misspecifying the infector is to concentrate on a subset of transmission pairs for which there is reasonable evidence about who the infector is [42], although this could introduce bias in the analysis.

2.2. A uniform mixture model

Let be a real-valued random variable representing the serial interval of an infectious disease and denote by the cumulative distribution function (cdf) of with . We adopt Laplace’s principle of insufficient reason [43,44] and assume that the interval-censored SI variable of the ith transmission pair is uniformly distributed over the censoring interval with endpoints siL and siR, i.e. . The resulting cdf associated with is denoted by:

where is the indicator function. The ordered pair denotes the ith transmission pair SI window constructed from the observed data points siL and siR. Also, let denote the set of ordered pairs representing the information set (or set of observables) constructed from serial interval data with n transmission pairs. Following previous work on mixtures of uniform distributions (see e.g. [45,46]), we propose to estimate by the n-component mixture with weights for . The resulting data-driven estimate is:

(1)

The above estimate is a finite convex combination of continuous functions and is therefore itself a continuous function in . Moreover, it is a non-decreasing function since it essentially accumulates probability mass over intervals when moving along the real line in the positive direction. It is also easy to verify that and , so that is a bona fide cdf. Note also that is a piecewise-linear function with breakpoints or “bends” arising at observed data points. Piecewise-linear cumulative distribution functions are endowed with interesting properties that have for instance been studied in [47,48]. These properties will guide us in computing point estimates of different serial interval features.

In parametric approaches, it is customary to work with the estimated probability density function (pdf) of the serial interval distribution, while our methodology concentrates around the estimated cumulative distribution function . This implies no loss of generality as the cdf gives a complete description of the underlying target distribution. For instance, our method can be used to compute an estimate of the basic reproduction number , the average number of secondary cases generated by a primary case in a fully susceptible population [49]. The generation interval distribution provides a link between the exponential growth rate of an epidemic r and the basic reproduction number via the Lotka-Euler equation [11], namely , where is the pdf of the generation interval. Using the serial interval as a proxy for the generation interval, the latter equation becomes , where is the pdf of the serial interval. Relying on the Riemann-Stieltjes integral notation, the estimated basic reproduction number using our nonparametric method is , where the integral can be solved numerically. An alternative way to proceed in estimating without entirely leveraging our nonparametric cdf estimate is to work with a classic parametric distribution for the generation time and use a parameterization that aligns with our nonparametric estimate of the mean and variance of the serial interval.

2.3. Point estimation

The uniform mixture model in (1) is mathematically appealing as it permits to compute frequently reported point estimates of features of the SI distribution in closed form based on the information set . Using the Riemann-Stieltjes integral representation of the expected value, point estimates of the SI mean and standard deviation are given by:

The estimated quantile function of the random variable is , where is the p-quantile of for a given . Denote by and two neighboring breakpoints of satisfying . When p is such that , then has a flat behavior between and and so by definition of the quantile function. When p is such that , the piecewise-linear property of can be used to compute the desired estimated quantile . In fact, by piecewise linearity, the slope of between and is equal to the slope of between and . This allows to write an equation that can be solved for the single unknown . Mathematically:

(2)

Solving (2) for yields:

(3)

If p satisfies , then and if , then .

2.4. Quantification of uncertainty

The generic notation is used to represent a given feature of the SI distribution, for instance if the mean is of interest or if the focus is on the standard deviation of . The bootstrap method will be used to compute measures of accuracy associated with the estimate [50,51]. Let denote a bootstrap sample obtained by sampling randomly and with equal probability n transmission pair SI windows with replacement from . With n = 4 transmission pairs, a possible realization is and the bootstrap sample is simply a set of n ordered pairs. For the features of presented in Sect 2.3, the bootstrap replication of denoted by can easily be computed based on . Generating B bootstrap samples and computing their corresponding bootstrap replicate of gives access to , which characterizes the bootstrap distribution of the statistic . The bootstrap estimate of the standard error of can be used as a measure of accuracy of the estimate . It corresponds to the empirical standard deviation of the values in :

A confidence interval for can be constructed from the empirical quantiles of the sample of bootstrap estimates in . Let and denote the and sample quantiles of the values in . Most software has readily available routines to compute these quantiles (e.g. the quantile function in R). The confidence interval for using the quantile method is denoted by . Following [52], we recommend using a bootstrap sample size of at least B = 2000 for confidence interval construction.

3. Results

3.1. Simulations

3.1.1. Generating mechanism for artificial serial interval data.

To simulate artificial serial interval data, we assume that the target SI has a distribution with mean and standard deviation . At the transmission pair level, the interval-censoring mechanism is governed by a discrete random variable with values cl = l for and probability mass function with . Given a set of parameters and , a complete dataset for n transmission pairs is obtained by repeating the following four steps n times. 1. Draw from a distribution. 2. Sample based on the chosen distribution. 3. Draw from a uniform distribution U(0,1). 4. Compute the left bound and right bound of the SI window of a transmission pair as and , where is the floor function returning the greatest integer less than or equal to its argument and is the ceiling function returning the smallest integer greater than or equal to its argument. The distribution of controls the degree of data coarseness, i.e. the width of the generated serial interval windows. This simple mechanism permits to simulate frequently encountered serial interval data in the epidemiologic literature and properly takes into account the uncertainty regarding the timing of symptoms onset within the day. Said differently, for the infector and infectee, symptoms onset are only known to lie between two successive calendar days so that transmission pair data are doubly interval-censored. Mathematically this means that, under the common mapping of calendar dates to integers, infector coarseness and infectee coarseness are both bounded below by one. This implies that SI coarseness measured by is bounded below by two. Fixing c1 = 1 in our data generating mechanism ensures that the SI window is at least equal to two days. Fig 2 illustrates two sets of simulated serial interval data with n = 15, , and censoring distribution p1 = 0.80, p2 = 0.15, p3 = 0.05.

thumbnail
Fig 2. Example of two coarse SI datasets of size obtained with our data generating mechanism using , and the censoring distribution , , .

https://doi.org/10.1371/journal.pcbi.1013338.g002

3.1.2. First set of simulations.

The performance of our nonparametric method is first assessed by assuming two target SI distributions, namely a SI distribution inspired from the SARS-CoV-2 Omicron variant [20] and a SI distribution that imitates results obtained for smallpox [31]. The distribution for the censoring mechanism is given by p1 = 0.80, p2 = 0.15, and p3 = 0.05, so that generated SI window widths vary between 2 and 4 days. For each target SI distribution, we simulate M = 1000 datasets with four different sample sizes ; covering frequently encountered numbers of transmission pairs in the literature [21,53]; yielding a total of scenarios. The performance of our nonparametric approach is assessed on the following often reported features of the SI distribution: mean , standard deviation and quantiles q0.05, q0.25 q0.50, q0.75, q0.95. We use bias, empirical standard error (ESE), root mean squared error (RMSE), coverage probability of and confidence intervals and median confidence interval width as performance criteria (formulas of these criteria are provided in S1 Text). Confidence intervals are constructed based on B = 2000 bootstrap samples.

Results for Scenarios 1-4 with underlying SARS-CoV-2 Omicron-like target SI distribution are shown in Table 1. Overall, our nonparametric method based on uniform mixtures exhibits fairly good performance with relatively low bias. The coverage of confidence intervals for all the chosen SI features are satisfactorily close to their nominal level starting from n = 50. Under smaller sample sizes, confidence intervals for the selected SI features tend to undercover, yet results for the mean and median remain reasonable given the underlying SI coarseness of at least two days and the small number of transmission pairs. The ESE, RMSE and width of confidence intervals tend to decrease as the sample size increases. It is also worth mentioning that estimation of remote quantiles, i.e. q0.05 and q0.95 is more challenging and the bias for these features is usually higher. Results for Scenarios 5-8 with a smallpox-like target SI distribution are given in Table 2 and the interpretation is the same as for Scenarios 1-4 with an overall good performance of our data-driven approach for all the considered SI features. Further simulations with higher average coarseness (Scenarios S1-S3) and scenarios with a Gamma target SI distribution inspired from measles (Scenarios S4-S7) are provided in S1 Text.

thumbnail
Table 1. Results for Scenarios 1-4 with simulated datasets, censoring distribution , and target inspired from [20] that imitates the SI distribution of the SARS-CoV-2 Omicron variant. The first column contains the selected features of , namely the mean, standard deviation, 5th, 25th, 50th, 75th and 95th quantiles. Bias, ESE, RMSE, coverage probability (CP) and median confidence interval width () are used as performance criteria.

https://doi.org/10.1371/journal.pcbi.1013338.t001

thumbnail
Table 2. Results for Scenarios 5-8 with simulated datasets, censoring distribution , and target inspired from [31] that imitates the SI distribution of smallpox. The first column contains the selected features of , namely the mean, standard deviation, 5th, 25th, 50th, 75th and 95th quantiles. Bias, ESE, RMSE, coverage probability (CP) and median confidence interval width () are used as performance criteria.

https://doi.org/10.1371/journal.pcbi.1013338.t002

3.1.3. The impact of coarseness.

To illustrate the negative impact of coarseness on estimates of certain SI features, we consider a target serial interval distribution inspired from influenza A [31] and run four simulation scenarios (Scenarios 9-12) with censoring distribution p1 = 0.80, p2 = 0.15, p3 = 0.05 and sample size . Results shown in Table 3 reveal how estimation performance is impacted by working with data having a coarseness degree of at least two days (i.e. doubly interval-censored SI data). Except for the mean and median, estimates of the chosen SI features tend to suffer from a larger bias as compared to the previous scenarios (Scenarios 1-8). This is because the standard deviation of the assumed influenza A serial interval target distribution is smaller than its counterpart in the SARS-CoV-2 Omicron () and smallpox () settings. Such a small standard deviation coupled with a degree of coarseness of at least two days for the serial interval windows blurs the information conveyed by the variation of the true (and unobserved) serial interval realizations. Said differently, the degree of coarseness “dominates” or hides the rather small variations of the true (unobserved) SI values around the mean . The price to pay for such a degree of coarseness is a larger bias and lower coverage, especially for estimates of and q0.05, q0.25, q0.75, q0.95 as shown in Table 3.

thumbnail
Table 3. Results for Scenarios 9-12 with simulated datasets, censoring distribution , and target inspired from [31] that imitates the SI distribution of influenza A. The first column contains the selected features of , namely the mean, standard deviation, 5th, 25th, 50th, 75th and 95th quantiles. Bias, ESE, RMSE, coverage probability (CP) and median confidence interval width () are used as performance criteria.

https://doi.org/10.1371/journal.pcbi.1013338.t003

To further stress the role played by the degree of coarseness in SI data, we consider the same influenza A setting but with a hypothetical coarseness degree that is close to zero. In particular, after generating , we assume that the left bound of the serial interval window is and that the right bound is with so that the degree of coarseness of the SI window is equal to and we refer to this censoring scheme as -coarseness. Simulations under -coarseness for the influenza A setting are implemented for a sample size , yielding Scenarios 13-16 and results are shown in Table 4. Without surprise, when coarseness is virtually zero, our nonparametric method shows good performance with negligible bias and confidence intervals that tend to have close to nominal coverage values starting from n = 50 for all the considered SI features.

thumbnail
Table 4. Results for Scenarios 13-16 with simulated datasets, -coarseness with , and target inspired from [31] that imitates the SI distribution of influenza A. The first column contains the selected features of , namely the mean, standard deviation, 5th, 25th, 50th, 75th and 95th quantiles. Bias, ESE, RMSE, coverage probability (CP) and median confidence interval width () are used as performance criteria.

https://doi.org/10.1371/journal.pcbi.1013338.t004

3.1.4. A note on asymptotic bias.

Coarseness is responsible for introducing bias in estimates of SI features. As can be seen from Scenarios 1-12, when n increases, the bias does not necessarily decrease. This is because the sample size considered in these scenarios is not large enough to fully reveal how the underlying degree of coarseness impacts the estimates. To show the “asymptotic" impact of coarseness, we run simulations for the SARS-CoV-2 Omicron, smallpox and influenza A settings with n = 500. Results are shown in S1 Text (Scenarios S8-S10) and reveal good performance for the mean and median but undercoverage for the standard deviation and quantiles depending on the setting. To further highlight the asymptotic bias argument, we compare how estimates provided by our data-driven approach evolve with sample size when assuming a degree of coarseness of at least two days and when considering the hypothetical case of -coarseness in the SARS-CoV-2 Omicron setting. For a sequence of sample sizes between n = 6 and n = 500, we compute M = 50 estimates for each selected SI feature and analyze how the mean estimate evolves with n. Fig 3 shows results for the SARS-CoV-2 Omicron target SI distribution with a degree of coarseness of at least two days (panel A) and under the hypothetical case of -coarseness (panel B). A coarseness degree of at least two days introduces bias in our estimates. This is especially visible for the standard deviation and quantiles q0.05 and q0.95. This bias reaches a limit as n grows large and is an unavoidable facet of coarseness that negatively impacts the confidence interval coverage performance. Under the hypothetical -coarseness setting, our estimates exhibit good performance and stabilize around the true SI features as n increases.

thumbnail
Fig 3. Mean estimates of selected SI features computed over simulated datasets for a sequence of sample sizes ranging between and when the underlying target SI distribution mimics the SARS-CoV-2 Omicron setting [20].

Dotted lines indicate the true value of a SI feature. (A) Serial interval coarseness of at least two days generated by the censoring distribution p1 = 0.8, p2 = 0.15, p3 = 0.05. (B) Hypothetical -coarseness setting with .

https://doi.org/10.1371/journal.pcbi.1013338.g003

3.2. Applications

To further validate our nonparametric method, we consider different applications on serial interval data from past outbreaks that are publicly available. A textual analysis is provided for each individual dataset and results are summarized in Table 5.

thumbnail
Table 5. Nonparametric estimates obtained with our method and parametric estimates of SI features (mean , standard deviation , median , and 95th quantile ) for different publicly available serial interval datasets. Values in round brackets correspond to confidence intervals for our method and confidence or credible intervals for parametric methods. The third column indicates the sample size. NR: Not Reported. The symbol * indicates that information was obtained by contacting the corresponding author of the article listed in the data source column.

https://doi.org/10.1371/journal.pcbi.1013338.t005

3.2.1. Influenza A (2009 H1N1 influenza) at a New York City school.

We start by analyzing a dataset based on illness onset dates of n = 16 infector-infectee pairs obtained from the supplementary appendix of [15]. After fitting a Weibull distribution to the data, the authors obtain a median serial interval of 2.7 days (CI: 2.0-3.5) and a 95th quantile of 5.1 days (CI: 3.6-6.5). Our nonparametric method estimates that the median SI is 2.8 days () and the 95th quantile estimate is 5.2 days (). Fig 4A summarizes the observed serial interval windows. Fig 4B shows the estimated cdf (black curve), point estimates (dots) and CIs for selected quantiles of .

thumbnail
Fig 4. (A) Serial interval windows of influenza A for infector-infectee pairs at a New York City school [15].

(B) Nonparametric estimate (black curve), point estimates (dots) and CIs (horizontal lines) for selected quantiles of .

https://doi.org/10.1371/journal.pcbi.1013338.g004

3.2.2. Influenza A (2009 H1N1 influenza) in San Antonio, Texas, USA.

We analyze another influenza dataset [34,54] containing doubly interval-censored serial interval data from the 2009 influenza A outbreak in San Antonio, Texas, USA [55]. Our methodology estimates the mean serial interval at 4.0 days (). The standard deviation is estimated at 1.9 days () and the 95th quantile is at 7.8 days (). Serial interval windows and estimates of different features of are shown in Fig 5.

thumbnail
Fig 5. (A) Serial interval windows of influenza A for infector-infectee pairs in San Antonio, Texas, USA [55].

(B) Nonparametric estimate (black curve), point estimates (dots) and CIs (horizontal lines) for selected quantiles of .

https://doi.org/10.1371/journal.pcbi.1013338.g005

3.2.3. Illness onset data for COVID-19 in Wuhan, China.

[17] share data on illness onset dates of n = 6 infector-infectee pairs and estimate that the serial interval has a mean of 7.5 days (CI: 5.3-19) based on a parametric model involving a Gamma distribution. Raw data come as calendar dates of illness onset for infector-infectee pairs. We therefore apply a one-day coarsening of the data to recover the desired doubly interval-censored structure. Our nonparametric method gives a mean serial interval estimate of 6.3 days () and a median SI of 6.7 days ().

3.2.4. Illness onset data for COVID-19 with n = 28 infector-infectee pairs.

A richer serial interval dataset on COVID-19 is provided by [18]. They obtained doubly interval-censored data on n = 28 infector-infectee pairs and estimated features of the serial interval based on a Bayesian parametric approach. The authors estimate the median serial interval to be 4.0 days (CrI: 3.1-4.9), where CrI denotes the credible interval. The mean and standard deviation of the serial interval are estimated at 4.7 days (CrI: 3.7-6.0) and 2.9 days (CrI: 1.9-4.9), respectively. Our nonparametric method estimates the median serial interval at 3.8 days (). Estimates for the mean and standard deviation are 4.6 days () and 2.6 days (), respectively. A graphical output of the nonparametric results is shown in Fig 6.

thumbnail
Fig 6. (A) Serial interval windows of COVID-19 for infector-infectee pairs [18].

(B) Nonparametric estimate (black curve), point estimates (dots) and CIs (horizontal lines) for selected quantiles of .

https://doi.org/10.1371/journal.pcbi.1013338.g006

3.2.5. Illness onset data for COVID-19 in Belgium.

[20] report data on illness onset dates of n = 2161 transmission pairs for the Omicron variant of SARS-CoV-2 and n = 334 infector-infectee pairs for the Delta variant. Fitting a Gaussian distribution to the data using a Bayesian approach, the authors obtain a median serial interval of 2.75 days (CrI: 2.65-2.86) and a standard deviation of 2.54 days (CrI: 2.46-2.61) for Omicron. For Delta, they obtain a median serial interval of 3.00 days (CrI: 2.73-3.26) and a standard deviation of 2.49 days (CrI: 2.31-2.69). Treating the data as doubly interval-censored, our data-driven approach estimates the median SI at 2.62 days () and the standard deviation at 2.60 days () for Omicron. For the Delta variant, the nonparametric approach estimates the median SI at 3.06 days () and the estimated standard deviation is 2.54 days ().

4. Discussion

Our new data-driven methodology permits to estimate serial interval features based on coarse illness onset data without making parametric assumptions with respect to the SI distribution. The proposed nonparametric estimates are based on uniform mixtures and the resulting piecewise-linear structure of the cumulative distribution function allows to compute point estimates of several SI features in closed form. Such a mathematical tractability implies a low computational cost in quantifying uncertainty via the bootstrap. Simulation results suggest that the proposed nonparametric methodology will provide a reasonable approximation to the true underlying SI distribution in a large number of real-world use cases if the spread of the target distribution is not too much dominated by the degree of coarseness. A visual inspection of serial interval windows after adjusting for double interval censoring already gives an insightful assessment of whether or not coarseness dominates the spread of the underlying target SI distribution. The smaller the frequency of overlapping SI windows in one-day intervals, the richer is the signal conveying information about the spread of the underlying distribution and hence the more confident we can be in estimates of the standard deviation and tail quantiles. Furthermore, we have shown that estimates of some SI features (mainly the standard deviation and tail quantiles) will remain biased even under large sample sizes due to the presence of coarseness.

While our method is specifically tailored for working with serial interval data that has been adjusted for double interval censoring, it is important to highlight that it does not adjust for right truncation. Right truncation means that SI windows are absent from the data because, at the time of data collection, the information required to build a SI window for an infector-infectee pair (i.e. two successive symptom onset times) is not yet available. The problem of right truncation appears in real-time settings and implies an overrepresentation of shorter serial intervals, which in turn can lead to underestimation of SI statistics [5,18]. Right truncation is accentuated during the early stage of an epidemic when it undergoes a growing phase [35]. In retrospective analyses, right truncation is usually not a problem if the surveillance period is long enough to provide a representative sample [5], and in that case, our data-driven approach does not require further adjustment. Methodological developments to correct for right truncation bias in estimating serial interval distributions have only recently emerged in parametric settings [6,18,35]. An interesting future research direction would be to extend existing right truncation adjustment approaches to our nonparametric setting.

There is a surface-level similarity between the nonparametric approach proposed here and our previous work on incubation period estimation [56], however these methods are radically different in several ways. First, in our incubation period paper, we work from a Bayesian perspective and leverage the power and flexibility of Laplacian-P-splines [57,58] to estimate the incubation density. The nonparametric approach proposed here is not Bayesian and does not require the specification of a prior. Second, the distribution of incubation times is modeled in a semiparametric way and the model includes spline parameters, while our data-driven method for serial intervals is entirely parameter-free. Third, there is a non-negligible difference in terms of computational complexity. In [56], we use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution of the model parameters, while here, the computational cost to obtain estimates of SI features is drastically reduced and mostly present in the resampling scheme of the bootstrap. Our nonparametric method for serial intervals is also mathematically less technical and thus perhaps more accessible to a broader set of users. For all these reasons, we believe that our approach for estimating incubation times and the newly proposed nonparametric method for estimating the serial interval distribution can be seen as complementary tools.

The proposed nonparametric method has several distinct strengths. First, being entirely data-driven, the method can be directly used to sketch the main characteristics of the SI distribution without imposing any parametric assumption. Moreover, the nonparametric estimate of the cumulative distribution function can be used as a benchmark to visually assess whether a chosen parametric model agrees with our data-driven fit, i.e. as an informal lack-of-fit test. Second, our method naturally deals with negative serial interval values and can thus be applied in a wide range of practical settings. Third, mathematical technicalities and computational complexity are kept minimal. This means that algorithms underlying our approach are very simple and can be easily translated and used in a programming language most preferred by the user. We developed a user-friendly routine for the proposed nonparametric serial interval estimation methodology that is available in the EpiDelays package (https://github.com/oswaldogressani/EpiDelays). Fourth, our method is in alignment with some of the best practices recommended by [5]. For instance, it naturally accounts for doubly interval-censored data. Also, our method automatically provides an estimate of variability (standard deviation) along with an estimate of central tendency, and these estimates are accompanied by confidence intervals via the bootstrap. Furthermore, the fact that the underlying code has a small footprint means that the method is easily reproducible. This facilitates serial interval analyses on past, current or future illness onset data streams.

A limitation of our method is that the estimated cdf obtained with uniform mixtures tells us that there is zero probability below the smallest observed left SI bound and that the serial interval lies with probability one below the largest observed right SI bound. Allowing for more flexible tails that go beyond the range of the observation set may be more realistic.

As previously mentioned, a challenging future research direction would be to adjust the proposed nonparametric approach for right truncation. Alternatively, it could be interesting to investigate how the data-driven method behaves under different weighting schemes. Instead of attributing an equal weight of to each serial interval window, we can for instance think of a rule that puts more weight to SI windows with smaller widths (i.e. with a lower degree of coarseness) since those windows are endowed with less uncertainty as compared to wider serial interval windows. Finally, a more theoretic study related to asymptotic properties and coarseness could provide interesting insights about the behavior of bias in our setting and give a flavor about the quality of information that can be extracted if the underlying serial interval data are characterized by an overall high degree of coarseness.

Supporting information

S1 Text.

Formulas of performance criteria used in the simulation study and additional simulation results.

https://doi.org/10.1371/journal.pcbi.1013338.s001

(PDF)

References

  1. 1. Simpson REH. The period of transmission in certain epidemic diseases; an observational method for its discovery. Lancet. 1948;2(6533):755–60. pmid:18100577
  2. 2. Madewell ZJ, Yang Y, Longini IM Jr, Halloran ME, Vespignani A, Dean NE. Rapid review and meta-analysis of serial intervals for SARS-CoV-2 Delta and Omicron variants. BMC Infect Dis. 2023;23(1):429. pmid:37365505
  3. 3. Cowling BJ, Fang VJ, Riley S, Malik Peiris JS, Leung GM. Estimation of the serial interval of influenza. Epidemiology. 2009;20(3):344–7. pmid:19279492
  4. 4. Te Beest DE, Henderson D, van der Maas NAT, de Greeff SC, Wallinga J, Mooi FR, et al. Estimation of the serial interval of pertussis in Dutch households. Epidemics. 2014;7:1–6. pmid:24928663
  5. 5. Charniga K, Park SW, Akhmetzhanov AR, Cori A, Dushoff J, Funk S, et al. Best practices for estimating and reporting epidemiological delay distributions of infectious diseases. PLoS Comput Biol. 2024;20(10):e1012520. pmid:39466727
  6. 6. Ward T, Christie R, Paton RS, Cumming F, Overton CE. Transmission dynamics of monkeypox in the United Kingdom: Contact tracing study. BMJ. 2022;379.
  7. 7. Svensson A. A note on generation times in epidemic models. Math Biosci. 2007;208(1):300–11. pmid:17174352
  8. 8. Lehtinen S, Ashcroft P, Bonhoeffer S. On the relationship between serial interval, infectiousness profile and generation time. J R Soc Interface. 2021;18(174):20200756. pmid:33402022
  9. 9. Chen D, Lau Y-C, Xu X-K, Wang L, Du Z, Tsang TK, et al. Inferring time-varying generation time, serial interval, and incubation period distributions for COVID-19. Nat Commun. 2022;13(1):7727. pmid:36513688
  10. 10. Park SW, Sun K, Champredon D, Li M, Bolker BM, Earn DJD, et al. Forward-looking serial intervals correctly link epidemic growth to reproduction numbers. Proc Natl Acad Sci U S A. 2021;118(2):e2011548118. pmid:33361331
  11. 11. Wallinga J, Lipsitch M. How generation intervals shape the relationship between growth rates and reproductive numbers. Proc Biol Sci. 2007;274(1609):599–604. pmid:17476782
  12. 12. Torneri A, Libin P, Scalia Tomba G, Faes C, Wood JG, Hens N. On realized serial and generation intervals given control measures: The COVID-19 pandemic case. PLoS Computat Biol. 2021;17(3):e1008892.
  13. 13. Boëlle P-Y, Ansart S, Cori A, Valleron A-J. Transmission parameters of the A/H1N1 2009 influenza virus pandemic: A review. Influenza Other Respir Viruses. 2011;5(5):306–16. pmid:21668690
  14. 14. Griffin J, Casey M, Collins Á, Hunt K, McEvoy D, Byrne A, et al. Rapid review of available evidence on the serial interval and generation time of COVID-19. BMJ Open. 2020;10(11):e040263. pmid:33234640
  15. 15. Lessler J, Reich NG, Cummings DAT, New York City Department of Health, Mental Hygiene Swine Influenza Investigation Team, Nair HP, Jordan HT and et al. Outbreak of 2009 pandemic influenza A (H1N1) at a New York City school. N Engl J Med. 2009;361(27):2628–36. pmid:20042754
  16. 16. Cowling BJ, Chan KH, Fang VJ, Lau LLH, So HC, Fung ROP, et al. Comparative epidemiology of pandemic and seasonal influenza A in households. N Engl J Med. 2010;362(23):2175–84. pmid:20558368
  17. 17. Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, et al. Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. N Engl J Med. 2020;382(13):1199–207. pmid:31995857
  18. 18. Nishiura H, Linton NM, Akhmetzhanov AR. Serial interval of novel coronavirus (COVID-19) infections. Int J Infect Dis. 2020;93:284–6. pmid:32145466
  19. 19. Ma Y, Jenkins HE, Sebastiani P, Ellner JJ, Jones-López EC, Dietze R, et al. Using cure models to estimate the serial interval of tuberculosis with limited follow-up. Am J Epidemiol. 2020;189(11):1421–6. pmid:32458995
  20. 20. Kremer C, Braeye T, Proesmans K, André E, Torneri A, Hens N. Serial intervals for SARS-CoV-2 omicron, delta variants, Belgium and November 19–December 31 2021 . Emerg Infect Dis. 2022;28(8):1699–702. pmid:35732195
  21. 21. Rai B, Shukla A, Dwivedi LK. Estimates of serial interval for COVID-19: A systematic review and meta-analysis. Clin Epidemiol Glob Health. 2021;9:157–61. pmid:32869006
  22. 22. Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc Series B: Stat Methodol. 1976;38(3):290–5.
  23. 23. Mettler SK, Kim J, Maathuis MH. Diagnostic serial interval as a novel indicator for contact tracing effectiveness exemplified with the SARS-CoV-2/COVID-19 outbreak in South Korea. Int J Infect Dis. 2020;99:346–51. pmid:32771634
  24. 24. Yang L, Dai J, Zhao J, Wang Y, Deng P, Wang J. Estimation of incubation period and serial interval of COVID-19: Analysis of 178 cases and 131 transmission chains in Hubei province, China. Epidemiol Infect. 2020;148:e117. pmid:32594928
  25. 25. Müller J, Kretzschmar M. Contact tracing—Old models and new challenges. Infect Dis Model. 2020;6:222–31. pmid:33506153
  26. 26. Reich NG, Lessler J, Cummings DAT, Brookmeyer R. Estimating incubation period distributions with coarse data. Stat Med. 2009;28(22):2769–84. pmid:19598148
  27. 27. Gandrud C. Reproducible research with R and R studio. Chapman and Hall/CRC; 2018.
  28. 28. Henderson AS, Hickson RI, Furlong M, McBryde ES, Meehan MT. Reproducibility of COVID-era infectious disease models. Epidemics. 2024;46:100743. pmid:38290265
  29. 29. Collins A, Alexander R. Reproducibility of COVID-19 pre-prints. Scientometrics. 2022;127(8):4655–73. pmid:35813409
  30. 30. Zavalis EA, Ioannidis JPA. A meta-epidemiological assessment of transparency indicators of infectious disease models. PLoS One. 2022;17(10):e0275380. pmid:36206207
  31. 31. Vink MA, Bootsma MCJ, Wallinga J. Serial intervals of respiratory infectious diseases: A systematic review and analysis. Am J Epidemiol. 2014;180(9):865–75. pmid:25294601
  32. 32. Reich NG, Lessler J, Azman AS. coarseDataTools: A collection of functions to help with analysis of coarsely observed data. R package version 0.6-6; 2021. Available from: https://cran.r-project.org/package=coarseDataTools
  33. 33. Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. Am J Epidemiol. 2013;178(9):1505–12. pmid:24043437
  34. 34. Thompson RN, Stockwin JE, van Gaalen RD, Polonsky JA, Kamvar ZN, Demarsh PA, et al. Improved inference of time-varying reproduction numbers during infectious disease outbreaks. Epidemics. 2019;29:100356. pmid:31624039
  35. 35. Park SW, Akhmetzhanov AR, Charniga K, Cori A, Davies NG, Dushoff J, et al. Estimating epidemiological delay distributions for infectious diseases. Cold Spring Harbor Lab. 2024. https://doi.org/10.1101/2024.01.12.24301247
  36. 36. Abbott S, Brand S, Pearson C, Funk S, Charniga K. Primary event censored distributions. 2025. https://doi.org/10.5281/zenodo.13632839
  37. 37. Batra, Neale, et al. The Epidemiologist R Handbook; 2021. https://epirhandbook.com/en/ [Accessed 16 October 2024 ].
  38. 38. Ryu S, Kim D, Lim J-S, Ali ST, Cowling BJ. Serial interval and transmission dynamics during SARS-CoV-2 delta variant predominance, South Korea. Emerg Infect Dis. 2022;28(2):407–10. pmid:34906289
  39. 39. Donnelly CA, Finelli L, Cauchemez S, Olsen SJ, Doshi S, Jackson ML, et al. Serial intervals and the temporal distribution of secondary infections within households of 2009 pandemic influenza A (H1N1): Implications for influenza control recommendations. Clin Infect Dis. 2011;52 Suppl 1(Suppl 1):S123-30. pmid:21342883
  40. 40. Heitjan D. Ignorability and coarse data. Ann Stat. 1996;19:207–13.
  41. 41. Hens N, Calatayud L, Kurkela S, Tamme T, Wallinga J. Robust reconstruction and analysis of outbreak data: Influenza A(H1N1)v transmission in a school-based population. Am J Epidemiol. 2012;176(3):196–203. pmid:22791742
  42. 42. McAloon CG, Wall P, Griffin J, Casey M, Barber A, Codd M, et al. Estimation of the serial interval and proportion of pre-symptomatic transmission events of COVID-19 in Ireland using contact tracing data. BMC Public Health. 2021;21(1):805. pmid:33906635
  43. 43. Birnbaum A. On the foundations of statistical inference. J Am Stat Assoc. 1962;57(298):269–306.
  44. 44. Kass RE, Wasserman L. The selection of prior distributions by formal rules. J Am Stat Assoc. 1996;91(435):1343–70.
  45. 45. Gupta AK, Miyawaki T. On a uniform mixture model. Biometrical J. 1978;20(7–8):631–7.
  46. 46. Craigmile PF, Tirrerington DM. Parameter estimation for finite mixtures of uniform distributions. Commun Stat – Theory Methods. 1997;26(8):1981–95.
  47. 47. Bratley P, Fox BL, Schrage LE. A guide to simulation. Springer New York; 1987.
  48. 48. Kaczynski W, Leemis L, Loehr N, McQueston J. Nonparametric random variate generation using a Piecewise-Linear cumulative distribution function. Commun Stat – Simul Computat. 2011;41(4):449–68.
  49. 49. Diekmann O, Heesterbeek JA, Metz JA. On the definition and the computation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneous populations. J Math Biol. 1990;28(4):365–82. pmid:2117040
  50. 50. Efron B. Bootstrap methods: Another look at the jackknife. Ann Stat. 1979;7(1):1–26.
  51. 51. Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall/CRC; 1994.
  52. 52. Efron B, Hastie T. Computer age statistical inference, student edition: Algorithms, evidence, and data science. Cambridge University Press; 2021.
  53. 53. Alene M, Yismaw L, Assemie MA, Ketema DB, Gietaneh W, Birhan TY. Serial interval and incubation period of COVID-19: A systematic review and meta-analysis. BMC Infect Dis. 2021;21(1):257. pmid:33706702
  54. 54. Cori A, Cauchemez S, Ferguson NM, Fraser C, Dahlqwist E, Demarsh PA. Package ‘EpiEstim’. Vienna, Austria: CRAN; 2020.
  55. 55. Morgan OW, Parks S, Shim T, Blevins PA, Lucas PM, Sanchez R. Household transmission of pandemic (H1N1) 2009, San Antonio, Texas, USA, April–May 2009. Emerg Infect Dis. 2010;16(4):631–7.
  56. 56. Gressani O, Torneri A, Hens N, Faes C. Flexible Bayesian estimation of incubation times. Am J Epidemiol. 2025;194(2):490–501. pmid:38988237
  57. 57. Gressani O, Lambert P. Fast Bayesian inference using Laplace approximations in a flexible promotion time cure model based on P-splines. Computat Stat Data Anal. 2018;124:151–67.
  58. 58. Gressani O, Wallinga J, Althaus CL, Hens N, Faes C. EpiLPS: A fast and flexible Bayesian tool for estimation of the time-varying reproduction number. PLoS Comput Biol. 2022;18(10):e1010618. pmid:36215319