EpiLPS: A fast and flexible Bayesian tool for estimation of the time-varying reproduction number

In infectious disease epidemiology, the instantaneous reproduction number Rt is a time-varying parameter defined as the average number of secondary infections generated by an infected individual at time t. It is therefore a crucial epidemiological statistic that assists public health decision makers in the management of an epidemic. We present a new Bayesian tool (EpiLPS) for robust estimation of the time-varying reproduction number. The proposed methodology smooths the epidemic curve and allows to obtain (approximate) point estimates and credible intervals of Rt by employing the renewal equation, using Bayesian P-splines coupled with Laplace approximations of the conditional posterior of the spline vector. Two alternative approaches for inference are presented: (1) an approach based on a maximum a posteriori argument for the model hyperparameters, delivering estimates of Rt in only a few seconds; and (2) an approach based on a Markov chain Monte Carlo (MCMC) scheme with underlying Langevin dynamics for efficient sampling of the posterior target distribution. Case counts per unit of time are assumed to follow a negative binomial distribution to account for potential overdispersion in the data that would not be captured by a classic Poisson model. Furthermore, after smoothing the epidemic curve, a “plug-in’’ estimate of the reproduction number can be obtained from the renewal equation yielding a closed form expression of Rt as a function of the spline parameters. The approach is extremely fast and free of arbitrary smoothing assumptions. EpiLPS is applied on data of SARS-CoV-1 in Hong-Kong (2003), influenza A H1N1 (2009) in the USA and on the SARS-CoV-2 pandemic (2020-2021) for Belgium, Portugal, Denmark and France.


Introduction
The instantaneous reproduction number R t is a time-varying parameter defined as the average number of secondary cases generated by an infectious individual at time t. During epidemic outbreaks, R t provides a snapshot (often on a daily basis) that quantifies the extent to which a given infectious disease transmits in a population and is therefore an important tool that assists governmental organizations in the management of a public health crisis. The reproduction number is also a good proxy for measuring the real-time growth phase of an epidemic and as such, constitutes a key signal about the transmission potential of the outbreak and the required control effort. For this reason, having a robust, accurate and timely estimator of R t is a crucial matter that has attracted considerable interest in developing new statistical approaches during the last two decades as summarized in [1]. The paper of [2] compares several methods for estimating R t and gives clear insights about the main challenges and obstacles that have to be faced. They recommend the method of [3] and its associated EpiEstim package [4] as an appropriate and accurate tool for near real-time estimation of the instantaneous reproduction number. Another recent approach is proposed in [5], where a recursive Bayesian smoother based on Kalman filtering is used to derive a robust estimate of R t in periods of low incidence. The EpiNow2 package [6] also provides interesting extensions and implementations of current best practices for precise estimation and forecast of the reproduction number using a Bayesian latent variable framework. Spline based approaches have shown to be a useful tool for flexible modeling of the reproduction number. [7] use penalized radial splines for estimating R t under a Bayesian setting with misreported data and [8] accelerated the computational implementation by replacing the Markov chain Monte Carlo (MCMC) scheme with Laplace approximations. From a frequentist perspective, [9] uses truncated polynomials and radial basis splines to model the series of new infections and a derivative thereof as a candidate estimator for the reproduction number.
In this article, we propose a new Bayesian approach termed "EpiLPS" for estimating R t based on case incidence data and the serial interval (SI) distribution (the time elapsed between the onset of symptoms in an infector and the onset of symptoms in the secondary cases generated by that infector). Our estimator of R t is based on epidemic renewal equations [10,11] and Laplacian-P-splines smoothing of the mean number of incidence cases. Time series of new cases by day of reporting (or day of symptom onset) are assumed to follow a negative binomial distribution to account for potential excess variability as frequently encountered in epidemiological count data. Algorithms related to Laplace approximations and evaluations of B-spline bases are coded in C++ and embedded in the R language through the Rcpp package [12], making computational speed another key strength of EpiLPS as R t can be estimated in seconds. In addition, EpiLPS can also be used to obtain a smoothed estimate of the epidemic curve that can be of potential interest to further visualize an epidemic outbreak.
The proposed Bayesian methodology is based on a latent Gaussian model for the B-spline amplitudes and opens up two possible paths for inference. The first is called LPSMAP, a fully sampling-free approach based on Laplace approximations to the conditional posterior of Bspline coefficients. The hyperparameter vector is fixed at its maximum a posteriori and credible intervals of R t are computed via the "delta" method. The second path is called LPSMALA and is a MCMC approach based on the Langevin diffusion for efficient exploration of the posterior distribution of latent variables. The latter approach is computationally heavier than LPSMAP but has the merit of taking into account the uncertainty surrounding the hyperparameters. The underlying Metropolis-within-Gibbs structure keeps the practical implementation to a fairly simple level and the computational cost is reasonable even for long chains.
Compared to existing methods, EpiLPS resembles EpiEstim from a methodological point of view in the sense that R t is estimated from incidence time series and a serial interval distribution, yet the two approaches fundamentally differ in many aspects. First, the methodology of [3] assumes that incidence at time t is Poisson distributed, while EpiLPS assumes a negative binomial model. Second, as our approach uses penalized spline based approximations, prior specifications are imposed on the roughness penalty parameter and not directly on R t as in EpiEstim. Third and most importantly, EpiLPS is free of any sliding window specification, while EpiEstim relies on a user-defined time window. This subjective time window choice is the key driving force that determines how smooth the estimated R t trajectory will be. In EpiLPS, the optimal amount of smoothing is data-driven and objectively estimated (through the penalty parameter) within the Bayesian model. An R package for EpiLPS has been developed and is available at https://cran.r-project.org/package=EpiLPS. The software also allows to compute the Cori et al. (2013) [3] estimate of R t for the sake of comparison.
The manuscript is organized as follows. We first present the Laplacian-P-splines model for smoothing count data and show how the Laplace approximation applies to the conditional posterior of the B-spline amplitudes and also derive the (approximate) posterior of the hyperparameter vector to be optimized. This yields the maximum a posteriori (MAP) estimate of the spline vector via Laplacian-P-splines (LPSMAP). We then use LPSMAP to propose a "plug-in" estimate of R t based on renewal equations and proceed to the computation of credible intervals. An alternative path for estimation of R t based on MCMC is also presented. The latter approach uses Langevin dynamics for efficient sampling of the target posterior distribution and is termed LPSMALA for "Laplacian-P-splines with a Metropolis-adjusted Langevin algorithm". Next, we assess the performance of EpiLPS in various simulation scenarios and make comparisons with EpiEstim. Finally, we apply EpiLPS to real world epidemic outbreaks before concluding with a discussion.

Negative binomial model for case incidence data
Let D ¼ fy t ; t ¼ 1; . . . ; Tg be a time series of counts during an epidemic of T days with y t 2 N (set of non-negative integers) denoting the number of cases by reporting date or by date of symptom onset. We assume that the number of cases on day t follows a negative binomial distribution y t � NegBin(μ(t), ρ), with mðtÞ; r 2 R � þ ≔fx 2 Rjx > 0g and probability mass function (see e.g. [13,14]): pðy t jmðtÞ; rÞ ¼ Gðy t þ rÞ Gðy t þ 1ÞGðrÞ where Γ(�) is the gamma function. The above parameterization is frequently encountered in epidemiology [15] and yields a mean Eðy t Þ ¼ mðtÞ and variance Vðy t Þ ¼ mðtÞ þ mðtÞ 2 =r, so that ρ is the parameter responsible for overdispersion (variance larger than the mean) that is absent in a Poisson setting. In the limiting case lim r!þ1 Vðy t Þ ¼ mðtÞ ¼ Eðy t Þ and we recover the mean-variance equality of the Poisson model. The key argument in favor of a negative binomial distribution is thus its ability to capture the often encountered feature of overdispersion present in infectious disease count data [16]. We assume that μ(t) evolves smoothly over the time course of the epidemic and model it with cubic B-splines [17]: where θ = (θ 1 , . . ., θ K ) > is the vector of B-spline amplitudes to be estimated and b(�) = (b 1 (�), . . ., b K (�)) > is a cubic B-spline basis defined on the domain T ¼ ½r l ; T�, where r l is a lower bound on the time axis, typically the first day of the epidemic (i.e. r l = 1). The philosophy behind P-splines consists in specifying a "large" number K of basis functions together with a discrete roughness penalty λθ > Pθ as a counterforce to the induced flexibility of the fit. The parameter λ > 0 acts as a tuning parameter calibrating the "degree" of smoothness and P ¼ D > r D r þ εI K is a penalty matrix built from rth order difference matrices D r of dimension (K − r) × K perturbed by an ε-multiple (here ε = 10 −6 ) of the K-dimensional identity matrix I K to ensure full rankedness. There are several attractive reasons to use P-splines for smoothing the epidemic curve and R t . First, as the P-splines setting specifies an abundant number of B-spline basis functions coupled with a penalty on the spline coefficients to control for overfitting, the resulting μ(t) fit is smooth and estimates can be obtained for any t on the continuous time domain. Second, even if the number K of B-splines is free to choose, the shape of the fitted R t curve is actually regulated by the smoothing parameter λ and hence only negligibly affected by the arbitrary choice of K, provided it is large enough [18]. Third, the intrinsic sparseness of P and of the B-spline basis matrix is computationally appealing as it softens the algorithmic implementation and yields numerically stable routines [19,20]. Another key advantage of Psplines smoothers is their natural formulation in a Bayesian framework by translating difference penalties on contiguous B-spline coefficients into Gaussian random walk smoothness priors [21]. Following the latter reference, we impose a Gaussian prior on the vector of spline coefficients θjl � N dimðθÞ ð0; Q À 1 l Þ, with precision matrix Q λ = λP. For full Bayesian inference, the following priors are imposed on the model hyperparameters. Following [22], a robust Gamma prior is specified for the roughness penalty parameter ljd � Gð�=2; ð�dÞ=2Þ, where Gða; bÞ is a Gamma distribution with mean a/b and variance a/b 2 , ϕ = 2 and δ is an additional dispersion parameter with hyperprior d � Gða d ¼ 10; b d ¼ 10Þ. This prior specification favors "small" λ values and translates the belief that a wiggly R t fit is more inclined to arise during the epidemic period as opposed to an oversmoothed fit. Finally, the following uninformative prior is imposed on the overdispersion parameter r � Gða r ¼ 0:0001; b r ¼ 0:0001Þ. Let η ≔ (λ, ρ) > denote the vector of hyperparameters. The full Bayesian model is thus: y t jmðtÞ; r � NegBinðmðtÞ; rÞ;

Laplace approximation to the conditional posterior of θ
The Laplace approximation has two key roles in the proposed EpiLPS methodology. First, it determines the approximating distribution to the (conditional) posterior of the spline vector θ that will be used to estimate the average incidence of cases at time t, i.e. Eðy t Þ and hence R t via the renewal equation. Second, the variance-covariance matrix of the Laplace approximation is used to quantify the uncertainty of the instantaneous reproduction number through a "delta" method in LPSMAP and is also introduced in the proposal distribution of the LPSMALA algorithm to form the skeleton of the correlation structure for the spline components. The synergy between Laplace approximations and P-splines has already been shown to be very effective for modeling count data (see for instance [23], in the context of generalized additive models). The log-likelihood for the negative binomial model is given by: with g(y t , ρ) = log Γ(y t + ρ) − log Γ(ρ) and _ ¼ denoting equality up to an additive constant. The gradient of the log-likelihood with respect to the spline coefficients is: where: ðy t þ rÞ expðθ > bðtÞÞ ðexpðθ > bðtÞÞ þ rÞ b k ðtÞ; k ¼ 1; . . . ; K: The Hessian of the log-likelihood with respect to the B-spline amplitudes is: with entries: @ 2 'ðθ; r; DÞ @y k @y l ¼ À rðy t þ rÞ expðθ > bðtÞÞ ðexpðθ > bðtÞÞ þ rÞ 2 b k ðtÞb l ðtÞ; k; l ¼ 1; . . . ; K: Using Bayes' rule, the conditional posterior of θ for a given η is: pðθjη; DÞ / Lðθ; r; DÞpðθjlÞ where Lðθ; r; DÞ denotes the likelihood function. The gradient and Hessian of the log-likelihood (3) can be used to compute the gradient and Hessian of the (log-)conditional posterior (4), namely: r θ log pðθjη; DÞ ¼ r θ 'ðθ; r; DÞ À lPθ; r 2 θ log pðθjη; DÞ ¼ r 2 θ 'ðθ; r; DÞ À lP: The above two equations will be used iteratively in a Newton-Raphson algorithm to obtain the Laplace approximation to the conditional posterior of θ: where θ � (η) and S � (η) is the mode and variance-covariance respectively after convergence of the Newton-Raphson algorithm. The latter two quantities are functions of the hyperparameter vector η. An intuitive choice for η is to fix it at its maximum a posteriori. This is the option retained here, although it is also possible to work with a grid-based approach [23,24].

Hyperparameter optimization
The hyperparameter vector η = (λ, ρ) > will be calibrated by posterior optimization. Following [25] and [24], the hyperparameter vector can be approximated as follows: e pðη; djDÞ / Lðθ; r; DÞpðθjlÞpðljdÞpðdÞpðrÞ e p G ðθjη; DÞ j θ¼θ � ðηÞ : Approximation (6) can be written extensively as: where the K/2 power of λ comes from the determinant jQ À 1 l j is the kernel of a Gamma distribution for the dispersion parameter δ, the following integral can be analytically solved: Eq (7) is numerically optimized and yields e η � ¼ argmaxη log e pðe ηjDÞ. Plugging the latter vector into the Laplace approximation (5), we obtain the estimateθ ¼ θ � ðe η � Þ of the spline vector. The latter can be seen as a MAP estimate of θ. Thus, the approximated (conditional) posterior of the spline vector is: and can be used to construct credible intervals for functions that depend on θ, such as R t as shown in the following section.

Estimation of R t with LPSMAP
The renewal equation "plug-in" estimate. In this section, we show how the negative binomial model for smoothing incidence counts can be used to estimate R t through the renewal equation. Let φ = {φ 1 , . . ., φ k } be a known k-dimensional vector representing the serial interval (SI) distribution, where φ s is the probability that the SI is equal to s day(s), i.e. φ s ¼ PðSI ¼ sÞ. We also assume P k s¼1 φ s ¼ 1 and PðSI � 0Þ ¼ PðSI > kÞ ¼ 0. The renewal model [10,11] gives a mathematical statement of equality between the mean incidence of cases at time step t and a product between the reproduction number R t and a convolution involving antecedent cases and the serial interval distribution: where L t ¼ P tÀ 1 s¼1 φ s y tÀ s denotes the number of circulating cases that contribute to active transmission, also known as total infectiousness at time t [5]. Rearranging Eq (9) and taking the length k of the serial interval into account, we obtain an equation with the instantaneous reproduction number on the left-hand side: Eðy t Þð P k s¼1 φ s y tÀ s Þ À 1 for k < t � T: Our Bayesian "plug-in" estimator of R t at time step t is obtained by replacing the average number of cases Eðy t Þ ¼ mðtÞ by the estimated averagemðtÞ ¼ expðθ > bðtÞÞ and by replacing PLOS COMPUTATIONAL BIOLOGY y t−s bymðt À sÞ ¼ expðθ > bðt À sÞÞ: Note that the MAP estimate of the overdispersion parameter affects the estimatemðtÞ viaθ.
Using the indicator function Ið�Þ, i.e. IðAÞ ¼ 1 if condition A is true and IðAÞ ¼ 0 otherwise, the above estimator can be written in a single line: Credible intervals for R t . Using the functional relationship between R t and θ as in Eq (12), the log of the instantaneous reproduction number can be written as: Note that h(θ|t) is seen here as a function of the spline vector θ for a given time point t. A (1 − α) × 100% approximate credible interval for R t is obtained via a "delta" method. Consider a first-order Taylor expansion of h(θ|t) around θ � ðe η � Þ (henceforth θ � for the sake of a light notation), the mean of the Laplace approximated posterior of the spline vector in (8): where the kth entry of the gradient vector rh(θ|t) = (@h(θ|t)/@θ 1 , . . ., @h(θ|t)/@θ K ) > is: It follows that for k = 1, . . ., K, we have: The Taylor expansion in (13) is a linear combination of the vector θ that is a posteriori (approximately) Gaussian due to the Laplace approximation. As the family of Gaussian distributions is closed under linear combinations, it follows that h(θ|t) (and hence log R t ) is a posteriori also (approximately) Gaussian with mean EðhðθjtÞÞ � hðθ � jtÞ and variance VðhðθjtÞÞ � r > hðθjtÞj θ¼θ � S � rhðθjtÞj θ¼θ � , where S � ≔ S � ðe η � Þ is the covariance matrix of the Laplace approximation (8). This suggests to write: The accuracy of the variance approximation in (14) can be improved through a scaling of the covariance matrix S � by multiplying it with the scaling factor kr t ≔ð1 þr À 1m ðtÞÞ À 1 , corresponding to the estimated mean-to-variance ratio Eðy t Þ=Vðy t Þ at time step t (see S2 Appendix). The (approximate) posterior distribution for R t is thus given by

Estimation of R t with LPSMALA
In Bayesian statistics, posterior distributions obtained with Bayes' theorem often entail a high degree of complexity and are typically not analytically tractable. To circumvent this problem, MCMC methods have been developed for generating samples from (possibly unnormalized) target distributions [26]. One of the most popular MCMC methods together with the Gibbs sampler [27] is the Metropolis-Hastings (MH) algorithm originally proposed by [28] and later generalized by [29]. In this section, we propose to implement a modified version of the Metropolis-adjusted Langevin algorithm (MALA) [30] within the EpiLPS framework. The major advantage of MALA as compared to MH algorithms is that the proposal distribution is based upon a discretized approximation of the Langevin diffusion that uses the gradient of the target posterior distribution. These "smarter" proposals make use of additional information about the target density so that algorithms based on Langevin dynamics can converge at sub-geometric rates and tend to be more efficient than naive random-walk Metropolis algorithms [31,32]. This motivates our choice for embedding a MALA algorithm in EpiLPS as an efficient way of obtaining MCMC samples for inference on the instantaneous reproduction number R t via the renewal equation. The end-user will thus have a fully flexible choice regarding the underlying approach for estimating R t either via Laplacian-P-splines, where the uncertainty surrounding the parameter λ responsible for smoothing is ignored and λ is fixed at its maximum a posteriori (LPSMAP); or via a modified MALA algorithm, where the uncertainty surrounding the penalty (and overdispersion) parameter is fully taken into account (LPSMALA). The approach permits to obtain samples from the joint posterior of the spline vector and the penalty and overdispersion parameters. The latter can then be injected in functionals of the spline vector to obtain smooth estimates of the epidemic curve as well as the instantaneous reproduction number. Another advantage is that highest posterior density intervals can be easily calculated with LPSMALA.
Conditional posteriors for a "Metropolis-within-Gibbs". Joint posterior of (ζ, λ) Let ζ = (θ > , ρ) > be the (K + 1)-dimensional vector gathering the B-spline coefficients θ and the overdispersion parameter ρ. Using Bayes' theorem, the joint posterior distribution for ζ, λ and δ is: The analytical formulas of the chosen priors are: pðrÞ / r a r À 1 expðÀ b r rÞ: Injecting the above priors into (15) yields: Conditional posteriors of ζ, λ and δ The following conditional posterior distributions can be directly obtained from (16): pðζjl; d; DÞ / r a r À 1 expð'ðζ; DÞ À b r r À 0:5lθ > PθÞ; ð17Þ ðljζ; d; DÞ � Gð0:5ðK þ �Þ; 0:5ðθ > Pθ þ d�ÞÞ; ð18Þ Sampling from the joint posterior pðζ; l; djDÞ As the full conditionals pðζjl; d; DÞ, pðljζ; d; DÞ and pðdjζ; l; DÞ are available, we follow a "Metropolis-within-Gibbs" strategy to sample the joint posterior pðζ; l; djDÞ. In particular, the hyperparameters λ and δ will be sampled in a Gibbs step, while ζ will be sampled using a modified Langevin-Hastings algorithm. This approach is presented in [33] in the context of Bayesian density estimation (see also [34] for the use of MALA in a proportional hazards model and [35] for a recent implementation in mixture cure models). We adapt the algorithm of the latter reference to our EpiLPS methodology. In particular, the variance-covariance matrix in the Langevin diffusion will be replaced by the variance-covariance matrix obtained with LPSMAP. The correlation structure borrowed from LPSMAP improves convergence and chain mixing. with the following log-likelihood under the reparameterization: Let us denote by e ζ ðmÀ 1Þ 2 R ðKþ1Þ the state of the chain at iteration (m − 1). In the Langevin-Hastings algorithm, the proposal for the vector e ζ at iteration m is a draw from the following multivariate Gaussian distribution: where % > 0 is a tuning parameter that has to be carefully chosen in order to reach a desired acceptance rate and S LH is the following block-diagonal variance-covariance matrix: where S � is the K-dimensional covariance matrix obtained with LPSMAP. The gradient of log pð e ζjl; d; DÞ ¼ 'ð e ζ; DÞ À 0:5lθ > Pθ À b r exp ðwÞ þ a r w can be decomposed as follows: rζ log pð e ζjl; d; DÞ ¼ r > θ log pð e ζjl; d; DÞ; @ log pð e ζjl; d; DÞ @w and is analytically available (see S1 Appendix for more details). All the quantities related to the Langevin-Hastings proposal have been analytically derived, so that the draw in (22) can be obtained (for a given value of λ and δ). As in a classic MH algorithm, the next step consists in computing the acceptance probability: where q(�, �) denotes the (Gaussian) proposal distribution and pð�jl; d; DÞ the target (conditional) posterior distribution. Finally, we generate a uniform random variable u � Uð0; 1Þ and accept the proposed vector e ζ ðpropÞ if u � pð e ζ ðmÀ 1Þ ; e ζ ðpropÞ Þ and reject it otherwise. While iterating through the Metropolis-within-Gibbs algorithm, the tuning parameter % is automatically adapted to reach the optimal acceptance rate of 0.57 [31,36,37]. The pseudo-code below summarizes the LPSMALA algorithm. LPSMALA algorithm to sample the posterior pðθ; r; l; djDÞ. 14: end for The adaptive tuning part (line 13) involves the step function H ðzÞ ¼ �Iðz < �Þ þ zIð� � z � A Þþ A Iðz > A Þ, with � = 10 −4 and A ¼ 10 4 , see [33] for details. Finally, the ratio qð e ζ ðpropÞ ; e ζ ðmÀ 1Þ Þq À 1 ð e ζ ðmÀ 1Þ ; e ζ ðpropÞ Þ entering the computation of the acceptance probability (line 6) is derived in S1 Appendix.
Posterior inference with LPSMALA. can be viewed as random draws from the target posterior distribution pðθ; r; l; djDÞ. Note that a convenient starting point for the initial values of the parameters might be to fix them at their LPSMAP estimate. Given the sample S , inference on quantities that are functions of θ becomes straightforward in the sense that point estimates and credible intervals can be easily obtained. A point estimate for the mean number of incidence counts at time t is taken to be the posterior mean (after discarding the burn-in phase): Note also that S can be used to compute highest posterior density intervals of μ(t) at any point t. Using the renewal equation and the MCMC sample, one can apply the "plug-in" method and recover the following estimate of the instantaneous reproduction number at time point t: Also, using S , one can compute a highest posterior density interval of R t at time step t.

Setting of the simulation study
In this section, a numerical study is implemented with nine epidemic scenarios to assess the accuracy with which EpiLPS is able to track the target reproduction number over time. EpiLPS results are compared with the instantaneous reproduction number estimate from the EpiEstim package [3] using three sliding windows options (the default weekly windows, three days windows and daily windows). In addition, we disentangle between comparisons of EpiLPS against EpiEstim with R t estimates reported on the last day of a window following the convention of [3] and R t estimates reported at the midpoint of a smoothing window following the best practice recommendation of [2]. For EpiLPS, K = 40 (cubic) B-splines are specified with a secondorder penalty and a chain length of 3 000 for LPSMALA (including a burn-in of size 1 000). In each scenario, S = 100 incidence time series of T days are simulated (initiated with 10 index cases). The epidemic data generating process computes the mean incidence at a given day t, i.e. μ(t) according to the renewal equation and the incidence of cases at time point t is sampled from the negative binomial distribution y t � NegBin(μ(t), ρ). The simulation study also accounts for varying degrees of overdispersion by using different values of ρ in the considered scenarios. Furthermore, the incidence data are generated according to three different serial interval distributions, namely φ FLU , φ SARS and φ MERS corresponding to an influenza, a SARS-CoV-1 and a MERS-CoV like serial interval, respectively. The discretized version of the SI distributions are computed by using the Cori et al. (2013) [3] discretization formula assuming a (shifted) Gamma distribution. In Scenario 1, a constant instantaneous reproduction number R t ¼ 1:3 is considered. Scenario 2 imitates an intervention strategy, so that R t ¼ 2 until a sudden drop to R t ¼ 0:9 occurs at day t = 20. The latter scenario allows to check whether EpiLPS is able to quickly react to a sudden change in R t . Scenario 3 is characterized by a more wiggly structure for R t and Scenario 4 considers the case of a vanishing epidemic with a monotonic decreasing reproduction number. Scenarios 5-8 assume the same functional form for R t as in Scenarios 1-4 but with a different serial interval distribution. In Scenario 9, the R t function is chosen in such a way that there is a single large wave in the early phase of the epidemic and a more stable pattern (with smaller waves) in the late phase. Table 1 summarizes the time domain of the epidemic curve, the target R t function, the serial interval distribution and its associated source(s) in the literature. The simulation study is organized as follows. First, we compare EpiLPS with EpiEstim using the convention of Cori and colleagues, namely reporting the R t estimate at the end of the smoothing window, which is well suited for real-time estimation. The latter approach reports the R t estimate computed in the window [t − ω; t], where ω denotes the window width. Next, the Gostic et al. (2020) [2] recommendation is used, where the R t estimate is reported at the center of the window, i.e. [t − ω/2; t + ω/2]. Concentrating on the window midpoint avoids lagged R t estimates at the cost of ruling out estimates at the last ω/2 time points as, in that case, the upper bound of the window reaches future calendar days for which data is not yet available. Fig 1 summarizes the two window structures used in the simulation study.

Comparing EpiLPS with EpiEstim at window boundary
The performance indicators computed for each scenario include the average bias, mean square error (MSE), coverage probability (CP) and width (CI Δ ) of 90% and 95% credible intervals for the R t estimator (see detailed formulas in S2 Appendix) with EpiLPS and EpiEstim, respectively. Estimates obtained during the first week of the epidemic are ignored as they may be subject to serious bias due to the poor information carried by the (few) incident cases in such an early phase. Therefore, the performance measures are computed as an average over days  Fig 1). A detailed description of the data generating process and the figures of the estimated R t trajectories for all scenarios are provided in S2 Appendix. The performance measures given in Tables 2 and 3 provide interesting insights into the behavior of EpiLPS and EpiEstim across the considered scenarios. In terms of bias, EpiLPS is really competitive against EpiEstim as both LPSMAP and LPSMALA outperform EpiEstim (no matter the time window size) in Scenarios 4-8. For the remaining scenarios, the bias between the two competing methods is more or less similar. Regarding the MSE, EpiLPS exhibits smaller values as compared to EpiEstim with three days and daily windows respectively across all scenarios. Moreover, specifying smaller time windows in EpiEstim leads (generally) to an increase in MSE and also an increase in bias. A close inspection of the coverage probability of credible intervals reveals that EpiLPS has close to nominal coverage in almost all scenarios. This is however not the case for EpiEstim, especially for weekly and three days windows, where severe to mild undercoverage is observed. Also, EpiEstim tends to show more severe undercoverage in scenarios where data is more overdispersed (see e.g. Scenario 4). More importantly, even when EpiEstim approaches the nominal coverage probability (with daily windows), it has much wider credible interval width (and so less precision) as compared to EpiLPS in almost all scenarios.
Figs 2 to 4 summarize the epidemic curves and the trajectories obtained for the estimated R t with LPSMAP (blue curves) and EpiEstim under weekly sliding windows (green curves) for selected scenarios. These figures highlight the flexibility and the precision with which Laplacian-P-splines are able to capture the reproduction number over the time course of the PLOS COMPUTATIONAL BIOLOGY epidemic. The dashed (dotted) curves represent the pointwise median (computed over the S = 100 estimates) of R t with LPSMAP (EpiEstim). For LPSMAP, it closely follows the true pattern of R t even under strong nonlinearities as in Fig 4. The EpiEstim trajectories appear shifted to the right of the target R t curve. This lag is due to the fact that for weekly sliding windows, the R t estimate provided by EpiEstim at the end of the window is entirely based on data

PLOS COMPUTATIONAL BIOLOGY
from past days and is therefore lagged compared to the target (instantaneous) R t . This shift effect can be corrected by decreasing the time window (e.g., using daily windows) at the cost of more "noisy" trajectories. Even then, the median R t estimates of EpiEstim appear to capture less precisely the target R t function as compared to LPSMAP/LPSMALA (see S2 Appendix) across most of the considered scenarios. To summarize, this simulation study sheds light on the trade-off faced by the Cori method when estimating the instantaneous reproduction number. Choosing a weekly sliding window as a default option in EpiEstim can lead to a forward shifted (and so inaccurate) estimate of R t . Smaller time windows in EpiEstim alleviate the lag effect, but the price to pay is that the fitted R t trajectory is wiggly (undersmoothing) as it captures more variation than necessary [2].
EpiLPS does not suffer from such a trade-off as the latter is naturally solved by P-splines. In fact, one could say that the time window size in EpiEstim is analogue to the smoothing parameter λ in EpiLPS as these quantities will be key for the resulting smoothness of the fit. The major advantage with EpiLPS is that λ is estimated naturally within the Bayesian model (either via maximum a posteriori estimation or MCMC), while the choice of the time window in EpiEstim is chosen freely outside the model.

Comparing EpiLPS with EpiEstim at window midpoint
To correct for the lag effect in EpiEstim resulting from reporting the reproduction number estimate at the end of the window, Gostic and colleagues recommend to report it at the center of the window to obtain an estimate that is more accurately oriented in time. It is therefore important to compare the performance of EpiLPS against this "corrected" EpiEstim output as it is considered a best practice for a retrospective usage and, as such, is a legitimate candidate against EpiLPS which is by nature only partially real-time (see next section). We therefore run the entire simulations for all scenarios one more time accounting for the corrected EpiEstim output under weekly windows ω = 6 and three days windows ω = 2, where the estimated R t is computed at the window midpoint. Results for a daily window (ω = 0) are identical to those reported in Tables 2 and 3, as sliding windows become degenerate intervals at each time step [t, t]. The performance measures are reported in Table 4 and Figs 5 to 7 summarize the estimated trajectories for the same scenarios as in the previous section for the sake of comparison. As expected, the resulting R t trajectories for EpiEstim are now closer to the target and the lag effect has disappeared. Despite this improvement, the performance indicators clearly highlight that EpiLPS outperforms EpiEstim in all scenarios except Scenario 1, where the numbers are of a similar order of magnitude. In general, the EpiLPS approach is less biased and provides credible intervals with close to nominal coverage. Even when correcting for the reporting of R t at the middle of the window, EpiEstim results are less accurate, especially regarding credible intervals with weekly windows that can strongly undercover. This has important implications regarding the recommendation of using EpiLPS in practice and detailed recommendation guidelines are provided below.

Real-time considerations
EpiEstim is a powerful tool to estimate R t in real-time and is probably the best tool currently available to deliver timely estimates of the reproduction number [41]. EpiLPS can be Table 4. Simulation results for EpiLPS and EpiEstim in Scenarios 1-9 for S = 100 simulated epidemics. The performance indicators in Scenarios 1-8 for R t are averaged over days t = 8,. . .,37 for LPSMAP, LPSMALA and weekly windows (EpiEstim) and over days t = 8,. . .,39 for 3 days windows under EpiEstim. In Scenario 9, the performance indicators for R t are averaged over days t = 8,. . .,57 for LPSMAP, LPSMALA and weekly windows (EpiEstim) and over days t = 8,. . .,59 for 3 days windows under EpiEstim. For EpiEstim, R t is reported at the window midpoint.

PLOS COMPUTATIONAL BIOLOGY
EpiLPS: A Bayesian tool for estimation of the time-varying reproduction number PLOS COMPUTATIONAL BIOLOGY considered a real-time approach only to a certain degree, where the real-time concept is partially present but fundamentally different from the one proposed in EpiEstim. By real-time method, we mean a method for which an estimate of the reproduction number at time t uses data up to (and including) time t. Let us assume that EpiLPS is applied on epidemic data over a specific period, say, T ¼ ½1; T�. For time points t = 1, . . ., T − 1, EpiLPS is clearly non realtime as the global smoothing of R t on the "bandwidth" T will be computed based on past, current and future data values. However, at the domain boundary (time point T), the EpiLPS estimate of R t will exclusively make use of data up to time T and is therefore real-time (in the same sense as EpiEstim). The EpiLPS real-time characteristic for this last time point is however only retained temporarily, as if applied (the next day) over the period T � ¼ ½1; T þ 1�, the estimate of the reproduction number at time T is not real-time anymore since it will be computed based on data up to time T and the "future" data value available at time point T+ 1. For EpiEstim, the real-time characteristic of the R t estimate is retained for any time point t and is therefore more suitable for timely estimation. The real-time properties of EpiLPS and EpiEstim are compared and illustrated in Fig 8. The extensive simulation results provided here, suggest that EpiLPS imposes itself as a robust retrospective estimation method. In particular, it seriously addresses a challenge faced by many existing methods, namely that R t estimates typically lead or lag the true value [2]. EpiLPS is therefore a powerful retrospective tool to estimate the reproduction number during and/or after epidemic outbreaks. It is however less preferable than EpiEstim for real-time estimation and should therefore be used with care for timely purposes.

Computing time and sensitivity analyses
The computational time of the EpiLPS algorithm is mainly affected by the number K of Bsplines specified in the basis and the total number of days T of the epidemic. Table 5 gives an overview of the real elapsed time in seconds required to run the EpiLPS routines for different (T, K) couples. Obviously, LPSMAP requires far less computational resources as it is a completely sampling-free approach relying on the MAP estimate of the hyperparameter vector. Even with an epidemic of roughly two months and K = 60, LPSMAP is extremely fast and delivers results in a fraction of a second. LPSMALA needs a larger computational budget as the algorithm relies on an iterative sampling scheme (MCMC). However, even for (T = 60, https://doi.org/10.1371/journal.pcbi.1010618.g008 K = 60), LPSMALA requires less than 10 seconds, which is a relatively reasonable time given the number of parameters involved in the model.
We assessed the sensitivity of the EpiLPS estimated reproduction number with respect to model inputs that are free to choose in order to check whether EpiLPS is robust with respect to different parameter choices. In particular, we focus on the sensitivity of the R t fit (with LPSMAP) to the number K of B-splines and to the parameters a δ and b δ of the Gamma hyperprior on δ. The sensitivity analyses are implemented in S2 Appendix and reveal a negligible sensitivity of the estimated R t curve with respect to the above-mentioned parameters. We also discuss the sensitivity of the reproduction number estimates when computed over time domains of increasing width, for instance on [1, T 1 ] and [1, T 2 ] with T 2 > T 1 . This gives an idea of the magnitude of variation in the estimated R t in the domain [1, T 1 ] when EpiLPS is actually fitted on the wider domain [1, T 2 ]. Results show that despite having values of R t that vary (in the past) when applied to larger time domains due to the global smoothing approach inherent in EpiLPS, the estimated values of the reproduction number remain reasonably close to the target. S3 Appendix provides ancillary results on the estimation performance of the overdispersion parameter ρ and sensitvity analyses of the computed credibles intervals for R t with respect to different couples (a δ , b δ ).

Application to observed case counts in infectious disease epidemics
Epidemics of SARS-CoV-1 and influenza A H1N1. In this section, the LPSMALA algorithm is applied on two historical outbreak datasets presented in [3]. In particular, we consider the 2003 SARS outbreak in Hong Kong and the 2009 pandemic influenza in a school in Pennsylvania (USA). We use K = 40 B-splines with a second-order penalty and the serial interval distributions provided in the EpiEstim package [4]. The LPSMALA algorithm is implemented with a chain of length 25 000. Acceptance rates for the generated chains are close to the optimal value of 57% and the posterior samples have converged according to the Geweke (1992) [42] diagnostic test (at the 1% level of significance). Fig 9 shows the smoothed epidemic curves and the estimated R t for the two outbreaks. Results for the SARS data show that the reproduction number reaches a first peak during the third week, whereR t ¼ 9:67 (95% CI: 5.19-16.47) and a second more moderate peak around week 6 withR t ¼ 2:78 (95% CI: 1.82-3.82). After day t = 43, the epidemic is under control and R t smoothly decays below 1. For the pandemic influenza in Pennsylvania, in the end of the second week R t is around 2.05 (95% CI: 1.21-3.06). During the middle of the third week, the situation is less severe and R t points below 1. As noted in [3], a few cases appeared in the last days of the epidemic generating an upward trend in R t estimates. Application on the SARS-CoV-2 pandemic. The EpiLPS methodology is illustrated on the SARS-CoV-2 pandemic using publicly available data from the Covid-19 Data Hub [43] and its associated COVID19 package on CRAN (https://cran.r-project.org/package= COVID19). Country-level data on hospitalizations for Belgium, Denmark, Portugal and France from April 5th, 2020 to October 31st, 2021 is used and a serial interval distribution with a mean of 3 days (and standard deviation of 2.48 days) is assumed [44] discretized as φ = {0.344, 0.316, 0.168, 0.104, 0.068}. In Fig 10, the estimated reproduction number obtained with EpiLPS and EpiEstim respectively, is shown for the four countries. Results are obtained with the LPSMAP algorithm using K = 30 B-splines and a second-order penalty. The gray shaded surface corresponds to 95% (pointwise) credible intervals for R t with LPSMAP and the dashed curves are for EpiEstim. From a computational perspective, it takes less than 3 seconds to fit the EpiLPS model for the four countries. The fitted reproduction numbers reflect the different waves of the COVID-19 pandemic and the rise in infections in the beginning of September 2021. We also see that EpiLPS tends to follow the same trend as the estimates provided by EpiEstim, the only difference is that LPSMAP estimates appear globally smoother with credible intervals that are less wide for Belgium, Denmark and Portugal.

Discussion
EpiLPS (an acronym for Epidemiological modeling with Laplacian-P-Splines) is a fast and flexible tool for Bayesian estimation of the instantaneous reproduction number R t during epidemic outbreaks. The tool is flexible in the sense that (penalized) spline based approximations provide smoothed estimates of R t with little computational effort and without the constraint of imposing any sliding window assumption that could potentially affect the timing and accuracy of the estimator. Moreover, the end user has the choice between a fully sampling-free approach (LPSMAP) or an efficient MCMC gradient-based approach with Langevin diffusions (LPSMALA) for inference. The available EpiLPS package (https://cran.r-project.org/package= EpiLPS) allows public health policy makers to analyze incoming data faster than existing methods relying on classic MCMC samplers, thus permitting them to be better informed when taking decisions on control measures for infectious disease outbreaks. Simulation studies in this paper provide encouraging results and support EpiLPS as being a robust tool capable of a precise tracking of R t over time. The EpiLPS software package and the early website version (https://epilps.com) provide additional guiding material about the proposed methodology.
EpiLPS cannot be termed a real-time method in the same sense as in the Cori method and is therefore less preferred than EpiEstim for real-time analysis. Conceptually, EpiLPS and EpiEstim both use data from the past (EpiLPS also uses data from the future) to estimate the PLOS COMPUTATIONAL BIOLOGY instantaneous reproduction number, but the mechanisms underlying the use of past observations differ. The method of Cori looks back in time only as far as the width of the chosen time window in terms of infected individuals. EpiLPS on the contrary has a stronger reach as the Psplines smoother approximates the reproduction number globally (or blockwise), over the entire domain of the epidemic curve, i.e. retrospectively and also including future values (except for the estimate of R t at the last day of the domain of the epidemic curve which makes use of the current day value and past values). This difference has important consequences and implies advantages as well as disadvantages. The advantage of working with a time window option as in EpiEstim is that one can control how far back in time to look in order to compute the desired R t estimate. This is not an option in EpiLPS as the penalty parameter, the key driver of the degree of smoothness of the fitted R t curve, is estimated within the model and is not fixed by the user. There is however no free lunch and the downside of having a time window choice in EpiEstim implies to face a trade-off between potential oversmoothing (with a wide time window) and undersmoothing (with a narrow time window). This trade-off is virtually absent in the EpiLPS setting as P-splines internally deal with the smoothing problem.
It is evident that when applying EpiLPS sequentially over time on epidemic curves with wider and wider domain length such as [1, T 1 ], [1, T 2 ], [1, T 3 ] with 1 < T 1 < T 2 < T 3 , the R t estimate over past days (for instance t 2 [1, T 1 ]) will inevitably change as EpiLPS is by nature a global smoother. This past variability should not be seen as a drawback as it is essentially an "update" taking into account the fact that the method works with an epidemic curve with a longer domain. The real question is whether the past variability of the R t estimate remains in a close neighborhood of the "true" value of the reproduction number for past days. On that side, the complete simulation study is rather convincing as it shows that EpiLPS is an accurate method that is successful in capturing the evolution of R t over time.
There are also other aspects with respect to which EpiEstim and EpiLPS differ. For instance, prior specification in EpiEstim assumes a Gamma distributed prior on the reproduction number which is conjugate to the Poisson likelihood (EpiEstim assumes that incidence at time step t is Poisson distributed), so that the posterior of R t also has a Gamma distribution. In EpiLPS, the prior(s) are not directly imposed on the reproduction number, but on the spline parameters (and hyperparameters) and the resulting posterior distribution of R t with LPSMAP is approximated by a lognormal distribution. Regarding computational complexity, EpiEstim and LPSMAP deliver estimates almost instantly, while LPSMALA requires a larger computing budget as it is a MCMC algorithm. We therefore recommend using LPSMALA over shorter epidemic durations and LPSMAP on longer outbreaks over several months. Our analysis suggests that EpiLPS might be more accurate than EpiEstim in presence of overdispersed epidemiological data, especially when it comes to quantify the uncertainty of R t as EpiLPS is shown to have narrower credible intervals with good coverage performance. A main limitation is that EpiLPS is more prone to numerical instability (e.g. during hyperparameter optimization or in the Newton-Raphson algorithm for the Laplace approximation) than EpiEstim, although such problems were not encountered here. Finally, it is also worth mentioning that R t estimates delivered by EpiLPS (and EpiEstim) are prone to potential biasing effects [2,45] since the serial interval is used as a surrogate for the generation interval (time elapsed between infection events of an infector and an infectee) as the latter is less easily observed.
The EpiLPS project opens up several future research directions. A possible extension would be to formulate the EpiLPS model within a zero-inflated (Poisson) framework to cope with incidence time series characterized by an excess of zero counts. Another interesting extension would be to adapt the model to allow for regional variation and imported cases. Moreover, akin to EpiEstim, the EpiLPS methodology could be further developed to explicitly account for uncertainty in the serial interval distribution. Finally, in face of long-lasting epidemic scenarios involving several variants characterized by different levels of virulence, it would be useful to extend the EpiLPS methodology to allow for smooth transitions of the estimated reproduction number accompanying the evolution of variants.