Fast estimation of time-varying infectious disease transmission rates

Compartmental epidemic models have been used extensively to study the historical spread of infectious diseases and to inform strategies for future control. A critical parameter of any such model is the transmission rate. Temporal variation in the transmission rate has a profound influence on disease spread. For this reason, estimation of time-varying transmission rates is an important step in identifying mechanisms that underlie patterns in observed disease incidence and mortality. Here, we present and test fast methods for reconstructing transmission rates from time series of reported incidence. Using simulated data, we quantify the sensitivity of these methods to parameters of the data-generating process and to mis-specification of input parameters by the user. We show that sensitivity to the user’s estimate of the initial number of susceptible individuals—considered to be a major limitation of similar methods—can be eliminated by an efficient, “peak-to-peak” iterative technique, which we propose. The method of transmission rate estimation that we advocate is extremely fast, for even the longest infectious disease time series that exist. It can be used independently or as a fast way to obtain better starting conditions for computationally expensive methods, such as iterated filtering and generalized profiling.


Contents
S0 Preliminaries 1 S1 Example of β(t) estimation using the FC, S, and SI methods 2 S1. 1  This supplement to the manuscript is compiled from the source file S1_Text.Rnw using R 2 version 4.0.2 (2020-06-22) [ Our primary aim here is to make our results entirely reproducible by the reader. Our 4 secondary aim is to make our methods available to potential users. To this end, we 5 introduce our R package fastbeta, which implements 6 -the FC, S, and SI methods for estimating time-varying transmission rates β(t) from 7 time series data; and 8 -peak-to-peak iteration (PTPI) for estimating the initial number of susceptible 9 individuals S 0 from time series data. 10 All methods are based on the SIR model with time-varying rates of birth, death, and 11 transmission: where γ = 1/t gen .

13
The most recent version of fastbeta is located in our GitHub repository and can be 14 installed using install_github() from the remotes package.

15
if (!require(remotes)) install.packages("remotes") remotes::install_github("davidearn/fastbeta") However, readers attempting to compile this document from source must install fastbeta 16 from the plos branch of the repository, which houses the version used at the time of this 17 writing.
18 if (!require(remotes)) install.packages("remotes") remotes::install_github("davidearn/fastbeta", ref = "plos") In the sections to follow, we annotate the R code needed to reproduce results reported 23 in the manuscript. Plotting commands have been suppressed, but are preserved in 24 S1_Text.Rnw . Since our work involves millions of simulations, we have retained certain 25 output in the directory RData/ to significantly reduce compilation time.
26 S1 Example of β(t) estimation using the FC, S, and SI Doing so, we obtain observations of reported incidence at equally spaced time points 33 t k = t 0 + k∆t , k = 0, . . . , n .
We then apply each algorithm (FC, S, SI) to reconstruct the seasonally forced transmission 34 rate from the data. 35 As this simulate-estimate routine is our main investigative approach going forward, we Above, we "omitted" beta_mean from the function call (by supplying a value NA ) and 52 obtained the desired value in the output. 53 Values for N 0 , S 0 , and I 0 (arguments N0 , S0 , and I0 ) were also set internally. This   (2)), expressed per unit ∆t per susceptible per infected.
In this example, there is no environmental noise, simply because par_list specified 68 epsilon = 0 . Hence beta and beta_phi are identical in the returned data frame. 69 S1.3 Estimating the time-varying transmission rate 70 We apply the FC, S, and SI methods to estimate incidence Z, susceptibles S, infecteds I, 71 and the seasonally forced transmission rate β from reported incidence and vital data. We 72 have implemented these methods in our functions estimate_beta_FC() , 73 estimate_beta_S() , and estimate_beta_SI() . We will refer to these collectively as 74 estimate_beta() , but note that there is no function by that name.

75
The first argument of estimate_beta() expects a data frame df with columns t , 76 C , B , and (for the S and SI methods) mu . These specify equally spaced observation times 77 t k (Eq (3)) in units of the observation interval ∆t (i.e., t k /∆t) and, at those times, 78 reported incidence C k , observed births B k , and the observed per capita natural mortality 79 rate µ k expressed per unit ∆t.

80
The second argument expects a list par_list with elements prep , trep , tgen , 81 S0 , and (for the SI method) I0 . These specify values for input parameters p rep , t rep , t gen , 82 S 0 , and I 0 .

125
The next code chunk applies the S and SI methods, without input error, to each 126 simulated reported incidence time series.

## Appending the loess object from earlier fits$SI_loess <-SI_loess
Before proceeding, we should verify that the β k time series contain 1000 1-year cycles. We must also specify the true 1-year cycle that each average 1-year cycle should estimate. 150 We specify the true cycle with a reference time indicating the initial phase. For simplicity, 151 we consider the cycle between times t0 and t0 + period to be the true cycle. Phases of 152 this cycle are specified by times s between 0 and period . To estimate the value of the 153 true cycle at phase s , we evaluate each fit in fits (2 linear interpolants and 1 loess 154 curve) at times t0 + s + (0:(m-1))* period and average the resulting m values.

## First and last time points
get_phase_average() computes this estimate for any s , for a given fit f .
Note that, whereas linear interpolants are just functions that we can evaluate at desired 157 times, loess objects must be passed to predict() to obtain fitted values. Note also that 158 the modulo operation s %% period makes get_phase_average() a periodic function of can lead to substantial improvement in both accuracy and interpretability.

170
In practice, we would like to choose the value q opt of the loess smoothing parameter q 171 that minimizes the error in β loess (t; q) relative to β(t). However, we cannot calculate (and 172 therefore cannot minimize) the error in β loess (t; q) when β(t) is not known. In this 173 situation, we can still estimate q opt using statistical methods (most notably time series 174 cross-validation [2]) or by direct inspection of β loess (t; q) for each value of q on a grid.

175
In our simulated data setting, we do know β(t) and can therefore determine q opt exactly. In this setting, it is instructive to quantify the reduction in error that is achieved when the optimal loess estimate β loess (t; q opt ) is chosen over the raw time series estimate β k . For each of these 41 × 100 simulations, we carry out the following steps. We estimate the 10 and 110, we fit a loess curve β loess (t; q) to the raw time series estimate β k . Finally, we Hence, for each value of p rep , we obtain 100 values for each of RRMSE raw , RRMSE loess,min , 189 and q opt . We can preallocate space for this output.

## Standardize missing values in the raw estimates. loess() ## handles NA but complains about NaN and Inf
For a given time series, this setting may not be optimal (q * = q opt ), but can be justified, as 204 we explain in the sections to follow.

205
S5 Sensitivity to data-generating parameters 206 Error in estimates of the seasonally forced β(t) (Eq (5)) from simulated reported incidence 207 data is a function of data-generating parameters, given by In order to measure the sensitivity of β(t) estimation error to θ, we must define grids of 209 parameter values. For this task, it is helpful to associate with each parameter a reference (1) with constant vital rates ν c and µ c and a seasonally forced transmission rate (Eq (5)).

234
For comparison, this is done using both the S and SI methods. We fix q = q * (Eq (7) parameter q used when fitting loess curves to raw transmission rate estimates β k . 247 loess_par [1] is used with the S method. loess_par [2] is used with the SI 248 method. and method ( "S" or "SI" ). It returns an array whose [i, j] th entry is the median 262 RRMSE for the (i, j)th (R 0 , α) grid point.  We estimate β(t) from each simulated reported incidence time series, without input error, 278 fit a loess curve β loess (t; q) to the raw estimate β k , and record the RRMSE in β loess (t k ; q).

303
As with test_s2dgbeta() , we apply get_rrmse_50pct() to the output of

S5.3 A note on smoothing 312
To generate Figs 5 and 6, we fixed q = q * (Eq (7)) when fitting loess curves β loess (t; q) to 313 raw transmission rate estimates β k . For a given β k time series, this setting may not have 314 been optimal (q * = q opt ), meaning that the RRMSE calculated for β loess (t k ; q) was greater 315 with q = q * than it would have been had we found q opt and used q = q opt .

316
This is potentially problematic, because sensitivity to data-generating parameters is 317 mediated by propagation of noise from the simulated reported incidence data to β k . We 318 may have observed less sensitivity to a parameter (for example, t gen in Fig 6) had we 319 smoothed more when there was extreme noise in β k (i.e., had we set q = q opt when 320 q opt > q * ). We did not do this, because finding q opt for each of the 5 × 10 6 time series We see that a period-doubling bifurcation occurs between t gen = 13 days and 337 t gen = 2 −1.5 · 13 days, with C k attaining a much smaller minimum in the time series with a 338 2-year cycle (generated by t gen = 2 −1.5 · 13 days).

339
Due to much closer approaches to zero by incidence and prevalence with 340 t gen = 2 −1.5 · 13 days, noise in C k is amplified to a much greater extent in the raw 341 transmission rate estimate β k . We show this by applying the SI method without input 342 error to estimate the underlying, seasonally forced transmission rate β(t) (Eq (5))-which 343 was the same across simulations-from each C k time series.

## List of loess objects encoding the fitted loess curves
The RRMSE in each of these estimates is calculated as before.

362
Comparing β k , β loess (t; q * ), and β loess (t; q opt ) for each value of t gen , we find that when noise 363 in β k is severe (in this example, when t gen = 2 −1.5 · 13 days), even an optimal degree of 364 smoothing cannot recover the true β(t) from the noise, due to underlying bias. No amount 365 of variance reduction can correct the error due to bias. For this reason, smoothing β k using 366 q * for the loess smoothing parameter q was never much worse than smoothing using the 367 optimal value q opt , even when q * q opt (as was the case with t gen = 2 −1.5 · 13 days): Error in estimates of the seasonally forced transmission rate (Eq (5)) from simulated 373 reported incidence data is also a function of the user-specified values of input parameters, 374 given by Input error arises when the user's input mischaracterizes the data-generating process. In 376 our simulated data setting, this occurs when the specified value of an input parameter 377 differs from the value used to simulate data (e.g., when S 0 = S 0 , and so on). of the S and SI methods. We fit loess curves β loess (t; q) to each raw transmission rate 387 estimate β k generated in this process, fixing q = q * (Eq (7)), and record the RRMSE in 388 β loess (t k ; q * ) (See §S6.1 for discussion of the consequences of using fixed q in this analysis.)

389
The above algorithm is implemented in our function test_s2inpars() ("sensitivity to parameter q used when fitting loess curves to raw transmission rate estimates β k .

403
loess_par [1] is used with the S method. loess_par [2] is used with the SI  There are no issues with RRMSE being assigned NA in this analysis, so calculating median 425 RRMSE is more straightforward. We estimate true incidence Z from reported incidence C k as in the SI method: We do this using the true (data-generating) values of p rep and t rep , so that Z k estimates Z with the first peak (and is therefore a subset of all ).

482
Before we construct a call to get_peak_times() , we must ascertain that our time 483 series Z k of estimated incidence is roughly periodic and determine the period. Plotting Z k , 484 it is clear that it is periodic with a 1-year cycle.

Z(t)
In general, it may be helpful to inspect the power spectrum of Z k to determine the period, 486 but we do not do this here. 487 We locate the peaks in Z k with the following call to get_peak_times() . Above, we assigned period the value of 1 year in units of the observation interval. We 489 chose bandwidths bw_mavg = 6 and bw_peakid = 8 using an simple tuning procedure. 490 First, we chose the smallest value of bw_mavg that eliminated noise near peaks in Z k . This 491 was determined by visual inspection of the moving average Z k ( peaks$x_mavg ). Next, we 492 chose an arbitrary value of bw_peakid greater than 5 and less than half of period . This 493 ensured that the definition of a peak was meaningful (a point greater than many of its 494 nearest neighbours) and that peaks were not being compared against other peaks. The 495 exact choice of bw_peakid tends not to be critical provided Z k is smooth near the peaks.

496
Note that the two index vectors all and phase returned by get_peak_times() are 497 identical. In this example, all peaks in Z k are in phase, because the time between peaks is 498 precisely the period (1 year). This is not true in general. For example, a 2-year cycle can 499 have major and minor peaks that are out of phase. In this case, all would index both 500 major and minor peaks, but phase would index either minor peaks or major peaks (but 501 not both).

502
Plotting true incidence Z ( df$Z ), estimated incidence Z k ( Z ), and the central moving 503 average Z k ( peaks$x_mavg ), as well as indicators of the times of peaks in Z k in phase 504 with the first peak ( peaks$phase ), we reproduce Fig 8A (see below). Fig 8A verifies that   505 all of the peaks of interest were identified by get_peak_times() . 506 We conclude the truncation step of the PTPI algorithm by retrieving the index of the 507 first peak in Z k and the index of the last peak occurring at the same phase of the cycle. 508 ## Index of first peak a <-with(peaks, phase [1]) ## Index of last peak in phase with first peak b <-with(peaks, phase[length(phase)]) The precise times t a and t b of these peaks are given by df$ t[c(a, b)] .

S7.2 Iteration step 510
The goal of the iteration step is to use times series Z k , B k , and µ k of (estimated) incidence, 511 births, and the per capita natural mortality rate to iteratively update an initial estimate of 512 S 0 = S(t 0 ). This updating procedure depends on the result of the truncation step. The 513 iteration step is implemented in our function ptpi() , which takes as arguments 514 df , a data frame with columns Z , B , and mu specifying time series Z k , B k and µ k 515 of (estimated) incidence, births, and the per capita natural morality rate;

526
We carry out the iteration step with the following call to ptpi() . For this example, 527 we suppose that our initial guess of S 0 is 4 times greater than its true value, and ask for 528 our estimate to be updated 25 times. We provide the peak indices a and b obtained in 529 the truncation step (see above). Finally, in the first argument, we specify our incidence 530 time series and nothing else, and in the second argument, we specify the data-generating 531 values of N 0 , ν c , and µ c . This means that ptpi() will construct mock vital data without 532 any systematic error and use it in conjunction with the supplied incidence time series.   of every other input parameter. We fit a loess curve β loess (t; q * ) to each raw transmission 555 rate estimate β k , and record β loess (t k ; q * ) as a column in a matrix beta_mat . Finally we 556 plot the columns of beta_mat , scaled by with(par_list, 1/beta_mean) .
In the manuscript, we report the median and 5th and 95th percentiles of the relative 572 error in the estimate of S 0 obtained in the last iteration (for each initial guess The SI method modifies a method presented in [3] by deJonge. Here, we cast the SI 576 method and deJonge's method as two possible algorithms from a set of nine, differing 577 according to (i) how Our function estimate_beta_SI() takes a third argument method , which must be 590 assigned a vector of length 2. method [1] has options "forward" , "backward" , and 591 "trapezoid" (default), telling estimate_beta_SI() how to numerically integrate 592 Eqs (14). method [2] has options "forward" , "backward" , and "both" , (default) 593 telling estimate_beta_SI() how to numerically integrate Eq (15).
For every choice of method [1] (panel title), the best choice of method [2] (legend 618 label) was typically "both" . On the other hand, for a given choice of method [2] , the 619 best choice of method [1] (by a small margin) was typically the one that avoided discretize Eq (15), RRMSE was typically smallest when forward and backward Euler, 622 respectively, were used to discretize Eqs (14). Similarly, when both forward and backward 623 Euler were used to discretize Eq (15), RRMSE was typically smallest when the trapezoidal 624 method was used to discretize Eqs (14). This combination, with 625 method = c("trapezoid", "both") , gave the best performance overall.
Following the pattern of Fig 10, Fig 11 shows  likely accounts for the difference in their performance shown in Fig 10.