Estimating the epidemic reproduction number from temporally aggregated incidence data: A statistical modelling approach and software tool

The time-varying reproduction number (Rt) is an important measure of epidemic transmissibility that directly informs policy decisions and the optimisation of control measures. EpiEstim is a widely used opensource software tool that uses case incidence and the serial interval (SI, time between symptoms in a case and their infector) to estimate Rt in real-time. The incidence and the SI distribution must be provided at the same temporal resolution, which can limit the applicability of EpiEstim and other similar methods, e.g. for contexts where the time window of incidence reporting is longer than the mean SI. In the EpiEstim R package, we implement an expectation-maximisation algorithm to reconstruct daily incidence from temporally aggregated data, from which Rt can then be estimated. We assess the validity of our method using an extensive simulation study and apply it to COVID-19 and influenza data. For all datasets, the influence of intra-weekly variability in reported data was mitigated by using aggregated weekly data. Rt estimated on weekly sliding windows using incidence reconstructed from weekly data was strongly correlated with estimates from the original daily data. The simulation study revealed that Rt was well estimated in all scenarios and regardless of the temporal aggregation of the data. In the presence of weekend effects, Rt estimates from reconstructed data were more successful at recovering the true value of Rt than those obtained from reported daily data. These results show that this novel method allows Rt to be successfully recovered from aggregated data using a simple approach with very few data requirements. Additionally, by removing administrative noise when daily incidence data are reconstructed, the accuracy of Rt estimates can be improved.


Introduction
As infectious disease outbreaks become more common, it is increasingly important to rapidly characterise the threat of emerging and re-emerging pathogens [1].Transmissibility, i.e. a pathogen's ability to spread through a population, can be quantified using the time-varying reproduction number, R t , defined as the average number of infections that are caused by a primary case at time t of an outbreak.R t signals whether an outbreak is growing (R t > 1) or declining (R t < 1), and whether current interventions are sufficient to control the spread of the disease.
One of the most popular tools for real-time R t estimation, the R package EpiEstim, relies on observing the incidence data and supplying an estimated serial interval (SI) distribution-the time between symptom onset in a case and their infector.EpiEstim requires that the SI distribution and incidence data are supplied using the same time units.This can be problematic when daily incidence data is not reported, which is common for many diseases, such as influenza, Zika virus disease, and most notifiable diseases in countries such as the UK and the US [2][3][4][5].Additionally, several studies intentionally aggregate data to reduce the impact of daily reporting variability; administrative noise, such as "weekend effects", are characterised by a drop in reported cases over weekends, due to reduced care seeking and longer delays in reporting, followed by a peak on Mondays [6,7].A commonly used workaround is to aggregate the SI distribution to match the frequency of incidence reporting [8,9], however this is not possible if the SI is shorter than the aggregation of data.For example, influenza-like illness is typically reported on a weekly basis, but influenza has an estimated mean SI of 2-4 days [10,11].Similarly, reporting of COVID-19, which has an estimated SI of 3-7 days, has typically moved from daily to weekly [12,13].Therefore, enabling estimation of R t from temporally aggregated data is critical to ensure methods such as EpiEstim are widely applicable [14].
In this study, we combine an expectation-maximisation (EM) algorithm with the renewal equation approach implemented in EpiEstim to reconstruct daily incidence from aggregated data and estimate R t .We assess the performance of the method using influenza and COVID-19 data, in addition to an extensive simulation study.Competing interests: I have read the journal's policy and the authors of this manuscript have the following competing interests: AC has received payment from Pfizer for teaching on a course for mathematical modelling of infectious disease transmission and vaccination.

EpiEstim
EpiEstim uses the renewal equation (Eq 1), a form of branching process model [15].In this formulation, the incidence of new symptomatic cases at time t (I t ) is approximated by a Poisson process, where I t-s is the past incidence, and g s is the probability mass function of the serial interval.
With EpiEstim, R t can be assumed to remain constant within user defined time windows, which smooth out estimates.

Extending EpiEstim for coarsely aggregated data
We extended EpiEstim to estimate R t from aggregated incidence data, where each aggregation window (w) is >1 day, whilst still conditioning on an assumed serial interval distribution (g s ).
We use an EM algorithm to iteratively reconstruct daily incidence from aggregated data, and in turn estimate R t .We present the method with weekly data in mind, but the method and software can be applied to any temporal aggregation (Fig 1 and S1 Appendix pp [23][24].We define: • I = {I t } t=1,. ..,T the vector of unobserved daily incidence, • A = {A w } w=1,. ..,W the vector of observed aggregated incidence, so that for each aggregation window w A w ¼ P t w t¼t wÀ 1 þ1 I t • R * ¼ R * w � � w¼1;...;W the vector of reproduction numbers corresponding to each incidence aggregation window.The � indicates that this is only used in the EM algorithm to reconstruct the daily incidence I, and is distinct from the final estimated R.
We use the following indexes: • t for days (t = 1, . .., T), • w for aggregation windows (w = 1, . .., W), Schematic of the EM algorithm approach used to reconstruct daily incidence (I) from temporally aggregated incidence data (in this case weekly, A).The algorithm is initialised with a naive disaggregation of the weekly incidence (assuming constant daily incidence throughout the aggregation window, left panel).The resulting daily incidence is then used to estimate the reproduction number for each aggregation window, in this case for each week, R* (expectation step, central panel).R* is converted into a growth rate (see Eq 7), which is in turn used to reconstruct daily incidence data, whilst ensuring that if I were to be reaggregated it would still sum to the original weekly totals (maximisation step, right panel).The process cycles between the expectation and maximisation steps until convergence.https://doi.org/10.1371/journal.pcbi.1011439.g001 • i for iterations of the EM algorithm (i = 0, . .., 10).
In the following, the bold notation signifies vectors.Our overall goal is to maximise the marginal likelihood function This marginal likelihood seeks to compute daily incidence while marginalising over the conditional probability distribution of the reproduction number.Loosely, the statistical goal is to produce a series of reproduction numbers that can reconstruct daily incidence while still being consistent with the observed aggregated incidence.We use an expectation maximisation scheme to approximate this marginal likelihood at low computational cost.The algorithm involves three steps: initialisation, expectation, and maximisation.

Initialisation
The algorithm is initialised (step i = 0) by disaggregating the aggregated incidence by piecewise constant functions (constant daily incidences across each aggregation window).That is, for aggregation window w covering days t = t w−1 + 1, . .., t w : Note that this allows non-integer incidence counts.
For iteration i � 1 of the algorithm, we iterate over two steps: expectation and maximisation.

Expectation
First, we compute the expectation This expectation computes the average reproduction number over each data aggregation window, given the original observed aggregated incidence and the previous (i th ) iteration of the estimate of daily incidence.To compute this, we use the renewal approach from EpiEstim where the posterior distribution of reproduction numbers is found analytically as , where a and b are the shape and scale of the Gamma prior distribution for R w .[15] The expected value for the reproduction number is therefore calculated as:

Maximisation
The maximisation step consists of recovering the most likely daily incidence from the expected R* i.e. maximising P(I, R*│A), or maximising P(I, R*) / ∏ t�1 P(I t │I 0 , . .., I t−1 , R*), subject to the constraint that daily incidence sums to the aggregated incidence i.e.A w ¼ P t w t¼t wÀ 1 þ1 I t .In our renewal equation context, I t │I 0 , . .., I t−1 , R follows a Poisson distribution with mean R w ∑ s I t−s g s (where w is such that t w−1 < t � t w ), and therefore has mode ÎðiÞ ÎðiÞ tÀ s g s c which we approximate as: ÎðiÞ tÀ s g s : ð6Þ Wallinga and Lipsitch [16] demonstrate, conditional on the generation time distribution, analytical correspondence between reproduction number R and growth rate r through the link function: We therefore assume local exponential growth so that Eq 6 is equivalent to: where r * ðiÞ w is the exponential growth rate over aggregation window w, obtained from R * ðiÞ w using the link function in Eq 7. k w is calculated to ensure the sum of daily incidence values adds up to the observed weekly totals: We then use the estimate of I from Eq 8 in the maximisation step and iterate, thus completing the algorithm (Eq 5 for the expectation step and Eq 8 for the maximisation step).At this point, I can be used to estimate the full posterior distribution of R over any time window using EpiEstim (hereafter, this final R is referred to as R t ).
Given the rapid computational time and convergence (see S1 Appendix pp 10 and 22), the default number of iterations was set to 10 in the R package.However, a convergence check ensures that the final iteration of the reconstructed daily incidence does not differ from the previous iteration beyond a tolerance of 10 −6 , and the number of iterations can be modified by the user.

Case studies
We chose datasets where incidence data was available daily, and then artificially aggregated them to weekly counts.R t was estimated from daily incidence that was reconstructed from weekly aggregated data using our new approach, and compared to R t estimates obtained from the reported daily incidence using the original EpiEstim R package.All R t estimates were made using both daily and weekly sliding time windows, and we refer to those estimates as daily R t estimates and weekly R t estimates respectively.
We considered three characteristics: 1) mean R t estimates, 2) uncertainty in the R t estimates, and 3) the classification of R t as increasing, uncertain or declining (S1 Appendix pp 8-9).To compare the performance of this approach to the original method, we assessed the correlations between each of the three characteristics when using the reported and reconstructed incidence.For the mean R t estimates and uncertainty in R t estimates, we assessed the linear relationships using the Pearson correlation coefficient (where values closer to +1 are indicative of a strong positive correlation).
The gamma distributed priors for R* and R t were set to a mean and standard deviation of 5 (shape = 1, scale = 5), which is the default prior parameterisation used in EpiEstim.The rationale behind this choice is that it ensures that one will not conclude R < 1 unless the data strongly supports that.The user can set the prior themselves.

Influenza
We obtained a five-week subset of a dataset (11 th December 2009-14 th January 2010) on US active component military personnel (employed by the military as their full-time occupation) that made an outpatient visit to a permanent military treatment facility describing a respiratory-related illness.This daily incidence by date of presentation at a clinic was originally obtained by Riley et al. from the Armed Forces Health Surveillance Center and were digitally extracted for use here.[17] We used a mean SI of 3.6 days and SD of 1.6 days.[10]

COVID-19
Incidence of UK COVID-19 cases and deaths were taken from the UK government website [18].For COVID-19 cases, we obtained ninety-seven weeks of data (21 st February 2020 to 30 th December 2021) for incidence by date of specimen, which is the date that a sample was taken from an individual which later tested positive.For COVID-19 deaths, we used ninety-six weeks of data (2 nd March 2020 to 2 nd January 2022) for incidence by date of death within twenty-eight days of a positive test.We assumed a mean SI of 6.3 days and SD of 4.2 days [12].
In the S1 Appendix, we also apply the EM algorithm to weekly incidence data for Zika virus disease to assess the performance of the method on a non-respiratory pathogen.

Simulation study
We considered scenarios where R t either remained constant or varied over time, with a stepwise or gradual change.For each scenario, one hundred seventy-day epidemic trajectories were simulated using a Poisson branching process as implemented in the R package projections [19].Daily datasets were aggregated weekly and used to estimate R t using the proposed method; these values were compared to R t estimates obtained from simulated daily data using the original EpiEstim R package.We explored the impact of weekend effects on R t estimates, the ability to supply alternative temporal aggregations of data e.g., three-day, ten-day, or twoweekly aggregations, the ability to detect mid-aggregation variations in transmissibility, and finally, the number of iterations required to reach convergence when reconstructing daily incidence data.The full simulation study description and details can be found in S1 Appendix.

Results
Hereafter, we refer to reported and reconstructed incidence data, these are the reported daily incidence and the daily incidence that has been reconstructed from weekly aggregated data, respectively.

Influenza
The reconstructed incidence of influenza was much smoother than the reported incidence, which showed clear weekend effects and lower reported cases on two public holidays, both occurring on Fridays ( In contrast, mean daily R t estimates differed markedly depending on whether the reported or reconstructed data were used, with an R 2 of 0.13 and much higher mean R t and uncertainty in estimates obtained from reported data (Fig 2E and 2F).Higher mean R t estimates coincided with large peaks in the reported daily incidence (typically on Mondays), as daily R t estimates were not smoothed and therefore more affected by intra-weekly variability (S1 Appendix p 2).The overall agreement in the classification of daily R t estimates was much lower, with only 44.4% agreement (S1 Appendix p 9).
In this case study, the greatest differences in R t estimates tended to correspond to time periods when the reported and reconstructed incidence data were most dissimilar (Fig 2B and S1 Appendix p 3).There was no apparent pattern in the estimates with regard to the outbreak phase, i.e. early, mid or late-phase, but this is likely due to this dataset being a snapshot of incidence taken from within an established epidemic (Fig 2).

COVID-19 cases
The reconstructed incidence of COVID-19 smoothed out intra-weekly variability, caused by factors such as weekend effects (Fig 3A and S1 Appendix pp 7-8).Weekly sliding R t estimates

R t estimates from daily incidence that was either reported or reconstructed from weekly aggregated influenza data.
A) The reported (grey) and reconstructed (green) daily incidence of influenza by date of presentation at a military clinic.B) Squared error of the daily (orange) and weekly sliding (pink) R t estimates that were made from reconstructed daily data compared to those obtained from the reported daily data.R t estimation starts on the first day of the second aggregation window (day 8-18 th December 2009) and is plotted on the last day of the time window used for estimation (i.e., starting on day 9 (19 th December) for daily estimates and day 14 (24 th December) for weekly estimates).Note: the x-axis is shared with the incidence plot above.C & E) Correlation between the weekly sliding (C) and daily (E) mean R t estimates using reconstructed data (y-axis) and reported daily data (x-axis).Vertical and horizontal lines depict the 95% credible intervals (95% CrIs) and dotted lines show the threshold of R t = 1.D & F) Correlation between the uncertainty in the weekly sliding (D) and daily (F) R t estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported (x-axis) daily data.The colour of the points in panels C-F correspond to the epidemic phase, i.e. the early (19 th -30 th December for daily estimates, or 24 th -30 th December for weekly sliding estimates), middle (31 st December-6 th January) or late (7 th -14 th January) phase of the data, shown by the strip in panel A. Solid lines show the linear model fit with 95% confidence intervals (grey shading).Dashed lines represent the x = y line.https://doi.org/10.1371/journal.pcbi.1011439.g002obtained from reconstructed and reported incidence were similar, both in their means (R 2 = 0.98) and their level of uncertainty (R 2 = 0.99, Fig 3C and 3D and S1 Appendix p 4).Mean daily R t estimates were less well correlated (R 2 = 0.67), although the difference is less marked than in the influenza case study (Fig 3E ), and the uncertainty in the estimates was similar across both approaches (R 2 = 0.97, Fig 3F).Most of the discrepant R t estimates and higher levels of uncertainty coincide with the early phase of the outbreak when incidence was lower (Fig 3E and 3F).Outside of periods of low incidence, the largest differences in R t estimates tended to correspond to time periods with greater disparities between the reported and reconstructed incidence data (Fig 3B and S1 Appendix p 5).The overall agreement in the classification of R t estimates was higher than for influenza, with 74.4% and 94.9% agreement for daily and weekly sliding R t estimates respectively (S1 Appendix p 9).

COVID-19 deaths
The reported incidence of COVID-19 deaths was much less influenced by day-to-day variation.The reconstructed daily incidence was more similar to the observed daily data than in the

R t estimates from daily incidence that was either reported or reconstructed from weekly aggregated COVID-19 case data.
A) The reported (grey) and reconstructed (green) daily incidence of COVID-19 by date of specimen.B) Squared error of the daily (orange) and weekly sliding (pink) R t estimates made from reconstructed data compared to those obtained from the reported daily data.R t estimation starts on the first day of the second aggregation window (day 8-28 th February 2020) and is plotted on the last day of the time window used for estimation (i.e., starting on day 9 (29 th February) for daily estimates and day 14 (5 th March) for weekly estimates).Note: the x-axis is shared with the incidence plot above and the y-axis has been limited to 0.5 for clarity.C & E) Correlation between the weekly sliding (C) and daily (E) mean R t estimates using reconstructed (y-axis) and reported (x-axis) daily data, starting on day 30 due to low incidence.Vertical and horizontal lines depict the 95% credible intervals (95% CrIs) and dotted lines show the threshold of R t = 1.D & F) Correlation between the uncertainty in the weekly sliding (D) and daily (F) R t estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported (x-axis) daily data.The colour of the points in panels C-F correspond to the epidemic phase, i. previous case studies (Fig 4A).Both weekly and daily R t estimates obtained from weekly data were highly consistent with those obtained from daily observations (R 2 = 0.98 and R 2 = 0.80 respectively, Fig 4C and 4E).The overall agreement in R t classifications for daily estimates was the highest of all case studies at 85.8%, and 93.3% for weekly R t estimates (S1 Appendix p 9). Discrepancies between the two mostly coincide with periods of particularly low incidence of deaths (Fig 4B and S1 Appendix p 7).The overall lower incidence of COVID-19 deaths compared to COVID-19 cases means there is greater uncertainty in R t estimates in this case study (Fig 4D and 4F and S1 Appendix p 6).However, there was minimal difference in the uncertainty of estimates obtained from daily and weekly data (Fig 4D and 4F).
In all case-studies, incidence reconstructions converged within 10 iterations of the EM algorithm.The overall process of R t estimation from weekly aggregated data took three seconds or less to run on MacOS (2 GHz Quad-Core Intel Core i5) 16GB RAM (S1 Appendix p 10); the influenza scenario, with over 57,000 cases, took two seconds to run, whilst the COVID-19 cases and deaths scenarios, with an overall incidence over 149,000 and 13 million cases respectively, took three seconds to run.A) The reported (grey) and reconstructed (green) daily incidence of COVID-19 by date of death within 28 days of a positive test.B) Squared error of the daily (orange) and weekly sliding (pink) R t estimates that were made from reconstructed data compared to those obtained from the reported daily data.R t estimation starts on the first day of the second aggregation window (day 8-9 th March 2020) and is plotted on the last day of the time window used for estimation (i.e., starting on day 9 (10 th March) for daily estimates and day 14 (15 th March) for weekly estimates).Note: the x-axis is shared with the incidence plot above and the y-axis has been limited to 0.5 for clarity.C & E) Correlation between the weekly sliding (C) and daily (E) mean R t estimates using reconstructed (y-axis) and reported daily data (x-axis), starting on day 30 due to low incidence.Vertical and horizontal lines depict the 95% credible intervals (95% CrIs) and dotted lines show the threshold of R t = 1.D & F) Correlation between the uncertainty in the weekly sliding (D) and daily (F) R t estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported daily (x-axis) data.The colour of the points in panels C-F correspond to the epidemic phase, i.e. the early (31 st  reconstructed incidence is most similar to the reported daily incidence of any case study.Therefore, the greatest differences in R t estimates from death data coincide with periods of low incidence (S1 Appendix p 7) when uncertainty increases.Weekly sliding R t estimates are equally as correlated as those from COVID-19 case data, but daily R t estimates are the most strongly correlated of any dataset (Fig 4).Additionally, there is very high overall agreement in the classification of daily and weekly R t (S1 Appendix p 9).This provides further support that differences between daily R t estimates for influenza and COVID-19 cases is likely due to the reconstructed incidence smoothing out weekly periodicity in reporting.
To investigate further, weekend effects were artificially introduced to data in the simulation study (S1 Appendix p 20).We have shown that, when using reported incidence, R t estimates are all strongly influenced by weekend effects (regardless of the smoothing time-window).Reconstructing daily incidence from weekly data completely removes the effect of noise from resulting R t values, greatly improving the accuracy of estimates.This demonstrates that it may be beneficial to artificially aggregate daily data, as has been done in previous studies [6,7].However, we did assume quite an extreme level of administrative noise, so in instances where the pattern is less prominent, it may have less of an impact on estimates.Furthermore, this smoothing effect could disguise genuine variations in transmissibility that occur mid-aggregation window, for instance, increased/decreased transmission over weekends (S1 Appendix section 3g).Disentangling important temporal trends in R t from noise in the data can be difficult, and if aggregated data is used it will be at the cost of reduced temporal resolution in R t estimates.
This can be seen when the method is applied to data aggregated over longer timescales, such as ten-to fourteen-days (S1 Appendix pp [23][24][25].This approach requires two layers of smoothing: 1) the incidence is smoothed over each aggregation window during the reconstruction process and 2) R t estimates are smoothed by the sliding window chosen by the user.If a change in R t occurs at the end of an aggregation window (i.e. on the last day), such as a sudden decrease in R t due to a strict lockdown, that change is detected with a lag, corresponding to the length of the sliding window used for R t estimation (S1 Appendix p 24).However, if the event occurs mid-aggregation window, then in addition to the usual lag caused by the sliding window, estimates will be affected by the smoothing of the incidence within the aggregation window during reconstruction (S1 Appendix p 25).The change in R t will seem more gradual over the period that data are aggregated over and will appear to start earlier (corresponding to the first day of the aggregation window).It is important for users to be aware of this, particularly when using longer aggregations of data.
Another consideration is that the reconstructed incidence can have discontinuities in the borders between aggregation windows (S1 Appendix pp [11][12].This occurs because in reconstructing daily incidence we impose that, if it were to be re-aggregated, it would match the original data.Methods that simply fit smoothing splines to weekly data, inferring daily case counts from the daily difference in cumulative counts, are not affected by this [23,24].To circumvent this problem, we recommend that sliding windows used to estimate R t are at least equal to or longer than the length of aggregation windows to reduce the impact of discontinuities on estimates (S1 Appendix pp [23][24][25]. Alternative approaches include simple smoothing splines or LOESS to reconstruct daily incidence from aggregated data (see S1 Appendix section 4), and modelling frameworks implemented in the Epidemia and EpiNow2 R packages [6,21,25].Daily infections are modelled as a latent process, back-calculated from observed data on cases or deaths, depending on an appropriate infection to observation distribution.In addition, Epidemia integrates further information, such as the infection ascertainment rate (for cases) or the infection fatality rate (for deaths) [21].This facilitates a 'nowcasting' approach, allowing users to estimate R t directly from the unobserved infections, but they typically require more data (e.g.incidence of deaths and cases), more assumptions (e.g.delay distributions and ascertainment rates), and are much more computationally intensive, which can be a barrier to the adoption of such methods by users [14].
Here, R t estimates are based on a single daily incidence reconstruction, meaning R t can be estimated very rapidly from aggregated data, which is particularly desirable during real-time outbreak analysis [14].A potential downside is that uncertainty in R t estimates could be underestimated.However, the simulation study showed that the 95% credible interval of estimates encompassed the correct value of R t the majority of the time, and we found no substantial indication that this approach detrimentally affected our characterisation of the uncertainty.
Given that this method is directly derived from EpiEstim, it relies on similar assumptions and caveats [15,26].As time of infection is more difficult to observe than symptom onset, the SI is typically used as an approximation of the generation time in the renewal equation, which may introduce bias [27].The SI, the level of undetected cases, and the reporting rate are assumed to remain constant, which is often not the case in practice.Factors such as changes in population immunity, and the introduction of interventions, can alter the SI throughout an epidemic [28].Whilst changing case definitions, new testing practices, and increased healthcare-seeking behaviour, can all affect case ascertainment.[15] Parameters chosen by users can also influence estimation accuracy, for instance, the time window length for temporal smoothing and the prior for R t [26].Finally, EpiEstim's assumption of a Poisson likelihood may be a limitation in instances when data is substantially overdispersed [29,30].
To make the method simple to implement for current and future users of EpiEstim, this extension has been fully integrated with the 'estimate_R()' function in the original R package on GitHub [31].Just one additional parameter is required-the number of days data are aggregated over (with some other optional parameters).The reconstructed daily incidence is also generated as an output, so it is possible to use it in other analysis pipelines involving alternative R estimation methods, which may perform better than EpiEstim in certain contexts, e.g. in retrospective analysis (S1 Appendix pp 29-30) [30], or in the presence of delays in reporting [25].More details regarding the applications of this method can be found in the package vignette and associated examples [31].

Conclusion
We extended the widely used R t estimation approach proposed by Cori et al., [15] and implemented in the R package EpiEstim, to incorporate a new feature which allows R t to be easily estimated from any temporal aggregation of incidence data.We have demonstrated that the method performs well using both simulated and real-world data, recovering or even improving upon the estimates that would have been made from reported daily data.This extension is easy to use and computationally efficient, which will enable epidemiologists and other public health professionals to apply EpiEstim to a wider range of diseases and epidemic contexts.

Fig 1 .
Fig 1.Schematic of the EM algorithm approach used to reconstruct daily incidence (I) from temporally aggregated incidence data (in this case weekly, A).The algorithm is initialised with a naive disaggregation of the weekly incidence (assuming constant daily incidence throughout the aggregation window, left panel).The resulting daily incidence is then used to estimate the reproduction number for each aggregation window, in this case for each week, R* (expectation step, central panel).R* is converted into a growth rate (seeEq 7), which is in turn used to reconstruct daily incidence data, whilst ensuring that if I were to be reaggregated it would still sum to the original weekly totals (maximisation step, right panel).The process cycles between the expectation and maximisation steps until convergence.
Fig 2A and S1 Appendix p 8). Considering weekly sliding R t first, there was a high correlation in both the mean R t estimates derived from each dataset (R 2 = 0.91, Fig 2C and S1 Appendix p 2) and their associated uncertainty (R 2 = 0.93, Fig 2D).The overall agreement in the classification of R t reached 81.8% (see methods and S1 Appendix p 9).

Fig 2 .
Fig 2. R t estimates from daily incidence that was either reported or reconstructed from weekly aggregated influenza data.A) The reported (grey) and reconstructed (green) daily incidence of influenza by date of presentation at a military clinic.B) Squared error of the daily (orange) and weekly sliding (pink) R t estimates that were made from reconstructed daily data compared to those obtained from the reported daily data.R t estimation starts on the first day of the second aggregation window (day 8-18 th December 2009) and is plotted on the last day of the time window used for estimation (i.e., starting on day 9 (19 th December) for daily estimates and day 14 (24 th December) for weekly estimates).Note: the x-axis is shared with the incidence plot above.C & E) Correlation between the weekly sliding (C) and daily (E) mean R t estimates using reconstructed data (y-axis) and reported daily data (x-axis).Vertical and horizontal lines depict the 95% credible intervals (95% CrIs) and dotted lines show the threshold of R t = 1.D & F) Correlation between the uncertainty in the weekly sliding (D) and daily (F) R t estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported (x-axis) daily data.The colour of the points in panels C-F correspond to the epidemic phase, i.e. the early (19 th -30 th December for daily estimates, or 24 th -30 th December for weekly sliding estimates), middle (31 st December-6 th January) or late (7 th -14 th January) phase of the data, shown by the strip in panel A. Solid lines show the linear model fit with 95% confidence intervals (grey shading).Dashed lines represent the x = y line.

Fig 3 .
Fig 3. R t estimates from daily incidence that was either reported or reconstructed from weekly aggregated COVID-19 case data.A) The reported (grey) and reconstructed (green) daily incidence of COVID-19 by date of specimen.B) Squared error of the daily (orange) and weekly sliding (pink) R t estimates made from reconstructed data compared to those obtained from the reported daily data.R t estimation starts on the first day of the second aggregation window (day 8-28 th February 2020) and is plotted on the last day of the time window used for estimation (i.e., starting on day 9 (29 th February) for daily estimates and day 14 (5 th March) for weekly estimates).Note: the x-axis is shared with the incidence plot above and the y-axis has been limited to 0.5 for clarity.C & E) Correlation between the weekly sliding (C) and daily (E) mean R t estimates using reconstructed (y-axis) and reported (x-axis) daily data, starting on day 30 due to low incidence.Vertical and horizontal lines depict the 95% credible intervals (95% CrIs) and dotted lines show the threshold of R t = 1.D & F) Correlation between the uncertainty in the weekly sliding (D) and daily (F) R t estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported (x-axis) daily data.The colour of the points in panels C-F correspond to the epidemic phase, i.e. the early (21 st March-12 th October 2020), middle (13 th October 2020-22 nd May 2021) or late (23 rd May-30 th December 2021) phase of the data, shown by the strip in panel A. Solid lines show the linear model fit with 95% confidence intervals (grey shading).Dashed lines represent the x = y line.
Fig 3. R t estimates from daily incidence that was either reported or reconstructed from weekly aggregated COVID-19 case data.A) The reported (grey) and reconstructed (green) daily incidence of COVID-19 by date of specimen.B) Squared error of the daily (orange) and weekly sliding (pink) R t estimates made from reconstructed data compared to those obtained from the reported daily data.R t estimation starts on the first day of the second aggregation window (day 8-28 th February 2020) and is plotted on the last day of the time window used for estimation (i.e., starting on day 9 (29 th February) for daily estimates and day 14 (5 th March) for weekly estimates).Note: the x-axis is shared with the incidence plot above and the y-axis has been limited to 0.5 for clarity.C & E) Correlation between the weekly sliding (C) and daily (E) mean R t estimates using reconstructed (y-axis) and reported (x-axis) daily data, starting on day 30 due to low incidence.Vertical and horizontal lines depict the 95% credible intervals (95% CrIs) and dotted lines show the threshold of R t = 1.D & F) Correlation between the uncertainty in the weekly sliding (D) and daily (F) R t estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported (x-axis) daily data.The colour of the points in panels C-F correspond to the epidemic phase, i.e. the early (21 st March-12 th October 2020), middle (13 th October 2020-22 nd May 2021) or late (23 rd May-30 th December 2021) phase of the data, shown by the strip in panel A. Solid lines show the linear model fit with 95% confidence intervals (grey shading).Dashed lines represent the x = y line.https://doi.org/10.1371/journal.pcbi.1011439.g003

Fig 4 .
Fig 4. R t estimates from daily incidence that was either reported or reconstructed from weekly aggregated COVID-19 death data.A) The reported (grey) and reconstructed (green) daily incidence of COVID-19 by date of death within 28 days of a positive test.B) Squared error of the daily (orange) and weekly sliding (pink) R t estimates that were made from reconstructed data compared to those obtained from the reported daily data.R t estimation starts on the first day of the second aggregation window (day 8-9 th March 2020) and is plotted on the last day of the time window used for estimation (i.e., starting on day 9 (10 th March) for daily estimates and day 14 (15 th March) for weekly estimates).Note: the x-axis is shared with the incidence plot above and the y-axis has been limited to 0.5 for clarity.C & E) Correlation between the weekly sliding (C) and daily (E) mean R t estimates using reconstructed (y-axis) and reported daily data (x-axis), starting on day 30 due to low incidence.Vertical and horizontal lines depict the 95% credible intervals (95% CrIs) and dotted lines show the threshold of R t = 1.D & F) Correlation between the uncertainty in the weekly sliding (D) and daily (F) R t estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported daily (x-axis) data.The colour of the points in panels C-F correspond to the epidemic phase, i.e. the early(31 st March-20 th October 2020), middle (21 st October 2020-28 th May 2021) or late (29 th May 2021-2 nd January 2022) phase of the data, shown by the strip in panel A. Solid lines show the linear model fit with 95% confidence intervals (grey shading).Dashed lines represent the x = y line.
Fig 4. R t estimates from daily incidence that was either reported or reconstructed from weekly aggregated COVID-19 death data.A) The reported (grey) and reconstructed (green) daily incidence of COVID-19 by date of death within 28 days of a positive test.B) Squared error of the daily (orange) and weekly sliding (pink) R t estimates that were made from reconstructed data compared to those obtained from the reported daily data.R t estimation starts on the first day of the second aggregation window (day 8-9 th March 2020) and is plotted on the last day of the time window used for estimation (i.e., starting on day 9 (10 th March) for daily estimates and day 14 (15 th March) for weekly estimates).Note: the x-axis is shared with the incidence plot above and the y-axis has been limited to 0.5 for clarity.C & E) Correlation between the weekly sliding (C) and daily (E) mean R t estimates using reconstructed (y-axis) and reported daily data (x-axis), starting on day 30 due to low incidence.Vertical and horizontal lines depict the 95% credible intervals (95% CrIs) and dotted lines show the threshold of R t = 1.D & F) Correlation between the uncertainty in the weekly sliding (D) and daily (F) R t estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported daily (x-axis) data.The colour of the points in panels C-F correspond to the epidemic phase, i.e. the early(31 st March-20 th October 2020), middle (21 st October 2020-28 th May 2021) or late (29 th May 2021-2 nd January 2022) phase of the data, shown by the strip in panel A. Solid lines show the linear model fit with 95% confidence intervals (grey shading).Dashed lines represent the x = y line.https://doi.org/10.1371/journal.pcbi.1011439.g004 Partnership (grant reference MR/N014103/1).AC acknowledges the Academy of Medical Sciences Springboard, funded by the Academy of Medical Sciences, Wellcome Trust, the Department for Business, Energy and Industrial Strategy, the