COVID-19: Short term prediction model using daily incidence data

Background Prediction of the dynamics of new SARS-CoV-2 infections during the current COVID-19 pandemic is critical for public health planning of efficient health care allocation and monitoring the effects of policy interventions. We describe a new approach that forecasts the number of incident cases in the near future given past occurrences using only a small number of assumptions. Methods Our approach to forecasting future COVID-19 cases involves 1) modeling the observed incidence cases using a Poisson distribution for the daily incidence number, and a gamma distribution for the series interval; 2) estimating the effective reproduction number assuming its value stays constant during a short time interval; and 3) drawing future incidence cases from their posterior distributions, assuming that the current transmission rate will stay the same, or change by a certain degree. Results We apply our method to predicting the number of new COVID-19 cases in a single state in the U.S. and for a subset of counties within the state to demonstrate the utility of this method at varying scales of prediction. Our method produces reasonably accurate results when the effective reproduction number is distributed similarly in the future as in the past. Large deviations from the predicted results can imply that a change in policy or some other factors have occurred that have dramatically altered the disease transmission over time. Conclusion We presented a modelling approach that we believe can be easily adopted by others, and immediately useful for local or state planning.

To model the number of cases, we use the simplest time-since-infection model which is also known as the "Kermack-McKendrick" model [1]. Following similar setup from Fraser [2] and Cori et al. [3], we define a function β(t, τ ) which is the transmission probability at a calendar time t, after being infected τ time ago. Therefore, between time t and t + δ, someone infected at time τ ago will consequently infect someone else with a probability of β(t, τ )δ. As a result, the number of cases at time t, I(t), is related back to previous cases through the following renewal equation: Suppose that β(t, τ ) is separable, that is: where, with no loss of generality, we assume This function R e (t) therefore estimates the number of people someone infected at time t would infect if the conditions of the epidemic remains constant. It was called "instantaneous reproduction number" by Cori et al. [3]. In this paper, we refer to this parameter as the effective reproduction number in order to distinguish it from the basic reproduction number R 0 -the average number of people infected by one infectious person under a natural situation when no intervention is implemented. The effective reproduction number R e is a very important parameter, since it can be shown that when R e is greater than one, infections will increase and lead to an outbreak where more and more people will be infected. If R e is less than one, then the outbreak will die down eventually. The effective reproduction number can be influenced by four major factors [4]: 1) transmissibility of the disease; 2) duration of the infectious period; 3) number of contacts between infectious and susceptible individuals each day; and 4) the percentage of people that are immune or no longer susceptible. We will show that this parameter plays a major role in our forecasting model for COVID-19. The normalized function w(τ ) represents the relative transmissivity at a time τ after infection. The distribution function w(τ ), also called the distribution of the serial interval, can be obtained only if the infection date for each case is known. In a real world application, we usually only have the information of the reported dates of each case, instead of the infection date, therefore, we will rely on either the external source of data or we will make some assumptions about the serial distribution w(τ ).
Plugging in the factorization for β(t, τ ), the renewal function (1) becomes In practice, incidence data are obtained usually at daily intervals, and thus are discrete. Therefore, the discrete version of the above renewal equation is where T indicates the last time point of the incidence series. Even though we could obtain an estimate of R e from the above equation, the estimate can be highly variable, due to reporting errors (for example, under-reporting and delayed reporting) associated with the incidence case series. Therefore, we follow the setup of Cori et al. [3] and view the cases as arising from a Poisson Process, where the mean parameter is given by the renewal equation (2). In order to obtain a more stable estimate of R e (t), we also assume that R e (t) is the same for a total of τ days during the period [t − τ + 1, t]. Explicitly, the likelihood function of the cases during the time period [t − τ + 1, t] is as follows: Following Cori et al. [3], we will use a Bayesian framework and assume R e (t) has a prior Gamma distribution with shape parameter a and scale parameter b. Since we have a Poisson likelihood, a gamma prior is conjugate, and should result in a gamma posterior for R e (t) with shape parameter a * and scale parameter b * , where For now, we set a and b to be hyper parameters specified by the user. We defer future discussion of these parameters and the serial interval function w(s) to the next section. This procedure can now be used to get the posterior distribution of R e (t) for any time interval [t − τ + 1, t]. The interval length τ is chosen such that there is a large enough data set to provide stable estimates, and small enough to capture the time-varying nature of R e (t). Eventually, we will use the posterior distribution of R e (T ) to predict future cases, where T is the last time point of the observed incidence series.
For short-term forecasting of new cases, we propose a method that uses very few assumptions and is therefore straightforward to model. We simply assume that in the near future after time T , R e (t) will 1) stay the same; 2) increase 5%; 3) decrease 5%. For the first scenario that R e (t) will stay the same, we perform the following procedures for making predictions: 1. Draw a new R * e from the posterior gamma distribution of R e (T ), as specified by (3), where T is the end time point of the observed incidence series.
2. Then we draw I T +1 from a Poisson distributing with a mean function obtained by the renewal function (2).
3. Similarly, we draw sequentially the values for I T +2 , I T +3 , · · · , I T +K , where K is predefined forecasting period. Thus we obtain one complete forecast series, I T +1 , I T +2 , · · · , I T +K . δ = 5%, or δ = −5%. Either way, we borrow the posterior distribution obtained from R e (T ), and assume that the mean parameter changes by δ, but the variance stays the same. This is equivalent to setting the posterior distribution for R e to be a gamma distribution with shape parameter a * δ and scale parameter b * δ , where a * δ = a * δ 2 , b * δ = b * /δ. Then we follow the same steps outlined above to predict the means and the 95% confidence intervals for I T +1 , I T +2 , · · · , I T +K .

Appendix B. Choosing Model Parameters
The goal of our proposed method is to make as few approximations as possible, so that we can make reasonable predictions without having to rely on the correct specification of the assumptions. However, given only the incidence data, the reproduction number R e (t) can not be determined uniquely [4], so some basic assumptions are necessary. The most important assumption is the serial interval distribution. Based on literature reviews, we adopted a discretized gamma distribution [3]. The mean of 3·95 days and a standard deviation of 4·24 days were based on the work of Ganyani et al. [5] using COVID-19 data from Tianjin, China. If desired, a prior can be placed on the parameters for the mean and standard deviation of the serial interval, and this approach was discussed in estimating R e (t) values in Cori et al. [3].
There are other hyper parameters that must be supplied. In our method, we assume that each R e (t) has a gamma prior distribution, with shape and scale parameters a and b. It is common knowledge that the posterior distribution of R e (t) depends both on the parameters in the prior distribution, and the observed data. When the incidence number is relatively high, as experienced in Texas and other states during the Summer and Fall of 2020, the prior parameters have very little effect on the final posterior distribution. Our experience suggests that a wide range of choices worked for a and b as hyper-parameters. Choosing a single a and b and using the same prior in general for all R e (t) work well. We have chosen a value of a = 1, b = 5 following the work of Cori et al. [3]. Another strategy is to choose the prior parameters so that the prior has a mean that is equal to the previous estimate R e (t − τ ), and a standard deviation resembling the posterior standard deviation of the previous R e (t − τ ).
The last parameter we need to decide is the interval parameter τ , since we based our prediction on the estimated R e (T ) during the interval [T − τ + 1, T ], where T is the end time of a time series upon which we wish to make a forecast. The choice of τ is a compromise between the variability and the accuracy of the predictions. In general, choosing τ to be small will result in highly variable estimates of R e (t), but will be more accurate due to the fact that R e (t) may be different at different times. In contrast, a large τ is less accurate, but also less variable, as it uses more data to fit R e (t). In our experience, a choice between 7 and 12 days worked well when we applied them to the real data sets.
We end this section on a note of using a Poisson distribution for the likelihood of the observed incidence sequence. Despite the over-dispersion issue associated with the Poisson distribution assumptions (i.e. the Poisson distribution assumes that the variance is the same as the mean, but in reality, the variance is often larger than the mean), our experience is that the predicted cases can vary a lot due to the change of the underlying reproduction number R e (t). By allowing R e (t) to take different values in the prediction, we will capture the uncertainty of the future incidence series in a different way. Our prediction interval will be determined by different transmission rate scenarios: the upper bound is given by the upper 95% confidence limit when the transmission rate increase 5%, and the lower bound is given by the lower 95% confidence interval limit when the transmission rate decreases 5%.