Evaluating data-driven methods for short-term forecasts of cumulative SARS-CoV2 cases

Background The WHO announced the epidemic of SARS-CoV2 as a public health emergency of international concern on 30th January 2020. To date, it has spread to more than 200 countries and has been declared a global pandemic. For appropriate preparedness, containment, and mitigation response, the stakeholders and policymakers require prior guidance on the propagation of SARS-CoV2. Methodology This study aims to provide such guidance by forecasting the cumulative COVID-19 cases up to 4 weeks ahead for 187 countries, using four data-driven methodologies; autoregressive integrated moving average (ARIMA), exponential smoothing model (ETS), and random walk forecasts (RWF) with and without drift. For these forecasts, we evaluate the accuracy and systematic errors using the Mean Absolute Percentage Error (MAPE) and Mean Absolute Error (MAE), respectively. Findings The results show that the ARIMA and ETS methods outperform the other two forecasting methods. Additionally, using these forecasts, we generate heat maps to provide a pictorial representation of the countries at risk of having an increase in the cases in the coming 4 weeks of February 2021. Conclusion Due to limited data availability during the ongoing pandemic, less data-hungry short-term forecasting models, like ARIMA and ETS, can help in anticipating the future outbreaks of SARS-CoV2.

Introduction Table 1 only provides the descriptive statistics for the 29 countries with the highest cumulative COVID-19 cases as of 1 st February 2021 and at the aggregated level for the entire world. The descriptive statistics for the entire sample are provided in S1 Table. Ethics No ethics approval was required for the study as secondary data analysis was performed on the publicly available COVID-19 dataset.

Forecasting methodology and evaluation
To forecast the cumulative COVID-19 cases, we use four different forecasting methods. Three of the forecasts are based on the autoregressive integrated moving average process which is usually denoted as ARIMA (p, d, q) where p is the order of the autoregressive model, d is the degree of differencing, and q is the order of the moving average model. The ARIMA model has been used for forecasting and assessing seasonality in infectious disease outbreaks [12][13][14][15].
The ARIMA model is a generalization of the autoregressive moving average (ARMA) model with an ability to address the potential non-stationarity of the variable of interest. To test for stationarity of the cumulative COVID-19 cases, we used the Augmented Dickey-Fuller (ADF) and Phillips-Perron (PP) unit root tests. The null hypothesis of these tests is that the variable contains a unit root, hence non-stationary, whereas the alternative is that the time series variable was generated by a stationary process. Table 2 reports the p-values, for the unit root tests, for the 29 countries with the highest cumulative COVID-19 cases and at the aggregated level for the entire world. S2 Table provides the results of the unit root tests for the 187

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases countries. The test results suggest non-stationarity which justifies the use of the ARIMA model. Let X t denote the cumulative COVID-19 cases on the t th day for the country being analyzed. Then, ARIMA(p, d, q) equation can be given as follows [16]: In Eq (1), D is the difference operator, α is the constant term, β's and γ's are the coefficients of the autoregressive and the moving average component of the ARIMA model, respectively, and ε is the error term which is assumed to be independently and identically distributed from a normal distribution with zero mean. Eq (1) shows that the AR component allows the variable to be determined based on its prior values whereas the MA component shows that the error term is a linear combination of the current and prior values of ε. The latter accounts for the autocorrelation in the variable of interest.
A particularly naïve attempt is to fit ARIMA(0,1,0) which is commonly referred to as random walk and its forecasts are termed as random walk forecasts (RWF). We generate the RWF with and without drift for the variable of interest.
A more systematic approach for fitting the ARIMA model follows these steps [17]: 1. To ensure stationarity, the differencing order (d) is selected by using the Kwiatkowski-Phillips-Schmidt-Shin test [18].
2. The lags, p and q, are determined by using the Akaike Information Criterion corrected for small sample sizes.
Aside from the ARIMA model, we also used the exponential smoothing method (ETS) for generating forecasts. ETS is a forecasting method for univariate data which deals with the systematic trend, seasonality, and can be used as an alternative to the ARIMA models [19].
To evaluate the performance of forecasts, the data is divided into two mutually exclusive sets, the training and test sets. The training set is used to fit the model (without using any data from the test set) whereas the test set is kept for evaluating the forecast accuracy. We use a variant of the time series cross-validation which is a more sophisticated version of the usual training-test set methodology [16]. In this method, there is a series of test sets, and each test set is accompanied by a corresponding training set consisting of observations before the test set. Therefore, a series of training-test sets are constructed, and for each training-test set forecast accuracy is determined. This method is more sophisticated than the usual training-test set methodology because it allows more comparisons of the forecasted and actual data values.
The time-series cross-validation method is also referred to as evaluation on a rolling forecasting origin because the origin of the test set is rolled forward in time. In simpler words: 1. An origin for the first test set is selected.
2. Forecasts are determined for the test set using the corresponding training set.
3. The origin is rolled forward by one period generating a new training-test set for which forecasts can be evaluated, and so on.
In this study, we take the 45 th day-since the first reported case in the country-as the origin which is then rolled forward one day at a time. The variation in our methodology is that, instead of taking each of the test set as a single observation, we take four different test sets for each training set: 1 week, 2 weeks, 3 weeks, and 4 weeks into the future. This allows us to ascertain the accuracy of the forecasting method up to 4 weeks ahead for each training set. Therefore, we include countries with at least 73 (45+28) observations to ensure that there is at least one available test set for the 4 weeks ahead forecasts for each country included for the forecast evaluation.
Suppose the country under consideration has data available for t 2 {1,2,� � �, T}. The following steps explain the methodology: 1. Use the data available till t = 45, and forecast the values of X t+τ for τ 2 {1,2,� � �, 28}, i.e., obtain the forecasts for the next 28 days or 4 weeks.
3. Increase the data sample by one day, i.e., take the data till (t + 1) th day and obtain 28-day ahead forecasts, and repeat this process until we reach the end of the data, i.e., we reach the T th day.
There are several methods to determine the accuracy of the forecasted values. We used the Mean Absolute Percentage Error (MAPE) for this purpose which is defined as follows [16]: In Eq (2), A i and F i denote the actual and forecasted values, respectively, and n is the number of forecasted values for which a corresponding actual data value exists. It should be clear that forecasting accuracy increases as MAPE becomes closer to zero. Since the forecasted variable of this study is the cumulative COVID-19 cases, MAPE represents the forecasting error as the percentage of cumulative COVID-19 cases. Based on our methodology, there is a series of training-test sets, and MAPE can be determined for each of these. Therefore, the forecasting accuracy is calculated by averaging MAPE over the series of the training-test sets [16].
We also estimate the Mean Percentage Error (MPE) which is defined as follows [20]: Since MAPE uses the absolute values of the forecasting errors, it is unable to determine whether the forecasting model is systematically under or over-predicting. In this regard, MPE can prove useful as it does not use the absolute values of the forecasting errors [20].

Results
This section presents our results; the forecasting accuracy of the four forecasting methodologies, the forecasted values, and the heat maps.

Forecasting evaluation
Figs 1 and 2 show the MAPE for the forecasted values of the 29 countries with the highest cumulative COVID-19 cases and at the aggregated level for the entire world. For all of the countries, the MAPE are provided in S1-S6 Figs. As expected, the MAPE increases as we increase the forecasting horizon from 1 week to 4 weeks ahead. This suggests that shorter-term

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases forecasts are more accurate compared to longer-term forecasts. Overall, forecast evaluation shows that the ARIMA and ETS forecasts outperform RWF with and without drift.

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases box ranges from the first quartile (Q 1 ) to the third quartile (Q 3 ) and the black notch in the box represents the median of the data. Each boxplot also has a vertical line that encompasses the non-outliers of the data. The bottom and top limit of the vertical line are determined as Q 1 − 1.5IQR and Q 3 + 1.5IQR, respectively, where IQR = Q 3 − Q 1 is the interquartile range. Any observation beyond the vertical is referred to as an outlier because 99.3% of the observations lie within its limits. Fig 3 shows that, among the non-outliers, the maximum MAPE values for 1 week, 2 weeks, 3 weeks, and 4 weeks ahead ARIMA forecasts are 4.97%, 8.00%, 11.89%, and 15.41%, respectively. Moreover, the median values for 1 week, 2 weeks, 3 weeks, and 4 weeks ahead forecasts

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases are 1.88%, 3.47%, 5.20%, and 7.07%, respectively. The performance of ARIMA forecasts is marginally better than ETS forecasts and significantly better than RWF with and without drift. We also used alternative measures for evaluating the forecasted values; Mean Absolute Error (MAE) and Mean Error (ME). Based on the results of these measures, presented in S1 File, the conclusions drawn from Figs 3 and 4 remain unchanged. We are only using the ARIMA and ETS forecasts as these outperform the RWF with and without drift as established in the previous subsection. These figures use the data from the entire sample period, and the values are forecasted for 4 weeks into the future. Figs 5 and 6 show that the ARIMA and ETS forecasts perform similarly in the depicted cases.

Forecasted scenario
We also use ARIMA forecasts to generate heat maps for 8 th February (Fig 7), 15 th February (Fig 8), 22 nd February (Fig 9), and 1st March (Fig 10) in 2021. To generate the heat maps, we selected all those countries which had at least 73 (45+23) observations and have a population larger than 1 million. Moreover, we only use the ARIMA forecasts since its performance is comparable to the ETS forecasts. To ensure comparability of the forecasts, we divided the forecasted values of the cumulative COVID-19 cases by population in millions. Overall, the heat

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases

Discussion
Our results show that the ARIMA and ETS methods perform well in forecasting cumulative COVID-19 cases. Additionally, using these forecasts, we generated heat maps to provide a pictorial representation of the countries at risk of having an increase in cases in the 4 weeks of February 2021.

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases Globally, uncertainty exists around the spread and transmissions of SARS-CoV2. For this purpose, many mathematical modeling and simulation-based techniques have been used, especially compartmental model techniques, to better understand the transmissions of COVID-19 cases. Among these, the most used is the Susceptible-Exposed-Infectious-Recovered (SEIR) model [21][22][23][24]. The SEIR model makes assumptions on the population belonging to the different compartments based on R 0 . However, for these assumptions to be reliable, large datasets are required and solely relying on R 0 can be misleading as COVID-19 outbreaks may be possible even when R 0 is lower than one [6,[21][22][23][24].
During a pandemic, not a lot of data is available to reliably run the aforementioned models. However, some of the models for infectious diseases were designed for determining longterm, instead of short-term, dynamics and projections [25]. In comparison, the data-driven

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases methods considered in this paper are less data-hungry, perform well for short-term forecasts (based on evaluation of 4-week ahead forecasts), and do not require as much level of detail in the datasets. Other advantages of these data-driven techniques include simplicity of estimation that can be performed using the open-source statistical software R 4.0.3. Different countries and regions have different health systems and capacities in place which determine their testing capabilities. The SEIR model can capture the propagation of the disease which means that it would be able to predict the true number of cases considering the susceptible and asymptomatic individuals. However, data for asymptomatic cases is largely unavailable for SARS-CoV2 due to limited testing capabilities and a large proportion of asymptomatic cases not being detected; making it challenging to verify the predictions from the compartmental models. On the other hand, data-driven techniques can provide information on the confirmed number of cases with high accuracy. For this study, we focused on the cumulative COVID-19 cases. However, these forecasting methods can be used for other indicators such as

PLOS ONE
cumulative deaths, cumulative recovery, etc. The forecasts of the confirmed number of cases are sensitive to the number of tests performed, however, since the confirmed number of cases is an indicator of the anticipated burden on the healthcare system and professionals, the projections by the data-driven techniques might be insightful for the policymakers. This is important because the availability of health service resources during COVID-19 is an issue faced by many countries [26]. Even with lockdown measures enacted, the peak demand for healthcare services, during the COVID-19 pandemic, exceeded capacity irrespective of the capacity of the healthcare infrastructure and resources especially during the second wave [27,28].
Globally, our forecasting results reveal that the number of cases will increase in most of the countries. Additionally, the forecasted scenarios for February 2021 indicate an increase in the cumulative cases of COVID-19 in Canada, Europe, and South America (Figs 7-10). The future of the global pandemic greatly depends on the vaccine rollout coupled with the implementation of mitigation and containment measures. Strict measures such as worldwide lockdowns,

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases travel restrictions, school closures, non-essential business closures, social distancing, isolation of infected populations as well as heightened hygiene measures can potentially reduce the risk of spread [26]. However, the effectiveness of interventions is far from homogenous and depends on how well people comply, the presence of enforcement, how well testing/contact tracing/quarantine efforts that are run alongside the lockdown are performed, etc. Yet, hopes of curtailing the pandemic have proven elusive, with many countries forced by their economies to relax the quarantine measures which can potentially lead to an exponential increase in the number of cases. With effective vaccine rollout, close monitoring of COVID-19 cases should be considered before easing the mitigation and containment strategies.
Although the novel coronavirus pandemic is associated with many uncertainties, we believe that short-term forecasting and predictive modeling can be an effective tool in targeted vaccine rollout and intervention strategies. Model-based predictions can help policymakers to make the right decisions in a timely way [29].

PLOS ONE
Data-driven methods for short-term forecasts of SARS-CoV2 cases

Conclusion
Results of the study indicate that the ARIMA and ETS models perform well in forecasting the short-term cumulative COVID-19 cases. We ran the model for 187 countries with varying health system resources and infrastructure, and at the aggregated level for the entire world. The results suggest that the ARIMA and ETS model can be used for SARS-CoV2 forecasting in different countries and regions with a high level of accuracy. Since these models rely on past observations of the cumulative COVID-19 cases, they can also be used for forecasting provincial, district, or state level cases and other COVID-19 indicators.