Downscaling epidemiological time series data for improving forecasting accuracy: An algorithmic approach

Mahadee Al Mobin; Md. Kamrujjaman

doi:10.1371/journal.pone.0295803

Abstract

Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.

Citation: Al Mobin M, Kamrujjaman M (2023) Downscaling epidemiological time series data for improving forecasting accuracy: An algorithmic approach. PLoS ONE 18(12): e0295803. https://doi.org/10.1371/journal.pone.0295803

Editor: Salim Heddam, University 20 Aout 1955 skikda, Algeria, ALGERIA

Received: September 14, 2023; Accepted: November 29, 2023; Published: December 14, 2023

Copyright: © 2023 Al Mobin, Kamrujjaman. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Any process that involves deriving high-resolution data from low-resolution variables is referred to as downscaling. This method relies on dynamical or statistical approaches and is extensively utilized in the field of meteorology, climatology, and remote sensing [1, 2]. Significant exploration of the downscaling methods has been done in the field of geology and climatology to enhance the out of existing models like the General Circulation Model (GCM) [3–8], Regional Climate Model (RCM) [9], Integrated Grid Modeling System (IGMS) [10], System Advisor Model (SAM) [10] and to make it usable for the forecast of geographically significant region and time. Several methods has been used to downscale these data such as BCC/RCG-Weather Generators (BCC/RCG-WG) [11–13], and Statistics Downscaling Model (SDSM) [11, 14–19], Bayesian Model Averaging (BMA) [20]. Even machine learning methods has been used like Genetic algorithm (GA) [9], K Nearest Neighbourhood Resampling (KNNR) [9], Support Vector Machine (SVM) [11, 21–23]. Except for the machine learning algorithms, which are methods that are finding their applications in new domains, the rest of the methods are tailored to suit the outputs of the models, as mentioned earlier.

This class of methods has recently been applied in the disaggregation of spatial epidemiological data [24, 25]. Nevertheless, significant work has yet to be done for the temporal downscaling of epidemiological data. The temporal downscaling techniques are often classical interpolation techniques that do not do justice to aggregated data. This phenomenon can be well illustrated with an example. Consider the case of monthly Dengue infection data of 2017 from Fig 1, which has been downscaled using linear interpolation by considering the aggregated value as the value of the end date of a month in Fig 2. In this case, if we consider the monthly aggregate of the downscaled data, it does not match the original aggregate. This downscaled data, which differs from the original data in such statistical measures, shall result in decisions and knowledge that cannot be far from the truth.

Download:

Fig 1. Monthly data of Dengue 2017.

The monthly aggregate of the DENV infection in Bangladesh in the year 2017. The data has been aggregated to monthly scale to avoid the discontinuity observed in the daily scale.

https://doi.org/10.1371/journal.pone.0295803.g001

Download:

Fig 2. Prior distribution for daily data of 2017.

The figure depicts the downscaled data using linear interpolation by considering the aggregated value as the value of the end date of a month using the data illustrated in Fig 1. In this case, if we consider the monthly aggregate of the downscaled data, it does not match the original aggregate. This downscaled data, which differs from the original data in such statistical measures, shall result in decisions and knowledge that cannot be far from the truth.

https://doi.org/10.1371/journal.pone.0295803.g002

The paper aims to achieve the following:

To propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled temporal time series of varying time lengths from aggregated data preserving most of the statistical characteristics and the aggregated sum of the original data.
To present two downscaling case studies of epidemiological time series data (namely Dengue and COVID-19 data of Bangladesh) to showcase the workflow and efficacy of the algorithm.
To present a comparison between the forecasting performance of aggregated data and algorithm generated synthetic data to showcase the improvement achieved (for synthetic data over aggregated data) in terms of scale independent error.

The paper is organized as follows. Materials and method section describes the data used for the paper and its sources and the methodology at length with the proposed SBD algorithm. The section titled “Comparison of the Synthesized Data with the Real Data” compares the synthesized data with the actual data of two different epidemiological cases (Dengue and COVID-19) in Bangladesh and shows how the SBD algorithm could generate statistically accurate approximate of the actual with very little input in both cases discuss the benchmark metric used for evaluating the output. Section titled “Improvements in Forecasting Accuracy” shows the improvement of the forecasting accuracy using synthesized data over aggregated data using a statistical forecasting toolbox in the dengue scenario of Bangladesh using the last 12 years of monthly aggregated data, Forecasting model selection procedures, and residuals. Finally, we concludes our paper with an overview of the paper and how our paper has contributed to the existing literature and scopes for improvements and fields of application of the SBD algorithm in the conclusion section.

Materials and methods

Data

The dengue data from Bangladesh used in this paper are from January 2010 to July 2022 and are collected from DGHS [26], and IEDCR [27]. The COVID-19 data of Bangladesh are from 8 March 2020 to December 2020 and are collected from the WHO data repository [28].

Methodology

The SBD algorithm can be segmented into three sequential parts, as exhibited in Fig 3. Initially, the algorithm considers a prior distribution to generate synthetic downscaled data. The SBD algorithm considers the aggregated data as the prior distribution of the downscaled data. For example: If we have the monthly epidemiological data of dengue for the year 2017, thus to attain the prior distribution for the downscaled data, we divide the data by 30. The fact is well illustrated in Figs 1 and 2. Fig 1 depicts the monthly distribution of the DENV (Dengue Virus) infection in Bangladesh for the year 2017, and Fig 2 represents the prior distribution obtained by the method described above.

Download:

Fig 3. Flow diagram of Stochastic Bayesian Downscaling algorithm.

The diagram depicts the flow diagram of the novel proposed algorithm. The algorithm works in three unique phases. The first phase (Initial Data Generator) generates a initial approximation based on the prior distribution, the second phase (Overthrow Correction) removes any abrupt fluctuation introduced during the de-aggregation in the first step, and finally the final step(Volume Correction) rectifies the any displacement of data point volume over the aggregation unit in the second step thus aggregation of this downscaled data agrees with the initial data.

https://doi.org/10.1371/journal.pone.0295803.g003

Based on the prior distribution, initial statistical properties of the synthetic data are obtained except for the standard deviation (σ). As σ is scaling independent, hence scaling method used to obtain the prior distribution from the monthly aggregate keeps the σ identical to the monthly aggregate. To overcome this problem, we consider, (1) where σ₀ is the standard deviation considered for the distribution to be fitted to generate the downscaled data by the algorithm and σ_{priordistribution} is the standard deviation of the obtained prior distribution. Later on, in section titled “Comparison of the Synthesized Data with the Real Data”, we will see that the initial assumption of the standard deviation considered in (1) is a good approximation for the downscaled data.

Initial data generation.

The “Initial Data Generator” phase feeds on the aggregated data, length of the aggregate interval, and σ₀ to give an initial downscaled data based on a “Distribution Generator”. Based on the prior distribution, a proper statistical probability distribution (PD) is to be considered to be fitted to generate the data. The “Distribution Generator” aims to fit the selected PD to the prior distribution based on the statistical properties obtained for the initial phase. The challenge in this scenario and every step of the algorithm is ensuring that the synthetic data produced in every step is non-negative integers, as we are dealing with epidemiological data. Thus specific measures have been deployed to tackle these challenges, which are:

To ensure non negativity consider the transformation:
To ensure that the data points are integer irrespective of the selection of PD, we round off the data to the nearest integer and subtract one from randomly selected data points in each aggregated unit such that the synthesized data has the same sum as the aggregated unit

Thus imposing these measures, the “Distribution Generator” generates synthetic distribution for each aggregated unit. Thus, looping over the entire aggregated timeline generates the initial distribution of the downscaled data concerning the aggregated data. This initial distribution is a suitable approximation to the actual data but can be improved with further refinement. The synthetic data will result in the exact aggregated data from which it is generated.

Overthrow correction.

This step is often necessary for time series data with an abrupt change in gradient or in case of initial approximation with abnormally large overthrow as the approximations are probabilistic. In case of data with the abrupt change in gradient, the initial approximation is often left with a staircase-like structure as exhibited in the Fig 4. The problem can be corrected using the overthrow correction measure, which is demonstrated in Fig 5.

Download:

Fig 4. 2019 Dengue infected cases daily, pre correction.

Initial approximation without overthrow correction exhibits a staircase like property due to higher gradient change of the prior distribution.

https://doi.org/10.1371/journal.pone.0295803.g004

Download:

Fig 5. 2019 Dengue infected cases daily, post correction tol = 176.4, iter = 30.

Initial approximation with overthrow correction exhibits a much proper approximation of the real case scenario preserving its original trend.

https://doi.org/10.1371/journal.pone.0295803.g005

The overthrow correction part takes a tolerance, δ, iteration limit, n, and a radius of an open interval, r. The step initially determines overthrow using tolerance between two neighboring points, i.e., if y_i − y_i−1 > δ or y_i − y_i+1 > δ then y_i is an overthrow. After identifying an overthrow, we consider an open interval of radius r around the overthrow point and execute the distribution generator on that open interval. This redistributes the sample within the open interval diminishing the overthrow to some extent. This process is iterated n times over the entire time series to ensure satisfactory results. The strength of the overthrow correction step can be dictated by the two parameters δ and n. The strength of the overthrow correction is directly proportional to n and is inversely proportional to δ. Selecting the correct parameter value can ensure a good approximation of the real-life scenario.

Volume correction.

The overthrow correction disrupts the property of the synthesized time series to conserve its aggregated sum equal to the given aggregated distribution due to its local correction property. The scenario best illustrates the Table 1. This problem is addressed in this step. To maintain aggregated sum equal to the original data, we consider each aggregated unit and adjust the sum accordingly, adding/subtracting 1 from randomly chosen indices until the sum equates as required.

Download:

Table 1. Comparison of actual vs SDB algorithm generated synthetic data using 2019 Dengue data of Bangladesh.

https://doi.org/10.1371/journal.pone.0295803.t001

The Stochastic Bayesian Downscaling (SBD) algorithm.

The algorithm calls for a unique name. From now on, we shall address it as Stochastic Bayesian Downscaling (SBD) algorithm. The structural part of the algorithm has been discussed at length in the first three segments of the methodology sub-section. The proper pseudo code of the SBD algorithm is as follows:

Algorithm 1. Stochastic Bayesian Downscaling (SBD) Algorithm

Require: Aggregated value vector, v

Overthrow tolerance, δ

Iteration limit, n

Radius of the open interval, r

Standard deviation, σ

Ensure: downscaled time series,

for elem in v do

= Distribution Generator(elem,σ)

end for

for i from 1 to n do

find a vector of coordinates of overthrow points

for elem in overthrow points do

open interval centering elem of radius, r = Distribution Generator(sum of the elements of open interval,σ)

end for

for elem in v do

if v_i ≠ sum of euiquivalent aggregate in then

d=v_i-sum of equivalent aggregate in

while d ≠ 0 do

if d > 0 then

d− = 1

else

d− = 1

end if

end while

end if

end for

Algorithm 2. Distribution Generator

Require: Total sum of the down scaled distribution, s

Standard deviation, σ

Ensure: Down scaled approximation over the length of the aggregate,

= Fit the decided distrubiton to the given down scaled time frame

if elems in then

end if

if elems in are not integer then

end if

if then

while d ≠ 0 do

if d > 0 then

d− = 1

else

d− = 1

end if

end while

end if

The SBD algorithm is heavily dependent on the random selection of numbers that are prone to generate non-reproducible results. Thus seeding the random number generator is highly recommended to ensure reproducible results.

The novelty of SBD algorithm is its consideration of the prior distribution as initialization and deploying the underlying distribution to generate synthesized downscaled data, which is non-negative and conserves the aggregated value of the given data.

Comparison of the synthesized data with the real data

To determine the accuracy of the SBD algorithm, we test the SBD algorithm against some real-world data. Here, we have taken 2020 COVID-19 data on infected individuals in Bangladesh and 2022 (January to July), Dengue data on infected individuals in Bangladesh. The data, as mentioned earlier, are daily data on the number of newly infected individuals nationwide. We aim to convert this data to monthly aggregate and feed the aggregated data to the algorithm to generate downscaled daily data; hence we can compare the accuracy of the synthetic daily data with the actual daily data. To determine the accuracy of the approximation, we will use two error measures and do component analysis on the real and synthetic data to see if the synthetic data can well approximate the underlying properties of the real data. In case of the component decomposition, we will use the additive model mentioned in (2), (2) as the procured data has some zero values for which the multiplicative model mentioned in (3) (3) is not suitable in this scenario.

Error measures for benchmark

To compare the result with the real world data we shall use two error terms that describes the overall error of the approximation. These are as follows:

Root Mean Square Error:
Mass Absolute Error where, x_i is the actual data and is the predicted data.

Since many of the data points in the actual and synthesized cases is popluated with 0 hence Mass Absolute Percentage Error (MAPE), and Scaled Mass Absolute Percentage Error (SMAPE) are undefined in this scenario.

Dengue

Preprocessing and result.

In case of this simulation, we took Bangladesh’s 2022 daily Dengue infected data from January to July. To feed this data into the SBD algorithm, we convert the daily data to monthly aggregate as illustrated in Fig 6. For majority of the statistical work done in the paper we have used R.

Download:

Fig 6. Monthly data of Dengue (2022).

Monthly aggregate of 2022 Dengue data from January to July.

https://doi.org/10.1371/journal.pone.0295803.g006

We feed in this data considering,

Initial standard deviation, .
Over throw tolerance, δ = 0.6× (Range of the initial distribution).
Iteration limit, n = 100.
Radius of open interval, r = 3.
Underlying distribution to be normal.

and generate the synthesized data. Fig 7 illustrates the synthesized data, which can be said to be a good approximation of the actual given the aggregated prior distribution (Fig 8).

Download:

Fig 7. Synthetic data of Dengue.

SDB algorithm generated synthesized daily number of infected cases of Dengue in 2022 from January to July.

https://doi.org/10.1371/journal.pone.0295803.g007

Download:

Fig 8. Actual data of Dengue.

Daily number of infected cases of Dengue in 2022 from January to July.

https://doi.org/10.1371/journal.pone.0295803.g008

Error metrics and statistical measures.

The calculated error measures are:

MAE = 6.60664, which implies that the average error between the actual and synthesized data is 6.60664.
RMSE = 12.64499 which implies that the standard deviation of the residuals/errors is 12.64499. The fact is well illustrated in Fig 14.

The error metric shows satisfactory results. The following Table 2 validates if the synthesized data honours the aggregated sum of the prior distribution.

Download:

Table 2. Aggregation comparison of actual vs. SBD algorithm generated synthetic data in the case study of Dengue.

https://doi.org/10.1371/journal.pone.0295803.t002

The total number of cases in each scenario has been maintained equally. As discussed earlier, we can see that the initial distribution holds the monthly sum consistently, which gets disrupted in the overthrow correction phase and later corrected in the volume correction phase.

We shall now explore the basic statistical properties of the synthetic data with respect to the actual data.

It is to be noted that the mean of the synthesized data equates to that of the original data, although it was not plugged into the SBD algorithm in any manner as illustrated in Table 3. As previously discussed that σ₀ is a good approximation to the original σ. All the rest of the measures are somewhat close, but the maximum varies by a lot. The maximum is hard to anticipate from the aggregated data; hence it is an avenue that demands further exploration.

Download:

Table 3. Statistical measure comparison of 2022 actual Dengue data vs. SDB algorithm generated data.

https://doi.org/10.1371/journal.pone.0295803.t003

Component decomposition and comparison.

We now want to do component decomposition of both the actual and synthetic data based on the model mentioned in (2). However, component decomposition is no benchmark for accuracy, but SBD algorithm aims to improve the outcome of forecasting techniques highly influenced by the components within a time series data. Thus comparing these components can answer the question of whether the components-based characteristics of the original time series are present within the synthesized data.

In the case of the trend component (Figs 9 and 10), both the actual and the synthesized data shows similar result and trend of the actual data have been well approximated by the trend of the synthesized data.

Download:

Fig 9. Trend of Dengue data of 2022 (Actual).

Trend of the actual dengue data.

https://doi.org/10.1371/journal.pone.0295803.g009

Download:

Fig 10. Trend of Dengue data of 2022 (Predicted).

Trend of the synthetic dengue data.

https://doi.org/10.1371/journal.pone.0295803.g010

In the case of the seasonality component (Figs 11 and 12), both the actual and the synthesized data show major weekly and minor sub-weekly seasonality. The synthesized data’s seasonality approximates the actual data’s seasonality well.

Download:

Fig 11. Seasonality of Dengue data of 2022 (Actual).

Seasonality of the actual dengue data.

https://doi.org/10.1371/journal.pone.0295803.g011

Download:

Fig 12. Seasonality of Dengue data Of 2022(Predicted).

Seasonality of the synthetic dengue data.

https://doi.org/10.1371/journal.pone.0295803.g012

In the case of the residual component (Figs 13 and 14), both the actual and the synthesized data show a similar result, although the residual of the synthetic data may look noisy at first glance but upon closer inspection, it is evident that the residual of the synthetic data shows less deviation from the standard value in comparison to the actual data. The synthesized data’s residual has well approximated the actual data’s residual.

Download:

Fig 13. Residual of Dengue data of 2022(Actual).

Residual of the actual dengue data.

https://doi.org/10.1371/journal.pone.0295803.g013

Download:

Fig 14. Residual of Dengue data of 2022(Predicted).

Residual of the synthetic dengue data.

https://doi.org/10.1371/journal.pone.0295803.g014

As mentioned earlier, the key takeaway from the discussion is that the SBD algorithm could generate an excellent approximation of the dengue data from the monthly aggregated data based on some statistical properties of the prior distribution. In the following section, we shall also test SBD algorithm’s efficacy in another epidemiological scenario.