Downscaling epidemiological time series data for improving forecasting accuracy: An algorithmic approach

Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.


Introduction
Any process that involves deriving high-resolution data from low-resolution variables is referred to as downscaling.This method relies on dynamical or statistical approaches and is extensively utilized in the field of meteorology, climatology, and remote sensing [1,2].Significant exploration of the downscaling methods has been done in the field of geology and climatology to enhance the out of existing models like the General Circulation Model (GCM) [3,4,5,6,7,8], Regional Climate Model (RCM) [9], Integrated Grid Modeling System (IGMS) [10], System Advisor Model (SAM) [10] and to make it usable for the forecast of geographically significant region and time.Several methods has been used to downscale these data such as BCC/RCG-Weather Generators (BCC/RCG-WG) [11,12,13], and Statistics Downscaling Model (SDSM) [11,14,15,16,17,18,19], Bayesian Model Averaging (BMA) [20].Even machine learning methods has been used like Genetic algorithm (GA) [9], K Nearest Neighbourhood Resampling (KNNR) [9], Support Vector Machine (SVM) [11,21,22,23].Except for the machine learning algorithms, which are methods that are finding their applications in new domains, the rest of the methods are tailored to suit the outputs of the models, as mentioned earlier.
This class of methods has recently been applied in the deaggregation of spatial epidemiological data [24].But significant work has yet to be done for the temporal downscaling of epidemiological data.Often, the temporal downscaling techniques are classical interpolation techniques that do not do justice to aggregated data.This phenomenon can be well illustrated with an example.Consider the case of monthly Dengue infection data of 2017 from figure 15, which has been downscaled using linear interpolation by considering the aggregated value as the value of the end date of a month in Figure 16.In this case, if we consider the monthly aggregate of the downscaled data, it does not match the original aggregate.This downscaled data, which differs from the original data in such statistical measures, shall result in decisions and knowledge that cannot be far from the truth.
This paper aims to propose a novel algorithm named Mahadee-Kamrujjaman Downscaling (MKD) algorithm based on the Bayesian approach that can regenerate downscaled temporal time series of varying time lengths from aggregated data preserving most of the statistical characteristics and the aggregated sum of the original data.
The paper is organized as follows.Section 2 describes the data used for the paper and its sources.Section 3 discusses the methodology at length with the proposed MKD algorithm.Section 4 compares the synthesized data with the actual data of two different epidemiological cases (Dengue and COVID-19) in Bangladesh and shows how the MKD algorithm could generate statistically accurate approximate of the actual with very little input in both cases, and discuss the benchmark metric used for evaluating the output.Section 5 shows the improvement of the forecasting accuracy using synthesized data over aggregated data using a statistical forecasting toolbox in the dengue scenario of Bangladesh using the last 12 years of monthly aggregated data, Forecasting model selection procedures, and residuals.Finally, section 6 concludes our paper with an overview of the paper and how our paper has contributed to the existing literature and scopes for improvements and fields of application of the MKD algorithm.

Data
The dengue data from Bangladesh used in this paper are from January 2010 to July 2022 and are collected from DGHS [25], and IEDCR [26].The COVID-19 data of Bangladesh are from 8 March 2020 to December 2020 and are collected from the WHO data repository [27].

Methodology
The MKD algorithm can be segmented into three sequential parts, as exhibited in Figure 1.Initially, the algorithm considers a prior distribution to generate synthetic downscaled data.The MKD algorithm considers the aggregated data as the prior distribution of the downscaled data.For example: If we have the monthly epidemiological data of dengue for the year 2017, thus to attain the prior distribution for the downscaled data, we divide the data by 30.The fact is well illustrated in Figures 15 and 16 attached in the appendix A. Figure 15 depicts the monthly distribution of the DENV (Dengue Virus) infection in Bangladesh for the year 2017, and Figure 16 represents the prior distribution obtained by the method described above.Based on the prior distribution, initial statistical properties of the synthetic data are obtained except for the standard deviation (σ).As σ is scaling independent, hence scaling method used to obtain the prior distribution from the monthly aggregate keeps the σ identical to the monthly aggregate.To overcome this problem, we consider,

IniƟal Data Generator
where σ 0 is the standard deviation considered for the distribution to be fitted to generate the downscaled data by the algorithm and σ prior distribution is the standard deviation of the obtained prior distribution.Later on, in section 4, we will see that the initial assumption of the standard deviation considered in (3.1) is a good approximation for the downscaled data.

Initial Data Generation
The "Initial Data Generator" phase feeds on the aggregated data, length of the aggregate interval, and σ 0 to give an initial downscaled data based on a "Distribution Generator".Based on the prior distribution, a proper statistical probability distribution (PD) is to be considered to be fitted to generate the data.The "Distribution Generator" aims to fit the selected PD to the prior distribution based on the statistical properties obtained for the initial phase.The challenge not only in this scenario but also in every step of the algorithm is to ensure that the synthetic data produced in every step is non-negative integers, as we are dealing with epidemiological data.Thus specific measures have been deployed to tackle these challenges which are: • To ensure non negativity consider the transformation: • To ensure that the data points are integer irrespective of the selection of PD, we round off the data to the nearest integer and subtract one from randomly selected data points in each aggregated unit such that the synthesized data has the same sum as the aggregated unit Thus imposing these measures, the "Distribution Generator" generates synthetic distribution for each aggregated unit.Thus looping over the entire aggregated timeline generates the initial distribution of the downscaled data with respect to the aggregated data.This initial distribution is a suitable approximation to the actual data but can be improved with further refinement.The synthetic data will result in the exact aggregated data from which it is generated.

Overthrow Correction
This step is often necessary for time series data with an abrupt change in gradient or in case of initial approximation with abnormally large overthrow as the approximations are probabilistic.In case of data with the abrupt change in gradient, the initial approximation is often left with a staircase-like structure as exhibited in the Figure 2. The problem can be corrected using the overthrow correction measure, which is demonstrated in Figure 3.The overthrow correction part takes a tolerance, δ, iteration limit, n, and a radius of an open interval, r.The step initially determine overthrow using tolerance between two neighboring points i.e. if y i − y i−1 > δ or y i − y i+1 > δ then y i is an overthrow.After identifying an overthrow, we consider an open interval of radius r around the overthrow point and execute the distribution generator on that open interval.This redistributes the sample within the open interval diminishing the overthrow to some extent.This process is iterated n times over the entire time series to ensure satisfactory results.The strength of the overthrow correction step can be dictated by the two parameters δ and n.The strength of the overthrow correction is directly proportional to n and is inversely proportional to δ. Selecting the correct parameter value can ensure a good approximation of the real-life scenario.Figure 3: Initial approximation with overthrow correction exhibits a much proper approximation of the real case scenario preserving its original trend.

Volume Correction
The overthrow correction disrupts the property of the synthesized time series to conserve its aggregated sum equal to the given aggregated distribution due to its local correction property.The scenario best illustrates the table 1.This problem is addressed in this step.To maintain aggregated sum equal to the original data, we consider each aggregated unit and adjust the sum accordingly, adding/subtracting 1 from randomly chosen indices until the sum equates as required.
Table 1: The table exhibits the comparison of the number of cases each month for executing the MKD algorithm on the Dengue 2019 data of Bangladesh with the actual data.Here we can see the total number of infected individuals in each algorithm step is the same.In the case of the monthly sum, we see some anomaly in the overthrow correction case, which has been fixed in the volume correction step.

The Mahadee-Kamrujjaman Downscaling (MKD) Algorithm
The algorithms calls for a unique name of it's own.From now on, we shall address it as Mahadee-Kamrujjaman Downscaling (MKD) algorithm.The structural part of the algorithm has been discussed at length to in the first three subsections of the section methodology.The proper pseudo code of the MKD algorithm is as follows: Algorithm The MKD algorithm is heavily dependent on the random selection of numbers that are prone to generate non-reproducible results.Thus seeding the random number generator is highly recom-mended to ensure reproducible results.
The novelty of MKD algorithm is its consideration of the prior distribution as initialization and deploying the underlying distribution to generate synthesized downscaled data, which is nonnegative and conserves the aggregated value of the given data.

Comparison of the Synthesized Data with the Real Data
To determine the accuracy of the MKD algorithm, we test the MKD algorithm against some realworld data.Here, we have taken 2020 COVID-19 data on infected individuals in Bangladesh and 2022 (January to July), Dengue data on infected individuals in Bangladesh.The aforementioned data are daily data of the number of newly infected individuals across the country.We aim to convert this data to monthly aggregate and feed the aggregated data to the algorithm to generate downscaled daily data; hence we can compare the accuracy of the synthetic daily data with respect to the actual daily data.To determine the accuracy of the approximation, we will use two error measures, and we will do component analysis on the real and synthetic data to see if the synthetic data can well approximate the underlying properties of the real data.In case of the component decomposition, we will use the additive model mentioned in (4.1), as the procured data has some zero values for which the multiplicative model mentioned in (4.2) is not suitable in this scenario.

Error Measures for Benchmark
To compare the result with the real world data we shall use two error terms that describes the overall error of the approximation.These are as follows: • Root Mean Square Error: The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a commonly used measure of the discrepancies between the values (sample or population values) predicted by a model or estimator and the actual values.RMSD is the square root of the second sample moment of the discrepancies between anticipated and observed values, or the quadratic mean of these differences.When the computations are executed over the data set used for estimate, these deviations are known as residuals, and when they are computed out-of-sample, they are known as errors (or prediction errors).The RMSD aggregates the magnitudes of the mistakes in predictions for various pieces of data into a single metric of predictive ability.RMSD is a measure of accuracy used to assess the predicting losses of various models for a specific dataset and not between datasets because it is scale-dependent [28].
RMSD is always positive, and a value of 0 would suggest a perfect fit to the data, which is nearly never attained in practice.A smaller RMSD is often preferable to a greater one.However, because the metric is dependent on the magnitude of the numbers used, comparisons between various kinds of data are invalid.
The root square of the mean of squared mistakes is the RMSD.The influence of each inaccuracy on RMSD is proportional to the magnitude of the squared error; therefore, larger errors have an outsized effect on RMSD.As a result, the RMSD is extremely sensitive to outliers [29,30].
Instead of the Root Mean Square Deviation, the Mean Absolute Error (MAE) has been suggested as a useful statistical tool by a number of scholars.The MAE has certain advantages over the RMSD when it comes to interpretability.The mean absolute error, abbreviated as MAE, is the average of the absolute values of the mistakes.The square root of the average of squared errors is more difficult to grasp than the MAE, which simplifies things considerably.
In addition, the magnitude of a mistake has an effect on the MAE in direct proportion to its absolute value, whereas the RMSD does not follow this pattern at all [29].
RMSE can be defined using the following formula: where, x i is the actual data and xi is the predicted data.
• Mass Absolute Error: In statistics, the term "mean absolute error" (MAE) refers to a measurement of the errors that occur when matched observations expressing the same event are compared.The scale that is being used to measure the data is also used for the mean absolute error.Because this is what's known as a scale-dependent accuracy measure, it can't be used to compare series that have different scales because the comparisons wouldn't be valid [31].
In time series analysis, the mean absolute error is a frequent way to quantify the accuracy of forecasts [28], occasionally leading to confusion with the more traditional definition of mean absolute deviation.There is, more generally speaking, the same confusion.
One of the many methods that may be used to compare forecasts with the results that actually transpired is called the mean absolute error.The mean absolute scaled error, also known as MASE, and the mean squared error are two options that have a solid track record.The mean signed difference is one metric that does put emphasis on this, as opposed to the other measures, which all summarize performance in a fashion that disregards the direction of whether the forecast was made too high or too low.
When it comes to fitting a prediction model with a chosen performance metric, the equivalent for mean absolute error is mean absolute deviations, and the least squares approach is related to the mean squared error.
Although some academics report and interpret it that way, mean absolute error (MAE) and root-mean-square error (RMSE) are not the same concept.The MAE is conceptually simpler and also easier to perceive than the RMSE.It is just the average absolute vertical or horizontal distance between each point in a scatter plot and the Y=X line.In contrast, the RMSE is a measure of error that is more difficult to interpret.To phrase this another way, MAE refers to the average absolute difference that exists between X and Y.In addition, the contribution that each error makes to the MAE is weighted according to the absolute value of the error.This is in contrast to the RMSE, which involves quadrupling the differences; as a result, a few significant changes will have a higher impact on the RMSE than they will have on the MAE [29].
Since many of the data points in the actual and synthesized cases is popluated with 0 hence Mass Absolute Percentage Error (MAPE), and Scaled Mass Absolute Percentage Error (MAPE) are undefined in this scenario.

Preprocessing and Result
In case of this simulation, we took Bangladesh's 2022 daily Dengue infected data from January to July.To feed this data into the MKD algorithm, we convert the daily data to monthly aggregate as illustrated in Figure 4, We feed in this data considering, • Initial standard deviation, σ 0 = σ prior distribution 30 = 556.643170330 = 18.55477234.
• Underlying distribution to be normal.
and generate the synthesized data.Figure 6 illustrates the synthesized data, which can be said to be a good approximation of the actual given the aggregated prior distribution.

Error Metrics and Statistical Measures
The calculated error measures are: • M AE = 6.60664, which implies that the average error between the actual and synthesized data is 6.60664.
• RM SE = 12.64499 which implies that the standard deviation of the residuals/errors is 12.64499.The fact is well illustrated in Figure 22.
The error metric shows satisfactory results.The total number of cases in each scenario has been maintained equally.As discussed earlier, we can see that the initial distribution holds the monthly sum consistently, which gets disrupted in the overthrow correction phase and later corrected in the volume correction phase.
We shall now explore the basic statistical properties of the synthetic data with respect to the actual data.

Measures
Observed It is to be noted that the mean of the synthesized data equates to that of the original data, although it was not plugged into the MKD algorithm in any manner.As previously discussed that σ 0 is a good approximation to the original σ.All the rest of the measures are somewhat close, but the maximum varies by a lot.The maximum is hard to anticipate from the aggregated data; hence it is an avenue that demands further exploration.

Component Decomposition and Comparison
We now want to do component decomposition of both the actual and synthetic data based on the model mentioned in (4.1).However, component decomposition in no way is a benchmark for accuracy, but as MKD algorithm aims to improve the outcome of forecasting techniques which are highly influenced by the components within a time series data.Thus comparing these components can answer the question of whether the components-based characteristics of the original time series are present within the synthesized data.
In case of the trend component (Figure 17 and 18 on appendix A), both the actual and the synthesized data shows similar result and trend of the actual data have been well approximated by the trend of the synthesized data.
In case of the seasonality component (Figure 19 and 20 on appendix A), both the actual and the synthesized data shows major weekly, and minor sub-weekly seasonality.The synthesized data's seasonality approximates the actual data's seasonality well.
In case of the residual component (Figure 21 and 22 on appendix A), both the actual and the synthesized data show a similar result, although the residual of the synthetic data may look a bit noisy at first glance but upon closer inspection, it is evident that the residual of the synthetic data shows less deviation from the standard value in comparison to the actual data.The actual data's residual has been well approximated by the synthesized data's residual.
The key takeaway from the discussion as mentioned earlier, is that the MKD algorithm could generate an excellent approximation of the dengue data from the monthly aggregated data based on some statistical properties of the prior distribution.We shall also test MKD algorithm's efficacy in another epidemiological scenario in the following section.

Preprocessing and Result
In case of this simulation, we took Bangladesh's 2020 daily COVID-19 infected data from March to December.To feed this data into the MKD algorithm, we convert the daily data to monthly aggregate as illustrated in Figure 7, We feed in this data considering, • Initial standard deviation, σ 0 = σ prior distribution 30 = 32021.8743930 = 1067.395813.
• Underlying distribution to be normal.
and generate the synthesized data.Figure 9 illustrates the synthesized data, which can be said to be a good approximation of the actual given the aggregated prior distribution.

Error Metrics and Statistical Measures
The calculated error measures are: • M AE = 257.41806,which implies that the average error between the actual and synthesized data is 257.41806, which is reasonable considering the mean of the data is 1717.424749.
• RM SE = 346.6241,which implies that the standard deviation of the residuals/errors is 346.6241.The fact is well illustrated in Figure 28.
it is to be noted that the error term of this scenario must not be compared with the error term of the previous case as they are of varying scale.Compared to the scale of the data, the error metric shows satisfactory results.The following table validates if the synthesized data honours the aggregated sum of the prior distribution.The total number of cases in each scenario has been maintained equally.As discussed earlier, we can see that the initial distribution holds the monthly sum consistently, which gets a little disrupted in the overthrow correction phase and is later on corrected in the volume correction phase.

Month
We shall now explore the basic statistical properties of the synthetic data with respect to the actual data.It is to be noted that the mean of the synthesized data equates to that of the original data, although it was not plugged into the MKD algorithm in any manner.As previously discussed that σ 0 is a good approximation to the original σ.All the rest of the measures are somewhat close, but the maximum varies by a lot.The maximum is hard to anticipate from the aggregated data; hence it is an avenue that demands further exploration.

Component Decomposition and Comparison
We now want to do component decomposition of both the actual and synthetic data based on the model mentioned in (4.1).However, component decomposition in no way is a benchmark for accuracy, but as MKD algorithm aims to improve the outcome of forecasting techniques which are highly influenced by the components within a time series data.Thus, comparing these components can answer the question of whether the original time series's components-based characteristics are present in the synthesized data.
In case of the trend component (Figure 23 and 24 on appendix A), both the actual and the synthesized data shows similar result and trend of the actual data have been well approximated by the trend of the synthesized data.
In case of the seasonality component (Figure 25 and 26 on appendix A), both the actual and the synthesized data shows major weekly seasonality.The seasonality of the synthesized data has well approximated the seasonality of the actual data.
In case of the residual component (Figure 27 and 28 on appendix A), both the actual and the synthesized data shows a similar result, although the residual of the synthetic data may look a bit noisy at first glance but upon closer inspection, it is evident that the residual of the synthetic data shows less deviation from the standard value in comparison to the actual data.The residual of the synthesized data has well approximated the residual of the actual data.
The key takeaway from the aforementioned discussion is that the algorithm could generate an excellent approximation of the COVID-19 data from the monthly aggregated data based on some statistical properties of the prior distribution.We shall also test MKD algorithm's efficacy in a forecasting scenario in the following section.

Improvements in Forecasting Accuracy
In this section, we shall forecast the Dengue infection case in Bangladesh using statistical forecasting tools.The use of statistical modelling is one of the helpful ways that may be utilized for the forecasting of dengue outbreaks [32,33].Previous research carried out in China [34], India [35], Thailand [36], West Indies [37], Colombia [38], and Australia [39] on infectious diseases made substantial use of the time series technique in the field of epidemiologic research on infectious diseases [39].A number of earlier research looked at the Autoregressive Integrated Moving Average (ARIMA) model as a potential tool for use in forecasting [40,41,42,41,43,43,41,44]. In addition, the ARIMA models have seen widespread use for dengue forecasting [45,46,43,47].When establishing statistical forecasting models, these are frequently paired with Seasonal Auto-regressive Integrated Moving Average (SARIMA) models, which have proven to be suitable for assessing time series data with ordinary or seasonal patterns [35,37,39,48,49].It is likely that developing a dengue incidence forecasting model based on knowledge from previous outbreaks and environment variables might be an extremely helpful tool for anticipating the severity and frequency of potential epidemics.
ARIMA is a well-known model in statistics that is predominantly used to forecast and analyze time series data [50].Auto Regression of order p can be defined as where e t are white noises of mean 0 and variance σ 2 e .
The Moving Average (MA) of order q is defined as, ARMA model in theory is formed in unison of (5.1) and (5.2).Hence, an ARMA model of order (p, q) is defined where the p and q are the corresponding order of the AR and MA.Development of the ARMA model for non-stationary time series is the Box-Jenkins model, also known as the ARIMA model, which integrates AR and MA with successive difference/lag operator, ▽ d .Hence, an ARIMA model of order (p, d, q) is defined where p, q has the previously mentioned definition, and d is the order of nonseasonal successive difference required to make the time series stationary i.e.
• and so on.
The idea of seasonality using the Fourier coefficient naming Fourier ARIMA model was introduced by [51].
where, δ 0 is the constant term and ω k is the periodicity of the data.
We aim to forecast the monthly data and the synthesized daily data using the aforementioned forecasting techniques and compare the accuracy of the forecast based on error measures.We use SARIMA and Fourier-ARIMA model to forecast the monthly and synthesized data respectively.The model in each case is chosen based on the lowest value of Akaike's Information Criterion (AIC), Akaike's Information Criterion correction (AICc), and Bayesian Information Criterion (BIC).

Model Selection Method
Box-Jenkins method is a generalized model selection pathway which works for time series irrespective of its stationarity or seasonality.The method is illustrated in Figure 10.

Error Measures
The error measures for comparison is Mean Absolute Scaled Error(MASE) which is defined as We used this metric as it is scale-independent; hence is perfect for comparison.We also could have taken MAPE as a metric, but MAPE is undefined for such cases as the data is populated with zero values.We also use RMSE and MAE to gauge the error in the forecast.

Forecast on the Aggregated Data
The actual data is monthly Dengue infection data of Bangladesh from 2010 to July 2022.Following Box-Jenkin's method, we firstly check for the stationarity of the data based on the Augmented Dicky Fuller (ADF) test.ADF test returns the value of -4.7906 with p-value = 0.01, which implies that the data is stationary.
We run multiple SARIMA models and calculate their AIC, AICc and BIC and the best model is chosen based on the minimum value of the criterion.We present 5 of the top results in table 6.Here, the best model to use is SARIMA (1, 0, 0)(0, 1, 1) 12 .We fit the given model, which gives us the following coefficients: ar1 ar2 sma1 0.6000 -0.0919 -0.8324 S.E.0.0879 0.0877 0.0948 Table 7: Coefficients of SARIMA (1, 0, 0)(0, 1, 1) 12 model to fit and forecast actual monthly data of Dengue infection in Bangladesh from 2010 to July, 2022.Here, ar implies autoregressive, ma implies moving average, SMA implies seasonal moving average, and the trailing number enumerates their coefficient ordering.SE implies the standard error of the mean.
To check the goodness of fit of the model, we use the Ljung box test, which returns the p-value = 0.9998 > 0.05, i.e. we accept the null hypothesis: "The model does not show lack ness of fit/ the residuals are not autocorrelated/ the residuals are random white noise." Given everything in place, we forecast the infection for the rest of 2023, i.e. from August to December.The forecast is illustrated in the given figure.
To calculate the accuracy of the given forecast, we calculate the aforementioned error measures.
The error measures are acceptable given the magnitude of the data, but there is room for improvement shall be demonstrated in the following subsection.

Forecast on the Synthesized Data
The synthesized data is daily Dengue infection data of Bangladesh from 2010 to July 2022.Following Box-Jenkin's method, we firstly check for the stationarity of the data based on the Augmented Dicky Fuller (ADF) test.ADF test returns the value of -6.6531 with p-value = 0.01, which implies that the data is stationary.
We run multiple Fourier ARIMA models and calculate their AIC, AICc and BIC.The best model is chosen based on the minimum value of the criterion.We present 5 of the top results in table 9.Here in each case of Fourier transformation, we used one pair of trigonometric terms where each pair is comprised of a sine and a cosine term as defined in (5.5) (7,0,7) model to fit and forecast actual monthly data of Dengue infection in Bangladesh from 2010 to July 2022.Here, ar implies autoregressive, ma implies the moving average, s and c represent the coefficient of the sine and cosine of Fourier term, intercept implies the constant term, and the trailing number enumerates their coefficient ordering.SE implies the standard error of the mean.
To check the goodness of fit of the model, we use the Ljung box test, which returns the p-value = 0.07749 > 0.05, i.e. we accept the null hypothesis: "The model does not show lack ness of fit/ the residuals are not autocorrelated/ the residuals are random white noise".
Given everything in place, we forecast the infection for the rest of 2023, i.e. from August to December.The forecast is illustrated in the given figure.To calculate the accuracy of the given forecast, we calculate the aforementioned error measures.
Data RMSE MAE MASE Daily 18.71255 6.593062 0.1115845 Table 11: Error measures for the forecast of the ARIMA (7, 0, 7) of the synthetic daily data.
The error measures are acceptable, given the magnitude of the data.In comparison to the error measures of the actual data illustrated in table 8, we can see improvement in the table 11.Comparing the MASE term of the two tables shows about four times improvement in the forecast accuracy using the synthetic data over actual data.

Figure 1 :
Figure 1: Flow diagram of the MKD algorithm.

Figure 2 :
Figure 2: Initial approximation without overthrow correction exhibits a staircase like property due to higher gradient change of the prior distribution.
Comparisons of what was predicted vs what was actually observed, subsequent time versus beginning time, and one technique of measurement versus an alternate technique of measurement are all examples of Y versus X.The mean absolute error (MAE) is determined by taking the sum of all absolute errors and dividing it by the total number of samples: MAE= N i=1 |x i − xi | N Therefore, it is an arithmetic average of the absolute errors, which may be represented as |e i | = |x i − xi |, where xi represents the forecast and x i represents the actual value.It is important to keep in mind that different formulations could use relative frequencies as weight factors.

Figure 4 :
Figure 4: Monthly aggregate of 2022 Dengue data from January to July.

Figure 6 :
Figure 6: Synthesized daily number of infected cases of Dengue in from January to July.

Figure 7 :
Figure 7: Monthly aggregate of 2020 COVID-19 infected data of Bangladesh from March to December.

19 Figure 8 : 19 Figure 9 :
Figure 8: Daily number of infected cases of COVID-19 in 2020 from March to December collected from DGHS.

Figure 11 :Figure 12 :
Figure 11: The figure illustrates the forecast generated by SARIMA (1, 0, 0)(0, 1, 1) 12 from actual aggregated data.To validate the goodness of the fit, we can analyze the model residual, illustrated in Figure 12.Here, the top graph is that of the residual with the timeline of the original data.The bottom left graph represents the Autocorrelation Function (ACF) with respect to the lag of the data.Almost all the values are within the significance e level, and the bottom right figure shows the distribution of the model's residuals.It implies that the residuals are distributed normally with zero mean.

Figure 13 :Figure 14 :
Figure 13: The figure illustrates the forecast generated by ARIMA(7,0,7) from actual aggregated data.To validate the goodness of the fit, we can analyze the model residual, illustrated in Figure 14.Here, the top graph is that of the residual with the timeline of the original data.The bottom left graph represents the Autocorrelation Function (ACF) with respect to the lag of the data.Almost all the values are within the significance e level, and the bottom right figure shows the distribution of the model's residuals.It implies that the residuals are distributed normally with zero mean.

Figure 16 :
Figure 16: The prior distribution of the DENV infection of Bangladesh in the year 2017 generated from the monthly aggregate distribution exhibited in the figure 15.

Figure 17 :
Figure 17: Trend of the actual dengue data.

Figure 18 :
Figure 18: Trend of the synthetic dengue data.

Figure 21 :
Figure 21: Residual of the actual dengue data.

Figure 22 :
Figure 22: Residual of the synthetic dengue data.

Figure 27 :
Figure 27: Residual of the actual dengue data.

Table 2 :
The following table validates if the synthesized data honours the aggregated sum of the prior distribution.This table illustrates that the synthetic data agrees with the monthly sum of the actual data

Table 3 :
This table illustrates the comparison of the basic statistical measures of the synthesized data with respect to the actual data.

Table 4 :
This table illustrates that the synthetic data agrees with the monthly sum of the actual data.

Table 5 :
This table illustrates the comparison of the basic statistical measures of the synthesized data with respect to the actual data.

Table 9 :
and the periodicity of the Fourier term is used to be 365.25.Prior to this we have used box cox transformation of λ = 0.49 Selection of Best model based on criteria.

Table 10 :
Coefficients of ARIMA