Prediction of dengue incidents using hospitalized patients, metrological and socio-economic data in Bangladesh: A machine learning approach

Dengue fever is a severe disease spread by Aedes mosquito-borne dengue viruses (DENVs) in tropical areas such as Bangladesh. Since its breakout in the 1960s, dengue fever has been endemic in Bangladesh, with the highest concentration of infections in the capital, Dhaka. This study aims to develop a machine learning model that can use relevant information about the factors that cause Dengue outbreaks within a geographic region. To predict dengue cases in 11 different districts of Bangladesh, we created a DengueBD dataset and employed two machine learning algorithms, Multiple Linear Regression (MLR) and Support Vector Regression (SVR). This research also explores the correlation among environmental factors like temperature, rainfall, and humidity with the rise and decline trend of Dengue cases in different cities of Bangladesh. The entire dataset was divided into an 80:20 ratio, with 80 percent used for training and 20% used for testing. The research findings imply that, for both the MLR with 67% accuracy along with Mean Absolute Error (MAE) of 4.57 and SVR models with 75% accuracy along with Mean Absolute Error (MAE) of 4.95, the number of dengue cases reduces throughout the winter season in the country and increases mainly during the rainy season in the next ten months, from August 2021 to May 2022. Importantly, Dhaka, Bangladesh’s capital, will see the maximum number of dengue patients during this period. Overall, the results of this data-driven analysis show that machine learning algorithms have enormous potential for predicting dengue epidemics.

Introduction of life, the dengue virus can be found living as a parasite in the bodies of different hosts. This virus is initially found in the Aedes mosquito's body, where it can be transmitted to humans (Fig 1(B)). Whenever a mosquito bites a human, the virus is transmitted through the penetration of the mosquito's needle, which has been living in the mosquito's body for several days. Afterward, the virus continues to replicate and multiply in the human body.
The rainy monsoon greatly benefits the Aedes mosquito's reproduction [5]. Rain, temperature, and humidity are essential factors in the Aedes mosquito's reproduction [6]. Bangladesh is a region where the weather is favorable for the reproduction of the Aedes mosquito during certain months, and mosquitoes of the genus Aedes can lay eggs at any time of the year [7]. Those eggs hatch when the monsoon comes or rains [8]. As a result, even a small amount of clean water can produce Aedes mosquitos, responsible for spreading the dengue virus.
This study attempts to determine the timing and place of the spreading of the dengue virus by exploring the number of dengue patients admitted to hospitals in the past years. For keeping track of the dengue patient's data, this research followed the officially released information from the Directorate General of Health Services (DGHS), under the Ministry of Health and Family Welfare responsible for health services in Bangladesh. Along with the cases of dengue patients recorded, this research also utilized the daily weather data from the Bangladesh Meteorological Department (www.bmd.gov.bd) and an updated population census from the Bangladesh Bureau of Statistics (www.bbs.gov.bd). Out of 64 districts of Bangladesh, 11 significant districts in terms of population and geographic conditions have been chosen to predict the trend of dengue spread in this exploration. More importantly, based on records, the maximum number of dengue cases has been reported from these 11 cities (Dhaka, Faridpur, Mymensingh, Chittagong, Cox's Bazar, Khulna, Rajshahi, Barishal, Bhola, and Sylhet). For this study, two different regression-based machine learning models (Multiple Linear Regression and Support Vector Regression) have been explored and utilized in this research, along with the developed DengueBD dataset. This article contributed to designing and developing the DengueBD dataset by predicting the spread of dengue cases in 11 different cities for the next 10 months in Bangladesh, employing two regression-based machine learning models. The performance of the two models (Multiple linear regression and Support vector regression) is evaluated using the Mean Absolute Error (MAE) on the DengueBD dataset, and Support vector regression (SVR) achieved higher accuracy in predicting dengue cases across the country.
Dengue fever is a severe disease that affects individuals across the world. As a result, several countries have been striving to solve the virus's mystery. Researchers worldwide have been attempting to determine the virus's origins and the optimal environment in which it can thrive.
Karim et al. [6] proposed a model to detect dengue utilizing multiple linear regressions. The authors utilized data from the Directorate General of Health Services (DGHS) and the Meteorological Department of Dhaka, Bangladesh, for dengue cases and climatic data from 2000 to 2008. They obtained that changes in the climate can significantly impact the spread of this virus in Dhaka city. Their study accurately forecasted a dengue breakout (�200 cases) with the area under the ROC curve being 0.89, 95%.
Mutsuddy et al. [7] designed a study based on the available dengue cases that focused solely on the epidemiological elements of dengue-related morbidity and mortality in terms of climate conditions and seasonal change in Bangladesh. Their research revealed a shift in dengue incidence, with more cases reported during the pre-monsoon season due to climate change and other urban factors. However, the authors did not use any prediction approach or machine learning techniques in their investigation.
In other research, Bhatt et al. [9] used a combination of published literature and internet sources to create a database of 8,309 geo-located occurrence reports from 1960 to 2012. The authors created a model using the boost regression tree (BRT) framework and identified a connection between the chance of dengue incidence and inapparent using longitudinal data from 54 dengue cohort studies.
Lambrechts et al. [10] investigated the effects of daily temperature changes on DENV (dengue virus) transmission potential using thermodynamic models. They observed that at a mean temperature of 26˚C, the percentage of infection is projected to be 99%, 88%, and 76% for (Diurnal temperature range) DTRs of 0˚C, 10˚C, and 20˚C, respectively. The authors also stated that once mean temperatures approach 26˚C, the infection risk appears to the peak.
Liu et al. [11] analyzed the spread out of the dengue virus from different perspectives. The authors observed that Aedes albopictus were orally infected with dengue virus 2 (DENV-2) and grown under constant temperatures (18, 23, 28, and 32˚C) and a variable temperature (28-23-18˚C). Their findings also revealed that, compared to the quantity of DENV-2 in salivary glands at 28˚C, the amount of DENV-2 in salivary glands at 28-23-18˚C was dramatically reduced.
A study has been conducted by Nagasaki University in Japan by Igarashi et al. [12] to observe the impact of the dengue virus on people and its control. This research includes a map of the dengue virus's geographical spread. According to this study's conclusions, using a net in the house can considerably prevent the transmission of this virus.
In an experiment conducted by Benelli et al. [13] of the University of Pisa in Italy, a novel method of mosquito control was presented, but no forecasting approaches have been designed. The authors observed that humans could prevent the spread of the virus by controlling the mosquito population and reducing the number of mosquitoes that reproduced.
Althouse et al. [14] compared step-down linear regression, generalized boosted regression, and negative binomial regression models to predict the incidence of dengue. For predicting a binary outcome defined by whether dengue incidence exceeded a set threshold, logistic regression, and Support Vector Machine (SVM) models were utilized. The authors obtained Dengue fever incidence data from Singapore (weekly incidence, 2004-2011) and Bangkok (monthly incidence, 2004-2011). The authors also claimed that SVM models outperformed logistic regression predicting high-incidence periods in Singapore and Bangkok. In Singapore, the Area Under the Curve (AUC) for SVM models with the 75th percentile cutoff is 0.906, whereas, in Bangkok, it is 0.960.
Uno et al. [15] conducted a study in the United States of America to observe the effects of the dengue virus on the human body. The life cycle of the dengue virus in the human body is depicted in this research, and the authors found that dengue vaccination can be used to reduce the number of dengue patients and the dengue vaccine, on the other hand, has been heavily criticized due to its low effectiveness in preventing the disease, at just 60%.
According to a recent editorial letter [16], dengue fever transmission season could potentially extend all year, with epidemics occurring at any time in Bangladesh. The authors calculated the change in vectorial capacity (VC) of Aedes aegypti mosquitoes at a seasonal level for all regions in Bangladesh to see how climate change affects dengue illnesses. The study's main contribution is estimating and identifying changes in VC; however, other environmental factors such as rainfall and humidity, which could be critical drivers of dengue risk, were not considered.
Another study [17] was conducted in Bangladesh to predict dengue outbreaks between 2000 and 2008, emphasizing the impact of seasonal climate data. They created a generalized linear model based on monthly minimum temperatures, rainfall, and sunshine before the dengue season to forecast the number of yearly dengue cases. Variable selection and leave-one-out cross-validation were used to develop the best prediction model and evaluate its performance.
In addition, no machine learning technique was used to anticipate the future trend of dengue incidence in Bangladesh.
Dourjoy et al. [18] designed a machine learning-based comparison study to predict Dengue incidences in Bangladesh using Support Vector Machine (SVM) and Random Forest (RF). The authors employed 600 patients' survey data acquired online for their trial. According to the findings, SVM and RF produced an accuracy of 68 percent and 64 percent, respectively. However, their study concentrated solely on the pattern of the patient's data, disregarding other factors such as metrological data, socioeconomic influence, geographic location, and weather correlation with monthly instances. Furthermore, there has been no validation of their results.
Martheswaran et al. [19] analyzed case report data from 2012 to 2020 to estimate dengue incidence in Singapore and Honduras using the random-sampling-based susceptible-infectedremoved (SIR) model. Aside from that, the suggested model was fitted using the Bayesian Markov Chain Monte Carlo (MCMC) technique. The study's findings suggested that their method may be used in other outbreaks, such as the ongoing COVID-19 pandemic, to comprehend the outbreak's future dynamics better.

Materials and methods
A dataset named DengueBD [20] has been developed based on the available daily press released information of the Directorate General of Health Services (DGHS) [14] of Bangladesh to conduct this research. The DGHS provided daily dengue case data, while the Bangladesh Meteorological Department provided daily rainfall and temperature data from 2012 to 2019. The population consensus (2011) was collected from the Bangladesh Bureau of Statistics. During the development of the dataset, the number of patients admitted to hospitals each month in every major city in Bangladesh was included. Data were obtained from 11 of Bangladesh's most crucial cities in terms of population and geographic location: Dhaka, Faridpur, Mymensingh, Chittagong, Cox's Bazar, Khulna, Jessore, Rajshahi, Barishal, Bhola, and Sylhet. Apart from using the meteorological department's data, this study also explored data from [21]. These collected data played a vital role in analyzing whether there is any correlation between the weather pattern and the number of dengue patients admitted to hospitals. Fig 2 visualizes the amount of recorded rainfall, maximum temperature, maximum humidity, and finally, the number of hospitalized patients for September 2019. Table 1 highlights the dataset pattern of the developed DengueBD dataset containing attributes and their description.
In this section, different research methods are highlighted, along with their working procedure. Fig 3 illustrates the workflow diagram of how this study is conducted from the initial data collection stage to prediction and visualization. Along with the data preprocessing technique, the article discusses the machine learning models that have been utilized to predict dengue cases for the upcoming years.

Data preprocessing
DengueBD dataset is dedicatedly developed to predict the upcoming dengue cases in Bangladesh. Researchers from different domains have emphasized the importance of data preprocessing [22]. Three core python libraries have been used for preprocessing, including NumPy, Pandas, and Matplotlib. A mean calculation approach has been followed to identify and handle the missing data. For feature scaling, this research used a standardization method in which the independent variables of a dataset remain within a specific range. The equation of standardization for feature scaling is shown in (Eq 1), where mean() returns the mean of feature x, and std () returns the standard deviation of x.
Finally, for splitting the dataset into 80% training and 20% test data, we have imported the

Proposed prediction model
This study used two regression algorithms: Support Vector Regression (SVR) and Multiple Linear Regression (MLR). Support Vector Regression (SVR). Support Vector Regression (SVR) [23] is a machine learning algorithm proposed in this paper for predicting dengue incidents. In this regression technique, two lines operate as the decision boundary, and those decision boundaries determine a single line in the middle, referred to as a hyperplane. When applying SVR, the points within the decision boundary have been considered. A hyperplane is the best fit line that contains a maximum number of points. Consider the decision lines are located from the hyperplane at a distance α. Therefore, these lines can be obtained at distances '+α' and '-α' from the hyperplane.
Assume that the hyperplane's equation is as follows: The decision boundary equations then become: As a result, any hyperplane that fits SVR should do so: The primary objective of using the SVR is to determine a decision boundary that is α distance from the original hyperplane such that data points remain as close to the hyperplane as possible.
Multiple Linear Regression (MLR). Derived from linear regression, MLR deals with time series data where it forecasts the values of one or more response variables using a set of independent variables [24]. MLR is one of the proposed and applied models to predict dengue cases in this research. The proposed prediction model has multiple independent variables that considerably correlate with the dependent variable. Therefore, this study utilized a multiple linear regression algorithm. In multiple linear regression, the model is assumed to be, where Y indicates the target variable and β 0 , β 1 , β 2 ,. . .,β m are the coefficients of the model. X 1 , X 2 ,. . .,X m are the feature variables or independent variables. In multiple linear regression models, data is assumed to be normally distributed. Thus, for n observations, it holds, Eq (6) can be written as, or; where, in Eq (6) and Eq (7), y = (y1, y2. . .,yn) T is a n×1 vector of n observations,X is the augmentation of X ¼ which is a n×m matrix of n observations on each of the m explanatory variables, β = (β 1 , β 2 ,. . .,β m ) T is a k×1 vector of regression co-efficient and ε = (ε 1 , ε 2 ,. . .,ε n ) T is n×1 dimensional vector of random error. In this research, the target variable is considered as Dengue suspected hospitalized patients, and the feature variables are treated as rainfall, temperature, and humidity.

Results and discussion
This research utilized various attributes to develop a dataset, namely DengueBD, that is used to predict the number of dengue patients in the 11 different cities of Bangladesh. The experiments were conducted on a local machine using Jupyter Notebook, whose specifications are furnished in Table 2 Based on the experiment, we enlisted the findings in the result section. According to experiments, two sets of results have been produced, one with MLR and another with SVR. For both the regression models, this research predicted the dengue cases from August 2021 to May 2022 (10 months). MLR achieved a prediction accuracy of 67% (Mean Absolute Error (MAE):4.57), whereas SVR outperformed MLR with a prediction accuracy of 75% (MAE:4.95). In MLR, the weather is considered an independent variable, and the number of dengue patients is a dependent variable to predict future dengue patients. To better understand, Fig 4 shows the predicted outcome for August 2021 to December 2021 for both the SVR and MLR models.
Until then, Dhaka, the capital of Bangladesh, was considered the epicenter of dengue patients, and in recent times, it experienced a significant rise in dengue cases. Support vector regression (SVR) performed well in time series prediction as a cutting-edge and powerful machine learning approach [23]. SVR was examined for tracking dengue dynamics and was compared to another regression model. An optimum cost parameter C was established to reduce overfitting and improve prediction accuracy. To choose the best SVR model, we used a cross-validation method employing Mean Absolute Error (MAE) as a performance measure. Although SVR showed good generalization performance compared to other models in this study, it can be incredibly slow in large-scale applications due to its extensive memory requirements [25]. Another key practical concern in SVR is kernel selection. The most appropriate kernel function for the dengue data should be considered when constructing the SVR model. Fig 4(A) shows that the capital city was highly affected by dengue cases in August 2021. Around 7000 cases have been registered in Dhaka city for the entire month. The SVR model predicted that at the end of December 2021, dengue cases in different districts would gradually decrease. Based on Fig 4(B), it can also be observed that the number of dengue patients in Dhaka city is decreasing over time according to the MLR model. With such evidence, the amount of rainfall, the highest temperature, and humidity gradually decrease every month. On the other part of the country, Jessore, another city in the southwestern region of Bangladesh, shows an upward trend and sudden spike in the number of dengue patients. A Map visualization using the MLR approach, including the observed 11 districts of Bangladesh, is shown in Interestingly the SVR prediction approach reflects that at the end of May 2022, Dhaka, the capital of Bangladesh, will be the epicenter of Dengue cases. Overall, an upcoming 10-monthbased dengue patient prediction trend was also designed and highlighted in This study aims to predict the number of dengue patients for the next 10 months based on the DengueBD dataset using two regression-based machine learning algorithms. Multiple Linear Regression (MLR) and Support Vector Regression (SVR) have been utilized to predict the Dengue cases in 11 different cities of Bangladesh. Apart from predicting the number of dengue patients, this research also predicted the amount of rainfall, humidity, and maximum temperature of that specific region from August 2021 to May 2022. Significant changes were observed in the predicted cases compared with MLR and SVR. Around 162272 dengue cases are predicted in 11 cities of Bangladesh for the next 10 months (Aug 2021-May 2022) using MLR. August 2021 experienced around 59244 (36%) dengue cases, which was the most compared with other months. Interestingly, the number of cases has decreased gradually along with the changes in the environment and season in Bangladesh, and in January 2022, Bangladesh will report only 791 (0.48%) cases in the entire country, which is the lowest among all predictions. However, there will be a sudden spike in the cases during May 2022, and there is a possibility that country will observe around 33158 (20.04%) dengue cases in the upcoming year. Around 149485 people around the country will be infected by the dengue virus from Aug 2021 to May 2022, according to the prediction using SVR. It also indicated that the curve for the dengue cases will remain downward until January 2022 (1834, 1.22%), and then it experienced a quick rise ahead of February 2022. The primary reason behind the sudden increase of dengue patients in Bangladesh ahead of March is highly related to the meteorological behaviors of nature. Usually, the start of the summer season in Bangladesh is in March, and in both cases, this study overserved and predicted the upward trend in terms of dengue cases increased all over the country. Among them, based on predicted data, Dhaka, the capital city, will experience the highest number of Dengue cases in the next 10 months, and it is around 40% (65955) of total cases (162272) using MLR and 43% (65747) of total cases (149485) using SVR. Another observation this research enlisted is that dengue patients increase significantly during a specific year. As the temperature and rainfall decreased, the number of dengue patients decreased in October. This correlation indicates a strong relationship between the weather and the spread of dengue fever in the country. Managing a growing number of patients outside urban areas is challenging due to the high population density and limited capacity. In addition, many areas lack well-equipped healthcare facilities. Therefore, this study will provide a broader view for the country's policymakers to observe and understand the future Dengue situation. Undoubtedly, the outcome of this research supports the authority to take necessary steps well ahead of time and make people aware of the situation, allowing them to protect themselves from this deadly virus. As a result, attempts to undertake risk assessments based on early warning indications must include social elements and mass movement events in addition to climate factors.
To better understand the contribution of this study, we have further designed a detailed comparison table. Table 3 summarizes the parameters and outcomes of the previously suggested model for predicting dengue outbreaks in various countries using machine learningbased methodologies. The comparative study demonstrates that researchers from all over the country employed various data sources based on their study aims and presented projections based on their proposed model. Only studies that were dedicatedly performed on distinct MLbased or epidemic modeling-based learning algorithms were selected for the comparison.
The proposed strategy is primarily based on a regression-based machine learning model that forecasts dengue outbreaks over the following ten months. However, the optimal solution cannot be guaranteed by the proposed model. As a result, the future focus of this research will be on the use of cutting-edge optimization methods on the same DengueBD Dataset. Nature- inspired optimization techniques are currently receiving greater attention. Some natureinspired optimization algorithms that we are targeting for future research are Particle Swarm Optimization (PSO): inspired by a flock of birds or school of fish [31], and Aquila Optimizer (AO): inspired by the Aquila's behaviors in nature during the process of catching the prey [32], Ebola Optimization Search Algorithm (EOSA): inspired by the propagation mechanism of the Ebola virus disease [33], Dwarf mongoose optimization algorithm (DMO) [34] and Reptile Search Algorithm (RSA): motivated by the hunting behavior of Crocodiles [35].

Conclusion
Dengue epidemic containment is one of the most pressing public health issues in tropical and semi-tropical nations such as Bangladesh. After a dengue epidemic has started, countries often strive to control the mosquitoes that carry the disease. However, more sickness and death might be avoided if countries periodically monitor the circulation of viruses in mosquitos and implement control measures before massive outbreaks occur. As machine learning techniques are beneficial for classifying and forecasting dengue fever outbreaks, this study used two distinct regression-based models to predict dengue cases in 11 different districts of Bangladesh for 10 months. This study contributes to developing a comprehensive dengue dataset named  Table 3. Comparative analysis of state-of-the-art research work in predicting dengue incidents globally using different data sources.

Reference
Year DengueBD and establishes a relation between the climatic conditions with the number of dengue cases in different geographic regions of Bangladesh. Furthermore, the findings of this study will support respective concerns, authorities, the government, different healthcare organizations, and other responsible communities to understand the epidemic pattern of dengue disease in the country and act accordingly. Currently, this research lacks the availability of a standard dataset; therefore, this study cannot reflect the complete spectrum of dengue cases across the country. In the future, the prediction period can be extended by standardizing the dataset, collaborating with other organizations, and employing more Machine learning approaches to obtain a more accurate prediction.