Figures
Abstract
Background
Cardiovascular disease (CVD) encompasses a group of disorders that affect the heart and blood vessels, making it one of the leading causes of death globally, including in Bangladesh. Applying predictive modeling for the early identification and detection of CVD holds significant promise for saving lives by enhancing prediction precision through machine learning algorithms. Therefore, this study aimed to predict high-risk individuals for CVD using machine learning algorithms and identify its influencing predictors by association mining rules among individuals in Bangladesh.
Materials and methods
This study utilized the most recent Bangladesh Demographic and Health Survey (BDHS) 2022 data, which encompassed 2,221 respondents. A Boruta-based feature selection method is employed to determine the important features associated with the high risk of CVD. Different machine learning algorithms, including logistic regression, Naïve Bayes, artificial neural network, random forest, and extreme gradient boosting (XGB), are adopted to predict the high-risk individuals for CVD in the training dataset. The predictive performance of the models is evaluated using accuracy, precision, recall, F1-score, and area under the curve (AUC) in the testing set. Additionally, the most significant rules are analyzed using the association mining technique to identify the influencing predictors of high risk of CVD.
Results
The Boruta method indicated that age, residence, marital status, wealth, having an air conditioner (AC), and body mass index (BMI) are important predictors of high risk of CVD. The XGB-based predictive model achieves impressive performance compared to other models, with an accuracy of 68.22%, precision of 69.70%, F1-score of 79.54%, and AUC of 0.721. The association rules identified that being aged 65 or older, living in an urban area, having the richest wealth status, having AC, and being widowed are the influencing predictors of high risk of CVD.
Conclusions
This study emphasizes the potential of XGB in predicting high-risk individuals for CVD and enhances the investigation of key factors contributing to CVD risk in this population, thereby facilitating the development of targeted prevention strategies that can effectively mitigate the high CVD risk.
Citation: Islam MM, Kumar S, Salam MA, Roy DC, Karim MR (2025) Cardiovascular risk prediction and influencing predictors identification among Bangladeshi individuals using machine learning algorithms and association rule mining. PLoS One 20(10): e0333913. https://doi.org/10.1371/journal.pone.0333913
Editor: Saifur R. Chowdhury,, McMaster University, CANADA
Received: October 16, 2024; Accepted: September 21, 2025; Published: October 7, 2025
Copyright: © 2025 Islam et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data set used and analyzed in this study is freely available at the Demographic and Health Surveys (DHS) program website https://dhsprogram.com/data/available-datasets.cfm. Interested researchers can obtain the data freely by registering on the website at https://dhsprogram.com/data/new-user-registration.cfm. The step-by-step instructions on how to register and download the data are provided at: https://dhsprogram.com/data/Access-Instructions.cfm.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Cardiovascular diseases (CVD) represent a major global health concern, being one of the leading causes of morbidity and mortality worldwide [1]. These encompass a variety of conditions affecting the heart and blood vessels, such as heart failure, stroke, and coronary artery disease [2–4]. This category of diseases is particularly prevalent among middle-aged and older individuals, contributing to one-third of all deaths. Unfortunately, low-income and middle-income nations, such as Bangladesh, bear a disproportionately high incidence of deaths attributed to CVD [5]. The rise in deaths from CVD in Bangladesh over the past few decades is very worrying. In 1986, the death rate from CVD was 11 per 10,000 people, but by 2006, it had jumped to 411 per 10,000 people [6]. In 2018, around 0.256 million people in Bangladesh died from CVD [7]. The burden of CVD is not just a public health challenge but also an economic one [8]. It imposes significant financial strain on patients and their families due to the high costs of long-term treatment and care. Additionally, healthcare systems face immense pressure as they struggle to manage the rising number of CVD cases. This includes costs related to hospital admissions, medication, surgeries, and ongoing care, which further stretch already limited healthcare resources, particularly in low- and middle-income countries like Bangladesh. The economic impact of CVD makes it essential to prioritize investment in prevention and early detection to reduce the health and financial burden of these conditions. Early detection and prediction are key to reducing morbidity and mortality, particularly in high-risk groups such as patients with type-2 diabetes (T2D), who are more prone to developing CVD. Predicting high-risk individuals for CVD can transform healthcare by allowing providers to identify the individuals at risk before the disease fully develops. This enables timely interventions, such as lifestyle modifications, medication, or monitoring, which can significantly lower the chances of disease progression and lead to better patient outcomes. By focusing on prevention and early treatment, healthcare systems can not only improve the quality of life for patients but also reduce the long-term costs associated with treating advanced CVD. However, to effectively address the rising burden of CVD, advanced techniques must concentrate on identifying the key contributing factors. Furthermore, developing a predictive model that can predict the disease’s status early is crucial. Utilizing machine learning (ML) algorithms presents a highly promising approach for enhancing the precision and accuracy of CVD prediction. These algorithms are particularly well-suited for analyzing large and complex datasets, as they can effectively model non-linear relationships among variables. Moreover, ML techniques can detect subtle patterns and interactions that conventional statistical methods often overlook. Moreover, ML models performed well for identifying hidden patterns and interactions that traditional statistical methods often overlook.
In recent years, numerous studies have demonstrated the application of various ML techniques for diagnosing and predicting high-risk individuals for CVD [9–18]. While these approaches have shown considerable potential, their effectiveness has not been consistent across different populations and datasets. Importantly, despite the global rise in the adoption of ML in healthcare, there remains a notable research gap in Bangladesh. To the best of our knowledge, no existing study has comprehensively addressed the early detection of CVD among individuals using ML algorithms applied to the most recent nationally representative dataset. Additionally, previous research has largely overlooked the use of association rule mining—a powerful data mining technique capable of uncovering hidden relationships among variables—to explore the underlying interactions between key risk factors of CVD. This analytical gap limits our understanding of the combinatorial effects of predictors that may jointly influence CVD risk. Therefore, the present study aims to address these gaps through a twofold objective: (1) to develop the most suitable predictive model for predicting high-risk individuals for CVD, using machine learning algorithms; and (2) to identify influential predictors of CVD through association mining techniques. By integrating predictive modeling with association rule mining, this study not only enhances individual-level risk prediction but also contributes to identifying actionable predictors. This combined approach can inform more effective risk stratification and targeted public health interventions for reducing the burden of CVD in Bangladesh.
The organization of the remaining part of this work is structured as follows: Section 2 introduces the materials and methods utilized. The data analysis results are presented in Section 3, and a detailed discussion is provided in Section 4. Finally, the conclusions are given in Section 5.
2. Materials and methods
2.1. Data source
The dataset for this study was derived from the 2022 Bangladesh Demographic and Health Survey (BDHS). The survey employed a two-stage stratified cluster sampling design for data collection [19]. In the first stage, enumeration areas (EAs) were selected using probability proportional to size sampling, consisting of 237 urban and 438 rural areas. In the second stage, a fixed number of households (typically around 30 per EA) are systematically selected from the designated EAs. Data are collected through face-to-face interviews using a structured questionnaire on various topics, including demographic characteristics, health indicators, nutrition, family planning, maternal and child health, and non-communicable diseases. The BDHS 2022 collected data from 30,330 households, with interviews completed in 30,018 of them, encompassing 132,463 individuals. Among these individuals, 118,167 were excluded due to missing data on systolic and diastolic blood pressure (SBP and DBP), resulting in a sample of 14,296 respondents. Subsequently, 333 individuals were excluded due to missing, not present, or other issues related to plasma blood glucose, resulting in 13,963 respondents with complete blood glucose data. From among them, 11,643 individuals without hypertension were excluded, yielding 2,315 respondents identified as having hypertension, including those with stage 1, stage 2, and stage 3. Finally, 94 respondents were eliminated due to missing, not present, or other issues related to any predictor variables, resulting in a study sample of 2,221 respondents. The sample selection procedure is presented in Fig 1.
2.2. Variable description
2.2.1. Outcome variable.
The outcome variable of interest in this study is the high risk of CVD, which was constructed by combining information on individuals with hypertension and diabetes, using the WHO/ISH risk prediction guidelines [20,21]. Initially, the presence of hypertension was measured as if a respondent had a systolic blood pressure (SBP) ≥140 mmHg and/or a diastolic blood pressure (DBP) ≥90 mmHg [22,23]. This categorization was further classified into three different stages: Stage 1 (SBP 140 mmHg −159 mmHg or DBP 90 mmHg −99 mmHg), Stage 2 (SBP 160 mmHg −179 mmHg or DBP 100 mmHg −109 mmHg), and Stage 3 (SBP > 180 mmHg or DBP > 110 mmHg) [24,25]. We also determined that individuals with fasting plasma glucose levels of 7.0 mmol/L or higher were considered diabetic, and those with levels below this threshold were considered non-diabetic. We further categorized hypertensive patients into three risk groups (Das, et al., 2024) —low, medium, and high—based on the presence or absence of diabetes as follows.
- Type-2 diabetes absent:
- Stage 1 hypertension (SBP 140–159 or DBP 90–99): Low risk
- Stage 2 hypertension (SBP 160–179 or DBP 100–109): Medium risk
- Stage 3 hypertension (SBP >180 or DBP >110): High risk
- Type-2 diabetes present:
- Stage 1 hypertension (SBP 140–159 or DBP 90–99): Medium risk
- Stage 2 hypertension (SBP 160–179 or DBP 100–109): High risk
- Stage 3 hypertension (SBP >180 or DBP >110): High risk
This high-risk group includes those with diabetes and/or those falling into the higher blood pressure categories (Stage 2 or Stage 3). Individuals who meet either or both of these conditions are classified as high-risk for CVD as follows
2.2.2. Explanatory variables.
This study included various explanatory variables as predictors for a high risk of CVD, based on earlier studies and data availability in the BDHS 2022 database [26–32]. The variables are age, sex, division, residence, marital status, education, wealth, watching television, having a mobile, having a computer, having AC, coffee, smoking, and body mass index (BMI). BMI is categorized into four groups: underweight (BMI < 18.5 kg/m²), normal weight (18.5–24.9 kg/m²), overweight (25.0–29.9 kg/m²), and obese (BMI ≥ 30 kg/m²) [33].
Ethical approval: As the data is available to the public on their website, it is not required to obtain ethical review and approval for this research involving human participants in compliance with local laws and institutional regulations.
2.3. Statistical analysis
The characteristics of the study participants are reported as frequencies in percent (%). In the context of bivariate analysis, the Pearson test is employed to assess the relationship between the outcome and explanatory variables. The test was two-tailed, with a p-value<0.05 deemed as statistically significant. The dataset was then split into two sets, namely training and testing, using a random sampling with an 8:2 ratio [34]. The training set consisted of 1,778 respondents, while the testing set comprised 443 respondents. Statistical analysis was carried out using SPSS software (Version 27). The membership class label of the data was imbalanced (high-risk for CVD: 35.0% and non-high-risk for CVD: 65.0%). To address this problem, we applied the adaptive synthetic (ADASYN) oversampling technique, an improved version of the synthetic minority oversampling technique (SMOTE) that generates new minority class samples based on a weighted distribution [35].
2.3.1. Feature selection.
Feature selection—also referred to as variable/attribute selection in statistics and machine learning—plays a critical role in creating an efficient prediction model by selecting the important features. Additionally, it can lead to shorter computation times, better generalization, higher interpretability, and improved model performance. To identify the important predictors of high risk of CVD, we executed the well-established Boruta feature selection method. The Boruta method is a wrapper-based feature selection technique that employs a random forest classifier to assess the importance of each feature [36,37]. In Boruta, the algorithm iteratively removes irrelevant features by comparing the significance of real features against random shadow features. Features that consistently perform worse than the shadow features are deemed irrelevant and are removed.
2.3.2. Machine learning algorithms.
This study applied five popular machine learning (ML) algorithms to predict individuals at high risk for CVD. Below is a brief description of them:
2.3.3. Logistic regression.
Logistic regression (LR) is a machine learning technique that predicts the probability of a binary outcome (i.e., yes/no) based on the input predictors [38]. It estimates the probability that a given input vector is associated with a specific class (i.e., class 1) through the use of the logistic function. The logistic function is as follows
where, is the probability that the outcome
is 1 given the input features
,
is intercept, and
,
, …,
are the coefficients of the input predictors
, respectively. If
, the predicted class is 1; otherwise, it is 0.
2.3.4. Naïve Bayes.
Naïve Bayes (NB) is a probabilistic classification algorithm that utilizes Bayes’ theorem, assuming that features are conditionally independent given the class label. NB often performs well despite this strong assumption of feature independence in many real-world applications, such as medical diagnostics like diabetes prediction [39]. The formula for NB classification is given as:
where, is the outcome variable;
are the input predictors;
is the posterior probability of class
given the predictors
;
is the prior probability of class
;
is the conditional probability of predictors
given class
;
is the marginal probability of predictors
. To classify a new instance, NB picks the class
that maximizes the posterior probability:
2.3.5. Artificial neural network.
Artificial neural network (ANN) is a widely used machine learning algorithm capable of performing various tasks, including classification. It consists of interconnected nodes, called neurons, which are arranged into three layers: an input layer, one or more hidden layers, and an output layer. During training, the neural network adjusts the weights and biases of each neuron to minimize the difference between predicted and actual outputs. This is achieved using an optimization algorithm, such as gradient descent, which iteratively updates the weights and biases with the sigmoid activation function [40]. The sigmoid function, also known as the logistic function, is a widely used activation function in ANNs for binary classification. The sigmoid function formula is as follows
where, is the input. The procedure continues until the minimum error is reached or the iteration values remain unchanged.
2.3.6. Random forest.
Random forest (RF) is a widely utilized machine learning algorithm for classification and regression tasks. It is an ensemble learning technique combining multiple decision trees by creating multiple subsets of the original training data through bootstrapping [41]. Let } be the training dataset, then the bootstrap sample
for the
tree can be represented as
Each decision tree in the forest makes the output a predicted class label
. The final prediction
for an input
is made by the majority voting:
where, is the total number of trees in the forest.
2.3.7. Extreme gradient boosting.
Extreme gradient boosting (XGB) is a popular variant of gradient boosting that is used extensively for prediction due to its efficiency and predictive power. It works by iteratively improving the model’s predictions using decision trees as weak classifiers, each aiming to correct the errors of the preceding trees [42]. XGB typically uses logistic loss (log loss) to measure the difference between predicted probabilities and actual binary labels
where, is the true class label of the
th sample and
is the predicted probability that the
th sample belongs to the positive class. In XGB, the raw prediction score (also called the logit) is computed as the sum of the outputs from all trees. For an input
, if there are
trees in the model, the raw prediction score
is given by:
where, is the output of the
tree for the input
. The logistic function converts the raw prediction score into a probability. The logistic function is
where, is the probability that the outcome
is 1, given the input predictors
. If
, the predicted class is 1; otherwise, it is 0.
2.3.8. Hyper-parameter tuning.
Hyperparameters are used to control the learning process of machine learning models. Unlike model parameters, which are learned during training, hyperparameters are set before the training process and can significantly influence the model’s performance. It is not easy to know what values to use for the hyper-parameters of a given algorithm on a given dataset, so random or grid searches for hyper-parameter values are commonly used techniques. In this study, hyperparameter tuning was performed using grid search methods, and the model was trained on the training data through 10-fold cross-validation.
2.3.9. Model evaluation.
After training the model, model evaluation determines how the model performs by measuring the performance on previously reserved test datasets. The confusion matrix is a simple table that compares actual and predicted categories for the outcome variable (Table 1). True Positives (TP) are the positive cases correctly predicted as positive. True Negatives (TN) are the negative cases accurately predicted as negative. False Negatives (FN) are the positive cases incorrectly predicted as negative, while False Positives (FP) are the negative cases incorrectly predicted as positive.
Accuracy: Accuracy is the ratio of the total number of correctly classified positive and negative instances to the overall number of instances. This can be expressed mathematically as follows:
Precision: Precision is the ratio of true positive predictions, accurately identified positive instances, to the total number of positive predictions generated by the model. This can be expressed mathematically as follows:
Specificity: Specificity is the proportion of correctly predicted negative instances to the total actual negative instances. This can be expressed mathematically as follows:
Recall: Recall is the ratio of correctly predicted positive instances to the total actual positive instances. It is computed as follows:
-score: F1-score is the harmonic mean of precision and recall. This can be expressed mathematically as follows:
2.3.10. ROC curve.
The ROC curve is a graphical representation used to show the diagnostic effectiveness of a binary classifier as its discrimination threshold is varied [43]. It plots sensitivity against 1-specificity across different thresholds. The AUC, or area under the ROC curve, quantifies the performance of the model. A higher AUC reflects a more effective model, indicating better discrimination between positive and negative cases. We have used the caret package (version 7.0−1) in R to implement each classifier for model training, tuning, and performance evaluation.
2.3.11. Association rules.
Association rule mining is a data mining technique that discovers interesting relationships or patterns in a dataset. It focuses on identifying associations or dependencies between items or events that occur frequently together in a given dataset and was first introduced by Agrawal et al. [44]. Association rules are typically represented as “if-then” statements or implications. The “if” part is called the antecedent or left-hand side (LHS) of the rule, while the “then” part is called the consequent or right-hand side (RHS) of the rule. The strength or quality of an association rule is usually measured in terms of support, confidence, and lift. Support indicates the proportion of transactions in the dataset that contain both the antecedent and the consequent of a rule. Higher support values imply that the rule occurs more frequently. Confidence measures the conditional probability of the consequent given the antecedent. A confidence of 1 indicates that the consequent always occurs when the antecedent is present. Lift assesses the significance of the association between the antecedent and the consequent. This study incorporated an additional approach that supports the classification of machine learning algorithms using the Apriori algorithm [45] to uncover the associations between the identified predictors and the outcome variable. The minimum support degree of 0.0011 and minimum confidence level of 90% were adopted to identify all potential association rules. A rule is considered reliable if its confidence level exceeds 80% [46]. This study focused on features implied by the outcome variable (Antecedent => Consequent), which is a method for classifying all predictors contributing to high-risk of CVD. These are commonly known as classification association rules [47]. This method determined the predictors that each category contributed to the outcome variable.
For a specific rule, the equations for support, confidence, and lift can be defined as follows: In this case, the feature sets represented by and
are mutually exclusive.
We used the arules (Version: 1.7–7) package in R to perform association rule mining. The overall workflow of the study is illustrated in Fig 2.
3. Data analysis results
3.1. Basic characteristics
The basic characteristics of the respondents are presented in Table 2. The overall prevalence of high-risk of CVD is 35.0%. As shown in Table 2, the rural areas had the most significant representation, with 62.8% of respondents living there. The majority of respondents were female, comprising 62.9% of the sample. Of the respondents, 77.4% were married, and around 36.0% had no education. The participants aged ≥ 65 years showed the highest prevalence (41.5%) of high risk of CVD, while the age group ranging from 18 to 34 years showed the lowest prevalence (19.5%). The richest respondents had the highest prevalence (42.5%) of high-risk for CVD, while respondents with middle-class and lower-class socioeconomic status had lower prevalence. Respondents who watched television exhibited a larger proportion (38.0%) of high-risk for CVD. Among the categories related to body weight, overweight respondents exhibited a higher prevalence of high-risk for CVD at 36.8%, while the category of underweight respondents showed a lower prevalence (24.5%). Age, residence, marital status, wealth, watching television, having a mobile, having a computer, having AC, and BMI are associated with a high risk of CVD (p-value < 0.05).
3.2. Important predictors of high-risk for CVD
Fig 3 illustrates the outcomes of the Boruta feature selection method. The Boruta method identified age, residence, marital status, wealth, having AC, and BMI as significant predictors of high-risk individuals for CVD. The identified predictors were incorporated as risk factors in the development of ML models for predicting high-risk individuals for CVD.
3.3. Performance comparison of the models
The comparative performance of the ML-based models is presented in Table 3. The results indicated that the XGB model achieved an impressive predictive accuracy of 68.22% (95% confidence interval (CI): 66.03–70.30), precision of 69.70%, specificity of 55.39%, and F1-score of 79.54%. However, the RF-based model showed the highest recall of 96.74%.
The corresponding ROC curves for the machine learning models are presented in Fig 4. The ROC analysis also reveals that the XGB model attained the highest AUC value of 0.721, surpassing all other models. Thus, the findings demonstrated that the XGB-based model outperformed other models in predicting high-risk individuals for CVD.
3.4. Validation of the proposed model
To validate the performance of the proposed model, we utilized the well-established Framingham Heart Study dataset, which contains 4,240 patient records with 15 explanatory variables [48]. The dataset contained missing values; after excluding these, the final dataset comprised 3,658 individuals, including 557 CVD cases (15.2%) and 3,101 non-CVD cases (84.8%). We applied the same analytical protocol as used with the BDHS, 2022 data. The model’s performance on the Framingham dataset is presented in Table 4. Notably, the XGB model achieved the highest performance, with an accuracy of 86.25%, an F1-score of 89.03%, and an AUC of 0.801.
The corresponding ROC curve is depicted in Fig 5. The XGB model showed the highest area in the ROC curve compared to other models, indicating superior discriminatory ability. Therefore, we propose that the predictive model demonstrates the highest performance across both the BDHS 2022 and Framingham datasets, supporting its potential generalizability and applicability in broader settings.
3.5. Association rules of high-risk for CVD
This study employed association rule mining using the Apriori algorithm to uncover meaningful patterns and relationships among the most relevant predictors, which were initially selected using the Boruta feature selection method. This approach was chosen to identify combinations of factors that frequently co-occur with high-risk individuals for CVD, offering deeper insights beyond individual predictor effects. As a result, 13 association rules with a confidence level of over 90% and the highest lift values were identified. Of all the rules, the top five most important rules were selected for predicting high-risk individuals for CVD, as illustrated in Table 5.
Rule 1: states that if the studied participants were urban residents with the richest wealth and living with AC, then the possibility of developing a high-risk individuals for CVD is 96.3% confidence.
Rule 2: indicates that if the studied participants were urban residents, the richest wealth, living with having AC, and those who are widowed, then the possibility of developing a high-risk individuals for CVD is 95.7% confidence.
Rule 3 states that if study participants have AC and an overweight BMI, the possibility of having high-risk individuals for CVD is 95.5% confidence.
Rule 4 means that if the participants are the richest in wealth, have an AC, and have an overweight BMI, then the possibility of having high-risk individuals for CVD is 95.5% confidence.
Rule 5 shows that if the studied participants residing in urban areas with the richest wealth, have an age ≥ 65 years, and are overweight, then the possibility of having high-risk individuals for CVD is 94.4% confidence.
The results of the association rule suggest that the factors – being aged 65 or older, urban living, higher socioeconomic status, having AC, and being widowed, are important predictors of high-risk individuals for CVD (lift value>1.5). The discovery of strong association rules enhances understanding of underlying relationships between variables, which can inform clinical practices and policy decisions.
4. Discussion
CVD, a condition with high heritability, is a leading cause of death globally. Despite the high global prevalence of CVD, the level of awareness regarding this condition remains significantly low [49]. However, this study tried to predict high-risk individuals for CVD and identify its co-occurring influencing predictors using machine learning and association rule mining. Initially, the Boruta feature selection technique was employed to determine predictors of high-risk individuals for CVD. The analysis revealed that age, place of residence, marital status, wealth, having an AC, and BMI are the important predictors of high-risk individuals for CVD. We subsequently applied several machine learning algorithms to predict high-risk individuals for CVD. Among them, the XGB models exhibited superior performance compared to the other models. Numerous studies have developed models to predict high-risk individuals for CVD across different countries, utilizing diverse datasets and employing various statistical and machine learning algorithms [50–55] (S1 Table). The findings of this study differ from those reported in earlier studies due to variations in sample size, predictor variables, geographical and demographic contexts, and other factors. Finally, the association rule mining is employed to analyze and identify combinations of predictors that frequently appear together and are associated with the outcome. This helps to understand better the interactions between multiple factors rather than just single predictors. The association rules identified the most significant predictors of high CVD risk as individuals aged 65 years or older, urban residents, widowed individuals, those from the wealthiest families, people with access to AC, and those who are overweight. The results of this study suggest that age, especially older age, is the most crucial predictor of CVD, consistent with earlier studies [56,57]. This relationship is attributed to its association with obesity, chronic inflammation, and oxidative stress, all of which may elevate the risk of heart-related conditions. The highest prevalence of CVD has been observed in urban regions, where rapid urbanization often leads to a more sedentary lifestyle [58]. In urban settings, individuals are more likely to engage in desk-bound jobs and rely on vehicles for transportation, which results in reduced physical activity. This shift towards a sedentary lifestyle is compounded by factors such as increased availability of fast food, higher levels of stress, and environmental pollutants, all of which contribute to elevated risks for CVD. Moreover, the lack of green spaces and recreational areas in many urban environments further limits opportunities for exercise, making it essential to address these lifestyle changes through public health initiatives and urban planning to mitigate the impact of CVD in these populations. This study found that marital status is an associated predictor of CVD, which is consistent with findings from previous studies [57]. Studies have shown that unmarried individuals, especially those who are widowed, often exhibit a higher risk of CVD [59,60]. This association may be attributed to factors such as increased emotional stress, social isolation, and lifestyle changes that can occur following significant relationship changes, ultimately impacting cardiovascular health. Another important predictor identified in our study is wealth, particularly among the richest families, which is positively associated with a high risk of CVD [61–63]. While the richest wealth status is often associated with better access to healthcare and healthier lifestyle choices, it can also lead to increased stress levels, unhealthy dietary habits, and sedentary behaviors that contribute to a high risk of CVD. Given their busy lifestyles, wealthier individuals may have access to more processed foods and engage in less physical activity. Furthermore, the psychological stress associated with maintaining a high social and economic status can adversely affect cardiovascular health. Understanding this correlation is crucial for developing targeted interventions to address high CVD risk across different socioeconomic groups.
We found that having AC is an important predictor of CVD, consistent with previous research indicating the role of environmental factors in cardiovascular health [64–66]. Access to AC can help regulate indoor temperatures, reduce humidity, and improve air quality, all contributing to a healthier living environment. Urban environments, where AC use is more common, are frequently characterized by more sedentary lifestyles due to limited opportunities for outdoor physical activity, increased reliance on mechanized transportation, and occupational structures that involve prolonged sitting. Sedentary behavior has been robustly linked to elevated risks of hypertension, obesity, and other cardiometabolic conditions, all of which are established precursors to CVD [67,68]. Moreover, prolonged use of AC may lead to increased time spent indoors, which could reduce exposure to natural ventilation and less physical activity. Moreover, urban dwellers who rely on AC may be exposed to higher levels of indoor air pollutants, particularly in settings where ventilation is poor or where AC systems are not regularly maintained. Additionally, AC access may indicate greater exposure to urban outdoor air pollution, including nitrogen dioxide (NO₂) and ozone (O₃), which have been consistently associated with increased cardiovascular morbidity and mortality [69,70]. These pollutants are known to promote endothelial dysfunction, oxidative stress, and systemic inflammation—pathways that play critical roles in the pathogenesis of CVD [71]. This study has also shown that BMI, particularly being overweight, is a significant predictor of high-risk individuals for CVD. Elevated BMI is associated with increased body fat, which can lead to various health issues, including hypertension, dyslipidemia, and insulin resistance – all of which are major risk factors for high CVD risk. Overweight individuals often experience increased strain on the cardiovascular system, contributing to higher rates of heart disease and related complications [72,73]. Addressing overweight and obesity through lifestyle modifications, such as improved diet and increased physical activity, is crucial for reducing high CVD risk and promoting overall cardiovascular health. The identified important features can guide the development of targeted interventions. Raising awareness about healthy lifestyles, especially for older adults, is essential in mitigating the high risk of CVD. Targeted initiatives in key regions can strengthen health systems and increase awareness, which helps reduce the incidence of high CVD risk in these communities.
4.1. Limitations and future works
This study has several limitations that should be considered. First, the BDHS data used in this study is cross-sectional, which prevents us from establishing causal relationships between the predictors and the outcome variable. Second, the data used were primarily based on self-reported responses, which may be subject to recall bias and social desirability bias. Participants might have underreported or overreported certain behaviors or health conditions, leading to potential misclassification. Third, the findings may not be generalizable to populations outside the survey setting or to individuals not included in the BDHS sample, such as those who are institutionalized or homeless. Fourth, some variables, such as having AC or watching television, were used as proxies for a sedentary lifestyle, which may not fully capture actual physical inactivity and could introduce measurement bias. To address these limitations, future studies should utilize longitudinal data to explore cause-and-effect relationships and changes over time in high-risk individuals for CVD. Incorporating reliable and standardized measures ensures a more accurate evaluation of a sedentary lifestyle. Additionally, incorporating more clinical variables—such as cholesterol levels, blood glucose, HbA1c, waist-hip ratio, alcohol use, and family history of CVD could improve the model’s accuracy and provide a better understanding of what influences the high-risk for CVD.
5. Conclusion
Several models were applied during experimentation to predict the high risk of CVD. Among them, the XGB model, which utilized predictors selected by the Boruta method, demonstrated superior performance compared to the other models. The selected predictors of CVD were employed in association rule mining to investigate their patterns and interrelations with high-risk individuals for CVD. Finally, the analysis identified that age ≥ 65 years, urban residence, richest wealth, having AC, and overweight BMI were the most associated predictors with higher confidence of CVD. This highlights their potential as reliable decision-support tools, enabling healthcare professionals, policymakers, and other stakeholders to make informed decisions for early intervention and personalized treatment strategies. Ultimately, this can improve cardiovascular health outcomes and help reduce the growing mortality rate and healthcare burden in Bangladesh.
Supporting information
S1 Table. Comparative model performance with existing risk models.
https://doi.org/10.1371/journal.pone.0333913.s001
(DOCX)
References
- 1. Roth GA, Mensah GA, Johnson CO, Addolorato G, Ammirati E, Baddour LM, et al. Global Burden of Cardiovascular Diseases and Risk Factors, 1990-2019: Update From the GBD 2019 Study. J Am Coll Cardiol. 2020;76(25):2982–3021. pmid:33309175
- 2. Di Cesare M, Perel P, Taylor S, Kabudula C, Bixby H, Gaziano TA, et al. The Heart of the World. Glob Heart. 2024;19(1):11. pmid:38273998
- 3. Sacco RL, Roth GA, Reddy KS, Arnett DK, Bonita R, Gaziano TA, et al. The Heart of 25 by 25: Achieving the Goal of Reducing Global and Regional Premature Deaths From Cardiovascular Diseases and Stroke: A Modeling Study From the American Heart Association and World Heart Federation. Circulation. 2016;133(23):e674-90. pmid:27162236
- 4.
Gaziano T, Reddy KS, Paccaud F, Horton S, Chaturvedi V. Cardiovascular disease. Disease Control Priorities in Developing Countries. 2nd edition. 2006.
- 5. WHO CVD Risk Chart Working Group. World Health Organization cardiovascular disease risk charts: revised models to estimate risk in 21 global regions. Lancet Glob Health. 2019;7(10):e1332–45. pmid:31488387
- 6. Rahman M, Nakamura K, Seino K, Kizuki M. Sociodemographic factors and the risk of developing cardiovascular disease in Bangladesh. Am J Prev Med. 2015;48(4):456–61. pmid:25498549
- 7. Chowdhury MZI, Haque MA, Farhana Z, Anik AM, Chowdhury AH, Haque SM, et al. Prevalence of cardiovascular disease among Bangladeshi adult population: a systematic review and meta-analysis of the studies. Vasc Health Risk Manag. 2018;14:165–81. pmid:30174432
- 8. Gheorghe A, Griffiths U, Murphy A, Legido-Quigley H, Lamptey P, Perel P. The economic burden of cardiovascular disease and hypertension in low- and middle-income countries: a systematic review. BMC Public Health. 2018;18(1):975. pmid:30081871
- 9. Baghdadi NA, Farghaly Abdelaliem SM, Malki A, Gad I, Ewis A, Atlam E. Advanced machine learning techniques for cardiovascular disease early detection and diagnosis. J Big Data. 2023;10(1).
- 10. Ogunpola A, Saeed F, Basurra S, Albarrak AM, Qasem SN. Machine Learning-Based Predictive Models for Detection of Cardiovascular Diseases. Diagnostics (Basel). 2024;14(2):144. pmid:38248021
- 11. DeGroat W, Abdelhalim H, Patel K, Mendhe D, Zeeshan S, Ahmed Z. Discovering biomarkers associated and predicting cardiovascular disease with high accuracy using a novel nexus of machine learning techniques for precision medicine. Sci Rep. 2024;14(1):1. pmid:38167627
- 12. Mohi Uddin KM, Ripa R, Yeasmin N, Biswas N, Dey SK. Machine learning-based approach to the diagnosis of cardiovascular vascular disease using a combined dataset. Intelligence-Based Medicine. 2023;7:100100.
- 13. Ullah T, Ullah SI, Ullah K, Ishaq M, Khan A, Ghadi YY, et al. Machine Learning-Based Cardiovascular Disease Detection Using Optimal Feature Selection. IEEE Access. 2024;12:16431–46.
- 14. Patil PB, Shastry PM, Ashokumar PS. Machine learning based algorithm for risk prediction of cardio vascular disease. Journal of Critical Reviews. 2020;7(9):836–44.
- 15. Chinnasamy P, Arun Kumar S, Navya V, Lakshmi Priya K, Sruthi Boddu S. Machine learning based cardiovascular disease prediction. Materials Today: Proceedings. 2022;64:459–63.
- 16.
Abdar M, Nasarian E, Zhou X, Bargshady G, Wijayaningrum VN, Hussain S. Performance improvement of decision trees for diagnosis of coronary artery disease using multi filtering approach. In: 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS), 2019. 26–30.
- 17. Nasirzadeh F, Mir M, Hussain S, Tayarani Darbandy M, Khosravi A, Nahavandi S, et al. Physical Fatigue Detection Using Entropy Analysis of Heart Rate Signals. Sustainability. 2020;12(7):2714.
- 18. Sharifrazi D, Alizadehsani R, Hoseini Izadi N, Roshanzamir M, Shoeibi A, Khozeimeh F, et al. Hypertrophic cardiomyopathy diagnosis based on cardiovascular magnetic resonance using deep learning techniques. Colour Filtering. 2021.
- 19.
National Institute of Population Research and Training (NIPORT) and ICF. Bangladesh Demographic and Health Survey 2022: Final Report. Dhaka, Bangladesh, and Rockville, Maryland, USA: NIPORT and ICF; 2024.
- 20. World Health Organization, International Society of Hypertension Writing Group. 2003 World Health Organization (WHO)/International Society of Hypertension (ISH) statement on management of hypertension. Journal of hypertension. 2003 1;21(11):1983–92.
- 21. Das S, Rahman R, Talukder A. Determinants of developing cardiovascular disease risk with emphasis on type-2 diabetes and predictive modeling utilizing machine learning algorithms. Medicine (Baltimore). 2024;103(49):e40813. pmid:39654201
- 22.
World Health Organization. Guideline for the pharmacological treatment of hypertension in adults. World Health Organization; 2021.
- 23. Zhou B, Perel P, Mensah GA, Ezzati M. Global epidemiology, health burden and effective interventions for elevated blood pressure and hypertension. Nat Rev Cardiol. 2021;18(11):785–802. pmid:34050340
- 24.
World Health Organization. Hypertension control: report of a WHO Expert Committee. World Health Organization; 1996.
- 25. Lamptey P, Laar A, Adler AJ, Dirks R, Caldwell A, Prieto-Merino D, et al. Evaluation of a community-based hypertension improvement program (ComHIP) in Ghana: data from a baseline survey. BMC Public Health. 2017;17(1):368. pmid:28454523
- 26. Khanam F, Hossain MB, Mistry SK, Afsana K, Rahman M. Prevalence and Risk Factors of Cardiovascular Diseases among Bangladeshi Adults: Findings from a Cross-sectional Study. J Epidemiol Glob Health. 2019;9(3):176–84. pmid:31529935
- 27. Behera S, Sharma R, Yadav K, Chhabra P, Das M, Goel S. Prevalence and predictors of risk factors for cardiovascular diseases among women aged 15-49 years across urban and rural India: findings from a nationwide survey. BMC Womens Health. 2024;24(1):77. pmid:38281909
- 28. Gupta RD, Tamanna RJ, Hashan MR, Akonde M, Haider SS, Chakraborty PA, et al. Prevalence and Associated Factors with Ideal Cardiovascular Health Metrics in Bangladesh: Analysis of the Nationally Representative STEPS 2018 Survey. Epidemiologia (Basel). 2022;3(4):533–43. pmid:36547257
- 29. Avesta L, Rasoolzadeh S, Naeim M, Kamran A. Prevalence of Cardiovascular Disease Risk Factors in the Women Population Covered by Health Centers in Ardabil. Int J Hypertens. 2022;2022:2843249. pmid:35321055
- 30. Kundu J, Kundu S. Cardiovascular disease (CVD) and its associated risk factors among older adults in India: Evidence from LASI Wave 1. Clinical Epidemiology and Global Health. 2022;13:100937.
- 31. Saki N, Karandish M, Cheraghian B, Heybar H, Hashemi SJ, Azhdari M. Prevalence of cardiovascular diseases and associated factors among adults from southwest Iran: Baseline data from Hoveyzeh Cohort Study. BMC Cardiovasc Disord. 2022;22(1):309. pmid:35804295
- 32. Dhungana RR, Thapa P, Devkota S, Banik PC, Gurung Y, Mumu SJ, et al. Prevalence of cardiovascular disease risk factors: A community-based cross-sectional study in a peri-urban community of Kathmandu, Nepal. Indian Heart J. 2018;70 Suppl 3(Suppl 3):S20–7. pmid:30595258
- 33.
Weir CB, Jan A. BMI classification percentile and cut off points. 2019.
- 34. Joseph VR, Vakayil A. SPlit: An Optimal Method for Data Splitting. Technometrics. 2021;64(2):166–76.
- 35.
Haibo He, Yang Bai, Garcia EA, Shutao Li. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008. 1322–8. doi: https://doi.org/10.1109/ijcnn.2008.4633969
- 36. Kursa MB, Rudnicki WR. Feature Selection with theBorutaPackage. J Stat Soft. 2010;36(11).
- 37. Chen R-C, Dewi C, Huang S-W, Caraka RE. Selecting critical features for data classification based on machine learning methods. J Big Data. 2020;7(1).
- 38. Ranganathan P, Pramesh CS, Aggarwal R. Common pitfalls in statistical analysis: Logistic regression. Perspect Clin Res. 2017;8(3):148–51. pmid:28828311
- 39.
Berrar D. Bayes’ theorem and naive Bayes classifier.
- 40.
Hassoun MH. Fundamentals of artificial neural networks. MIT Press; 1995.
- 41. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
- 42.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 785–94.
- 43. Hajian-Tilaki K. Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian J Intern Med. 2013;4(2):627–35. pmid:24009950
- 44.
Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, 1993. 207–16. doi: https://doi.org/10.1145/170035.170072
- 45. Harahap M, Husein AM, Aisyah S, Lubis FR, Wijaya BA. Mining association rule based on the diseases population for recommendation of medicine need. J Phys: Conf Ser. 2018;1007:012017.
- 46. Altaf W, Shahbaz M, Guergachi A. Applications of association rule mining in health informatics: a survey. Artif Intell Rev. 2016;47(3):313–40.
- 47.
Khare S, Gupta D. Association rule analysis in cardiovascular disease. In: 2016 Second International Conference on Cognitive Computing and Information Processing (CCIP), 2016. 1–6. doi: https://doi.org/10.1109/ccip.2016.7802881
- 48. Suhatril RJ, Syah RD, Hermita M, Gunawan B, Silfianti W. Evaluation of Machine Learning Models for Predicting Cardiovascular Disease Based on Framingham Heart Study Data. Ilk J Ilm. 2024;16(1):68–75.
- 49. Roth GA, Johnson C, Abajobir A, Abd-Allah F, Abera SF, Abyu G, et al. Global, Regional, and National Burden of Cardiovascular Diseases for 10 Causes, 1990 to 2015. J Am Coll Cardiol. 2017;70(1):1–25. pmid:28527533
- 50. Sianga BE, Mbago MC, Msengwa AS. Predicting the prevalence of cardiovascular diseases using machine learning algorithms. Intelligence-Based Medicine. 2025;11:100199.
- 51. Shah P, Shukla M, Dholakia NH, Gupta H. Predicting cardiovascular risk with hybrid ensemble learning and explainable AI. Sci Rep. 2025;15(1):17927. pmid:40410273
- 52. Hossain S, Hasan MK, Faruk MO, Aktar N, Hossain R, Hossain K. Machine learning approach for predicting cardiovascular disease in Bangladesh: evidence from a cross-sectional study in 2023. BMC Cardiovasc Disord. 2024;24(1):214. pmid:38632519
- 53. Bukaita W. Cardiovascular Disease Prediction Using Machine Learning. AJBSR. 2025;27(2):327–40.
- 54. Theerthagiri P. Predictive analysis of cardiovascular disease using gradient boosting based learning and recursive feature elimination technique. Intelligent Systems with Applications. 2022;16:200121.
- 55.
Dinesh KG, Arumugaraj K, Santhosh KD, Mareeswari V. Prediction of cardiovascular disease using machine learning algorithms. In 2018 international conference on current trends towards converging technologies (ICCTCT) . IEEE; 2018 Mar 1. p. 1–7.
- 56. Mamgai A, Halder P, Behera A, Goel K, Pal S, Amudhamozhi KS, et al. Cardiovascular risk assessment using non-laboratory based WHO CVD risk prediction chart with respect to hypertension status among older Indian adults: insights from nationally representative survey. Front Public Health. 2024;12:1407918. pmid:39301516
- 57. Dhindsa DS, Khambhati J, Schultz WM, Tahhan AS, Quyyumi AA. Marital status and outcomes in patients with cardiovascular disease. Trends Cardiovasc Med. 2020;30(4):215–20. pmid:31204239
- 58. Park JH, Moon JH, Kim HJ, Kong MH, Oh YH. Sedentary Lifestyle: Overview of Updated Evidence of Potential Health Risks. Korean J Fam Med. 2020;41(6):365–73. pmid:33242381
- 59. Humbert X, Rabiaza A, Fedrizzi S, Alexandre J, Menotti A, Touzé E, et al. Marital status and long-term cardiovascular risk in general population (Gubbio, Italy). Sci Rep. 2023;13(1):6723. pmid:37185571
- 60. Molloy GJ, Stamatakis E, Randall G, Hamer M. Marital status, gender and cardiovascular mortality: behavioural, psychological distress and metabolic explanations. Soc Sci Med. 2009;69(2):223–8. pmid:19501442
- 61. Schultz WM, Kelli HM, Lisko JC, Varghese T, Shen J, Sandesara P, et al. Socioeconomic Status and Cardiovascular Outcomes: Challenges and Interventions. Circulation. 2018;137(20):2166–78. pmid:29760227
- 62. Nghiem N, Atkinson J, Nguyen BP, Tran-Duy A, Wilson N. Predicting high health-cost users among people with cardiovascular disease using machine learning and nationwide linked social administrative datasets. Health Econ Rev. 2023;13(1):9. pmid:36738348
- 63. Murphy A, Palafox B, O’Donnell O, Stuckler D, Perel P, AlHabib KF, et al. Inequalities in the use of secondary prevention of cardiovascular disease by socioeconomic status: evidence from the PURE observational study. Lancet Glob Health. 2018;6(3):e292–301. pmid:29433667
- 64. Münzel T, Hahad O, Sørensen M, Lelieveld J, Duerr GD, Nieuwenhuijsen M, et al. Environmental risk factors and cardiovascular diseases: a comprehensive expert review. Cardiovasc Res. 2022;118(14):2880–902. pmid:34609502
- 65. Liu J, Varghese BM, Hansen A, Zhang Y, Driscoll T, Morgan G, et al. Heat exposure and cardiovascular health outcomes: a systematic review and meta-analysis. Lancet Planet Health. 2022;6(6):e484–95. pmid:35709806
- 66. Bhatnagar A. Environmental Determinants of Cardiovascular Disease. Circ Res. 2017;121(2):162–80. pmid:28684622
- 67. Lee IM, Shiroma EJ, Lobelo F, Puska P, Blair SN, Katzmarzyk PT. Effect of physical inactivity on major non-communicable diseases worldwide: an analysis of burden of disease and life expectancy. The Lancet. 2012;380(9838):219–29.
- 68. Lavie CJ, Ozemek C, Carbone S, Katzmarzyk PT, Blair SN. Sedentary Behavior, Exercise, and Cardiovascular Health. Circ Res. 2019;124(5):799–815. pmid:30817262
- 69. Brook RD, Rajagopalan S, Pope CA 3rd, Brook JR, Bhatnagar A, Diez-Roux AV, et al. Particulate matter air pollution and cardiovascular disease: An update to the scientific statement from the American Heart Association. Circulation. 2010;121(21):2331–78. pmid:20458016
- 70. Newby DE, Mannucci PM, Tell GS, Baccarelli AA, Brook RD, Donaldson K, et al. Expert position paper on air pollution and cardiovascular disease. European Heart Journal. 2015;36(2):83–93.
- 71. Rajagopalan S, Al-Kindi SG, Brook RD. Air Pollution and Cardiovascular Disease: JACC State-of-the-Art Review. J Am Coll Cardiol. 2018;72(17):2054–70. pmid:30336830
- 72. Ortega FB, Lavie CJ, Blair SN. Obesity and Cardiovascular Disease. Circ Res. 2016;118(11):1752–70. pmid:27230640
- 73. Csige I, Ujvárosy D, Szabó Z, Lőrincz I, Paragh G, Harangi M, et al. The Impact of Obesity on the Cardiovascular System. J Diabetes Res. 2018;2018:3407306. pmid:30525052