Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Predicting the risk of hypertension using machine learning algorithms: A cross sectional study in Ethiopia

  • Md. Merajul Islam ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Resources, Software, Visualization, Writing – original draft, Writing – review & editing

    jahangir_statru63@yahoo.com (MJA); merajul.stat4811@gmail.com (MMI)

    Affiliations Department of Statistics, Jatiya Kabi Kazi Nazrul Islam University, Trishal, Mymensingh, Bangladesh, Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh

  • Md. Jahangir Alam ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Software, Supervision, Validation, Writing – original draft, Writing – review & editing

    jahangir_statru63@yahoo.com (MJA); merajul.stat4811@gmail.com (MMI)

    Affiliations Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh, Mainanalytics GmbH, Sulzbach/Taunus, Germany

  • Md Maniruzzaman,

    Roles Conceptualization, Supervision, Validation, Writing – review & editing

    Affiliation Statistics Discipline, Khulna University, Khulna, Bangladesh

  • N. A. M. Faisal Ahmed,

    Roles Writing – review & editing

    Affiliation Institute of Education and Research, University of Rajshahi, Rajshahi, Bangladesh

  • Md Sujan Ali,

    Roles Writing – review & editing

    Affiliation Department of Computer Science and Engineering, Jatiya Kabi Kazi Nazrul Islam University, Trishal, Mymensingh, Bangladesh

  • Md. Jahanur Rahman,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh

  • Dulal Chandra Roy

    Roles Supervision, Writing – review & editing

    Affiliation Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh

Abstract

Background and objectives

Hypertension (HTN), a major global health concern, is a leading cause of cardiovascular disease, premature death and disability, worldwide. It is important to develop an automated system to diagnose HTN at an early stage. Therefore, this study devised a machine learning (ML) system for predicting patients with the risk of developing HTN in Ethiopia.

Materials and methods

The HTN data was taken from Ethiopia, which included 612 respondents with 27 factors. We employed Boruta-based feature selection method to identify the important risk factors of HTN. The four well-known models [logistics regression, artificial neural network, random forest, and extreme gradient boosting (XGB)] were developed to predict HTN patients on the training set using the selected risk factors. The performances of the models were evaluated by accuracy, precision, recall, F1-score, and area under the curve (AUC) on the testing set. Additionally, the SHapley Additive exPlanations (SHAP) method is one of the explainable artificial intelligences (XAI) methods, was used to investigate the associated predictive risk factors of HTN.

Results

The overall prevalence of HTN patients is 21.2%. This study showed that XGB-based model was the most appropriate model for predicting patients with the risk of HTN and achieved the accuracy of 88.81%, precision of 89.62%, recall of 97.04%, F1-score of 93.18%, and AUC of 0. 894. The XBG with SHAP analysis reveal that age, weight, fat, income, body mass index, diabetes mulitas, salt, history of HTN, drinking, and smoking were the associated risk factors of developing HTN.

Conclusions

The proposed framework provides an effective tool for accurately predicting individuals in Ethiopia who are at risk for developing HTN at an early stage and may help with early prevention and individualized treatment.

Introduction

Hypertension (HTN), defined as the elevated blood pressure beyond its normal ranges, is a major public health concern with its raising prevalence and effect among the adults’ overtime worldwide [13]. It is one of the most common serious chronic non-communicable diseases. Hypertensive people are affected by different types of cardiovascular diseases (CVDs), e.g., coronary heart disease, stroke, peripheral arterial disease, aortic disease, myocardial infarction [47], which are the leading cause of disability, morbidity and mortality that increase the economic burden of out-of-pocket expenditures (OOPE) [810]. As reported by World Health Organization (WHO), worldwide around 9.4 million people were died due to HTN every year [10]. According to Belay et al., [2022], globally the prevalence of HTN was 26% in 2000 and it was projected to reach around 1.56 billion (29.2%) by 2025 [11]. The latest estimation by WHO in 2021 revealed that about one-third (31.1%) of the world’s adult population had HTN (1.39 billion); of whom 2/3 were from in low and middle-income countries (LMICs) [12]. Also, a systematic analysis of population-based studies from 90 countries, including Ethiopia estimated that HTN among adults was more prevalent in LMICs (31.5%) than the high-income countries (28.5%) [13]. Different epidemiological studies in Ethiopia reported that the prevalence of HTN was ranging from 7.7%-41.9% [14]. Moreover, the prevalence of HTN is disproportionately more prevalent and it increases alarmingly in poor resource countries, like Ethiopia [11]. But it might be helpful to mitigate and manage/control the risk of HTN if identification of HTN patients with interpretable risk factors at an early stage. Thus, early detection of HTN patients with identification of interpretable risk factors plays a key role, which could help to get the patients timely prevention and intervention. It is therefore highly essential to detect/diagnosis and identify the interpretable risk factors of HTN at an early stage.

Many convincing research and empirical studies determined several risk factors associated with HTN in LMICs countries, including Ethiopia [1521]. Nevertheless, existing association studies had several limitations. Most importantly, previous existing studies considered traditional linear models, such as logistic regression (LR), Cox proportional hazard model, for identifying the significantly associated risk factors of HTN [2224]. Moreover, a real data with high-dimensional non-linear pattern presents a challenge to traditional linear models, and low precision of linear models impedes patients-level use. To overcome those limitations with complex real data, machine learning (ML) might be a right choice, which is being widely used in current public health research fields. ML is a subset of artificial intelligence (AI), in which the algorithms that execute the prediction process collect the necessary information from previous experiences and/or detect patterns in data to accomplish a task, typically a classification or identification [2528]. It can provide several advantages, including automatic specific process, reliable probabilistic estimation for uncovering hidden patterns or relationships with high accuracy while lowering labor costs and time for large amounts of data that aid in decision-making or inference, and model interpretability [2931]. There are different types of learning algorithm in ML, among them supervised learning is the most popular and widely applicable. The supervised learning algorithm’s goal is to use the dataset to build a model that can predict the system’s output given new inputs. The major two types of supervised learning algorithm are regression and classification. Example of regression include linear regression and logistic regression [32]. Examples of classification include ensemble methods, decision trees (DT), k-nearest neighbors (kNN), support vector machine (SVM), Naïve Bayes (NB), artificial neural network (ANN), so on [32, 33]. The ensemble method is a machine learning technique that combine multiple models with the same learning algorithm to achieve better predictive performance [34]. Ensemble methods include eXtreme gradient boosting (XGB), adaBoost, histogram-based gradient boosting classification Tree, and random forest (RF) [25]. However, previously, some researcher’s conducted their study to develop multivariable prediction models using several ML and explainable artificial intelligence methods [3537]. Most of the existing risk prediction models were developed with limited number of risk factors that provided less accuracy for predicting HTN patient [35, 38]. However, DT and ensemble approaches have attracted a great attention in recent years for identifying individuals at risk of HTN, there is no evidence that these algorithms are successfully applied in Ethiopian clinical settings.

To the best of our knowledge, this is the first study that applied and builds a predictive model using ML algorithms for predicting the individual risk of HTN in Ethiopia. Thus, the objective of this study was to develop an efficient ensemble based explainable ML framework for predicting patients with the risk of HTN in Ethiopia.

Furthermore, we employed under-sampling and adaptive syntactic (ADASYN) class balancing strategy to enhance the confidence score of the developed prediction models. For model interpretation, we identified the key risk factors of HTN and direction of the relationship between the risk factors and HTN using SHapley Additive exPlanations (SHAP), which is a post hoc model interpretation technique viz. theoretically based on the Shapley value. The overall pipeline of the explainable machine learning based framework is displayed in Fig 1.

thumbnail
Fig 1. Workflow of the proposed ML-based methodology for predicting risk of HTN.

https://doi.org/10.1371/journal.pone.0289613.g001

The layout of this paper is presented as follows: Materials and methods included data source, statistical analysis, feature selection, machine learning algorithms, performance evaluation criteria, and model interpretability. The results are presented in section 3 and discussed in section 4. Finally, conclusion is represented in section 5.

Materials and methods

Data source

The community-based cross-sectional data used in this investigation were collected in 2017 by the Hawassa city administration and made available to the public by Paulose et al. [39]. The data were collected through multistage random sampling and comprised a total of 633 respondents, ranging in age from 31 to 90, and residing in the city for at least six months. The sample size was determined by using the formula of sample size determination method, which considered the design effect of 1.5, the 95% confidence interval, the 5% margin of error, the 80% power, the proportion of 50% (to maximize sample size), and the 10% non-response rate [39]. Different levels of explanatory variables were included as individual risk factors of HTN and categorized the quantitative variables based on the previous sittings [1820, 39]. A brief explanation of the included risk factors has been presented in Table 1. In this study, a patient with HTN is determined based on WHO cutoff (≥140/90 mmHg and/or diastolic pressure ≥90 mmHg and/or being on medication of HTN at the time of data collection) [40]. Finally, a total of 612 respondents were incorporated in this study after eliminating all the missing values.

thumbnail
Table 1. Name, description, and categorization of the selected factors.

https://doi.org/10.1371/journal.pone.0289613.t001

Statistical analysis

The baseline and demographic characteristics of the patients were presented in percentage (%) for categorical and mean ± SD (standard deviation) for continuous data. Pearson χ2-test was employed to determine the association between categorical risk factors and HTN, whereas for continuous risk factors, independent sample t-test was used to examine the mean difference between the HTN groups (HTN vs. non-HTN) for normally distributed data. Two-sided test was performed and a p-value of <0.05 was considered statistically significant for all the tests. Data analysis was performed by SPSS (version-27.0) and R (version-4.2.2).

Feature selection

Feature selection (FS), or risk factor identification is also known as variable selection, or subset selection in statistics and ML. The identification of risk factors is a method for selecting the relevant features by removing the irrelevant or redundant features from the dataset. In this study, Boruta-based feature selection method (FSM) was adopted to identify the relevant features. Boruta is a wrapper-based feature selection method that employs the random forest classifier algorithm. This method has a wider range of applications and performs better than others as it is unbiased and steady [41].

Machine learning algorithms

This study used three different types of supervised ML algorithms for predicting patients with the risk of HTN (Table 2).

thumbnail
Table 2. Different machine learning algorithms with types.

https://doi.org/10.1371/journal.pone.0289613.t002

Logistic regression

Logistic regression (LR) is a most popular supervised ML-based algorithm that leverages the idea of probability. Logistic regression (LR) is a most popular supervised ML algorithm mainly used for classification task [42]. The LR model employs the logistic function to estimate the probability of the response variable (HTN and non-HTN) in terms of one or more input features. The logistic function can be represented as follows (1) where, pj denote the probability of HTN and (1−pj) denote the probability of non-HTN for jth individual; Xkj is the kth input feature of the jth individual and βk is the kth regression coefficients.

The above Eq (1) can be expressed as (2) and odds as (3)

If , then we classify as HTN, while , then we classify as non-HTN.

Artificial neural network

Artificial neural network (ANN) is a non-linear modeling algorithm that is inspired by the structure and function of human brain. It consists of interconnected processing nodes that are organized by three different types of layers: input, hidden, and output. The input layer is connected to hidden layer with updated weight, and hidden layer is connected to the output. In this method, X = x1,…,xk are used as the input vector in back propagation (BP) algorithm for learning as well as mapping the relationship between input features and outcome variable. The BP algorithm propagates the error between the input risk factors and outcome variable by adjusting weights of hidden layers via backward direction with non-linear sigmoid activation function [43]. The sigmoid activation function is defined as (4)

This procedure is repeated iteratively until no change iteration values or not getting the minimum error.

Random forest

Random forest is a popular machine learning algorithm that developed by Leo Breiman and widely used in classification and regression problems [44]. It is based on the concept of ensemble learning algorithm that trains multiple decision tree on random subsets of the data to solve the problem. The RF-based model is constructed by using the following steps:

  1. Step1: The given training data set (Xij, i = 1, 2… k, j = 1, 2… n), select randomly risk factors from training dataset by using bootstrap sampling procedure.
  2. Step 2: Built a decision tree (DT) for creating new subset.
  3. Step3: Repeat Step1 and Step2, until construct many trees and consist of a forest.
  4. Steps 4: Consider the prediction result from each created DT and select final prediction with the help of majority voting.

Extreme gradient boosting

Extreme gradient boosting (XGB) is an efficient ensemble-based machine learning algorithm that uses decision trees and gradient boosting algorithm. It is highly adaptable and working in most classification problem, especially HTN disease prediction [45]. Boosting is a learning algorithm, which attempts to create a strong classifier based on weak learners or classifiers. The weak and strong classification models mention to the correlation of predicted and actual class. By adding classifiers on top of each other iteratively, the next classifier can modify the errors of the earlier one. This procedure is repeated until the training data set accurately predicts the membership class label of the target variable.

Data partition and balancing

We randomly divided the whole dataset into two sets as 70% training set [HTN: 91 (21.2%), non-HTN: 338 (78.8)] and 30% testing set HTN: 39 (21.3%), non-HTN: 144 (78.7)] using stratified sampling procedure [46]. Membership class label of the data was imbalance i.e., skewed class distribution of observations. Imbalance class problem of a data provided a biased result for the majority class of the response variable in classification task [47, 48]. To deal this problem, several data balancing strategy are widely applicable. Among them, under-sampling and Adaptive synthetic (ADASYN) balancing strategy were executed in the training set to balance the data. ADASYN is the newly generalized version of synthetic minority oversampling technique (SMOTE) and generates new sample for the minority class using a weighted distribution [49].

Cross validation and tune hyperparameters

The mentioned above four ML algorithms (LR, ANN, RF, and XGB) have other parameters, called hyperparameters. Hyperparameters are those parameters that the user explicitly defines before the learning process to improve the model performance. The grid search method with repeated10-fold (K10) cross-validation protocol was used to tune the hyperparameter values in the training set. The training dataset is divided into a 9:1 ratio as a training subset and a verification set to perform the K10 protocol. The caret package (version 6.0-93) in R was used to generate the optimal hyperparameter values for four models, which are displayed in Table 3.

thumbnail
Table 3. The value of hyperparameter for ML-based models.

https://doi.org/10.1371/journal.pone.0289613.t003

Performance evaluation criteria

The performance of selected four ML models was evaluated by five popular evaluation criteria: accuracy, precision, recall, F-score, and area under the curve (AUC). The values of performance evaluation criteria were calculated from the confusion matrix by four measures (Table 4):

  1. True positive (tp): model predicted the disease group as HTN where actual group was HTN,
  2. False positive (fp): model predicted the disease group as HTN where actual group was non-HTN,
  3. False negative (fn): model predicted the control group non-HTN where actual class was HTN,
  4. True negative (tn): model predicted the control group non-HTN where actual group was non-HTN.

Accuracy.

It is used to assess the overall accuracy for the models. It is defined as the ratio of the sum of true cases (tp and tn) against total number of cases. Accuracy is defined mathematically as (5)

Precision.

It is the ratio of tp cases against the predicted positive (DR) cases. It is also called positive predictive value and used to assess the reliability for predicting the model as positive. Precision is defined mathematically as (6)

Recall.

It is the ratio of tp cases against the actual positive cases (DRs). Model with high recall indicates low fn. It’s also called sensitivity or true positive rate (TPR). Recall is defined mathematically as (7)

F1-score.

It is a harmonic mean of precision and recall. F-score is defined mathematically as (8)

Area under the curve

The AUC is defined as an integral of the receiver operating characteristic (ROC) function over the given range and used to assess the quality of the built predictive model. The mathematical formula of AUC is as follows (9)

A ROC curve is a plot of TPR or sensitivity on the y axis against false positive rate (FPR) or 1-specificity on the x axis for different cutoff values. The ROC curve is broadly used in medical diagnosis as another single-number measure for evaluating the predictive validity of ML-based model [50]. ROCs generate an AUC value from 0 to 1.

Model interpretability

Shapley additive explanations (SHAP) is an interpretability visualization approach, which is constructed based on Shapley values. This method was introduced by Lundberg and Lee (2017), and widely used to explain the local and global importance using SHAP value by computing the contribution of each risk factor in the ML-based prediction model [51]. The explanation value of SHAP was initially established from coalitional game theory, where each predictor is used as an individual player in a game or coalition. SHAP values framework offers a fair solution for each player in a model outcome, and provides a series of desirable properties/axioms, including consistency, efficiency, dummy, and additively [52]. The efficiency property of SHAP method provided better reliable results compared to another methods, for example local interpretable model-agnostic explanations [53]. Risk factors contribute to the model’s outcome or prediction with different magnitude and sign, which is accounted for by Shapley values. Accordingly, Shapley values represent estimates of feature importance magnitude of the contribution and its direction (sign). Risk factors with positive SHAP value contribute to predict patent with HTN in the model, whereas risk factors with negative SHAP value contribute to predicting patients with control in the model. Particularly, the importance of each risk factor, say kth risk factor, is measured by the Shapley value defined by the following formula (10) where, S denotes the subset of risk factors, that does not include the risk factor for which we are calculating the value of ∅k(v); S∪{k} is the subset of risk factors, that includes in S and the kth risk factor; v(S) corresponds to the outcome of the ML-based model that explain using the risk factors of S; SM\{k} represents all sets of S that are subsets of the full set of M risk factors, excluding the kth risk factor.

Results

Baseline characteristics

This study enrolled 612 participants (HTN: 130, 21.2% and non-HTN: 482, 78.8%) with 27 HTN-related predictor variables (Table 5). About 53.4% respondents were male and more than half of the respondents living in urban areas. The average age of the participants was 47.56.20±13.40 years, height 165.20±8.87 cm, and weight 66.589±8.769 kg. Obese respondents showed higher prevalence rate of HTN than normal (50.0% vs. 13.4%). Patients having diabetes (47.5% vs. 30.0%) and smoking (50.4% vs. 23.8%) were more prevalent to HTN. The prevalence of HTN was greater among the respondents who had family history of diabetes (41.8% vs. 11.2%) and HTN (60.3% vs. 21.9%). The result of association showed that residence, sex, age, occupation, income, PA, walking, diabetes, height, weight, BMI, smoking, drinking, vegetable, fat, salt, transport, HD, wealth, HHTN were significantly associated with HTN (P-value<0.005).

Risk factors selection using Boruta

The result of Boruta based feature selection method is presented in Fig 2. The method showed that age, occupation, PA, walking, diabetes, height, weight, BMI, smoking, drinking, vegetable, fat, transport, HD, wealth, and HHTN were the important risk factors of HTN. The selected risk factors were included to construct the ML-based model for prediction of HTN status (HTN or non-HTN).

thumbnail
Fig 2. Risk factors selection using Boruta based feature selection method.

https://doi.org/10.1371/journal.pone.0289613.g002

Performance comparisons of ML-based models

The performance of four ML-based models with under-sampling and ADASYN shown in Table 6 and S1 Fig. It is to be noticed that XGB model with ADASYN balancing method achieved the highest predictive discrimination ability with the accuracy of 88.81% (95% CI: 85.44–91.63), precision of 89.62, recall of 97.04, F1-score of 93.18, and AUC of 0.894 (95% CI: 0.827–0.961) compared to others.

thumbnail
Table 6. Performance of four models with two class balancing methods.

https://doi.org/10.1371/journal.pone.0289613.t006

The corresponding ROC curves and precision recall curves of four predictive models with ADASYN displayed in Fig 3. The ROC curves and precision recall curves also indicated that the XGB model reached significantly better than other models as LR, ANN, and RF. Therefore, in comparison to other models, our results showed that the XGB-based model with ADASYN performed well.

thumbnail
Fig 3.

(a) ROC curves and (b) Precision vs. recall curves of four predictive models.

https://doi.org/10.1371/journal.pone.0289613.g003

Interpretable risk factors of hypertension

SHAP analysis was executed to determine the interpretable predictive risk factor of HTN for the suited prediction model (XGB) based on the SHAP values. Fig 4(A) explains the global importance of each risk factor of XGB-based model. The importance plots only show the global influence of each feature on the prediction. However, the global importance plot does not indicate which risk factors affect positively (HTN) or negatively (non-HTN) on the prediction. For that reason, summary plots are executed, which provide a global macro-level explanation of how the input risk factors contribute to the prediction. Fig 4(B) represents the summary plot indicating the importance, impact, original value, and correlation of the risk factors to high risk of HTN. Particularly, the effect [positive (HTN) vs. negative (non-HTN)] is shown on the x-axis. The color signifies the value of a specific risk factor, wherein red indicates a high value and blue indicate a low value. However, XGB-based model showed that age, weight, fat, income, BMI, diabetes, salt, HHTN, drinking, and smoking were the high interpretable risk factors on the predication of HTN.

thumbnail
Fig 4. Importance of risk factors based on SHAP values.

(A) Mean absolute SHAP values, to explain global risk factor importance, (B) Local explanation summary, to reveal the direction of the relationship between a risk factor and game outcome.

https://doi.org/10.1371/journal.pone.0289613.g004

Discussion

In this study, we investigate several ML-based algorithms to propose an explainable framework for predicting the risk of HTN in Ethiopia. We trained up four ML algorithms (ANN, SVM, RF, and XGB) to predict HTN, using 16 risk factors obtained from Boruta feature selection method. The performance of the developed models compared by accuracy, precision, recall, F1-score, and ROC curve with AUC value on testing set. Based on performance measurements, we proposed XGB model as the most appropriate candidate classifier for predicting HTN.

Several studies were conducted using ML framework to predict the risk of HTN. A comparison of the present study with the existing studies is presented in Table 7. Chowdhury et al. [54] proposed a system on 18,322 respondents with 24 candidate risk factors in Canada. Before constructing the models, they applied five top FSM for selecting the significant risk factors and adopted five ML algorithms LASSO, Elastic Net, random survival forest (RSF), and gradient boosting, with the conventional Cox proportional hazard model for predicting HTN. They measure the performance of the models by C-index for each model. Pratiwi OA [35] applied four ML algorithms such as DT, RF, GB, and LR for predicting individual risk of HTN in Indonesia. He developed the model by K10 protocol based on training set and prediction performance of these models was measure on testing set in terms of accuracy, precision, recall, F1-score, and AUC. He indicated LR is the best performer marginally compared to others with AUC (0.829). Oanh and Tung [55] suggested a ML based model to predict patient with the risk of HTN in Vietnam. The model was developed by Naïve Bayes (NB), multilayer perceptron (MLP), decision tree (DT), k-nearest neighbors (kNN), SVM, and ensemble algorithms: bagging (RF), boosting and voting based on training set. The performance of the models was assessed by testing set in terms of F1-score, precision, and recall. Islam et al. [38] conducted a study on three countries such as Bangladesh, Nepal, and India. They included 818603 respondents with seven risk factors and performed GT, RF, GBM, XGB, LR, LDA algorithms for predicting HTN patients. They focused that XGB achieved the best performance score than others. Chai et al. [56] used Malaysian data with 2461 respondents and 11 covariates to develop a system for diagnosing HTN patients by 3 different types of algorithms, including neural network (MLP), classical model (LR, DT, NB, k-NN), and ensemble model (RF, SVM, GB, XGB, LightGBM, CatBoost, AdaBoost, and LogitBoost). Before building the model, they adopted correlation-based FSM to select a set of leading features and utilized SMOTE technique to balance membership class label of the data. They evaluate the predictive ability of the models by sensitivity, specificity, accuracy, precision, F1-score, misclassification rate, and AUC on testing set and found that LightGBM based model acquired the best accuracy with 74.39%. Islam et al. [57] used nationally representative HTN data in Bangladesh. The data consisted of 6965 subjects with 13 risk factors. They determine the prominent risk factors of HTN by two popular FSM such as LASSO and SVMRFE in Bangladesh. They utilized then K10 protocol to construct model using four ML algorithms on training set and measured the performance of the models on testing set using accuracy, precision, recall, F1- score and AUC. Overall experimental sittings demonstrated that gradient boosting model attained the best score of AUC (0.669). Zheng et al. [58] explored a system for predicting HTN patients using several ML techniques in USA. No feature selection method had used to select the prominent features of HTN before constructing ML-based system. They found that ANN model reached the maximum performance score. Alkaabi et al. [59] utilized HTN data in Qatar. The dataset comprised of 987 respondents with 12 risk factors. They adopted 3 ML-based algorithms including DT, RF, and LR. Overall experimental results anticipated that RF model provided better generalization predictive ability than others.

thumbnail
Table 7. Comparative performance of the proposed study with the existing studies.

https://doi.org/10.1371/journal.pone.0289613.t007

Thus, the comparative results suggested that our proposed XGB framework can predict HTN with higher AUC (Table 7). Moreover, SHAP analysis with the proposed method revealed that age, weight, fat, income, diabetes, BMI, height, salt, smoking, and HHTN were the associated risk factors for developing HTN. Local explanation summary plot showed that age is the 1st leading risk factor of HTN in Ethiopia. A study conducted by Belay et al., [2022] in Ethiopia found that a patient with age>60 years was two times more likely to have HTN than those with age 18–40 years [11]. This result also supported by several systematic review and meta-analysis studies [60, 61]. The vascular system of our body changes in arteries, particularly with large artery stiffness caused by older age. Weight and fat are the 2nd and 3rd leading drivers of HTN. This finding supports the conclusions of earlier investigations [62]. Excess body weight increases visceral and retroperitoneal fat, which can contribute to the development of HTN. Household income is linked to the risk of HTN, which was in line with the prior investigations [63]. Due to a number of reasons, including the ongoing nutritional transition, rising trends in sedentary lifestyle, and other modifiable risk factors, people from low-income families may have a greater burden from the disease [64]. BMI is another gradient of HTN which is corroborated with the earlier studies [65]. BMI might be a cause of HTN and other cardiovascular disease by stimulating the renin-aldosterone system and endothelial dysfunction [66]. Diabetes is another important marker of HTN. The two medical conditions diabetes and HTN may cause each other and share common risk factors. HHTN is another important covariate of HTN. This result is also coincided with the previous studies conducted in Ethiopia and other countries [67]. This might be as family member share same genetic factors, behaviors, mostly similar lifestyle, and environments related factor that could influence the risk of HTN disease. Additionally, other risk factors such as salt, drinking alcohol, and smoking were found to be an important contributing risk factors of HTN, which is similar with other studies in literature [68, 69]. Although this work has many strengths, it also has some limitations, such as the sample only included permanent the residents of the city administration who had lived in the area for more than six months and were older than 30. Additionally, it did not measure the amount of alcohol, cigarettes, fruits, vegetables, fats, and salts that were consumed in measurable units.

Conclusions

In this study, we adopted four different machine learning algorithms to build the most appropriate predictive model for classification of HTN. Overall experimental results anticipated that, among four models, the XGB model is the most appropriate model for predicting patient with the risk of HTN. The SHAP analysis revealed that age, weight, fat, income, BMI, diabetes, salt, HHTN, drinking, and smoking are the high contributing risk factors for developing HTN. Therefore, the proposed integrating system can be conveniently utilized as a useful tool in clinical sittings to accurately identify the patients with the risk of HTN at an early stage. With the help of this information, a doctor can make decisions that will reduce healthcare costs and time while also enabling individualized interventions and targeted treatment to minimize the burden of HTN in Ethiopia.

Supporting information

S1 Fig.

ROC curve of four models with two class balancing methods, (a) under-sampling and (b) ADASYN.

https://doi.org/10.1371/journal.pone.0289613.s001

(DOCX)

Acknowledgments

Authors would like to thanks the PLOS ONE’s editor and reviewers for their valuable comments and suggestions to improve the quality of the manuscript.

References

  1. 1. Mills KT, Stefanescu A, He J. The global epidemiology of hypertension. Nature Reviews Nephrology. 2020;16(4):223–37. pmid:32024986
  2. 2. GBD 2017 Risk Factor Collaborators. Global, regional, and national comparative risk assessment of 84 behavioural, environmental and occupational, and metabolic risks or clusters of risks for 195 countries and territories, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet. 2018; 392:1923–94. pmid:30496105
  3. 3. GBD 2017 Causes of Death Collaborators. Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980–2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet. 2018;392:1736–88. pmid:30496103
  4. 4. Gupta R, Xavier D. Hypertension: the most important non communicable disease risk factor in India. Indian heart journal. 2018;70(4):565–72. pmid:30170654
  5. 5. Fuchs FD, Whelton PK. High blood pressure and cardiovascular disease. Hypertension. 2020;75(2):285–92. pmid:31865786
  6. 6. Roth GA, Mensah GA, Johnson CO, Addolorato G, Ammirati E, Baddour LM, et al. Global burden of cardiovascular diseases and risk factors, 1990–2019: update from the GBD 2019 study. Journal of the American College of Cardiology. 2020;76(25):2982–3021. pmid:33309175
  7. 7. Rapsomaniki E, Timmis A, George J, Pujades-Rodriguez M, Shah AD, Denaxas S, et al. Blood pressure and incidence of twelve cardiovascular diseases: lifetime risks, healthy life-years lost, and age-specific associations in 1·25 million people. The Lancet. 2014;383(9932):1899–911.
  8. 8. Sorato MM, Davari M, Kebriaeezadeh A, Sarrafzadegan N, Shibru T. Societal economic burden of hypertension at selected hospitals in southern Ethiopia: a patient-level analysis. BMJ open. 2022;12(4):e056627. pmid:35387822
  9. 9. Mehta R, Mantri N, Goel AD, Gupta MK, Joshi NK, Bhardwaj P. Out-of-pocket spending on hypertension and diabetes among patients reporting in a health-care teaching institute of the Western Rajasthan. Journal of Family Medicine and Primary Care. 2022;11(3):1083. pmid:35495832
  10. 10. Berek PA, Irawati D, Hamid AY. Hypertension: A global health crisis. Ann Clin Hypertens. 2021;5:8–11.
  11. 11. Belay DG, Fekadu H, Molla MD, Chekol HA, Adugna DG, Melese E, et al. Prevalence and associated factors of hypertension among adult patients attending the outpatient department at the primary hospitals of Wolkait tegedie zone, Northwest Ethiopia. Frontiers in Neurology. 2022;13:943595. pmid:36034276
  12. 12. Mamdouh H, Alnakhi WK, Hussain HY, Ibrahim GM, Hussein A, Mahmoud I, et al. Prevalence and associated risk factors of hypertension and pre-hypertension among the adult population: findings from the Dubai Household Survey, 2019. BMC Cardiovascular Disorders. 2022;22(1):18. pmid:35090385
  13. 13. Tesfa E, Demeke D. Prevalence of and risk factors for hypertension in Ethiopia: A systematic review and meta‐analysis. Health Science Reports. 2021;4(3):e372. pmid:34589614
  14. 14. Anjulo U, Haile D, Wolde A. Prevalence of Hypertension and Its Associated Factors Among Adults in Areka Town, Wolaita Zone, Southern Ethiopia. Integrated Blood Pressure Control. 2021;14:43–54. pmid:33758539
  15. 15. Damtie D, Bereket A, Bitew D, Kerisew B. The prevalence of hypertension and associated risk factors among secondary school teachers in Bahir Dar City administration, Northwest Ethiopia. International Journal of Hypertension. 2021;2021:525802. pmid:33953969
  16. 16. Asresahegn H, Tadesse F, Beyene E. Prevalence and associated factors of hypertension among adults in Ethiopia: a community based cross-sectional study. BMC research notes. 2017;10:1–8.
  17. 17. Khanam R, Ahmed S, Rahman S, Al Kibria GM, Syed JR, Khan AM, et al. Prevalence and factors associated with hypertension among adults in rural Sylhet district of Bangladesh: a cross-sectional study. BMJ open. 2019;9(10):e026722. pmid:31662350
  18. 18. Matsuzaki M, Sherr K, Augusto O, Kawakatsu Y, Ásbjörnsdóttir K, Chale F, et al. The prevalence of hypertension and its distribution by sociodemographic factors in Central Mozambique: a cross sectional study. BMC public health. 2020;20:1–9.
  19. 19. Sharma JR, Mabhida SE, Myers B, Apalata T, Nicol E, Benjeddou M, et al. Prevalence of hypertension and its associated risk factors in a rural black population of Mthatha town, South Africa. International Journal of Environmental Research and Public Health. 2021;18(3):1215. pmid:33572921
  20. 20. Manios Y, Androutsos O, Lambrinou CP, Cardon G, Lindstrom J, Annemans L, et al. A school-and community-based intervention to promote healthy lifestyle and prevent type 2 diabetes in vulnerable families across Europe: design and implementation of the Feel4Diabetes-study. Public Health Nutrition. 2018;21(17):3281–90. pmid:30207513
  21. 21. Hong K, Yu ES, Chun BC. Risk factors of the progression to hypertension and characteristics of natural history during progression: A national cohort study. Plos one. 2020;15(3):e0230538. pmid:32182265
  22. 22. Chowdhury MZ, Naeem I, Quan H, Leung AA, Sikdar KC, O’Beirne M, et al. Prediction of hypertension using traditional regression and machine learning models: A systematic review and meta-analysis. Plos one. 2022;17(4):e0266334. pmid:35390039
  23. 23. Chowdhury MZ, Leung AA, Sikdar KC, O’Beirne M, Quan H, Turin TC. Development and validation of a hypertension risk prediction model and construction of a risk score in a Canadian population. Scientific Reports. 2022;12(1):12780. pmid:35896590
  24. 24. Ghosh S, Kumar M. Prevalence and associated risk factors of hypertension among persons aged 15–49 in India: a cross-sectional study. BMJ open. 2019;9(12):e029714. pmid:31848161
  25. 25. Baştanlar Y, Özuysal M. Introduction to machine learning. miRNomics: MicroRNA biology and computational analysis. Humana Press. 2014:105–28.
  26. 26. Ghaderzadeh M, Asadi F, Hosseini A, Bashash D, Abolghasemi H, Roshanpour A. Machine learning in detection and classification of leukemia using smear blood images: a systematic review. Scientific Programming. 2021;2021:1–4.
  27. 27. Ghaderzadeh M, Rebecca FE, Standring A. Comparing performance of different neural networks for early detection of cancer from benign hyperplasia of prostate. Applied Medical Informatics. 2013;33(3):45–54.
  28. 28. Salehnasab C, Hajifathali A, Asadi F, Parkhideh S, Kazemi A, Roshanpoor A, et al. An Intelligent Clinical Decision Support System for Predicting Acute Graft-versus-host Disease (aGvHD) following Allogeneic Hematopoietic Stem Cell Transplantation. Journal of Biomedical Physics & Engineering. 2021;11(3):345. pmid:34189123
  29. 29. Kruppa J, Liu Y, Biau G, Kohler M, Koenig IR, Malley JD, et al. Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biometrical Journal. 2014;56(4):534–63. pmid:24478134
  30. 30. Garavand A, Salehnasab C, Behmanesh A, Aslani N, Zadeh AH, Ghaderzadeh M. Efficient model for coronary artery disease diagnosis: a comparative study of several machine learning algorithms. Journal of Healthcare Engineering. 2022;2022. pmid:36304749
  31. 31. Nadim K, Ragab A, Ouali MS. Data-driven dynamic causality analysis of industrial systems using interpretable machine learning and process mining. Journal of Intelligent Manufacturing. 2023;34(1):57–83.
  32. 32. Géron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc. 2022.
  33. 33. Rezaianzadeh A, Dastoorpoor M, Sanaei M, Salehnasab C, Mohammadi MJ, Mousavizadeh A. Predictors of length of stay in the coronary care unit in patient with acute coronary syndrome based on data mining methods. Clinical Epidemiology and Global Health. 2020;8(2):383–8.
  34. 34. Kumar A, Mayank J. Ensemble learning for AI developers. BA press: Berkeley, CA, USA. 2020.
  35. 35. Kurniawan R, Utomo B, Siregar KN, Ramli K, Besral B, Suhatril RJ, et al. Hypertension prediction using machine learning algorithm among Indonesian adults. IAES International Journal of Artificial Intelligence. 2023;12(2): 776–84.
  36. 36. Visco V, Izzo C, Mancusi C, Rispoli A, Tedeschi M, Virtuoso N, et al. Artificial Intelligence in Hypertension Management: An Ace up Your Sleeve. Journal of Cardiovascular Development and Disease. 2023;10(2):74. pmid:36826570
  37. 37. Alsaleh MM, Allery F, Choi JW, Hama T, McQuillin A, Wu H, et al. Prediction of disease comorbidity using explainable artificial intelligence and machine learning techniques: A systematic review. International Journal of Medical Informatics. 2023;175:105088. pmid:37156169
  38. 38. Islam SM, Talukder A, Awal MA, Siddiqui MM, Ahamad MM, Ahammed B, et al. Machine Learning Approaches for Predicting Hypertension and Its Associated Factors Using Population-Level Data from Three South Asian Countries. Frontiers in Cardiovascular Medicine. 2022;9:839379. pmid:35433854
  39. 39. Paulose T, Nkosi ZZ, Endriyas M. Prevalence of hypertension and its associated factors in Hawassa city administration, Southern Ethiopia: Community based cross-sectional study. Plos one. 2022;17(3):e0264679. pmid:35231073
  40. 40. Park S. Ideal target blood pressure in hypertension. Korean Circulation Journal. 2019;49(11):1002–9. pmid:31646769
  41. 41. Pudjihartono N, Fadason T, Kempa-Liehr AW, O’Sullivan JM. A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics. 2022;2:927312. pmid:36304293
  42. 42. Ranganathan P, Pramesh CS, Aggarwal R. Common pitfalls in statistical analysis: logistic regression. Perspectives in clinical research. 2017;8(3):148. pmid:28828311
  43. 43. Montesinos López OA, Montesinos López A, Crossa J. Fundamentals of Artificial Neural Networks and Deep Learning. In: Multivariate Statistical Machine Learning Methods for Genomic Prediction. Cham: Springer International Publishing. 2022 (pp. 379–425).
  44. 44. Breiman L. Random forests. Machine learning. 2001;45(1):5–32.
  45. 45. Guang P, Huang W, Guo L, Yang X, Huang F, Yang M, et al. Blood-based FTIR-ATR spectroscopy coupled with extreme gradient boosting for the diagnosis of type 2 diabetes: A STARD compliant diagnosis research. Medicine. 2020;99(15). pmid:32282717
  46. 46. May RJ, Maier HR, Dandy GC. Data splitting for artificial neural networks using SOM-based stratified sampling. Neural Networks. 2010;23(2):283–94. pmid:19959327
  47. 47. Thabtah F.; Hammoud S.; Kamalov F.; Gonsalves A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020; 513:429–441.
  48. 48. Buda M.; Maki A.; Mazurowski M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks. 2018;106:249–259. pmid:30092410
  49. 49. He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE. 2008 (pp. 1322–1328).
  50. 50. Hajian-Tilaki K. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian journal of internal medicine. 2013;4(2):627. pmid:24009950
  51. 51. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30.
  52. 52. Shapley LS. 17. A value for n-person games. InContributions to the Theory of Games (AM-28). Princeton University Press. 2016: 307–318.
  53. 53. Palatnik de Sousa I, Maria Bernardes Rebuzzi Vellasco M, Costa da Silva E. Local interpretable model-agnostic explanations for classification of lymph node metastases. Sensors. 2019;19(13):2969. pmid:31284419
  54. 54. Chowdhury MZ, Leung AA, Walker RL, Sikdar KC, O’Beirne M, Quan H, et al. A comparison of machine learning algorithms and traditional regression-based statistical modeling for predicting hypertension incidence in a Canadian population. Scientific Reports. 2023;13(1):1–3.
  55. 55. Oanh TT, Tung NT. Predicting Hypertension Based on Machine Learning Methods: A Case Study in Northwest Vietnam. Mobile Networks and Applications. 2022;27(5):2013–23.
  56. 56. Chai SS, Goh KL, Cheah WL, Chang YH, Ng GW. Hypertension Prediction in Adolescents Using Anthropometric Measurements: Do Machine Learning Models Perform Equally Well? Applied Sciences. 2022;12(3):1600.
  57. 57. Islam MM, Rahman MJ, Roy DC, Tawabunnahar M, Jahan R, Ahmed NF, et al. Machine learning algorithm for characterizing risks of hypertension, at an early stage in Bangladesh. Diabetes & Metabolic Syndrome: Clinical Research & Reviews. 2021;15(3):877–884. pmid:33892404
  58. 58. Zheng J, Yu Z. A novel machine learning-based systolic blood pressure predicting model. Journal of Nanomaterials. 2021;2021:1–8.
  59. 59. AlKaabi LA, Ahmed LS, Al Attiyah MF, Abdel-Rahman ME. Predicting hypertension using machine learning: Findings from Qatar Biobank Study. Plos One. 2020;15(10):e0240370. pmid:33064740
  60. 60. Legese N, Tadiwos Y. Epidemiology of hypertension in Ethiopia: a systematic review. Integrated blood pressure control. 2020;13:135–43. pmid:33116810
  61. 61. Koya SF, Pilakkadavath Z, Chandran P, Wilson T, Kuriakose S, Akbar SK, et al. Hypertension control rate in India: Systematic review and meta-analysis of population-level non-interventional studies, 2001–2022. The Lancet Regional Health-Southeast Asia. 2023;9:100113. pmid:37383035
  62. 62. Solomon M, Shiferaw BZ, Tarekegn TT, GebreEyesus FA, Mengist ST, Mammo M, et al. Prevalence and Associated Factors of Hypertension Among Adults in Gurage Zone, Southwest Ethiopia, 2022. SAGE Open Nursing. 2023; 9:2377960823115347 pmid:36761364
  63. 63. Qin Z, Li C, Qi S, Zhou H, Wu J, Wang W, et al. Association of socioeconomic status with hypertension prevalence and control in Nanjing: a cross-sectional study. BMC Public Health. 2022;22(1):1–9.
  64. 64. Ranzani OT, Kalra A, Di Girolamo C, Curto A, Valerio F, Halonen JI, et al. Urban-rural differences in hypertension prevalence in low-income and middle-income countries, 1990–2020: A systematic review and meta-analysis. Plos Medicine. 2022;19(8):e1004079. pmid:36007101
  65. 65. Hall JE, do Carmo JM, da Silva AA, Wang Z, Hall ME. Obesity, kidney dysfunction and hypertension: mechanistic links. Nature reviews nephrology. 2019;15(6):367–85. pmid:31015582
  66. 66. Imai Y. A personal history of research on hypertension from an encounter with hypertension to the development of hypertension practice based on out-of-clinic blood pressure measurements. Hypertension Research. 2022;45(11):1726–42. pmid:36075990
  67. 67. Mayl JJ, German CA, Bertoni AG, Upadhya B, Bhave PD, Yeboah J, et al. Association of alcohol intake with hypertension in type 2 diabetes mellitus: The ACCORD Trial. Journal of the American Heart Association. 2020;9(18):e017334. pmid:32900264
  68. 68. Nguyen TT, Nguyen MH, Nguyen YH, Nguyen TT, Giap MH, Tran TD, et al. Body mass index, body fat percentage, and visceral fat as mediators in the association between health literacy and hypertension among residents living in rural and suburban areas. Frontiers in Medicine. 2022;9. pmid:36148456
  69. 69. Choi JW, Han E, Kim TH. Risk of Hypertension and Type 2 Diabetes in Relation to Changes in Alcohol Consumption: A Nationwide Cohort Study. International Journal of Environmental Research and Public Health. 2022;19(9):4941. pmid:35564335