Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Machine learning-based models to predict the conversion of normal blood pressure to hypertension within 5-year follow-up

  • Aref Andishgar,

    Roles Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation USERN Office, Fasa University of Medical Sciences, Fasa, Iran

  • Sina Bazmi,

    Roles Conceptualization, Investigation, Validation, Writing – original draft, Writing – review & editing

    Affiliation Student Research Committee, Fasa University of Medical Sciences, Fasa, Iran

  • Reza Tabrizi ,

    Roles Conceptualization, Project administration, Resources, Supervision

    kmsrc89@gmail.com

    Affiliation Noncommunicable Diseases Research Center, Fasa University of Medical Science, Fasa, Iran

  • Maziyar Rismani,

    Roles Conceptualization, Data curation, Methodology

    Affiliation Student Research Committee, Fasa University of Medical Sciences, Fasa, Iran

  • Omid Keshavarzian,

    Roles Supervision, Validation

    Affiliation School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran

  • Babak Pezeshki,

    Roles Supervision, Validation

    Affiliation Clinical Research Development Unit, Valiasr Hospital, Fasa University of Medical Sciences, Fasa, Iran

  • Fariba Ahmadizar

    Roles Supervision, Validation

    Affiliation Department of Data Science and Biostatistics, Julius Global Health, University Medical Center Utrecht, Utrecht, The Netherlands

Abstract

Background

Factors contributing to the development of hypertension exhibit significant variations across countries and regions. Our objective was to predict individuals at risk of developing hypertension within a 5-year period in a rural Middle Eastern area.

Methods

This longitudinal study utilized data from the Fasa Adults Cohort Study (FACS). The study initially included 10,118 participants aged 35–70 years in rural districts of Fasa, Iran, with a follow-up of 3,000 participants after 5 years using random sampling. A total of 160 variables were included in the machine learning (ML) models, and feature scaling and one-hot encoding were employed for data processing. Ten supervised ML algorithms were utilized, namely logistic regression (LR), support vector machine (SVM), random forest (RF), Gaussian naive Bayes (GNB), linear discriminant analysis (LDA), k-nearest neighbors (KNN), gradient boosting machine (GBM), extreme gradient boosting (XGB), cat boost (CAT), and light gradient boosting machine (LGBM). Hyperparameter tuning was performed using various combinations of hyperparameters to identify the optimal model. Synthetic Minority Over-sampling Technology (SMOTE) was used to balance the training data, and feature selection was conducted using SHapley Additive exPlanations (SHAP).

Results

Out of 2,288 participants who met the criteria, 251 individuals (10.9%) were diagnosed with new hypertension. The LGBM model (determined to be the optimal model) with the top 30 features achieved an AUC of 0.67, an f1-score of 0.23, and an AUC-PR of 0.26. The top three predictors of hypertension were baseline systolic blood pressure (SBP), gender, and waist-to-hip ratio (WHR), with AUCs of 0.66, 0.58, and 0.63, respectively. Hematuria in urine tests and family history of hypertension ranked fourth and fifth.

Conclusion

ML models have the potential to be valuable decision-making tools in evaluating the need for early lifestyle modification or medical intervention in individuals at risk of developing hypertension.

Introduction

Hypertension, a prevalent chronic multifactorial disease, remains a significant challenge in the modern world [1]. In 2021, the World Health Organization estimated that approximately one-third of the global population have hypertension, two-thirds of those found in low- and middle-income countries [2]. Despite advancements in diagnosis and treatment, the prevalence of hypertension in these countries continues to rise [3]. Iran, for instance, reports a 25% prevalence of hypertension [4]. According to World Health Organization reports, hypertension contributes to an annual toll of 9.4 million deaths. In low- and middle-income countries, hypertension was responsible for approximately 8.5 million deaths in 2015, accounting for 88% of global hypertension-related mortality [5]. Specifically, hypertension stands as a primary cause of mortality in the Middle East [6]. Referred to as a silent killer, hypertension becomes apparent only at hazardous pointes, leading to events such as heart attacks or strokes [7]. Despite being controllable through cost-effective medications and timely interventions [8], many hypertensive patients remain undiagnosed due to insufficient awareness of screening and risk factors [9]. Moreover, care episodes for hypertension in low- to middle-income countries incur costs ranging from $500 to $1500, with monthly treatment expenses averaging around $22 [10], and hypertension commonly develops among middle-aged individuals, impacting productivity and imposing additional burdens on economic systems [11]. Given the high level of costs and complications associated with the chronic disease, studies have aimed to estimate hypertension risks for more effective prevention and management of complications [1]. Among the most renowned risk assessment tools is the Framingham Risk Score for predicting cardiovascular diseases [12]. However, these models lack sufficient diversity in encompassing different ethnicities, necessitating the constant development of tailored risk prediction models for specific populations [13].

Machine learning (ML), an integral component of artificial intelligence (AI), has gained significant traction in recent years due to its superior performance in risk classification tools compared to conventional statistical techniques [14]. This technology enables computers to learn without direct programming and adeptly analyze intricate data interactions [15]. Typically, ML surpasses traditional statistical methods by reducing bias, autonomously handling missing variables with minimal intervention in original data, managing distorted variables, and ensuring balanced data, thereby yielding superior outcomes [15]. Furthermore, ML models have the capacity to represent nonlinear relationships and enhance overall predictive accuracy [16]. Consequently, ML methods serve as a valuable tool for automating disease prediction [15].

While the precise origins of hypertension remain elusive, factors such as genetics, excessive salt intake, reduced physical activity, and being obese are known contributors to its progression [17]. These and variables such as educational levels and income, among others, exhibit significant variations across countries and regions [8], underscoring the need for further research to develop location-specific risk assessment tools. Numerous studies have sought to predict hypertension using AI-based ML models. However, the data from these studies have been primarily cross-sectional, and there is no evidence indicating the successful implementation of these algorithms in clinical settings in the rural Middle East areas. Additionally, to date, no longitudinal hypertension prediction model has been established on the total population of these regions.

In this investigation, we aim to assess and contrast the efficacy of various ML methods utilizing a longitudinal rural middle eastern dataset to forecast individuals susceptible to developing hypertension within a 5-year span, hence identifying individuals with a higher probability of benefiting from treatments. We scrutinize and compare ten ML techniques to derive the optimal model for predicting hypertension risk. The assessment of models involves multiple metrics, employing a range of validation techniques and evaluation criteria.

Methods

1. Data source

This is a retrospective longitudinal study based on the Fasa Adults Cohort Study (FACS) data. FACS study has 10 118 participants aged 35–70 years in Sheshdeh and Qarabolagh districts of Fasa city. FACS was created to assess the risk factors that predispose Fasa’s rural residents to Non-Communicable Diseases (NCDs), including cardiovascular diseases. FACS enrollment began in October 2014 and ended in September 2016 in an area with 84% rural residents. Since September 2021, when the fourth follow-up was completed, the cohort study has entered the re-evaluation phase of the same variables as the registration phase, with 3000 of the first phase participants scheduled to participate. Random sampling was used to select participants for this phase of the study. The re-evaluation phase includes all of the steps taken, clinical examinations performed, biological samples taken, and questionnaires administered during the registration phase [18].

2. Study population

Our study sample is selected as a census from the FACS. Five inclusion criteria were considered to include people in the study: 1. Participants with 5 years of follow-up 2. Participants with 5 years data available 3. Participants without hypertension diseases at the first phase (with the same diagnostic criteria mentioned in the final outcome section) 4. Being alive at the end of the follow-up. Finally, 2288 participants were included with census method. The study steps are summarized in a flowchart displayed in Fig 1.

3. Data preparation and preprocessing

Most of the variables had no missing data and the other had < 10% missing data. For continuous variables, mean imputation was employed, while for categorical variables, median imputation was used to replace missing data. Finally, a total of 160 variables were included in the ML models. The list of all variables is included in S1 Table. Moreover, data must be processed before using ML models. Two methods were employed in this process: feature scaling and one-hot encoding. They were used to process the continuous variables and variables with more than 2 categories, respectively. In this study, standard scaling procedure was used which transforms continuous variables with a range from -1 to +1. One-hot encoding was applied to produce dummy variables which takes only the value of 0 or 1.

4. Final outcome

In the FACS study, a person was diagnosed with hypertension if they had systolic blood pressure (SBP) ≥140 mmHg or diastolic blood pressure (DBP) ≥90 mmHg on at least two episodes (15 minutes apart), or consuming anti-hypertensive drugs due to previous diagnosis [17]. In the current study, having hypertension after 5 years of follow-up analyzed as a classified outcome (hypertensive participants / non-hypertensive participants).

5. Splitting data

To avoid overfitting, the dataset was divided into two parts: training (80%) and test (20%) data. Training was used for training the models, hyper-parameter tuning and 5-fold cross validation. Test data was blind to the training data and was used for final evaluation and internal validation of the ML models. Training and test data was followed 5 years until final outcome was achieved (Fig 2).

thumbnail
Fig 2. Procedure of splitting dataset into training (80%) and test (20%) parts.

https://doi.org/10.1371/journal.pone.0300201.g002

6. Machine learning algorithms

In this study, ten supervised ML algorithms were used: logistic regression (LR), support vector machine (SVM), random forest (RF), gaussian naive bayes (GNB), linear discriminant analysis (LDA), k-nearest neighbors (KNN), gradient boosting machine (GBM), extreme gradient boosting (XGB), cat boost (CAT) and light gradient boosting machine (LGBM). We used a variety of ML methods to make sure the dataset was thoroughly explored. Every algorithm possesses distinct advantages and disadvantages, and our objective was to evaluate each one’s performance independently in several research domains. LR was implemented for its simplicity. SVM was selected for its capability in handling high-dimensional data and finding complex relationships. RF was leveraged as an ensemble learning model. GNB provided a computationally efficient approach. LDA offered interpretability. KNN was implemented to detect local patterns. GBM sequentially refined model performance. XGB can perform with high accuracy in large datasets. CAT optimized categorical feature handling, and LGBM efficiently managed larger datasets with swift training. This multifaceted strategy sought to capitalize on the distinct advantages of every model, guaranteeing a thorough examination of the dataset. By not depending only on a single model, we were able to prevent any bias and obtain a comprehensive comprehension of the data.

Anaconda (version 4.12.0) on the Visual Studio Code Platform (version 1.76.2) and python (version 3.9.12) was used to implement all ML algorithms. Furthermore, the machine algorithms were run using the Scikit-Learn Module (version 1.1.3) [19].

7. Model development

At first, 5-fold cross validation and hyper-parameter tuning were applied on training data to find the optimal hyper-parameters. In this stage all features were used. The 5-fold approach separated all of the training data into 5 equal parts, and each time one of the parts was considered validation data, it trained itself and reported the accuracy, and eventually, the average of all 5 accuracies was obtained. Each ML model’s accuracy may now be changed by adjusting its hyper-parameters. Various combinations of hyper-parameters were utilized in the hyper-parameter tuning process to find the best combination of hyper-parameters. For the hyper-parameter tuning step, the grid search approach was employed [20] (S2 Table).

Second, over-sampling was employed to balance the outcome values. The Synthetic Minority Over-sampling Technology (SMOTE) was used to balance the training data. This technique oversamples the minority group by creating "fake" instances. SMOTE selects samples from the minority class and creates "fake" samples along the same line segment, linking some or all of the k nearest neighbors of the minority class [21]. Hypertensive participants were the minority class and SMOTE generated 1428 instances to equalize hypertensive and non-hypertensive individuals.

Finally, ML models were trained using balanced training data and the best hyper-parameters.

8. Model evaluation

All trained ML models were applied to test data. For the final evaluation and comparison of the ML models, three metrics were used: Area under receiver operating characteristic curve (AUC), f1-score and area under the precision-recall curve (AUC-PR) (Fig 3A, S3 Table). In addition, for more detail, S3 Table includes measures such as accuracy, sensitivity, and specificity. The following equations are used to determine the evaluation metrics:

Accuracy =

Sensitivity =

Specificity =

F1-score =

TP stands for true positive rate, TN is for true negative rate, FP stands for false positive rate, and FN stands for false negative rate. Finally, LGBM model was chosen as the best ML model based on AUC, f1-score and AUC-PR.

thumbnail
Fig 3. Comparative analysis of model performance indicators and diagnostic visualizations for full-featured and simplified models across ten algorithms.

(A), Comparison the AUC, f1-score and AUC-PR among the models with ten algorithms using all features. (B), Comparison of the AUC, f1-score and AUC-PR among simplified models and the model with all features. (C-D), ROC curve and confusion matrix of the LGBM model with top-30 features. LR; Logistic Regression, SVM; Support Vector Machine, RF; Random Forest, GNB; Gaussian Naive Bayes, LDA; Linear Discriminant Analysis, KNN; K-Nearest Neighbors, GBM; Gradient Boosting Machine, XGB; Extreme Gradient Boosting, CAT; Cat boost, LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, AUC-PR; Area Under the Precision-Recall curve.

https://doi.org/10.1371/journal.pone.0300201.g003

9. Feature selection

To accomplish efficient data reduction, feature selection approaches can be utilized. This is helpful in identifying more accurate ML models and reduce computational costs. There are three types of feature selection: wrapper, filter, and embedded methods [22]. SHapley Additive exPlanations (SHAP) was used as a wrapper method. SHAP is a uniform way to explaining any ML model’s output. It combines game theory to local explanations, bringing together various earlier approaches and represents the only consistent and locally correct additive feature attribution method based on expectations. It has become a feature selection method in the recent years [23, 24].

Then, SHAP and LGBM models were combined to determine the optimal amount of features between 10, 15, 20, 25, 30, and 35. The best performance was achieved by a subset of 30 characteristics (Fig 3B, S4 Table). Fig 4B shows the top 30 features and their importance in predicting hypertension.

thumbnail
Fig 4. Interpret LGBM model with top-30 features and its performance.

(A), SHAP beeswarm plot for top features. The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low). This reveals for example that a high systolic blood pressure highers the predicted home price. (B), The SHAP values of top-30 variables. The input features on the y-axis are arranged in descending importance, and the values on the x-axis represent the mean influence of each feature on the size of the model output based on SHAP analysis. (C), Receiver operating characteristic (ROC) curves of top-3 features. LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, SHAP; Shapley Additive exPlanations, SBP: Systolic Blood Pressure, WHR; Waist-to-Hip Ratio.

https://doi.org/10.1371/journal.pone.0300201.g004

Table 1 displays a descriptive analysis of chosen characteristics. Statistical tests such as Independent Samples Test, Chi-Square Test, and Mann-Whitney Test were utilized. Statistical significance was defined as P-values less than 0.05. The data was analyzed using SPSS version 18 (IBM Corp., Armonk, N.Y., USA).

thumbnail
Table 1. General and top-30 important characteristics of the participants according to hypertension after 5 years of follow-up (Total number of participants = 2288).

https://doi.org/10.1371/journal.pone.0300201.t001

10. Model interpretation

ROC curve and confusion matrix of the LGBM model with top-30 features is displayed in Fig 3C & 3D. The SHAP analysis was utilized to comprehend the LGBM model. SHAP values for the top features were determined in detail (Fig 4B). The beeswarm plot is designed to display an information-dense summary of how the top features in a dataset impact the model’s output (Fig 4A).

11. Ethics approval and consent to participate

Our study protocol was approved by the Fasa University of Medical Sciences Research Council and Ethics Committee (approval code: IR.FUMS.REC.1402.133) and adhered to Helsinki guidelines. Furthermore, all subjects provided written informed consent before participating. The authors have no access to information that could identify individual participants during or after data collection.

Results

1. Compare performance of machine learning algorithms

Fig 3A and S3 Table display the performance of all ML models based on various metrics. To make the final decision and discover the best ML model, the AUC, f1-score and AUC-PR metrics were considered. The highest AUC was achieved by RF (0.65). GBM and LGBM were ranked second and third, respectively, with AUC = 0.63. Among the ML models, CAT obtained the highest f1-score. GNB and SVM came in second and third, with f1-scores of 0.21 and 0.20, respectively. LGBM had the highest AUC-PR (0.20), while RF was second (AUC-PR = 0.19). Ultimately, the optimal model was determined to be LGBM. At last, 30 of the greatest features were chosen as the best numbers for predicting hypertension with LBGM model. The LBGM with top-30 features had AUC = 0.67, f1-score = 0.23 and AUC-PR = 0.26.

2. Descriptive analytics of top-30 variables of participants

In this research, 251 people (10.9%) were diagnosed with hypertension (Table 1). Women were more likely than males to have hypertension (p-value<0.05). In hypertension individuals, SBP, DBP, waist-to-hip ratio, and waist-to-height ratio were higher (p-value<0.05). Physical activity, blood in urine test, iron intake, sodium intake, cholesterol intake, grain products consumption, meat consumption, history of oral aphthous, history of chronic headaches, past medical history of hospitalization, family history of hypertension, family history of epilepsy, family history of pelvic or femoral fracture, family history of stroke, and having a job were all higher in hypertensive patients (p-value<0.05). There were no statistically significant differences in alkaline phosphatase level, ascorbic acid in urine test, vegetable consumption, dairy products consumption, history of joint pain, history of heartburn, history of back stiffness and h7istory of urinary problems (p-value>0.05).

3. Feature importance

Fig 4B shows the top-30 features in order of importance. The top three predictors of hypertension were SBP, gender, and waist-to-hip ratio, with AUCs of 0.66, 0.58, and 0.63, respectively. Blood in urine test and family history of hypertension were in fourth and fifth rank, respectively. In Fig 4A features were ordered by their SHAP values and it shows a beeswarm plot. For better understanding of this plot, binary variables have values of zero and one, which one indicates a positive value. To interpret this plot, more SBP and waist-to-hip ratio increase the risk of having hypertension in the future. Female gender and the presence of blood in urine test increase the risk of having hypertension in the future.

Discussion

This longitudinal study, based on the FACS with 10,118 participants aged 35–70, aimed to predict 5-year hypertension risk using ML. After selecting 2288 participants meeting specific criteria, 160 variables were processed and prepared for analysis. Various ML algorithms were employed, and Light Gradient Boosting Machine (LGBM) emerged as the optimal model. The study eventually introduced the top 30 features, highlighting the top 5 factors of SBP, gender, waist to hip ratio, hematuria, and family history of hypertension significantly associated with hypertension development. The model achieved an AUC of 0.67.

So far, three relatively robust studies have been conducted to predict hypertension using ML in the Middle East. AlKaabi et al. [1] implemented three supervised ML algorithms in a cross-sectional study involving 987 individuals aged over 18 in Qatar, where the random forest model demonstrated the best performance with an AUC of 0.869. Our study had a much larger sample size, a stronger methodology, utilized more models and variables, and unlike this study, incorporated feature selection. Additionally, the study population wasn’t entirely representative of the entire region, focusing solely on individuals residing in Qatar within a specific timeframe. Sakr et al. [6] conducted a longitudinal study on 23,095 suspected cardiovascular patients referred for exercise testing and followed up for 10 years, implementing six ML models. The RTF model achieved the best performance with an AUC of 0.93. This study was longitudinal, had a larger sample size, and attained a higher AUC. However, it did not encompass the general population, evaluating only patients referred for exercise testing, and focused mainly on factors related to cardiovascular diseases and exercise test results. Furthermore, Namatollahi et al. [25] designed a predictive model for hypertension based on factors associated with body composition in a cross-sectional study utilizing data from the same adult cohort in Fasa. This study also followed a cross-sectional design, focusing exclusively on factors related to body structure. In contrast, considering the follow-up phase of this cohort, we conducted the current longitudinal study, incorporating a broader range of factors. Most studies conducted in the field of ML models for predicting hypertension [8, 2628], including studies by AlKaabi and Namatollahi [1, 25], were based on cross-sectional data. Firstly, cross-sectional studies cannot precisely determine the exact timing of future hypertension development in patients. Secondly, cross-sectional data often include numerous hypertension-related complications in patients’ records, which essentially provide the ML models with an unfair advantage, artificially inflating their accuracy scores. This issue, known as data leakage, undermines the predictive reliability of the results, making them fundamentally non-generalizable to real-world scenarios. In contrast, longitudinal data, like ours, begins with patients who are initially healthy, showing no signs of hypertension or its extensive complications. Consequently, results derived from longitudinal data hold greater validity, and even lower scores are more valuable than the misleadingly elevated scores from cross-sectional models.

In our model, an interesting predictive factor that had less discussion in texts regarding its relation to hypertension was positive hematuria. Before this, only three studies [2931] directly examined this connection, all exclusively on hemophiliac patients, a population with higher occurrences of hematuria and hypertension than the general population, and they were conducted with small sample sizes. Holme et al. [30] in their cross-sectional study did not find a significant correlation between the presence of hematuria and hypertension in these patients. Sun et al. [31], in their prospective study focusing solely on men, concluded that despite the high prevalence of hematuria and hypertension in hemophiliac patients, these two factors are not related, and hematuria is unlikely to lead to hypertension in the long term. Also, renal insufficiency in these patients in the follow-up was rare, questioning the renal damage as an intermediary for this relationship. However, this study was solely conducted on hemophiliac male patients and had a small sample size. Nonetheless, Qvistad et al. [29] in a recent study found that the connection between hematuria and hypertension becomes significant in patients with a family history of hypertension. Our study results were adjusted for a family history of hypertension, yet hematuria was selected as one of the top 5 predictive factors for hypertension in a 5-year model. Hematuria could be a sign of underlying kidney damage or dysfunction, which, although mild and overlooked, could, in the long term, alter blood pressure regulation by affecting sodium balance, increasing fluid retention, and disrupting hormonal equilibrium, such as the renin-angiotensin-aldosterone system, ultimately leading to hypertension [3234]. Additionally, factors causing hematuria might trigger an inflammatory response and endothelial dysfunction. If chronic, this inflammation and dysfunction could potentially increase vascular resistance, subsequently raising blood pressure and leading to hypertension [35]. Of course, both hypertension and hematuria share common risk factors such as obesity and smoking, but these factors are adjusted for in models. Longitudinal studies based on this hypothesis are needed to examine and confirm the relationship between hematuria and the likelihood of developing hypertension over time.

Repeatedly, anthropometric indices have been introduced as risk factors for cardiovascular diseases [36], and various studies have reported a strong correlation between WHR (waist-to-hip ratio) and hypertension. However, WHR as a specific predictor for the occurrence of hypertension has been less discussed. Initially, a cross-sectional study by Feldstein et al. [37] demonstrated that WHR might better and logically predict the risk of hypertension compared to other anthropometric indices. A meta-analysis of cross-sectional studies indicated that WHR is a better biomarker for cardiovascular diseases and hypertension risk [38]. Choi et al. in a large longitudinal study with a good sample size concluded that WHR has a significant and strong relationship with the occurrence of hypertension over time [39]. The use of WHR, compared to popular anthropometric indices like BMI and WC, could be more useful as it’s easier to measure, doesn’t have a linear relationship with other indices, and has shown consistency across different age and ethnic groups [40]. In our study, WHR was chosen as the third top predictive factor for hypertension in the next 5 years, aligning with the mentioned texts and similar studies.

Family history of hypertension, like other diseases, is associated with a higher chance of developing hypertension in an individual. Wang and colleagues’ extensive 54-year longitudinal study on a cohort demonstrated that family history of hypertension, both from the father and mother, has an independent and strong correlation with the occurrence of hypertension over time [41]. Similarly, a recent longitudinal study by Kunnas et al. [42] with a 15-year follow-up and a more precise design showed similar results. In our study, a family history of hypertension was selected as the fifth top predictive factor for hypertension in the next 5 years, in line with the mentioned texts and similar studies.

The occurrence and prevalence of hypertension differ between men and women [43]. Generally, hypertension prevalence is usually higher in men than in women, but our model identified female gender as the second top predictive factor for hypertension in the next 5 years. Our study cohort included individuals aged 35 to 70 years. As age increases, especially beyond the sixth decade of life, the steepness of hypertension occurrence in women significantly rises [44]. Moreover, at older ages, specific hypertension risk factors for women, such as pregnancy-induced hypertension and menopause, become evident and prevalent, increasing the chances of developing hypertension at these ages [45]. Additionally, socioeconomically disadvantaged status is more associated with hypertension in women [46], which seems entirely logical given our study population in rural areas, predominantly with lower socioeconomic status. Considering that hypertension in women is a stronger risk factor for cardiovascular diseases [45], this result seems crucial.

Two factors, WHR and SBP, among the top predictive factors in our study, were in line with the top predictive factors in a similar and robust study conducted in Canada [47]. This conformity could indicate a percentage of similarity among different populations in predicting future hypertension.

Based on our ML model, individuals at high risk for developing hypertension can be recommended to modify their lifestyles and behaviors (such as physical activity, dietary changes, smoking cessation, and alcohol consumption) to avoid hypertension and prevent all associated dangerous complications and costs [47]. It is further recommended to employ new ML models in various geographical regions where there is a wide diversity in hypertension risk factors, as each model may reveal new predictive factors for hypertension [28].

Our study had several strengths that set it apart from previous research. Firstly, it was a longitudinal study with a 5-year follow-up period, providing valuable insights into the long-term development of hypertension. Additionally, we employed feature selection approaches and utilized ten supervised ML algorithms, enhancing the robustness of our analysis. Furthermore, we conducted hyperparameter tuning to optimize the performance of our models. Moreover, our study has a significantly larger number and scope of variables compared to most studies conducted in this field, including the Canadian study [47].

Our study had a strong methodology; however, unfortunately, we faced severe data limitations. Out of the 3,000 followed individuals, only 251 developed hypertension. Due to this data limitation, our model’s final F1 score was low, and the AUC did not reach a significantly high value. Despite the severe data limitations, we were able to achieve an AUC of approximately 0.67, which demonstrates the strength of our methodology. Upon the completion of the data collection for the Fasa cohort follow-up phase in the upcoming years, we will be able to enhance and fortify our models. Furthermore, we were unable to perform external validation with our models due to limitations in accessing complete datasets from different cohorts.

Conclusion

ML models demonstrated effective performance in predicting hypertension and its related factors in our rural population. LGBM emerged as the optimal model. It eventually introduced the top 30 features, highlighting the top 5 factors of higher baseline SBP, female gender, higher WHR, positive hematuria, and family history of hypertension significantly associated with hypertension development in the future. The model achieved an AUC of 0.67, f1-score = 0.23 and AUC-PR = 0.26. Individuals identified as high risk can be recommended to modify their lifestyles and behaviors to prevent hypertension and associated complications and costs.

Supporting information

S1 Table. All characteristics and clinical features of participants.

https://doi.org/10.1371/journal.pone.0300201.s001

(DOCX)

S2 Table. Finding the appropriate hyper-parameter values for each algorithm after hyper-parameter tuning.

#Abbreviations, LR; Logistic Regression, SVM; Support Vector Machine, RF; Random Forest, GNB; Gaussian Naive Bayes, LDA; Linear Discriminant Analysis KNN; K-Nearest Neighbors, GBM; Gradient Boosting Machine, XGB; Extreme Gradient Boosting, CAT; Cat Boost, LGBM; Light Gradient Boosting Machine.

https://doi.org/10.1371/journal.pone.0300201.s002

(DOCX)

S3 Table. Performance of the ten machine learning algorithms using all features.

#Abbreviations, LR; Logistic Regression, SVM; Support Vector Machine, RF; Random Forest, GNB; Gaussian Naive Bayes, LDA; Linear Discriminant Analysis, KNN; K-Nearest Neighbors, GBM; Gradient Boosting Machine, XGB; Extreme Gradient Boosting, CAT; Cat boost, LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, AUC-PR; Area Under the Precision-Recall curve.

https://doi.org/10.1371/journal.pone.0300201.s003

(DOCX)

S4 Table. Performance of the LGBM model with different number of features.

#Abbreviations, LGBM; Light Gradient Boosting Machine, AUC; Area Under the ROC Curve, ROC; Receiver operating characteristic, AUC-PR; Area Under the Precision-Recall curve.

https://doi.org/10.1371/journal.pone.0300201.s004

(DOCX)

Acknowledgments

This project has been approved by the National Institutes for Medical Research Development (NIMAD), Tehran, Iran under code "4021292".

References

  1. 1. AlKaabi LA, Ahmed LS, Al Attiyah MF, Abdel-Rahman ME. Predicting hypertension using machine learning: Findings from Qatar Biobank Study. PLoS One. 2020;15(10):e0240370. Epub 20201016. pmid:33064740; PubMed Central PMCID: PMC7567367.
  2. 2. Mamdouh H, Alnakhi WK, Hussain HY, Ibrahim GM, Hussein A, Mahmoud I, et al. Prevalence and associated risk factors of hypertension and pre-hypertension among the adult population: findings from the Dubai Household Survey, 2019. BMC Cardiovascular Disorders. 2022;22(1):18. pmid:35090385
  3. 3. Tang C, Jiang H, Zhao B, Lin Y, Lin S, Chen T, et al. The association between bilirubin and hypertension among a Chinese ageing cohort: a prospective follow-up study. Journal of Translational Medicine. 2022;20(1):108. pmid:35246141
  4. 4. Oori MJ, Mohammadi F, Norozi K, Fallahi-Khoshknab M, Ebadi A, Gheshlagh RG. Prevalence of HTN in Iran: meta-analysis of published studies in 2004–2018. Current hypertension reviews. 2019;15(2):113–22. pmid:30657043
  5. 5. Berek PA, Irawati D, Hamid AYS. Hypertension: A global health crisis. Ann Clin Hypertens. 2021;5:8–11.
  6. 6. Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, Keteyian S, et al. Using machine learning on cardiorespiratory fitness data for predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLoS One. 2018;13(4):e0195344. Epub 20180418. pmid:29668729; PubMed Central PMCID: PMC5905952.
  7. 7. Rapport R. Hypertension Silent killer. New Jersey medicine: the journal of the Medical Society of New Jersey. 1999;96(3):41–3.
  8. 8. Islam SMS, Talukder A, Awal MA, Siddiqui MMU, Ahamad MM, Ahammed B, et al. Machine Learning Approaches for Predicting Hypertension and Its Associated Factors Using Population-Level Data From Three South Asian Countries. Front Cardiovasc Med. 2022;9:839379. Epub 20220331. pmid:35433854; PubMed Central PMCID: PMC9008259.
  9. 9. Mills KT, Stefanescu A, He J. The global epidemiology of hypertension. Nature Reviews Nephrology. 2020;16(4):223–37. pmid:32024986
  10. 10. Gheorghe A, Griffiths U, Murphy A, Legido-Quigley H, Lamptey P, Perel P. The economic burden of cardiovascular disease and hypertension in low-and middle-income countries: a systematic review. BMC public health. 2018;18(1):1–11. pmid:30081871
  11. 11. Wang G, Grosse SD, Schooley MW. Conducting research on the economics of hypertension to improve cardiovascular health. American journal of preventive medicine. 2017;53(6):S115–S7. pmid:29153111
  12. 12. D’Agostino RB Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, Kannel WB. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation. 2008;117(6):743–53. pmid:18212285
  13. 13. Goff DC Jr, Lloyd-Jones DM, Bennett G, Coady S, D’agostino RB, Gibbons R, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation. 2014;129(25_suppl_2):S49–S73. pmid:24222018
  14. 14. Beunza J-J, Puertas E, García-Ovejero E, Villalba G, Condes E, Koleva G, et al. Comparison of machine learning algorithms for clinical event prediction (risk of coronary heart disease). Journal of biomedical informatics. 2019;97:103257. pmid:31374261
  15. 15. Bi Q, Goodman KE, Kaminsky J, Lessler J. What is machine learning? A primer for the epidemiologist. American journal of epidemiology. 2019;188(12):2222–39. pmid:31509183
  16. 16. Wang P, Li Y, Reddy CK. Machine learning for survival analysis: A survey. ACM Computing Surveys (CSUR). 2019;51(6):1–36.
  17. 17. Bolívar JJ. Essential hypertension: an approach to its etiology and neurogenic pathophysiology. International journal of hypertension. 2013;2013:547809. Epub 2014/01/05. pmid:24386559; PubMed Central PMCID: PMC3872229.
  18. 18. Homayounfar R, Farjam M, Bahramali E, Sharafi M, Poustchi H, Malekzadeh R, et al. Cohort Profile: The Fasa Adults Cohort Study (FACS): a prospective study of non-communicable diseases risks. International journal of epidemiology. 2023. Epub 2023/01/03. pmid:36592077.
  19. 19. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–30.
  20. 20. Schratz P, Muenchow J, Iturritxa E, Richter J, Brenning A. Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecological Modelling. 2019;406:109–20.
  21. 21. Nakamura M, Kajiwara Y, Otsuka A, Kimura H. LVQ-SMOTE—Learning Vector Quantization based Synthetic Minority Over-sampling Technique for biomedical data. BioData mining. 2013;6(1):16. Epub 2013/10/04. pmid:24088532; PubMed Central PMCID: PMC4016036.
  22. 22. Jović A, Brkić K, Bogunović N, editors. A review of feature selection methods with applications. 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO); 2015 25–29 May 2015.
  23. 23. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30.
  24. 24. Effrosynidis D, Arampatzis A. An evaluation of feature selection methods for environmental data. Ecological Informatics. 2021;61:101224.
  25. 25. Nematollahi MA, Jahangiri S, Asadollahi A, Salimi M, Dehghan A, Mashayekh M, et al. Body composition predicts hypertension using machine learning methods: a cohort study. Sci Rep. 2023;13(1):6885. Epub 20230427. pmid:37105977; PubMed Central PMCID: PMC10140285.
  26. 26. Islam MM, Rahman MJ, Chandra Roy D, Tawabunnahar M, Jahan R, Ahmed N, Maniruzzaman M. Machine learning algorithm for characterizing risks of hypertension, at an early stage in Bangladesh. Diabetes Metab Syndr. 2021;15(3):877–84. Epub 20210420. pmid:33892404.
  27. 27. Islam MM, Alam MJ, Maniruzzaman M, Ahmed N, Ali MS, Rahman MJ, Roy DC. Predicting the risk of hypertension using machine learning algorithms: A cross sectional study in Ethiopia. PLoS One. 2023;18(8):e0289613. Epub 20230824. pmid:37616271; PubMed Central PMCID: PMC10449142.
  28. 28. Guo S, Ge JX, Liu SN, Zhou JY, Li C, Chen HJ, et al. Development of a convenient and effective hypertension risk prediction model and exploration of the relationship between Serum Ferritin and Hypertension Risk: a study based on NHANES 2017-March 2020. Front Cardiovasc Med. 2023;10:1224795. Epub 20230906. pmid:37736023; PubMed Central PMCID: PMC10510409.
  29. 29. Qvigstad C, Sørensen LQ, Tait RC, de Moerloose P, Holme PA. Macroscopic hematuria as a risk factor for hypertension in ageing people with hemophilia and a family history of hypertension. Medicine (Baltimore). 2020;99(9):e19339. pmid:32118768; PubMed Central PMCID: PMC7478422.
  30. 30. Holme PA, Combescure C, Tait RC, Berntorp E, Rauchensteiner S, de Moerloose P. Hypertension, haematuria and renal functioning in haemophilia—a cross-sectional study in Europe. Haemophilia. 2016;22(2):248–55. Epub 20151216. pmid:27880029.
  31. 31. Sun HL, Yang M, Sait AS, von Drygalski A, Jackson S. Haematuria is not a risk factor of hypertension or renal impairment in patients with haemophilia. Haemophilia. 2016;22(4):549–55. Epub 20160331. pmid:27030081.
  32. 32. Orlandi PF, Fujii N, Roy J, Chen H-Y, Lee Hamm L, Sondheimer JH, et al. Hematuria as a risk factor for progression of chronic kidney disease and death: findings from the Chronic Renal Insufficiency Cohort (CRIC) Study. BMC nephrology. 2018;19:1–11.
  33. 33. Remuzzi G, Perico N, Macia M, Ruggenenti P. The role of renin-angiotensin-aldosterone system in the progression of chronic kidney disease. Kidney Int Suppl. 2005;(99):S57–65. pmid:16336578.
  34. 34. Te Riet L, van Esch JH, Roks AJ, van den Meiracker AH, Danser AH. Hypertension: renin-angiotensin-aldosterone system alterations. Circ Res. 2015;116(6):960–75. pmid:25767283.
  35. 35. Patrick DM, Van Beusecum JP, Kirabo A. The role of inflammation in hypertension: novel concepts. Curr Opin Physiol. 2021;19:92–8. Epub 20201013. pmid:33073072; PubMed Central PMCID: PMC7552986.
  36. 36. Fuchs FD, Gus M, Moreira LB, Moraes RS, Wiehe M, Pereira GM, Fuchs SC. Anthropometric indices and the incidence of hypertension: a comparative analysis. Obes Res. 2005;13(9):1515–7. pmid:16222051.
  37. 37. Feldstein CA, Akopian M, Olivieri AO, Kramer AP, Nasi M, Garrido D. A comparison of body mass index and waist-to-hip ratio as indicators of hypertension risk in an urban Argentine population: a hospital-based study. Nutr Metab Cardiovasc Dis. 2005;15(4):310–5. pmid:16054556.
  38. 38. Browning LM, Hsieh SD, Ashwell M. A systematic review of waist-to-height ratio as a screening tool for the prediction of cardiovascular disease and diabetes: 0·5 could be a suitable global boundary value. Nutr Res Rev. 2010;23(2):247–69. Epub 20100907. pmid:20819243.
  39. 39. Choi JR, Koh SB, Choi E. Waist-to-height ratio index for predicting incidences of hypertension: the ARIRANG study. BMC Public Health. 2018;18(1):767. Epub 20180619. pmid:29921256; PubMed Central PMCID: PMC6008942.
  40. 40. Ashwell M, Gunn P, Gibson S. Waist-to-height ratio is a better screening tool than waist circumference and BMI for adult cardiometabolic risk factors: systematic review and meta-analysis. Obes Rev. 2012;13(3):275–86. Epub 20111123. pmid:22106927.
  41. 41. Wang NY, Young JH, Meoni LA, Ford DE, Erlinger TP, Klag MJ. Blood pressure change and risk of hypertension associated with parental hypertension: the Johns Hopkins Precursors Study. Arch Intern Med. 2008;168(6):643–8. pmid:18362257.
  42. 42. Kunnas T, Nikkari ST. Family history of hypertension enhances age-dependent rise in blood pressure, a 15-year follow-up, the Tampere adult population cardiovascular risk study. Medicine (Baltimore). 2023;102(39):e35366. pmid:37773803; PubMed Central PMCID: PMC10545328.
  43. 43. Connelly PJ, Casey H, Montezano AC, Touyz RM, Delles C. Sex steroids receptors, hypertension, and vascular ageing. Journal of human hypertension. 2022;36(2):120–5. pmid:34230581
  44. 44. Connelly PJ, Azizi Z, Alipour P, Delles C, Pilote L, Raparelli V. The importance of gender to understand sex differences in cardiovascular disease. Canadian Journal of Cardiology. 2021;37(5):699–710. pmid:33592281
  45. 45. Connelly PJ, Currie G, Delles C. Sex Differences in the Prevalence, Outcomes and Management of Hypertension. Curr Hypertens Rep. 2022;24(6):185–92. Epub 20220307. pmid:35254589; PubMed Central PMCID: PMC9239955.
  46. 46. Neufcourt L, Deguen S, Bayat S, Zins M, Grimaud O. Gender differences in the association between socioeconomic status and hypertension in France: A cross-sectional analysis of the CONSTANCES cohort. PLoS One. 2020;15(4):e0231878. pmid:32311000
  47. 47. Chowdhury MZI, Leung AA, Walker RL, Sikdar KC, O’Beirne M, Quan H, Turin TC. A comparison of machine learning algorithms and traditional regression-based statistical modeling for predicting hypertension incidence in a Canadian population. Sci Rep. 2023;13(1):13. Epub 20230102. pmid:36593280; PubMed Central PMCID: PMC9807553.