Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Machine learning-driven Diabetes Health Tracer (DHT): Optimizing prognosis using RaSK_GraDe and RaSK_GraDeL models

  • Muhammad Noman ,

    Contributed equally to this work with: Muhammad Noman, Maria Hanif, Abdul Hameed, Muhammad Babar, Basit Qureshi

    Roles Data curation, Investigation, Visualization

    Affiliation Department of Software Engineering and Artificial Intelligence, Iqra University, H-9, Islamabad, Pakistan

  • Maria Hanif ,

    Contributed equally to this work with: Muhammad Noman, Maria Hanif, Abdul Hameed, Muhammad Babar, Basit Qureshi

    Roles Conceptualization, Data curation, Investigation, Methodology, Supervision, Validation, Visualization, Writing – original draft

    Affiliation Department of Software Engineering and Artificial Intelligence, Iqra University, H-9, Islamabad, Pakistan

  • Abdul Hameed ,

    Contributed equally to this work with: Muhammad Noman, Maria Hanif, Abdul Hameed, Muhammad Babar, Basit Qureshi

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    hameed01@gmail.com

    Affiliation Department of Computer Science, University of Management and Technology, Lahore, Pakistan

  • Muhammad Babar ,

    Contributed equally to this work with: Muhammad Noman, Maria Hanif, Abdul Hameed, Muhammad Babar, Basit Qureshi

    Roles Funding acquisition, Project administration, Supervision, Writing – review & editing

    Affiliation Robotics and Internet of Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia

  • Basit Qureshi

    Contributed equally to this work with: Muhammad Noman, Maria Hanif, Abdul Hameed, Muhammad Babar, Basit Qureshi

    Roles Funding acquisition, Project administration, Supervision, Writing – review & editing

    Affiliation College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia

Editorial Note

The PLOS One Editors issue this Editorial Note to inform readers that the work cited in this article [1] as Reference 32 was retracted before the publication date of [1]. PLOS considers that this reference is not crucial in supporting the research or conclusions reported in [1], but the following sentences are no longer supported:

  • Section 2 Literature Review, paragraph 15, sentence 1: “In [32], the author developed the intelligent diabetes mellitus prediction framework (IDMPF), a framework for diabetes prediction. The Pima dataset was utilised by them. The achieved accuracy was 83%. However, the model gives result but not so good, so their is a way to make it better.”

We regret that the issues were not identified prior to the article’s publication.

30 Mar 2026: The PLOS One Editors (2026) Editorial Note: Machine learning-driven Diabetes Health Tracer (DHT): Optimizing prognosis using RaSK_GraDe and RaSK_GraDeL models. PLOS ONE 21(3): e0346001. https://doi.org/10.1371/journal.pone.0346001 View editorial note

Abstract

Diabetes mellitus presents a significant global health challenge, particularly in regions like Pakistan, India, and Bangladesh. Machine learning (ML) techniques offer promising solutions for diabetes prediction, surpassing traditional methods in reliability and efficiency. This research conducts a comparative analysis of ML algorithms including Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), K-nearest neighbors (KNN), Gradient Boosting (GB), RaSK_GraDe (Proposed Voting), and RaSK_GraDeL (Proposed Stacking). Evaluation is performed using datasets, such as PIMA Indian, Frankfurt Hospitals Diabetes, RTML with Insulin, and the proposed Diabetes Health Tracer (DHT) dataset comprising 2877 observations with nine features. Data pre-processing techniques address missing values, outliers, normalization, and class balancing (SMOTE), enhancing model robustness. Hyperparameter tuning via cross-validation and Random Search optimizes model performance. Additionally, ensemble methods—Voting Classifier (RaSK GraDe) and Stacking Model (RaSK GraDeL with Logistic Regression) are applied, achieving notable accuracies of 98.03% and 98.55%, respectively, on the DHT dataset. The study underscores ML’s potential in diabetes prediction, advocating for personalized treatment and healthcare management advancements.

1 Introduction

Diabetes mellitus (DM) is a chronic medical condition caused by high blood glucose level due to the body’s inability to produce enough insulin or effectively use the insulin it produces. Diabetes currently stands as one of the most deadly global diseases that people are afraid of nowadays. It has two main types: type 1, which commonly occurs in children mediated by immune mechanism, and type 2, which occurs later in life due to malfunctioning or diseases of the pancreas [1]. The other two types include gestational diabetes occurs during pregnancy and its symptoms disappears after the pregnancy process, and prediabetes the blood glucose level always stand above normal range.

This health epidemic extends its influence worldwide, presenting a significant challenge for nations, particularly those in the process of development such as Pakistan, India and Bangladesh [2]. According to certain research findings, compared to non-immigrant populations, which had a prevalence of 11.6%, South Asian countries had a high rate of diabetes, with patients from the country of Sri Lanka having the highest prevalence (26.8%), followed by those from the country of Bangladesh (22.3%), Pakistan’s (19.6%), India as well (18.3%), and Nepal have (16.5%) [2]. Diabetes affects an estimated 537 million adults worldwide between the age of 20 to 79. This diabetes rate will globally increase up to 643 million in 2030, increasing to 783 million by 2045 [3]. Diabetes disease instantaneously caused 6.7 million death in 2021 according to World Health Organization (WHO). Moreover, the amount of time spent on diabetes in health organization has increased by around 316% in the last 15 years [4].

Diabetes short-term symptoms are caused by high glucose level include polyuria, polydipsia, weight loss, blurred vision and sometimes polyphagia. Long-term complications of diabetes include heart-attack, partial paresis, foot ulcers, loss of vision, sexual dysfunction, and cerebrovascular disease [5]. Diabetes is one of the leading causes of chronic kidney disease (CKD). The people with diabetes develop CKD with ratio of 40%, and the number of new cases of CKD in people with type 2 diabetes increases up to 74% from 1990 to 2017 [5]. According to a Global healthcare expenditure due to diabetes the cost will grow from 966 billion U.S dollar to just over one trillion U.S dollar between 2021 and 2025 [6].

A well-known way of treating diabetes was first used three decades ago is self-monitoring of blood glucose (SMBG) using finger-stick blood samples [7,8]. With the aforementioned method, diabetics use finger-stick glucose metres to prick their finger skin three or four times a day to monitor their blood glucose levels in an intrusive manner. The idea is to measure blood glucose concentrations at various intervals and modify insulin dosage, food, and exercise to keep blood glucose levels within normal ranges. Nevertheless, if the estimation of insulin intake is based on a small number of SMBG samples, this method may be deceptive in addition to being difficult and uncomfortable. As a result, there is a chance that the plasma glucose levels will rise over normal. Continuous glucose monitoring (CGM), which offers the most information about variations in blood glucose concentration throughout the day and helps diabetes patients make the best treatment decisions, was launched as a solution to the aforementioned issue. This method uses tiny wearable sensors or systems to continually monitor blood glucose levels by tracking the amounts of glucose in the blood all day long. These systems may be non-invasive, minimally invasive, or intrusive. Moreover, it is possible to classify CGM systems into two groups: real-time systems and retrospective systems [9]. The arrival and accessibility of numerous cutting-edge continuous glucose monitors (CGMs) electronic devices and systems present new chances for diabetic individuals to easily manage their blood sugar levels. The majority of contemporary CGM typically use a minimally invasive technique to continuously measure the interstitial fluid (ISF) to calculate and record the patient’s current glycemic status every minute. These systems/devices just breach the skin’s outer layer without actually piercing any blood vessels, and are considered minimally invasive. Moreover, there exist non-invasive techniques, such as using electromagnetic radiation to measure blood glucose levels by passing it through the skin and into the body’s blood vessels [10]. diagnosis of diabetes may not always be possible using the traditional techniques, due to factors such as, poverty, and distant location of hospitals from people living in villages. Therefore, the popularity of artificial intelligence techniques play an important role in healthcare centers along with improvement in technologies. Especially, Machine learning (ML) algorithms due to reliability and robustness. ML algorithms give a persuasive result in very less time as compared to old classical methods.

Therefore, the objective of this research is to create a new dataset by merging different datasets, proposed a system that can easily predict diabetes in real-time, and to do a comparative analysis of the performance of different ML algorithms such as Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), KNN, and Gradient Boosting (GB). The algorithms are implemented on each dataset to do binary classification of diabetes. Their performances are assessed by various evaluation metrics namely accuracy, recall, F1-score, ROC-curve, AUC-score, and Precision.

The remainder of this paper is structured as follows: Sect 2 presents the background and a review of related work. Sect 3 outlines the proposed methodology in detail. Sect 4 provides experimental results and a comparative analysis with state-of-the-art approaches. A concise discussion of the system’s performance is provided in Sect 5, and concluding remarks are given in Sect 6. Table 1 lists the abbreviations used throughout the paper for clarity and reference.

2 Literature review

Different types of machine learning (ML) strategies have been implemented on diabetes prediction and classification to achieve high accuracy. Some of them are explain below. In [11], The author compares the accuracy and performance of five supervised machine learning algorithms to predict diabetes. The predictive power of the DT, LR, KNN, RF and SVM methods are evaluated. Furthermore, two different datasets (PIMA and Frankfurt) are used to investigate the effect of dataset size on model accuracy, with the RF approach yielding a 97% accuracy. The impact of underfitting and overfitting on predicted results is also investigated in this work.

In [12], author presents a comprehensive guide on diabetes prediction utilizing ml models like LR, SVM, NB, RF and ensemble techniques such as XGBoost, LightGBM, CatBoost, Adaboost, and Bagging. Among ensemble metods, CatBoost emerges as the most effective, boasting an impressive accuracy rate of 95.4% compared to XGBoost’s 94.3%. Furthermore, CatBoost’s higher AUC-ROC score of 0.99. Metrices used in this study are Accuracy, sensitivity and confusion matrix.

In [13], author focuses on feature selection using KNN, RF, J48 and NB. The study’s findings demonstrate that feature selection improves models by avoiding overfitting and eliminating unnecessary data. SMOTE class balancing technique has been used. Therefore, after being assessed using metrics like the F-measure, Precision-Recall curve, and Receiver Operating Characteristic Area Under Curve, the study’s results, when compared to earlier research, demonstrate that a better outcome was obtained. When medical professionals try to identify diabetes at an early stage, this discovery may have an effect on clinical practice.

In [14], the author proposed a study of early diabetes prediction using feature selection on diabetic dataset having 2500 items with 15 attributes. DT, RF, and NB algorithms were used in the study. The highest accuracy was achieved by Naïve Bayes (82.30%). In [15], the author aims to present an ensemble machine learning model for early diabetes prediction using the PIMA dataset. While preparation techniques handle outliers and missing data, shuffle split improves accuracy. In terms of accuracy, XGBoost (XG) outperforms Random Forest (RF), AdaBoost (AB), and XGBoost (XG) (0.961 +/- 0.014). XGBoost (XG) has greater AUC (96.1%), False Negative Rate, False Positive Rate, Precision (86.6%), Sensitivity (79.8%), Specificity (94.2%), and Accuracy (89.6%) values.

In [16], the author used both deep learning (ANN) and machine learning algorithms (RF, KNN) to classify the diabetes. PIMA Indian dataset were used for the study. Feature extraction technique was carried out. By applying feature extraction, the best accuracy was achieved by ANN (75.7%).

In [17], Yadav et al.’s study from 2023 shows a substantial development in the study of diabetes mellitus and fractional-order modelling. Through the use of the Atangana-Baleanu Caputo (ABC) operator, the authors have produced a model of diabetes dynamics that is more precise and thorough. In addition to adding to the body of knowledge already in existence, this work opens the door for more studies and therapeutic applications in the area.

In [18], Parveen et al. used real world clinical data CPCSSN that contains 172,168 unique patient’s data. This review of the literature offers a thorough summary of the main topics relevant to managing longitudinal data with irregular sampling and applying machine learning approaches to prognostic model diabetes. It draws attention to the difficulties, both conventional and contemporary methods, assessment criteria, case studies, and various possibilities for future research in this area.

In [19], Manarvi et, al. took a survey based on questionnaires to collect information about the patients of diabetes. One such questionnaire was modified and translated into Arabic for use in the current study in order to survey patients at a nearby hospital. Nineteen hundred and one patients took part in this study. The outcomes of the tests are examined based on patient demographics, diagnoses, tests, and other elements of their diabetes self-management. This review of literature offers a thorough analysis of the body of knowledge regarding diabetes treatment procedures in Arabic-speaking nations, emphasising significant discoveries and suggesting areas in need of more research.

Smith et al. (2019) used random forest and logistic regression models to predict diabetes in their study. The random forest model performed better since it could handle non-linear correlations, with an accuracy of 78% and 85%, respectively. PCA and RFE were two of the methods used for feature selection, which improved model performance by finding important predictors. Although these developments, the research highlighted problems with complex model interpretability, generalisability, and data quality [20].

In [21], the author proposed a study on diabetes prediction and classification using PIMA Indian dataset. Three machine learning algorithms (Decision Tree, SVM and Naïve Bayes) have been used. The result showed that the highest accuracy was achieved by Naïve Bayes (74.28%) along with precision (75.7%), Recall (76.1%) and F1-Measure (75.8%). In [22], The author presents the Diabetes Expert System, which improves diabetes prediction through the use of Machine Learning Analytics (DESMLA). Five class balancing techniques were used by DESMLA to address the imbalance in diabetes datasets. DESMLA improves predictive accuracy by using Random Forest (RF) and Decision Tree (DT) classifiers in conjunction with thorough data pretreatment procedures. Interestingly, DESMLA works best when using Gaussian SMOTE and K-Means SMOTE approaches.

In [23], the author carried out diabetes prediction using supervised machine learning on PIMA Indian dataset. Two ML algorithms (KNN, Naïve Bayes) have been used, in which Naïve Bayes outperformed with an accuracy of (76.07%). In [24], the author uses different algorithms to predict diabetes after a rigorous experimental analysis. Compared to other algorithms, logistic regression with all features produces a better result (ACC = 84.70%) for diabetes prediction. The model performs better than the other approaches due to feature engineering to choose the right characteristics. This indicates that an 85% accuracy is obtained with the best logistic regression using certain features. In [25], a dataset including 340 occurrences and 26 characteristics is used to compare two Machine Learning methods for diabetes classification. The Bagging and Decorate Ensemble Machine Learning methods were used with WEKA software. While Decorate’s accuracy was 98.53%, Bagging’s was 95.59%. With a Kappa Statistic of 0.9214, MAE of 0.0482, and RMSE of 0.1546, bagging successfully identified 95.5882% of the occurrences. Furthermore, for Bagging, the TP rate was 0.956, the FP rate was 0.032, and the Specificity was 94.9%.

In [26], the author proposed a study of diabetes data classification using deep learning approach on 130-US hospital dataset. Five machine learning algorithms have been used including NB, RF, DT, SVM, and ensemble learning. The best accuracy of ML algorithm was (86%), while deep learning was (85.61%). In [27], the author uses the Pima dataset in order to accurately predict diabetes using machine learning techniques. Preprocessing methods such as feature selection, imputation of null values, scaling, and uniformity are used in combination with a number of classification algorithms, including Decision Tree (J48), NB, SVM, LR, Multilayer Perceptron, KNN, Logistic Model Tree, RF, and others. At 80.869% accuracy, the RF model gives the highest result. In [28], the author examines the use of predictive analytics in healthcare, highlighting how it can be helpful for practitioners, data-driven patient care decisions. Using a dataset of patient medical information, six machine learning methods are examined. We apply SVM, KNN, RF, DT, LR, and NB on the PIMA dataset and compare and assess each model’s accuracy and performance. By determining the best ml model for the prediction, the study seeks to help medical practitioners anticipate diabetes early on. KNN performed well among all other algorithms.

In [29], the author proposed a study of gestational diabetes on PIMA Indian dataset using Parameter-Tuned KNN. The author used grid search hyper parameter optimization technique. The accuracy was improved by 5.29% achieving the best (82.5%). In [30] , the author examines the possibility of diabetes disease through an analysis of five supervised ml models: SVM, NB, DT, RF, and KNN. After post-classification and cross-validation, the author observes steady accuracy by taking into account all risk factors in the dataset. The KNN achieves the best accuracy of 76%, while other classifiers also retain accuracy above 70%. Examining training and testing accuracy visualizations for indications of model overfitting and underfitting, the author looks at why some ML classifiers are unstable and inaccurate. The main goal of the research is to determine the best outcomes for diabetes illness prediction in terms of computing time and accuracy.

In [31], the author proposed a study of diabetes prediction using data mining technique on PIMA Indian dataset. Four methods have been used such as RF, SVM, LR, and NB. The performances were measured by confusion matrix, sensitivity and accuracy metrics. In Logistic Regression the accuracy was high as (82.46%).

In [32], the author developed the intelligent diabetes mellitus prediction framework (IDMPF), a framework for diabetes prediction. The Pima dataset was utilised by them. The achieved accuracy was 83%. However, the model gives result but not so good, so their is a way to make it better. In [33], The study’s methodology comprises parameter evaluation, prediction, ML algorithm selection, pre-processing, cross-validation, and dataset selection. Additionally, the study used 10-fold cv to divide the data into training and testing using the WEKA software. To ensure that every instance in the dataset had the same weight, a class balancer technique was used. For the PID dataset, SVM had the highest accuracy of 74.3%, while KNN and RF had the highest accuracy of 98.7% for the Germany diabetes dataset.

In [34], the author used machine learning in order to diagnose and predict diabetes, which facilitates decision-making on the management of the condition. The multilayer perceptron algorithm is the most predictively accurate of these, with an excellent area under the curve of 86%, a low mean square error of 0.19, and low rates of false positives and false negatives. The methodology includes analyzing diabetes datasets using both neural network-based and conventional classification algorithms. Performance metrics are evaluated to ascertain the efficacy of the algorithms in terms of prediction accuracy, false positive and false negative rates, and overall area under the curve. The overall related work has been shown in Table 2.

3 Proposed work

In this section, we created a unique dataset by combining three distinct datasets that share similar characteristics but vary in the number of observations. Fig 2 visually depicts the framework of this methodology, containing several key steps: dataset merging, data preprocessing, dataset splitting, utilization of machine learning models like RF, DT, SVM, KNN and GB, ensemble techniques, and evaluation using various performance metrics i.e. accuracy, F1-score, recall, and precision.

3.1 Dataset creation

In our research, we created a new dataset by merging three distinct datasets: a) PIMA Indian [35] b) Frankfurt Hospitals Diabetes [36] and c) RTML with Insulin [37], which is Obtained from 103 female individuals of Rownak Textile Mills Ltd, Dhaka, Bangladesh. These datasets are publicly accessible on platforms like Kaggle and github. Initially, we imported these datasets and examined their contents individually to understand their characteristics and patterns.

The PIMA Indian dataset comprises 768 observations with 9 features, while the Frankfurt Hospitals Diabetes dataset includes 2000 observations with the same features as the PIMA Indian dataset. The RTML with Insulin dataset contains 110 observations with 8 features. Notably, one feature was missing in the RTML with Insulin dataset, which we addressed by employing a machine learning technique called median imputation.

After removing any redundant variables, we merged these datasets to create a proposed dataset called Diabetes Health Tracer (DHT) as shown in Fig 1. Subsequently, we utilized this dataset for training models to predict diabetes as described in Algorithm 1.

Algorithm 1. Dataset creation.

Require: Read Datasets (PIMA, FHD, RTML-I)

1: for all d in Datasets do

2:   for all feature_name in d do

3:    if feature_name is not common or feature_name is

  Unnamed then

4:     Remove feature_name from d

5:    end if

6:   end for

7: end for

8: Declare variable Combined_dataset

9: for all d in Datasets do

10:   Combined_dataset = merge(d, Combined_dataset)

11: end for

12: return combined_dataset =0

The process of creating proposed DHT dataset has been explained below:

  1. Reading the datasets: We have read all three datasets (PIMA, FHD, RTML with Insulin) for the merging process.
  2. Inspection: In the second step we inspect datasets one by one to check their behaviours and patterns.
  3. Remove Unmatched features: In this step we remove the features which are not aliened or not common in all three datasets. There was only one features in “RTML with Insulin” dataset which has no name so we remove it.
  4. Merging datasets: After removing Unnamed feature we merged all three datasets into DHT (proposed) dataset using concatenation technique, and saved the DHT (proposed) dataset for future use.

The proposed DHT dataset comprises 9 attributes, including one target variable. The Diabetes Health Tracer (DHT) dataset is publicly available under the Apache License 2.0. The dataset can be accessed at Diabetes-Health-Tracer-DHT-Dataset. Also the repository link is present in the reference section at [38]. It consists only numeric data, with a total of 2877 observations. The target variable indicates two classes (0,1). Table 3 gives a summary of the attributes, types, and values of the proposed DHT dataset. Descriptive analysis of DHT using Measures of central tendency, frequency and standard deviation as been shown in the Fig 3.

thumbnail
Table 3. Diabetes health tracer dataset description.

https://doi.org/10.1371/journal.pone.0327661.t003

3.2 Data preprocessing

Data preprocessing is a technique that is required to prepare the raw data for another processing procedure [39]. Preparing data is a prerequisite step to develop a reliable predictive model. It is the process of making the data suitable to train a machine learning model. Following are the preprocessing techniques that we applied on the data to make it suitable for ML models and ensemble techniques:

3.2.1 Imputation of missing values.

We observed that the dataset has missing values as shown in the Fig 4. Missing data causes machine learning models to malfunctioning like overfitting or under-fitting due to which accuracy of algorithms effected. Features containing missing values has been shown in Table 4. It is clearly shown in the Fig 4 that the features like Glucose, Insulin, GlucosePedgreeFunction, SkinThickness, and BMI have some missing values. We use imputation method to handle this problem. We have used Median to remove missing values from the dataset. The median is the middle value of a dataset when it is ordered from lowest to highest. If there is an even number of values, the median is the average of the two middle values. Unlike the mean, the median is not affected by outliers, making it a more reliable measure for skewed distributions. The formula for median imputation is given below.

(1)(2)

where is the imputed value for observation i, and xi is the original value. The median is calculated based on the available non-missing values. Equation (1) is commonly used for the imputation of missing values having odd numbers while on the other hand, equation (2) is used for for handle the missing values having even numbers.

thumbnail
Fig 4. Missing value matrix: Missing values of overall dataset.

https://doi.org/10.1371/journal.pone.0327661.g004

thumbnail
Table 4. Features with number of missing values.

https://doi.org/10.1371/journal.pone.0327661.t004

3.2.2 Normalization.

The Min-Max scaling technique is applied to the non-binary features in the dataset in order to scale these features. The Min-Max scaler scales the data within the range (0,1). The formula is given below.

(3)

The Equations (3) is most commonly used for normalization of the data and is adapted from [40]. Normalised feature values can be understood as representing the original value’s range between the initial minimum and maximum, from 0 percent to 100 percent. [40].

3.2.3 Outlier detection and removal.

Some of the features in the dataset have outliers. The presence of outliers makes a model biased and affects its predictive performance. In this study, the inter-quartile- range (IQR) method is applied to remove outliers from the dataset. In this method, the data points that are outside the range between the 1st quartile (Q1) and the 3rd quartile (Q3) are considered outliers and removed by replacing their values with the median value of the specific column. In the IQR method, the threshold is calculated by the following equation:

(4)(5)(6)

Eqs (4), (5), and (6) are utilised to identify and eliminate outliers from the data; they were derived from [41]. Here Q1 and Q3 refer to the 1st and 3rd quartile respectively and c is the threshold which is usually set to 1.5 [41]. In Fig 5, (a) shows the outlier values of all features (Glucose, BloodPressure, Insulin, SkinThickness, BMI, GlucosePedgreeFunction, and Age). In (b) the outlier values has been removed using IQR method.

thumbnail
Fig 5. Features of DHT dataset with outliers.

https://doi.org/10.1371/journal.pone.0327661.g005

3.2.4 Imbalanced dataset.

The term “imbalance” describes the dataset’s uneven class assignment. Bias in the classification results from an imbalance in data. In the dataset, this problem is apparent. We apply SMOTE class balancing techniques which has been explained below.

Synthetic Minority Over-sampling Technique (SMOTE):

In [42], author presented a unique technique called SMOTE “Synthetic Minority Over-sampling Technique” to increase the decision area of the minority class samples and address the problem of over-fitting. This method uses the feature space, not the data space, to create synthetic samples. Creating artificial data is used to oversample the minority class, as compared to replacing or using randomised sampling approaches. In order to improve the data space and address the lack of data in the sample distribution, it was the first strategy to add new data points to the learning dataset [42]. When classifying imbalanced data (such as minority classes), the oversampling technique is a standard procedure [43]. In the last ten years, a great deal of work has been put into it by machine learning researchers. Algorithm 2 presents the working of SMOTE.

Algorithm 2. Synthetic Minority Oversampling Technique (SMOTE).

Require: Training data

1: Tr is the input for the training set

2: Closest neighbour equals p

3: Closest neighbour after data cleansing equals k

Ensure: Following SMOTE = , the training set

4: Get going

5: for i = 1 to N do

6:   Create artificial samples from the minority class and add

  them to

7: end for

8: END =0

The class distribution before and after the balancing techniques has been shown in Table 5. The total number of observation was increased from 2877 to 3788 due to the oversampling technique. Fig 6 visualizes the class distribution of the imbalanced dataset and balanced dataset respectively.

A popular visualisation technique for revealing patterns hidden in the data is the heatmap [44]. The Fig 7 show the correlation of the features. The heathmap shows that the target variable ‘Outcome’ is highly depend on Glucose, BMI, Age, and Insulin features.

3.3 Dataset splitting

In this study, the train-test splitting technique was utilised to achieve better accuracy. Train-test splitting is a classic approach in which the dataset is split into two parts. One part is used to train the models while the other one is used for testing purpose. In this study, we split the dataset into (80 20). The 80% was used for training and the 20% was used for testing.

3.4 Machine learning algorithms

Different types of machine learning (ML) models were used in this study. Following are some suggested supervised machine leaning (ML) models that has been used in this work.

3.4.1 Decision tree classifier.

Decision trees (DTs) are a type of supervised machine learning technique that can be used for regression and classification tasks. It consists of a tree-structured classifier, in which every leaf node delivers the classification result and every node inside the tree represents the properties of a dataset [45]. A decision tree (DT) consists of two nodes: the decision node and the leaf node. Decision nodes are used to conduct actions and include some branches, whereas leaf nodes show the results of those decisions and do not contain any additional branches.It gets its name from the fact that, just like a tree, it starts at the base and spreads outward on successive branches to form a structure that resembles a tree.The leaves stand in for the options or alternatives. These decision nodes divided up the data. There are two metrics related to decision tree (DT) building: information gain (IG) and entropy/Gini-index. The two formulas for the metric computation are shown below.

(7)(8)(9)

3.4.2 Random forest classifier.

A type of ensemble learning, random forest (RF) is a supervised machine learning classifier. An ensemble of DTs, the majority of which were trained using the “bagging” technique, is combined to make a forest. The bagging approach’s core tenet is that mixing several learning strategies can lead to superior outcomes [46]. Using voting approaches, this supervised learning system makes a prediction about the outcome. The random forest (RF) forecasts that the final prediction will also be 1 if the majority of the forest’s trees expect that number [47]. Furthermore, a quantitative technique called random forest (RF) uses decision tree classifiers on many resamples of the dataset before averaging the results to improve prediction accuracy and avoid overfitting. When bootstrap = True, the max samples parameter sets the size of the resamples; otherwise, each tree is generated using the entire dataset [48]. The same metrics, that are employed in decision tree (DT) classifiers are also used in random forests (RF). The decision tree (DT) portion above already displays the equation for those metrics.

3.4.3 Support vector machine.

Support vector machines (SVMs) are a class of supervised learning approaches that address regression issue analysis, outlier identification, and classification problems. The ability of SVMs to choose a decision boundary that minimises the distance between nearby data points across all categories sets them apart from other classification methods. The highest margin hyperbolic decision boundary or the SVM-produced decision boundary classifier is referred to as “plane and plane.” Kernel SVM and basic SVM are the two types of SVMs [49,50]. This research used the kernel SVM. To the kernel SVM, we applied the linear kernel SVM. Compared to linear kernel functions, there are fewer parameters to optimise and the majority of other kernel functions are slower. The equations that the linear kernel SVM utilises are described below [49].

(10)

In this formula, the variables K, X, and b represent the weight matrix to be optimised, the data to be interpreted, and the projected linear coefficient from the training or test dataset, respectively.

3.4.4 K-Nearest Neighbors (KNN).

The k-nearest neighbour (KNN) algorithm is a supervised machine learning technique that is mostly used for classification. It has been widely used to forecast illnesses. The supervised KNN method makes predictions about the classification of data without labels by using the features and labels of the training data. The KNN technique can usually classify datasets using a training model that is similar to the testing query by taking into account the k nearest training data points (neighbours) and choosing the ones that are closest to the query it is testing [51]. TThe algorithm then choose which categorisation to use in the final stage using a majority voting rule. The KNN method is one of the most fundamental kinds of machine learning algorithms and is often used in classification issues due to its very flexible and easy to understand architecture. It is commonly recognised that the technique can solve regression and classification problems with data of different sizes, label counts, noise levels, distances, and contexts [51]. Distance formula (d) is used to calculate the distance measures.

(11)

3.4.5 Gradient boosting classifier.

Gradient Boosting algorithm created for regression challenges. By repeatedly merging weak learners like decision trees into an additive approximation of a target function, it seeks to produce a strong learner. The algorithm trains models using datasets containing computed pseudo-residuals in order to minimise the predicted value of a particular loss function [52]. Regularisation is performed by shrinking, reducing tree complexity (e.g., depth), and adding hyper-parameters. Overfitting is an issue. To fine-tune the model, it is imperative to examine parameters such as learning rate, maximum tree depth, sub-sampling rate, characteristics considered for splitting, and minimum samples for node splitting. By adding regularisation hyper-parameters, Gradient Boosting mitigates the termination risk associated with perfect fitting. Because of the algorithm’s versatility, hyper-parameters like learning rate, tree depth, feature selection, and sub-sampling rate can be used for optimisation and better generalisation [52]. The mathematical equation is given below.

(12)

3.5 Ensemble learning techniques

Ensemble methods, sometimes referred to as multi-classifier systems, are one of the most significant ML research fields because they address a problem with classic ML approaches. Ensemble Classifiers are based on the basic principal that improved classification outcomes achieved by combining the predictions of several different base classifiers. Furthermore, connecting the predictions made by several base classifiers would correct errors made by each classifier and result in more accurate predictions than a single classifier [53]. We employed two distinct ensemble strategies in this work, which are covered in more detail below.

3.5.1 Voting classifier.

The voting classifier is one machine learning model that trains a collection of other models. The voting classifier used the results from each classifier to forecast the output class based on the largest vote majority. Voting ensemble techniques are used by ensemble machine learning models to aggregate predictions from different models. As demonstrated in Fig 8, the voting method, which we utilised during our research, determines the class with the most votes based on the total predictions of all classifiers.

thumbnail
Fig 8. (a). Proposed voting classifier (RaSK_GraDe), (b). Proposed Stacking Model (RaSK_GraDeL).

https://doi.org/10.1371/journal.pone.0327661.g008

A more accurate and balanced prediction can be obtained by the voting ensemble classifier by integrating the predictions of several classifiers. Five base classifiers are used by the voting ensemble model in this study. Initially, the entire training input data set for the base model was used to train basic classifiers. Each base model’s prediction given weighted upon which the final prediction was made using soft voting mechanism. The robustness of noisy data and outliers can both be addressed using the adaptive voting ensemble classifier. The reason for this is that the same dataset is used to train different classifiers [54]. Moreover, the RaSK_GraDe (Voting Classifier) in this study of diabetes prediction system, can result in improved accuracy and robustness in prediction, making it very useful tool in the prediction of diabetes. Algorithm 3 explain the working of RaSK_GraDe (Voting Classifier).

Algorithm 3. Proposed Voting classifier: RaSK_GraDe.

Require: Dataset split into Training set and Testing set

1: Start

2: Establish base classifiers:

3:   Decision Tree (DT) at level 0

4:   Random Forest (RF) at level 0

5:   Support Vector Machine (SVM) at level 0

6:   K-Nearest Neighbors (KNN) at level 0

7:   Gradient Boosting (GB) at level 0

8: for n = 1 to N do

9:   Train the model fn using dataset

10: end for

11: Collect predictions from each base classifier for the test

  set

12: for each test sample xi in do

13:   Collect predictions from classifiers

14:   Aggregate predictions

15:   Store the final prediction for xi

16: end for

17: Calculate the performance of the voting classifier using the

  true labels

18: END

  =0

Fig 8 presents the working of voting classifier that take predictions from base models and use them as input features for making final prediction.

3.5.2 Stacking model.

Stacking approaches build an ensemble model by combining base learners, just like voting classifiers. But there are two key distinctions between voting classifiers and stacking. Firstly, unlike voting classifiers, which often use homogeneous learners (same classification algorithms), stacking typically uses heterogeneous base learners (various classification algorithms). Second, basic learners are integrated in a deterministic manner for voting classifiers. Usually, a weighted total or voting are used for this. In stacking, on the other hand, a meta learner uses a non-deterministic mechanism to combine the base learners [55].

Stacking can be done using two-layers or multiple-layers. Here in this study, we uses two-layers stacking technique. Base layer which uses five machine learning models (RF, SVM, KNN< GB, and DT) and meta-layer which uses LOGISTIC REGRESSION as a meta model. First, we trained five machine learning models as base learners on the training dataset. Secondly, we provided the prediction of each base learner models as input features to the mete-learner model. Lastly, The meta-learner model gives final prediction. Algorithm 4 explain the working of RaSK_GraDe (stacking model).

Algorithm 4. Stacking Model: RaSK_GraDeL.

Require: Dataset split into Training set and Testing set

1: Start

2: Establish base classifiers:

3:   Decision Tree (DT) at level 0

4:   Random Forest (RF) at level 0

5:   Support Vector Machine (SVM) at level 0

6:   K-Nearest Neighbors (KNN) at level 0

7:   Gradient Boosting (GB) at level 0

8: for n = 1 to N do

9:   Train the model fn using dataset

10: end for

11: Generate a new dataset based on the predictions

12: for j = 1 to m do

13:   Create dataset , where

14: end for

15: Develop a meta-classifier:

16:   LR at the secondary level

17: Train the meta-classifier F using dataset Xf

18: Train the complete model

19: Fit the complete model using and

20: Perform the final prediction

21: END =0

Fig 8 visualise the working of propsed stacking model (RaSK_GraDeL) that takes input predictions from base models and then gives final prediction by using meta model (LR).

4 Result

4.1 Comparative analysis of individual datasets with proposed (DHT) dataset

We have applied machine learning algorithms on each dataset to check their performances and then compare them with our own dataset. The comparative analysis on the basis of accuracy has been shown in below Table 6. As we can seen from the Table 10, the ensemble models (RaSK_GraDe and RaSK_GraDeL) gives highest accuracy for all of the four datasets. Furthermore, The Proposed (DHT) dataset has highest accuracy score among all other datasets.

Fig 9 shows the ROC_AUC curves for all four datasets along with AUC score. As it clearly shows that the ROC_AUC curve of ensemble models (RaSK_GraDe and RaSK_GraDeL) has been more accurate as compared to all other base models. While Fig 10 shows the Confusion matrix of all models for the proposed (DHT) dataset. The confusion matrix for ensemble models (RaSK_GraDe and RaSK_GraDeL) shows that it gives very high and accurate prediction.

thumbnail
Fig 10. Confusion Matrices of proposed (DHT) Dataset.

https://doi.org/10.1371/journal.pone.0327661.g010

4.2 Performance evaluation metrics

To determine the optimum fit, accuracy and additional statistical evaluation indicators were taken into account. ML model out of all the classifiers that were used. Based on the standards employed to assess their efficiency, every applicable supervised machine learning classifier was contrasted with one another. The most common methods for evaluating machine learning models are Precision, Recall. F1_Score and accuracy, which are produced by a confusion matrix. The ratio of successfully categorised models to all other possible outputs is known as classification accuracy. When the intended feature categories in the data are fairly similar, accuracy is a useful statistic [56].

(13)

To assess each algorithm’s performance more precisely, a number of additional evaluation measures were taken into account in addition to these four the other metrics are Recall, precision, f1-measure, and ROC Curve.Furthermore, the area under the ROC curve, or AUC, has a value between 0 and 1. Recall is a metric used to quantify how many successful outcomes machine learning (ML) systems yield [57]. The resultant score will yield the harmonic mean of accuracy and recall. To assess the F1-score, the weighted average of accuracy and recall is used [57]. Precision determines the ratio of true positives to all expected positives [57]. The terms TP, FP, and TN, FN, respectively, in the equation stand for true positive, false positive, true negative, and false negative.

(14)(15)(16)(17)

Fig 11 shows the comparative analysis of the proposed (DHT) dataset with other three datasets in terms of Accuracy, Precision, Recall, and F1_Score. As we can seen from the Fig 11 that both ensemble techniques gives highest result among all other machine learning models. Also Fig 11 depicts that proposed (DHT) dataset have high result as compared to PIMA, FHD, and RTML_I datasets.

thumbnail
Fig 11. Comparative analysis of the proposed (DHT) dataset with other datasets in terms of performance metrics.

https://doi.org/10.1371/journal.pone.0327661.g011

The performances of all machine learning models and ensemble models using class balance and hyperparameter tuning are shown in Tables 7, 8, 9, and 10. The Diabetes Health Tracer (DHT) dataset, however, continuously produced the top results for all performance criteria, demonstrating the usefulness of ensemble approaches in diabetes prediction as well as the promise of the combined dataset. These tables clearly states that our ensemble models RaSK_GraDe and RaSK_GraDeL performed well on huge data as compared to small data, furthermore, these tables shows other performance metrics such as Precision, Recall, and F1_Score.

4.3 Hyperparameter tuning

During the model setup of machine learning algorithms, certain strategies are employed to modify the parameter values in the most appropriate manner. The GridSearch method was used to identify the best appropriate parameter for the hyperparameter settings in these datasets. Table 11 provides the hyperparameter values for the machine learning methods used with the diabetes dataset.

thumbnail
Table 11. Hyperparameter tuning of models on (DHT) dataset.

https://doi.org/10.1371/journal.pone.0327661.t011

4.3.1 Random search.

We employed random search as our strategy. Given that the hyperparameters are chosen at random, it is an excellent option for large data sets. Grid search, on the other hand, increases the computational cost by looking through every conceivable combination [58]. The mathematical expression of random search is given by equations 19 and 20. S is defined as an n-dimensional feasible region, x is a vector, and f is a real-valued function defined over S in equation (4). The objective is to find an x value in S that minimises f. These are the global optimal solutions, denoted by x’ and y’.

(18)(19)

Algorithm 5 presents the working of random search using cross validation techniques.

Algorithm 5. Random search algorithm.

Require: y is the new sample point; x is a random position.

1: Start

2: Let x be a random position in the search space.

3: while termination requirement not satisfied do

4:   Generate a new location y by sampling the hyper-sphere with a given radius around the current point x.

5:   if f(x)<f(y) then

6:    Update x = y to move to the new location.

7:   end if

8: end while

9: END =0

4.4 SHAPley Additive exPLANATIONS

A visualization tool called SHAPley Additive exPLANATIONS (SHAP) is used to improve the readability of the output produced by ML models. By calculating the relative contributions of each feature to the forecast, it can be used to explain the prediction of any model. A model’s output needs to be split into the sums of the impacts of each of its characteristics in order for SHAP to function. The contribution of each feature to the model result is represented by the value that SHAP yielded. These values can be used to help someone comprehend the significance of each component and to explain the model’s output. Businesses and teams who answer to clients or management would particularly benefit from this [59].

Fig 14, the feature names are arranged from top to bottom along the Y-axis. The amount of change in log odds is represented by the SHAP value, which is displayed on the X-axis. Each point on the graph is Coloured to indicate the value of the relevant attribute; red denotes high values and blue denotes low values. One row of data from the original dataset is represented by each point. Age, Glucose, Skin Thickness and Insulin are typically high and have a good SHAP value. this indicates that it has a beneficial impact on the result.

thumbnail
Fig 12. SHAP value impact on the Proposed (DHT) dataset using summary plot of ensemble model 1 (a).

(RaSK_GraDe), and (b). (RaSK_GraDeL)

https://doi.org/10.1371/journal.pone.0327661.g012

thumbnail
Fig 13. SHAP value impact on the Proposed (DHT) dataset using bar plot of ensemble model 1 (a).

(RaSK_GraDe), and (b). (RaSK_GraDeL)

https://doi.org/10.1371/journal.pone.0327661.g013

thumbnail
Fig 14. SHAP value impact on the Proposed (DHT) dataset using (RaSK_GraDe).

https://doi.org/10.1371/journal.pone.0327661.g014

Similarly, Fig 15, shows the same features high which means that both RaSK_Grade and RaSK_GraDeL have significant depend on these four features.

thumbnail
Fig 15. SHAP value impact on the Proposed (DHT) dataset using (RaSK_GraDeL).

https://doi.org/10.1371/journal.pone.0327661.g015

Figs 14 and 15 display the result of dependence of all features on the Outcome variable.We can see from the Fig 12 and 13, that the Glucose, and Insulin characteristics have high impact on the result as their combination increases the impact on the result also increases. This suggest that the higher value of these both features is beneficial for the model’s forecast.

5 Discussion

Although a great deal of study has been done on diabetes prediction, there is still opportunity for further development in this area. We use a merged dataset, as previously mentioned in the dataset information section, to predict diabetes in our work. We pre-processed the dataset once it was collected to prepare it for additional analysis. To predict diabetes, we used five supervised machine learning algorithms: RF, SVM, KNN, GB, and DT as a base models for the ensemble models (RaSK_GraDe and RaSK_GraDeL). Following the application of the ML techniques, we evaluated the outcomes using various performance metrics, including accuracy, precision, recall and F1-Score. The dataset we have used is never used in any other study before. Table VIII shows that our suggested ensemble models (RaSK_GraDe and RaSK_GraDeL) are quite good at predicting diabetes based solely on demographics characteristics. Furthermore, compared to the current models, the suggested ensemble models (RaSK_GraDe and RaSK_GraDeL) have additional validation metrics supporting it. The suggested ensemble models have some real-world applications in a variety of areas, including community health programs, telehealth and remote monitoring, personalised diabetes prevention plans, early diabetes risk assessment, and others. Furthermore, this research will aid in the development of personalized treatment and apps for managing diabetes. In summary, this study shows a wide range of possible practical uses in the medical field.

6 Conclusion and future directions

Diabetes remains a major global health concern. Beyond traditional clinical testing, data mining and machine learning are increasingly being used for early prediction. This study aimed to develop an automated model for diabetes prediction using five machine learning classifiers—Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Gradient Boosting (GB), and Decision Tree (DT)—alongside two ensemble models, RaSK_GraDe and RaSK_GraDeL.

The ensemble models outperformed individual classifiers, achieving accuracies of 98.03% and 98.55%, respectively, making them the most effective for diabetes prediction. The study also emphasized the importance of hyperparameter tuning via Random Search, as optimal hyperparameters significantly enhance model performance and generalization.

Future work should explore advanced ensemble techniques, deep learning approaches, dynamic hyperparameter tuning, and the integration of domain knowledge. Additionally, developing a user-friendly web and Android application for real-time diabetes prediction is recommended to support practical deployment.

References

  1. 1. Azeem S, Khan U, Liaquat A. The increasing rate of diabetes in Pakistan: a silent killer. Ann Med Surg (Lond). 2022;79:103901. pmid:35860160
  2. 2. Banerjee AT, Shah BR. Differences in prevalence of diabetes among immigrants to Canada from South Asian countries. Diabet Med. 2018;35(7):937–43. pmid:29663510
  3. 3. Kumar A, Gangwar R, Zargar AA, Kumar R, Sharma A. Prevalence of diabetes in India: a review of IDF Diabetes Atlas 10th Edition. Curr Diabetes Rev. 2024;20(1):e130423215752. pmid:37069712
  4. 4. Doğru A, Buyrukoğlu S, Arı M. A hybrid super ensemble learning model for the early-stage prediction of diabetes risk. Med Biol Eng Comput. 2023;61(3):785–97. pmid:36602674
  5. 5. Federation I. IDF diabetes atlas, 10th ed. International Diabetes; 2021.
  6. 6. Sun H, Saeedi P, Karuranga S, Pinkepank M, Ogurtsova K, Duncan BB, et al. IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Res Clin Pract. 2022;183:109119. pmid:34879977
  7. 7. Olansky L, Kennedy L. Finger-stick glucose monitoring: issues of accuracy and specificity. Diabetes Care. 2010;33(4):948-9. pmid:20351231
  8. 8. Buse JB, Wexler DJ, Tsapas A, Rossing P, Mingrone G, Mathieu C, et al. 2019 Update to: Management of Hyperglycemia in Type 2 Diabetes 2018 . A Consensus Report by the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD). Diabetes Care. 2020;43(2):487–93. pmid:31857443
  9. 9. Langendam M, Luijf YM, Hooft L, Devries JH, Mudde AH, Scholten RJPM. Continuous glucose monitoring systems for type 1 diabetes mellitus. Cochrane Database Syst Rev. 2012;1(1):CD008101. pmid:22258980
  10. 10. Choleau C, Klein JC, Reach G, Aussedat B, Demaria-Pesce V, Wilson GS, et al. Calibration of a subcutaneous amperometric glucose sensor. Part 1. Effect of measurement uncertainties on the determination of sensor sensitivity and background current. Biosens Bioelectron. 2002;17(8):641–6. pmid:12052349
  11. 11. Sidana K. Prediction of Diabetes using Machine Learning Algorithms. 2023 11th International Conference on Internet of Everything, Microwave Engineering, Communication and Networks (IEMECON). 2023. pp. 1–6.
  12. 12. Modak SKS, Jha VK. Diabetes prediction model using machine learning techniques. Multimed Tools Appl. 2024;83(13):38523–49.
  13. 13. Oladimeji OO, Oladimeji A, Oladimeji O. Classification models for likelihood prediction of diabetes at early stage using feature selection. Appl Comput Inform. 2024;20(3/4):279–86.
  14. 14. Sneha N, Gangil T. Analysis of diabetes mellitus for early prediction using optimal features selection. J Big Data. 2019;6(1):1–19.
  15. 15. Kumar PM, Haswanth KVS, Swaroop GM, Priyadarsini MJP. Diabetes prediction using different ensemble learning classifiers in machine learning. Int J Res Appl Sci Eng Technol. 2022.
  16. 16. Mahboob Alam T, Iqbal MA, Ali Y, Wahab A, Ijaz S, Imtiaz Baig T. A model for early prediction of diabetes. Inform Med Unlocked. 2019;16(1):1–6.
  17. 17. Yadav P, Jahan S, Shah K, Peter OJ, Abdeljawad T. Fractional-order modelling and analysis of diabetes mellitus: Utilizing the Atangana-Baleanu Caputo (ABC) operator. Alex Eng J. 2023;81:200–9.
  18. 18. Perveen S, Shahbaz M, Saba T, Keshavjee K, Rehman A, Guergachi A. Handling irregularly sampled longitudinal data and prognostic modeling of diabetes using machine learning technique. IEEE Access. 2020;8:21875–85.
  19. 19. Manarvi IA, Matta NM, Yassin A. Investigating the Practices of Patients and Hospitals in Treatment of Diabetes - A Survey Questionnaire for Arabic Speaking Countries. Curr Diabetes Rev. 2018;14(5):451–7. pmid:28748753
  20. 20. Larabi-Marie-Sainte S, Aburahmah L, Almohaini R, Saba T. Current techniques for diabetes prediction: review and case study. Appl Sci. 2019;9(21):4604.
  21. 21. Shafi S, Ansari GA. Early prediction of diabetes disease & classification of algorithms using machine learning approach. In:Proceedings of the International Conference on Smart Data Intelligence (ICSMDI 2021); 2021.
  22. 22. Reshmi S, Biswas SK, Boruah AN, Thounaojam DM, Purkayastha B. Diabetes Prediction Using Machine Learning Analytics.2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON). 2022;1:108–12.
  23. 23. Febrian ME, Ferdinan FX, Sendani GP, Suryanigrum KM, Yunanda R. Diabetes prediction using supervised machine learning. Procedia Comput Sci. 2023;216:21–30.
  24. 24. Ram AR, Vishwakarma H. Diabetes Prediction using Machine learning and Data Mining Methods. IOP Conference Series: Materials Science and Engineering. 2021. pp. 1116.
  25. 25. Islam MT, Raihan M, Akash SRI, Farzana FD, Aktar N. Diabetes Mellitus Prediction Using Ensemble Machine Learning Techniques. Advances in Computational Intelligence, Security and Internet of Things. 2019.
  26. 26. Abdollahi J, Nouri Moghaddam B, MirzaeiA. Diabetes data classification using deep learning approach and feature selection based on genetic. 2023.
  27. 27. Kırğıl ENH, Erkal B, Ayyildiz TE. Predicting Diabetes Using Machine Learning Techniques. 2022 International Conference on Theoretical and Applied Computer Science and Engineering (ICTASCE). 2022; pp. 137–41.
  28. 28. Sarwar MA, Kamal N, Hamid W, Shah MA. Prediction of Diabetes Using Machine Learning Algorithms in Healthcare.2018 24th International Conference on Automation and Computing (ICAC). 2018; pp. 1–6.
  29. 29. Assegie TA, Suresh T, Purushothaman R, Ganesan S, Kumar NK. Early prediction of gestational diabetes with parameter-tuned K-Nearest Neighbor Classifier. J Robot Control. 2023;4(4):452–7.
  30. 30. Lyngdoh AC, Choudhury NA, Moulik S. Diabetes Disease Prediction Using Machine Learning Algorithms. 2020 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES). 2021; pp. 517–21.
  31. 31. Rastogi R, Bansal M. Diabetes prediction model using data mining techniques. Measurement: Sensors. 2023;25:100605.
  32. 32. Krishnamoorthi R, Joshi S, Almarzouki HZ, Shukla PK, Rizwan A, Kalpana C, et al. A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques. J Healthc Eng. 2022;2022:1684017. pmid:35070225
  33. 33. Kangra K, Singh J. Comparative analysis of predictive machine learning algorithms for diabetes mellitus. Bull Electr Eng Inform. 2023;12(3):1728–37.
  34. 34. Theerthagiri P, Ruby U. Diagnosis and Classification of the Diabetes Using Machine Learning Algorithms; 2021. Available from: https://api.semanticscholar.org/CorpusID:249888003
  35. 35. Repository UML. PIMA Indian Diabetes Dataset; 1990. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
  36. 36. Maretva L. Frankfurt. Hospital Diabetes Dataset with LGBM Classifier. 2024. https://www.kaggle.com/code/linggarmaretva/frankfurt-hospital-diabetes-with-lgbmclassifier
  37. 37. Nabil T. RTML with Insulin Dataset; 2021. https://github.com/tansin-nabil/Diabetes-Prediction-Using-Machine-Learning/blob/main/RTML%20with%20Insulin.csv
  38. 38. Noman M. Diabetes Health Tracer Dataset; 2024. https://github.com/MuhammadNoman2/Diabetes-Health-Tracer-DHT-dataset
  39. 39. Iliou T, Anagnostopoulos CN, Nerantzaki M, Anastassopoulos G. A novel machine learning data preprocessing method for enhancing classification algorithms performance. In: Proceedings of the 16th International Conference on Engineering Applications of Neural Networks (INNS). 2015. pp. 1–5.
  40. 40. Yang J, Rahardja S, Fränti P. Outlier detection: how to threshold outlier scores? In: Proceedings of the international conference on artificial intelligence, information processing and cloud computing; 2019. pp. 1–6.
  41. 41. Pandey A, Jain A. Comparative analysis of KNN algorithm using various normalization techniques. Int J Comput Netw Inform Secur. 2017;11(11):36.
  42. 42. Chawla NV. Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook. 2010. pp. 875–86.
  43. 43. Wang J, Xu M, Wang H, Zhang J. Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: 2006 8th international Conference on Signal Processing. vol. 3. IEEE; 2006.
  44. 44. Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32(18):2847–9. pmid:27207943
  45. 45. Sheth V, Tripathi U, Sharma A. A comparative analysis of machine learning algorithms for classification purpose. Procedia Comput Sci. 2022;215:422–31.
  46. 46. Azar AT, Elshazly HI, Hassanien AE, Elkorany AM. A random forest classifier for lymph diseases. Comput Methods Programs Biomed. 2014;113(2):465–73. pmid:24290902
  47. 47. Song Y-Y, Lu Y. Decision tree methods: applications for classification and prediction. Shanghai Arch Psychiatry. 2015;27(2):130–5. pmid:26120265
  48. 48. Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2(3):18–22.
  49. 49. Zhang Y. Support vector machine classification algorithm, its application. In: Information Computing, Applications: Third International Conference and ICICA 2012, Chengde, China, September 14-16, 2012. Proceedings, Part II 3. Springer; 2012. pp. 179–86.
  50. 50. Hanif M, Shahzad MK, Mehmood V, Saleem I. EPFG: electricity price forecasting with enhanced Gans neural network. IETE J Res. 2023;69(9):6473–82.
  51. 51. Uddin S, Haque I, Lu H, Moni MA, Gide E. Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci Rep. 2022;12(1):6256. pmid:35428863
  52. 52. Bentéjac C, Csörgő A, Martfnez-Muñoz G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev. 2021;54:1937–67.
  53. 53. Chakir O, Rehaimi A, Sadqi Y, Abdellaoui Alaoui EA, Krichen M, Gaba GS, et al. An empirical assessment of ensemble methods and traditional machine learning techniques for web-based attack detection in industry 5.0. J King Saud Univ - Comput Inform Sci. 2023;35(3):103–19.
  54. 54. Batool A, Byun YC. Towards improving breast cancer classification using an adaptive voting ensemble learning algorithm. IEEE Access. 2024.
  55. 55. Shafieian S, Zulkernine M. Multi-layer stacking ensemble learners for low footprint network intrusion detection. Complex Intell Syst. 2023;9(4):3787–99.
  56. 56. Sanni RR, Guruprasad H. Analysis of performance metrics of heart failured patients using Python and machine learning algorithms. Glob Transitions Proc. 2021;2(2):233–7.
  57. 57. Erickson BJ, Kitamura F. Magician’s corner: 9. Performance metrics for machine learning models. Radiol Artif Intell. 2021;3(3):e200126.
  58. 58. Nayyer N, Javaid N, Akbar M, Aldegheishem A, Alrajeh N, Jamil M. A new framework for fraud detection in bitcoin transactions through ensemble stacking model in smart cities. IEEE Access. 2023.
  59. 59. Noor A, Javaid N, Alrajeh N, Mansoor B, Khaqan A, Bouk SH. Heart disease prediction using stacking model with balancing techniques and dimensionality reduction. IEEE Access. 2023.