Predicting long-term type 2 diabetes with support vector machine using oral glucose tolerance test

Diabetes is a large healthcare burden worldwide. There is substantial evidence that lifestyle modifications and drug intervention can prevent diabetes, therefore, an early identification of high risk individuals is important to design targeted prevention strategies. In this paper, we present an automatic tool that uses machine learning techniques to predict the development of type 2 diabetes mellitus (T2DM). Data generated from an oral glucose tolerance test (OGTT) was used to develop a predictive model based on the support vector machine (SVM). We trained and validated the models using the OGTT and demographic data of 1,492 healthy individuals collected during the San Antonio Heart Study. This study collected plasma glucose and insulin concentrations before glucose intake and at three time-points thereafter (30, 60 and 120 min). Furthermore, personal information such as age, ethnicity and body-mass index was also a part of the data-set. Using 11 OGTT measurements, we have deduced 61 features, which are then assigned a rank and the top ten features are shortlisted using minimum redundancy maximum relevance feature selection algorithm. All possible combinations of the 10 best ranked features were used to generate SVM based prediction models. This research shows that an individual’s plasma glucose levels, and the information derived therefrom have the strongest predictive performance for the future development of T2DM. Significantly, insulin and demographic features do not provide additional performance improvement for diabetes prediction. The results of this work identify the parsimonious clinical data needed to be collected for an efficient prediction of T2DM. Our approach shows an average accuracy of 96.80% and a sensitivity of 80.09% obtained on a holdout set.


Abstract
Diabetes is a large healthcare burden worldwide. There is substantial evidence that lifestyle modifications and drug intervention can prevent diabetes, therefore, an early identification of high risk individuals is important to design targeted prevention strategies. In this paper, we present an automatic tool that uses machine learning techniques to predict the development of type 2 diabetes mellitus (T2DM). Data generated from an oral glucose tolerance test (OGTT) was used to develop a predictive model based on the support vector machine (SVM). We trained and validated the models using the OGTT and demographic data of 1,492 healthy individuals collected during the San Antonio Heart Study. This study collected plasma glucose and insulin concentrations before glucose intake and at three time-points thereafter (30, 60 and 120 min). Furthermore, personal information such as age, ethnicity and body-mass index was also a part of the data-set. Using 11 OGTT measurements, we have deduced 61 features, which are then assigned a rank and the top ten features are shortlisted using minimum redundancy maximum relevance feature selection algorithm. All possible combinations of the 10 best ranked features were used to generate SVM based prediction models. This research shows that an individual's plasma glucose levels, and the information derived therefrom have the strongest predictive performance for the future development of T2DM. Significantly, insulin and demographic features do not provide additional performance improvement for diabetes prediction. The results of this work identify the parsimonious clinical data needed to be collected for an efficient prediction of T2DM. Our approach shows an average accuracy of 96.80% and a sensitivity of 80.09% obtained on a holdout set. PLOS

Introduction
The global incidence of diabetes was estimated at 422 million in the year 2014 and its prevalence among the adult population increased from 4.7% in 1980 to 8.5% in 2014 [1]. In 2015 alone, about 1.6 million deaths worldwide were attributed to diabetes. In addition to the high mortality rate, an individual with diabetes is at a greater risk of developing cardiovascular disease (CVD), visual impairment and limb amputations, as compared to a non-diabetic individual. Due to the substantial socio-economic burdens that are associated with diabetes, its early detection, prevention, and management has become a worldwide top-level health concern.
There is experimental evidence that the development of diabetes can be delayed or even prevented provided an individual undertakes a lifestyle change that includes diet management, adopting exercise, and adhering to a pharmacological treatment [2]. The early identification of high risk individuals of diabetes is therefore, essential for targeted prevention strategies [3]. Even though the number of clinical studies aimed at diagnosing diabetes has been growing recently, studies predicting the risk of developing diabetes are limited. This subject has lately received an increased amount of research interest [4]. However, the clinical significance of such predictions largely depend on the type and quality of data collected. There are studies that assign a probability to the future risk of diabetes using socio-demographic characteristics such as age, ethnicity, body-mass index (BMI) and genealogical information collected through population [5,6]. Due to the unreliable data collection, such techniques can be misleading. The collection of blood samples, on the other hand, provides more reliable data and is a first step towards the disease prognosis with a deeper clinical insight [7]. The oral glucose tolerance test (OGTT) is commonly used to screen diabetes [8] and to provide a critical understanding of its future evolution [9]. In an OGTT, the plasma glucose and insulin levels are measured at regular intervals in a 2-hr period after orally administering a standard dose of glucose [9]. The glucose tolerance and insulin resistance are two of the most significant parameters deduced from the OGTT that are widely regarded as the major factors in the development of type 2 diabetes mellitus (T2DM).
A precursory stage of diabetes, commonly referred to as prediabetes, exists before overt T2DM, and is described by impaired fasting glucose (IFG), along with impaired glucose tolerance (IGT). According to the World Health Organization (WHO) diagnostic criteria, the IFG is defined as fasting plasma glucose level of 100 mg/dL to 125 mg/dL. The IGT which describes an abnormally raised glucose level is defined as the 2-hour plasma glucose level in the range of 140 mg/dL to 199 mg/dL, measured during the OGTT [10]. Although prediabetes is considered as an intermediate stage in the natural progression of T2DM [11], it has been reported that only 50% of the subjects diagnosed with IGT developed diabetes within 10 years [12,13]. Moreover, long-term population studies have also shown that around 50% of the diabetic patients did not exhibit IGT at any time prior to the diagnosis [14]. This suggests that the fasting and 2-hour plasma glucose levels used in and of themselves cannot accurately predict the future development of T2DM.
The availability of big data in the healthcare sector has made machine learning (ML) a viable instrument for disease prediction [15][16][17][18]. In contrast to traditional diagnostic techniques employing population based statistics, ML methods develop models that are trained using large amounts of data. In a pilot study, Maeta et al developed a ML algorithm to predict the risk of developing glucose metabolism disorder using the OGTT data [19]. Barakat et al used socio-demographic information, and point-of-care testing from blood and urine to develop diagnostic models of diabetes [20]. This approach uses support vector machine (SVM) along with a rule-based explanation to provide a comprehensibility of the results to the clinicians. The plasma glucose levels at baseline and 2-hr were among the features used. Han et al employed an ensemble SVM and random forest learning approaches to develop a decision making algorithm for the diagnosis of diabetes [21]. However, investigations that are designed to identify individuals at high risk of developing T2DM in the long-term future are limited. The San Antonio diabetes prediction model (SADPM) [22] uses a logistic regression supported by physiological parameters such as systolic blood pressure and cholesterol level. The underlying causes of T2DM in the form insulin resistance and insulin secretion were studied to develop a prediction model in [14]. In another study, multivariate logistic models using the plasma glucose values measured in the OGTT were used to predict the future risk of developing T2DM [23,24]. The predictive power of different bio-markers such as the fasting plasma glucose level, BMI and haemoglobin A1C (HbA1c) for T2DM onset was assessed in [25]. This study focused on individuals with metabolic syndrome, a complex and serious health condition that greatly increases the risk of CVD and diabetes.
The standard ML algorithms are designed to yield optimal performance in terms of accuracy over the full data-set. However, medical applications such as diagnosis and prediction of a disease require a biased decision-making mechanism that favours one of the classes. This approach inherently maximises the performance of the class that is more relevant in clinic terms. Therefore, the objective in such applications is to design a classifier that improves the accuracy of the class that is clinically more relevant. Additionally, often the amount of data is highly skewed with the clinically relevant class in an out-sized minority. There are various roundabout ways to obtain accurate classifier performance in this scenario that include the method of sampling [26] in which the class distribution is artificially balanced by either under sampling the majority class, over-sampling the minority class or both. Furthermore, feature weighting schemes assign distinct costs to training examples [27] in order to introduce a certain bias. Other techniques introduce evaluation metric such as the geometric mean (g-mean) [28], that concurrently optimises the positive class accuracy (sensitivity) and the negative class accuracy (specificity) [29].
We hypothesised that the features extracted from the OGTT will be able to predict the future onset of T2DM. In this paper, we therefore propose a screening tool that identifies the most relevant features extracted from the OGTT data that strongly correlate with the future development of T2DM. We then use SVM to develop a prediction model by utilising these relevant features estimated from the longitudinal cohort study, the San Antonio Heart Study (SAHS) [30,31].

San Antonio Heart Study
The SAHS is a population-based epidemiological study that was conducted to assess the risk factors of diabetes and cardiovascular diseases in healthy population [30,31]. In total, 5,158 men and non-pregnant women of Mexican American (MA) and non-Hispanic White (NHW) residents of San Antonio, Texas participated in the study in two cohorts. The age of individuals at the time of recruitment was between 25 and 64 years. As a part of the data collection, plasma glucose and serum insulin concentrations were collected during the OGTT at the baseline and after an average follow-up of 7.5 years. The BMI was also recorded for each individual at the baseline. In this study, we analysed only the data generated from the second cohort of the SAHS which comprised of 1,492 subjects from the second cohort of the SAHS.
T2DM was diagnosed at the follow-up using the WHO criteria, i.e. fasting glucose level > 126 mg/dL or 2-hr glucose level �200 mg/dL [10]. Furthermore, all individuals taking anti-diabetic medications were also classified as having T2DM. Individuals that reported by themselves any cardiovascular event such as a heart attack, stroke or angina, were labelled as having CVD at the follow-up. All other participants without T2DM or self-reported CVD were labelled as healthy for the case of this study. During the course of this longitudinal study, a total of 171 individuals developed T2DM with 10 individuals also reporting at least one cardiovascular event. The incidence rate of T2DM in the second cohort of the SAHS population was 10.79%. Table 1 shows the population distribution in terms of the four classes. The distribution in terms of the ethnicity shows the T2DM prevalence among the MA individuals more than double, as compared to the NHW population.
The data used in this study consists of plasma glucose and serum insulin concentrations sampled at the baseline, and at 30, 60 and 120 min thereafter. The individuals are labelled at the SAHS follow-up using the current standard of care [30]. Fig 1 shows the distributions of the data used in this study.

Machine learning framework
In this paper, we implemented SVM to construct the models for the prediction of future T2DM. The SVM develops models from a given training data-set such that it generalises well to a new data-set and minimises the empirical risk associated with misclassification of samples in the training set [32,33]. A model constructed by the SVM minimises the overlap between classes in the training set by optimising the separating hyper-plane. For problems that may not be amenable to linear separation between the two classes, the SVM technique is very attractive due to the fact that the input feature space can be transformed to a higher dimension space, and a linear boundary can then be determined. This approach generally provides a better training performance, but potentially increases computational complexity excessively with the increase of the dimensionality of the input feature space [34]. The introduction of a kernel alleviates the need to determine the transformation by calculating the inner product between the coordinates of the input feature space instead. In this paper, we used the Gaussian radial basis function (RBF), as the kernel. The performance of the SVM can be optimised by tuning the free parameter of the kernel σ and specifying a cost that controls the rigidity of the class margin. This process is normally carried out through a grid search.

Feature extraction
We extracted all the features from the SAHS data acquired at the baseline. The data-set consists of plasma glucose and insulin concentrations recorded before glucose intake and at three time-points thereafter (30, 60, and 120 min). The labels (healthy and diabetes) were generated at the 7.5 years follow-up using the current standard of care diagnostics [30]. From the glucose and insulin concentrations, we computed the slope and area under the curve between all the possible combinations of a pair of measurements. In addition, we also calculated three empirical markers that describe the relationship between the glucose intake and insulin response. The first is the insulinogenic index (IGI) [35], which is a direct measure of the insulin response to glucose. It is calculated as the ratio of the slope of the insulin curve to the slope of the glucose curve between any two time intervals in the OGTT. The second marker, Matsuda index (M) evaluates the insulin sensitivity from the OGTT using a product of the weighted averages of the glucose and insulin concentrations [36], M ¼ 10; 000 8 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where the subscripts depict the time point of the OGTT. In case when the value at 90 min is not available, the average of 60 and 120 min is used instead [36]. The third marker, homeostatic model assessment-insulin resistance (HOMA-IR) [37] evaluates the beta-cell function. It is defined as the product of fasting plasma glucose concentration and fasting blood insulin concentration divided by 22.5. These markers have been used to estimate abnormalities in the insulin sensitivity. A total of 61 features (illustrated in Fig 2) are used in this study. The prefix AuC denotes the area under the curve and the slope is denoted by the symbol Δ. The term T half represents the linearly interpolated value between any two intervals.

Feature selection
Before constructing the SVM model to predict a future diabetes occurrence, we search for the most effective subset of features in terms of relevance to the classifier output, i.e. incidence of T2DM at the follow-up. As a first step, we selected the ten most relevant features from the 61 available features using the minimum redundancy maximum relevance (mRMR) algorithm [38], which selects the most relevant features with minimum correlation among them. The minimum redundancy maximum relevance (mRMR) algorithm determines the relevance between a feature (x as continuous random variable) and the class label (y as discrete random variable) in terms of the mutual information I defined as [39], where p i , and p j are the probabilities of the random variables x and y taking a particular value x i and y j 2 (−1, 1)8j respectively. The term p ij denotes the joint probability P{x = x i , y = y j }. The three terms in Eq (2) represent the continuous, discrete and joint entropies of the random variables in the respective order. The features that are most relevant to the class label are the ones that maximize I. A heuristic approach is to keep only one a single feature from a correlated set of features that provides similar relevance information, and discard the remaining features. In order to ensure this, the mRMR algorithm minimizes the mutual correlation among the features expressed in terms of redundancy R, where I follows its definition in Eq (2). This procedure yielding maximum I with respect to the diabetic class, along with minimal R, shortlists a set of ten features that are potentially strong predictors of the future development of T2DM.

Classification
We developed a supervised learning scheme using the baseline SAHS data-set and the labels (healthy, T2DM) obtained at the follow-up after an average of 7.5 years. In each experiment, we used a kernel-based binary SVM method to train, test and validate the performance of the diabetes prediction models. We excluded the 44 CVD entries as the only way of defining this class was based upon self-reporting and not on quantitative assessment. Furthermore, we also removed all entries with any information missing. That resulted in a total of 1,492 instances that were used in this study, out of which 171 were from the minority class and 1,321 were majority instances. As shown in Table 1, the SAHS data-set is intrinsically unbalanced with the class distribution skewed toward the majority class with a ratio of 7.5:1. We considered the minority class of diabetic subjects as the positive class with a label of 1, whereas the majority class consisting of healthy persons was termed as the negative class marked by a '-1' label. To standardise the feature range prior to training, the feature space was scaled to unit variance around the respective mean for each feature respectively. To ensure that a model was unbiased, robust, and generalised well to the new data, we performed 10-fold cross-validation (CV). We refrained from balancing the data-set so that the majority and minority class prevalence becomes the same, as we believe this measure artificially boosts the classification performance.
For each CV, we first randomly selected a hold-out set consisting of 11 minority and 83 majority instances. We evaluated each model 100 times, in which the data was randomly partitioned on each occasion. We compared the performances of linear and non-linear SVM for all 1,023 possible combinations of the 10 most relevant features by considering all 1 to 10 combinations of features. The optimal hyper-plane parameters of the kernel were determined through a grid search. To select the best feature set, we have used the geometric mean of sensitivity and specificity [28]. All experiments were performed by an in-house developed software using Matlab 1 (v9.2.0 MathWorks Inc., Natick, Massachusetts, USA).

Results and discussion
The mRMR algorithm produces a sequential list of ten ranked features, shown in Table 2. Besides ethnicity (ranked fourth), all other features are notably derived from OGTT measurements. The list contains six features derived from plasma glucose concentrations, while only three features are deduced from insulin concentrations.
In all the classification experiments, we aimed to maximise the ability to correctly predict the diabetic class without compromising the classifier accuracy. The bar plots in Fig 3 show the g-mean of the sensitivity and specificity obtained from the linear and RBF kernels. For each number of features used, we selected the combination that generated the maximum g-mean.
All the results presented here are averaged over 100 iterations of the respective classifiers. The g-mean obtained from the linear SVM ranges from 0.8711 to 0.8742. As observed from  Notably, all four features are derived from the plasma glucose concentrations. We note that the glucose derived features are ranked the highest during feature selection. Moreover, a combination of glucose only features generate the best SVM models when less than four features are used. The accuracy and sensitivity of the same feature combinations are separately shown in Fig  4. The best model obtained using a combination of four glucose derived features and RBF kernel has an accuracy of 96.80%, and sensitivity of 80.09%. Table 3 presents a comparison of the generated SVM models to the results obtained in other studies using the SAHS data-set. We compared our results with the SADPM [22], in which a person's age, gender, ethnicity, fasting glucose level, family history, blood pressure, and cholesterol level were used to construct a logistic regression. It is notable that the SADPM has the highest sensitivity (88.80%) however, the increased prediction performance comes along with a very low accuracy of 56.33%. In [23], a two-step approach was introduced that first used the SADPM risk score and then augmented it with the 1-hour plasma glucose concentration measured in the OGTT. This strategy resulted in an improved accuracy but the sensitivity dropped to 77.70%.
In the SAHS data-set, the prevalence of IFG and IGT was 8.91% (133 instances) and 22.52% (336 instances) respectively. Out of the 399 subjects diagnosed with prediabetes showing IFG or IGT at the baseline, only 120 (30.08%) actually developed diabetes between the baseline and the follow-up. Furthermore, 120 (25.67%) subjects diagnosed with diabetes at the follow-up did not show any symptoms of either IGT or IFG at the baseline.  Our investigation shows that features derived from insulin have less predictive value for T2DM as compared to glucose based features. Indices such as Matsuda and HOMA-IR that are commonly used to assess the insulin function, also did not yield high correlation with the future development of T2DM.

Conclusion
In this paper, we developed a non-linear SVM based prediction model that accurately identifies the persons at a higher risk of developing T2DM in future. To develop the model, we first assessed the predictive power of features that were derived from the OGTT data and were augmented by personal information such as age, ethnicity, and BMI. Using a feature selection algorithm, we demonstrated that the features deduced from the plasma glucose concentrations provide the optimal feature subset and have the strongest predictive power for the future development of T2DM. Moreover, the performance of the presented prediction model is significantly better in terms of combined accuracy and sensitivity combined, compared to other T2DM prediction models. In order to address the unbalanced nature of the SAHS data-set, we chose the g-mean of sensitivity and specificity as the performance evaluation criteria. Our prediction model outperforms other similar models by more than 12% in terms of the g-mean of sensitivity and specificity. The mean accuracy, specificity and sensitivity achieved after 100 iterations were 96.80%, 99.02%, and 80.09%.
The principal contribution of this study includes a T2DM prediction model based on the features derived only from the plasma glucose concentrations measured during an OGTT. The findings of this paper provide a complementary and cost-effective tool for the clinicians to screen individuals that are at an increased risk of developing T2DM in the future.