Evaluation of the impact of body mass index on venous thromboembolism risk factors

In this paper, we investigate the interaction impacts of body mass index (BMI) on the other important risk factors for venous thromboembolism (VTE), using deep venous thrombosis (DVT) patient data from the International Warfarin Pharmacogenetics Consortium (IWPC). We apply eight machine learning techniques, including naive Bayes classifier (NB), support vector machine (SVM), elastic net regression (ENET), logistic regression (LR), lasso regression (LAR), multivariate adaptive regression splines (MARS), boosted regression tree (BRT) and random forest model (RF). The RF method is selected as the best model for classification. Out of 33 features considered in this study, we identify 12 variables as relatively important risk factors for VTE. Finally, we examine the interaction impacts of BMI on these important VTE risk factors. We conclude that the impacts of risk factors on VTE incidence are varying across different BMI groups, and the variations are different for different risk factors. Therefore the interaction impacts of BMI on the other risk factors have to be taken into account in order to better understand the incidence of VTE.


Introduction
Venous thromboembolism (VTE), a term referring to blood clots in the veins, is a disorder that includes deep vein thrombosis (DVT) and pulmonary embolism (PE). It is the third most common vascular disorder in Caucasian after myocardial infarction and stroke [1,2]. VTE also causes morbidity and mortality in cancer patients and patients after major surgery, especially hip or knee arthroplasty [3][4][5]. To prevent VTE, warfarin is one of the main oral anticoagulants treatments [6,7], which needs international normalized ratio (INR) monitoring.
One leading risk factor for VTE is obesity [1,3,8,9]. In particular, Yang et al. [10] classified obesity as a moderate risk factor for VTE incidence, and found it can interact with other risk factors in VTE development and increase the risk of VTE. Obesity not only increases the VTE incidence, but also causes other chronic diseases such as diabetes, hypertension, coronary heart disease, and ischemic stroke [10]. The prevalence of obesity has increased dramatically during the last decades [11,12]. According to the World Health Organization, at least 700 million adults and more than 20% of young children aged 6-11 years in the world were obese in PLOS ONE PLOS ONE | https://doi.org/10.1371/journal.pone.0235007 July 9, 2020 1 / 17 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 2015 [13]. Due to the high prevalence of obesity, there is increasing interest in measuring body fat. Body mass index (BMI), calculated as the ratio of an individual's weight (in kg) to his/her squared height (in meter), is one of the most popular measures of body mass because of its simplicity. BMI, as an indicator for obesity, has been identified as a VTE risk factors by the previous studies [14][15][16][17][18]. The previous studies have also found the effect of BMI on warfarin treatment. For example, Routledge et al. [19] identified one of the key factors affecting on warfarin therapy is body weight. Tellor [20] showed morbidly obese patients required higher total weekly dose to maintain a therapeutic INR. Wallace et al. [21] compared average warfarin dose with the different patients BMI to get the therapeutic INR in hospital, and found warfarin response dose was decreasing with the BMI increment. Wells et al. [22] predicted warfarin dose with BMI, age and some other predictors, for patients with a history of VTE. However, there is no enough evidence on the impact of BMI on VTE occurrence for warfarin treated patients with history of DVT, and on the interaction between BMI and other risk factors for VTE.
To assess the VTE risk factors, the traditional statistical techniques such as linear regression model [17,22], logistic regression model [5] and Cox regression model [23] have been widely used by the previous studies. Although these statistical techniques are powerful tools for prediction and description, there are various issues influencing classification, i.e. prediction of discrete values, by the traditional regression models [24,25]. In particular, highly correlated predictors and sparse sample size are two common statistical issues which may lead to collinearity and over-fitting, respectively [26]. To address these issues, machine learning (ML) techniques provide an alternative way. Significant advantages of ML techniques include high power and accuracy, ability of modeling non-linear effects, and capacity of dealing with large data sets [27,28].
The objective of ML algorithms is to optimize its performance at a particular task using the past experience (input data). Supervised classification techniques are ML algorithms that learn patterns in data to predict associated discrete classes [29]. In medical science, supervised classification techniques have been employed to identify risk factors for a specific disease or to predict disease occurrence such as VTE. Among a large number of available supervised ML techniques, kernel machine learning [30], various decision trees [31,32], artificial neural networks [33][34][35], random forest [36,37], support vector machines [38,39], Bayesian decision rules [40,41], supervised principal component analysis [42], penalized regression models [43] have been applied in medical science. Although the choice of ML techniques is often based on the minimum loss function, it is difficult to make an informed decision on the most appropriate method.
The main objective of this study is to investigate the interaction effect of BMI on the other risk factors for VTE. In order to fulfill this objective, we have answered the following two questions first. They are, (1) which ML algorithm is most suitable for classification, and (2) which risk factors play important roles as risk factors for VTE. In particular, we use the DVT patient's validation group of international warfarin pharmacogenetics Consortium (IWPC), from the PharmGKB website to identify the risk factors for VTE, predict the VTE occurrence, and examine the interaction impact of BMI on the other risk factors for VTE. We perform a twostep procedure to select important risk factors for VTE. First, we apply eight ML methods, including naive Bayes classifier (NB), support vector machine (SVM), elastic net regression (ENET), logistic regression model (LR), lasso regression (LAR), multivariate adaptive regression splines (MARS), boosted regression tree (BRT) and random forest classification (RF), and select the best classification method among them. Second, we use the selected classification model to identify the important risk factors for VTE. Finally, we examine the interaction impacts of BMI on the selected risk factors for VTE.

Methods
In this study, we apply eight supervised classification ML methods. We compare their performance and select the best model to identify the important risk factors for VTE. A brief introduction of each ML technique is given below.

Naïve Bayesian classifier
Naïve Bayesian classifier (NB) is a simple probabilistic classifier based on the Bayes' theorem [29], which is introduced by Maron [44]. NB assumption is conditional independence between every pair of predictors [45]. It predicts membership probabilities for each class, such as the probability that a given record or a data point belongs to a particular class. The class with the highest probability is considered as the most likely class, which is also known as maximum posterior probability of each class [46].

Support Vector Machine (SVM)
Introduced by Vapnik [47], Support Vector Machine (SVM) is based on the inductive learning [48]. SVM is used in both classification and regression. It uses the principle of maximum margin classifier to separate data. For a d-dimensional data, SVM uses a (d-1)-dimensional hyper plane for data separation. The advantages of SVM are in identifying nonlinear impact using small sample and high dimensional data [49]. In addition, its loss function is based on a global optimization, hence SVM is not prone to fall into a local optimization [50].

Elastic-net regression
Elastic-net Regression (ENET) was introduced by Zou [51], which combines the lasso regression (LAR) and the ridge regression model. It penalizes both the L 1 and L 2 norms with individual tuning parameters in order to achieve the best performance for both LAR and ridge regressions. ENET is robust to extreme correlations among the predictors [52].

Logistic regression model
Pearl et al. [53] introduced the logistic regression model (LR), based on logistic function, to model a binary dependent variable. It has been borrowed by machine learning for binary classification problems.

Lasso regression
Lasso (Least absolute shrinkage and selection operator) regression (LAR) analysis performs both variable selection and regularization. It was introduced by Tibshirani [54] in order to improve the prediction accuracy of regression model by selecting only a subset of the provided covariates for use in the final model rather than using all of them. LAR relies on the L 1 penalty for both fitting and penalization of the coefficients.

Multivariate adaptive regression splines
Multivariate adaptive regression splines (MARS) is a non-parametric regression technique introduced by Friedman [55]. The MARS uses spline-based method by allowing different functions (linear or nonlinear) over different intervals to model the nonlinear relationship between the input and the output variables (x, y), in order to improve the goodness of fit [56].

Boosted regression tree model
Boosted Regression Tree (BRT) models combine two techniques, i.e. decision tree algorithms and boosting methods [57]. In particular, a decision tree algorithm relates a response to their predictors by recursive binary splits, and a boosting method is an adaptive method for combining many simple models to give improved predictive performance.

Random forests classifier
Random Forests (RF) classifier is an ensemble tree-based learning algorithm [58]. Generally speaking, RF constructs a multitude of decision trees at training time and outputs the class as the mode of the classes (classification) of the individual trees. The first algorithm for random forests was created by Ho [59], and the random forests proper was first introduced by Breiman [60]. RF is robust to overfitting, and more stable in the presence of outliers and in high dimensional parameter spaces than other machine learning algorithms [61].

Simulation setup
Feature (or variable) selection helps to explain the data in the simplest way, avoids unnecessary cost of measuring redundant predictors, and mitigate the issue of collinearity. It also helps to improve the prediction accuracy [62]. Therefore finding a proper feature selection method is crucial. In this paper, we apply the eight aforementioned ML methods and compare their performance in order to select the best classification model suitable for our data. To evaluate the eight ML methods' performance, we apply bootstrap sampling and cross validation [63]. In order to generate reliable results, we use 100 bootstrapping samples for the bootstrap sampling and choose K = 10 for the K-fold cross validation.
Except for LR, all the other models rely on specific assumptions or tuning parameters, which need to be selected in advance. For example, for NB model we need to assume a distribution or generate nonparametric models for the explanatory variables from the training set. We are required to specify the kernel and their hyper parameters for the SVM. For ENET and LAR we need to select the penalty parameters in the loss functions. MARS model requires to specify the pruning method. The tuning parameters for BRT include the loss function (distribution), the tree complexity, the K interaction depth (K-folds), the learning rate parameter λ (shrinkage) and the subsampling rate (bag.fraction). In the case of RF, hyper parameters include the number of decision trees in the forest and the number of features considered by each tree when splitting a node.
We use accuracy and Cohen's Kappa to compare model performance in prediction. Models with higher accuracy and Cohen's Kappa value are believed to have better performance in prediction and fitness. We select the best one among the aforementioned eight methods as our classification method. From the selected model, we identify the relatively important risk factors for VTE as those variables with p-value less than α (α = 0.05).

Data
In this study, we use the International Warfarin Pharmacogenetics Consortium (IWPC) data obtained from the PharmGKB (http://www.pharmgkb.org), to identify important risk factors for VTE and to predict its incidence among DVT patients. The IWPC include information for 6256 warfarin treated patients from 22 research groups over nine countries in four continents [64]. The dataset contains patients' non-genetic and genetic information such as their clinical characteristics, personalized medications, warfarin therapeutic doses, and genotypes. In particular, two very important genotypes, i.e. CYP2C9 and VKORC1, are available in the dataset, which are significantly related to warfarin therapy [65][66][67][68]. The IWPC has been used in several recent studies to model the warfarin dose with individual patient's clinical characteristics and genetic information [64,[69][70][71][72][73].
In this study, we choose 376 DVT patients as the cohort validation group to predict VTE incidence (262 out of the 376 DVT patients have VTE). These DVT patients use warfarin on daily basis. Therefore in this study we take their warfarin dose, INR and the genotypes of CYP2C9 and VKORC1 as potential risk factors to examine. We also examine other risk factors including demographic factors, BMI and clinical characteristics. Except that the variables of INR and warfarin dose are continuous, all the other risk factors considered in this study are categorical variables. Table 1 presents the demographic features used in this study, including age, gender, race, and genotypes (CYP2C9 and VKORC1), as well as warfarin dose, INR and BMI, and the corresponding frequency and variable names for each feature.
The clinical characteristics considered in this study include the DVT patients' comorbidities and concomitant medications.
In Table 2, we present the frequencies for the comorbidities and concomitant medications we examine, as well as their corresponding variable names in the dataset. All the comorbidities and concomitant medications we examine in this study happen to more than 5% of the DVT patients in this dataset.
We check the correlations across all variables in the DVT patient dataset. Fig 2 presents the correlation vizualization. It is worthwhile to note that the labels in Fig 2 are the variable names listed in Tables 1 and 2, which are in the alphabetical order from left to right and top to bottom. It is common to see high negative correlations across different categories of the same feature, for example "bl" (the race of black) and "wh" (the race of white). Across different features, we see high positive correlations between "Han" and "AA", and between "bl" and "GG". It has been found that AA genotype of VKORC1 is the most common genotype in Asians, and the most common genotype in African Americans is GG [74], which can explain the high correlations across the VKORC1 and race categories in our data.

Results and discussion
As shown in Fig 1, to evaluate the interaction impact of BMI on the other risk factors for VTE, we conduct a three-steps analysis. First, we select the best classification model from the eight ML methods presented in Section 2; second, we identify the important risk factors for VTE using the selected model; third, we examine the interaction impact of BMI on the other risk factors for VTE.

Selection of the best classification model
First of all, we apply the eight ML methods introduced in Section 2, in order to select the best classification method among them. As discussed in Section 2.9, we have to select the required  performance of these ML models, we apply five measures, including accuracy, Cohen's Kappa, precision, recall and F1 score. For all performance measures, a higher value indicates a better performance on prediction. We perform a bootstrap sampling with replication number of 100 and K-fold (K = 10) cross validation to calculate these measures. The results are presented in Fig 3. As shown in Fig 3, the RF model has the best prediction performance among the eight ML models, followed the MARS model, and the NB model has the poorest performance. Therefore we choose RF as the best model to conduct the following analysis.
In Fig 4, we plot the Receiver Operator Characteristic (ROC) curve for the RF model. Its AUC is 0.78207, indicating a good performance on classification.
Next, we implement a simple Monte Carlo simulation to examine the stability of our prediction using the RF model. In this Monte Carlo simulation, we generate a simulation envelope through a number of randomization (for example, n = 50) of VTE risk factors. From each randomization, the probability density function (PDF) of the predicted probability of VTE incidence can be obtained. All these PDFs constitute the simulation envelope. We compare the PDF of the predicted probability of VTE incidence from our dataset with the simulation envelope to examine the stability of the RF model prediction. Fig 5 presents the results on the PDF of the predicted probability of VTE incidence from our dataset (indicated by the red line) and the simulation envelope (indicated by the grey lines).
As shown in Fig 5, the prediction from the RF model (observed) matches well with the simulation envelope (perturbed). In particular, if we define the perturbed error as the difference between the prediction from the RF model and the simulation probability, the error mean is 0.1012 and the standard deviation is 0.0064, which demonstrate the stability of the RF prediction.

Identification of the important risk factors
We apply the RF model to rank importance of risk factors for VTE. Table 3 presents the mean decrease accuracy (MDA) for all risk factors used in this study and the corresponding p-values from permutation tests. The variables with larger mean decrease accuracy values are more important for VTE incidence, and the permutation p-values indicate the significance of importance metrics for the RFmodel.

Impact of BMI
Even though our results from the RF model show BMI is not an important risk factor for VTE, previous studies have found that obesity may interact with other risk factors in VTE development and change the impacts of other risk factors on VTE [22]. However there is very limited evidence on the exact interaction between BMI and the other VTE risk factors. This study aims to fill the gap in the literature. To examine the interaction effects, first we group our sample into four BMI categories, including underweight and normal weight (BMI<25), overweight (25�BMI<30), obesity (30�BMI<40), and morbidly obesity (BMI�40). We then apply the RF model to each BMI group separately. In this way, we can obtain the varying impacts of the other important risk factors, including comorbidities and demographic features, on VTE occurrence for individuals from different BMI groups. Table 4 presents the MDA of the important risk factors (identified by the RF model for the whole sample) for each BMI group. We also present the significance level of permutation test for importance for each risk factor across the BMI groups. As shown in Table 4, all important risk factors for VTE, except for two features (infarction and stroke), are still significantly important for VTE across all different BMI groups. However, infarction and stroke are significantly important for some specific BMI groups. For example, infarction is a moderate and weak VTE risk factor for morbidly obese people and overweight people, respectively, with significance level for importance at 10%, but it is not significantly important for people for the other two BMI groups. Stoke is significantly important for overweight and obese people but not for normal or under weight people, in terms of VTE occurrence. It is worthwhile to mention that the importance of AA genotype for VKORC1 is only 10% significant for the morbidly obesity people, while for all other people, it is 5% significantly important. On the other hand, depression is 5% significantly important for overweight people, but only 10% significantly important for the other BMI groups. In terms of MDA values, all risk factors, except for three feastures including AA genotype for VKORC1, infarction and stroke, can be qualified as, at least, moderate (i.e.MDA>2) risk factors for VTE. In particular, AA genotype for VKORC1 and infarction are two weak VTE risk factors (MDA<2) for overweight individuals but moderate for people from other BMI groups; while stroke is a weak risk factor for any BMI group. It can be seen that importance of the same risk factors are different across different BMI groups. We present the visualized results of MDA across different BMI groups in Fig 6. The horizontal axis of Fig 6 indicates the important VTE risk factors identified from the whole sample, which are ordered according to their MDA from highest to lowest. For example, "Han" has the highest MDA of 10.7470 and the lowest MDA of 2.6900 belongs to "dep", as presented in Table 3. Fig 6 presents four lines of MDA, estimated for the four BMI groups, respectively. The trend of these lines is not strictly downward and these lines are not parallel.

PLOS ONE
Evaluation of the impact of body mass index on venous thromboembolism risk factors  Two conclusions emerge from Fig 6. First, the importance order of these risk factors for a particular BMI group is different from that of the whole population. For example, although the variable of "Han" is the most important VTE risk factor for DVT patients regardless BMI category, the second most important risk factor is different for each BMI group from the whole population. In particular, without taking into account of the interaction effect of BMI, hyperlipidemia is the second most important risk factor, however it is "bl" for overweight and morbidly obesity group, and "GG" for obesity group. Similar differences can be found on the other risk factors.
Second, the same VTE risk factor may play a different role across different BMI groups. Take the variable of race feature, "Han", as an example, though it is the most important one across all BMI categories, it plays a most important role in the obesity group, followed by morbidly obesity group, and a least important role in the overweight group (as indicated by the points in the dashed oval in Fig 6). For some risk factors, such as "AF", we can see more variations in the importance across different BMI grooups. But for some variables, such variation is much smaller, such as "INR". These results further demonstrate the interaction effect of BMI on the other risk factors on VTE incidence.

Conclusions
In this study, we investigate the interaction impact of BMI on the other important VTE risk factors. First, we apply eight ML methods, including NB, SVM, ENET, LR, LAR, MARS, BRT and RF. According to five performance measures, i.e. accuracy, Cohan's Kappa, precision, recall and F1-score, we choose the RF model as the best classification model among the eight ML methods. Second, by applying the RF model we identify twelve important risk factors according to their MDA and permutation test for importance. Last, we run the RF model separately for each BMI group to examine the interaction impact of BMI on the other important VTE risk factors. From this three-step analysis, we conclude that, first the importance of VTE risk factors may vary for different BMI groups. For example, we find that the AA genotype of VKORC1 plays a more important role in determine VTE occurrence for obesity or morbidly obesity individuals than for people from the other BMI categories. Second, the variation of a risk factor's importance on VTE incidence across the four BMI groups is different. For example, we see large variation in the importance of atrial fibrillation across BMI groups, however the variation in INR's importance is much smaller. Therefore in order to determine the risk of VTE and how risk factors impact on VTE, the interaction impact of BMI on the risk factors has to be taken into account.