Ensemble of machine learning algorithms using the stacked generalization approach to estimate the warfarin dose

Warfarin dosing remains challenging due to narrow therapeutic index and highly individual variability. Incorrect warfarin dosing is associated with devastating adverse events. Remarkable efforts have been made to develop the machine learning based warfarin dosing algorithms incorporating clinical factors and genetic variants such as polymorphisms in CYP2C9 and VKORC1. The most widely validated pharmacogenetic algorithm is the IWPC algorithm based on multivariate linear regression (MLR). However, with only a single algorithm, the prediction performance may reach an upper limit even with optimal parameters. Here, we present novel algorithms using stacked generalization frameworks to estimate the warfarin dose, within which different types of machine learning algorithms function together through a meta-machine learning model to maximize the prediction accuracy. Compared to the IWPC-derived MLR algorithm, Stack 1 and 2 based on stacked generalization frameworks performed significantly better overall. Subgroup analysis revealed that the mean of the percentage of patients whose predicted dose of warfarin within 20% of the actual stable therapeutic dose (mean percentage within 20%) for Stack 1 was improved by 12.7% (from 42.47% to 47.86%) in Asians and by 13.5% (from 22.08% to 25.05%) in the low-dose group compared to that for MLR, respectively. These data suggest that our algorithms would especially benefit patients requiring low warfarin maintenance dose, as subtle changes in warfarin dose could lead to adverse clinical events (thrombosis or bleeding) in patients with low dose. Our study offers novel pharmacogenetic algorithms for clinical trials and practice.


Introduction
Warfarin is the most widely used oral anticoagulant worldwide. Due to its narrow therapeutic window and large interpatient variability, warfarin dosing remains challenging [1]. The PLOS  consequence of incorrect warfarin dosing can be devastating, predisposing the patient to thrombosis in the case of under-dosing or bleeding in the case of overdosing. Because of these challenges associated with warfarin use, it is one of the leading causes in emergency department visits and the most often cited cause of drug-related mortality [2]. Clinical factors, demographic variables and genetic variants significantly contribute to interpatient variability in dose requirement for warfarin. While non-genetic factors, such as age, height, weight, race and drug interaction, have been reported to explain 15-20% of the variability in dose, polymorphisms in cytochrome p450, family 2, subfamily C, polypeptide 9 (CYP2C9), and vitamin K epoxide reductase complex, subunit1 (VKORC1) independently correlate with warfarin therapeutic dose [3][4][5], and combined polymorphisms of those two genes account approximately 30% (20%-25% for VKORC1 rs9923231; 5%-10% for CYP2C9) of the interpatient warfarin dose variability [3][4][5][6].
Remarkable efforts have been made to estimate the appropriate warfarin dose to improve the patient care. Many pharmacogenetic algorithms integrating clinical, demographic and genetic variables have been created to predict the dose requirement in individual patients [7][8][9][10][11]. Most of the dosing algorithms are based on multivariate linear regression (MLR). One of the most widely used and tested algorithms is the IWPC pharmacogenetic algorithm [8], which has proved accuracy in multiple studies. Other advanced machine learning approaches such as deep learning (neural networks), tree-based algorithms and support vector machines have also been used to predict warfarin dose [7,10,11], but those studies use a single machine learning algorithm to maximize the accuracy of predicting warfarin dose. In a mathematical point of view, a machine learning algorithm is a sophisticated fit to a non-linear function, and a single machine learning model may fit well to a certain subset of patients, but may overfit or underfit to the rest of patients with different genetic and racial background. As a result, the prediction accuracy of a single model may reach an upper limit even with optimal parameters. It is not surprising that the models produced by MLR and support vector regression with a linear kernel were statistically indistinguishable and significantly outperformed all the other approaches in the IWPC cohort [8]. One way to overcome the limitation of a single algorithm is to combine the advantages of several algorithms to break through the upper limit of a single machine learning algorithm (i.e. ensemble method). Recently, the ensemble method "bagging" has been applied to predict warfarin dose [12,13]. Stacked generalization is another ensemble method that uses a higher-level model to combine lower-level models to achieve higher prediction accuracy [14,15]. Unlike the "bagging" and "boosting" approaches which can only combine machine learning algorithms of the same type, stacked generalization can combine different types of algorithms through a meta-machine learning model to maximize the generalization accuracy.
In this study, we created novel regression models to estimate warfarin stable dose utilizing stacked generalization frameworks that combine the advantages of distinct machine learning algorithms and significantly improved the prediction accuracy compared to MLR.

Materials and methods
The International Warfarin Pharmacogenetic Consortium (IWPC) Cohort IWPC Cohort has been described previously [8]. Expanded Data set was downloaded from the PharmGKB website (http://www.pharmgkb.org/downloads/), which contains pooled data on 6256 chronic warfarin users recruited through collaborative efforts of 22 research groups from 4 continents. This data set includes detailed de-identified curated data on demographic factors, clinical features, such as age, weight, height and concomitant use of amiodarone, as well as CYP2C9 and VKORC1 genotypes. Missing values for height and weight were imputed with multivariate linear regression models. Specifically, weight, race, and sex were used for the imputation of the height variable, while height, race, and sex were used for the weight variable. For missing values of the VKORC1 rs9923231, the imputation strategy has been described [8], which is based on linkage disequilibrium in VKORC1 and race. We excluded 17 subjects with CYP2C9 � 5, � 6, � 11, � 13 and � 14, due to low allele frequency and an outlier subject with warfarin stable dose 315 mg/week. A total of 5743 subjects were included in this study.

Stacked generalization framework
Predicting warfarin maintenance dose is a regression task, which requires a function or model f that maps an input vector x 2 d onto the corresponding continuous label value y 2 . Since a training data set {(x 1 , y 1 ), . . ., (x n , y n )} is utilized to find f, the task falls into the category of a supervised learning problem. The machine learning problem is formulated as a minimization problem of the form: The first term of the objective is the empirical risk described by a loss function S which measures the quality of the function f. A specific case is the squared loss l(ŷ, y) = (ŷ -y) 2 . The second term of the objective in Eq (1) is a regularization term which measures the complexity or roughness of the function f, which usually is a norm of f or its derivatives. L 2 regularization was used in this study.
Stacked generalization is utilized to ensemble different machine learning algorithms, which can be viewed as a means of collectively using several models to estimate their own generalizing biases with respect to a particular learning set, and then filter out those biases [16,17]. There are two kinds of models in a stacked generalization framework: several base models (level-0 models) and one meta-model (level-1 model). The essence of stacked generalization is to use the level-1 model to learn from the predictions of level-0 models. Generally, a stacked generalization framework can obtain more accurate prediction compared to the best level-0 model [16].
One of the key points is to obtain the training data for level-1 model (D cv ) from cross-validation technique. Given an original data set D = {(y n , x n ), n = 1, . . ., N}, where y n is the target value and x n represents feature vectors of the nth instance, randomly split the data into K almost equal folds D 1 At the end of the entire cross-validation process of each M j , the data assembled from the outputs of the J models is D cv ¼ fðy n ; z 1n ; . . . ; z Jn Þ; n ¼ 1; 2; . . . ; Ng: ð3Þ D cv is the training set of level-1 model M meta . To complete the training process, level-0 models M j (j = 1, 2, . . ., J) are trained using original dataset D, and M meta is trained by D cv . Now we consider the prediction process, which uses the models M j , j = 1, 2, . . ., J, in conjunction with M meta . Given a new instance, models M j produce a vector (z 1 , . . ., z J ). This vector is input to the level-1 model M meta , whose output is the final prediction result for that instance.
In this study, 80% of the eligible patients were randomly chosen as the training set (N = 4594), which was used to train the machine learning models. The rest 20% of the patients (N = 1149) were utilized as the hold-out test set. Various models were then built on the training set and the hold-out test set was used to evaluate the model performance by two metrics. The mean absolute error (MAE), the absolute difference between the predicted and actual maintenance doses, was used to evaluate each model's predictive accuracy. The mean of the percentage of patients whose predicted dose of warfarin within 20% of the actual stable therapeutic dose (mean percentage within 20%) was utilized to evaluate the clinical significance of each algorithm, as a difference in warfarin dose greater than 20% is likely to be considered to clinically relevant by clinicians. The features/variables in single models were identified based on reported IWPC pharmacogenetic dosing algorithm [8], including height, weight, race, age, enzyme inducer and use of amiodarone. Additional features used in other warfarin stable dose prediction algorithms [7,9,10,20] such as diabetes mellitus, heart failure, valve replacement, smoking, use of statins, smoking status, and other VKORC1 genotypes (S1 Table) were also included to train the stacked generalization models.

Warfarin dose subgroup analysis
To assess the performance of the algorithms in different dose ranges, warfarin stable dose was divided into three subgroups based on the 25% and 75% quantiles according to race: low dose

Statistical analysis
Due to the skewed (with a longer tail at high doses) distribution of warfarin dose, we transformed raw dose into square root of the dose. To obtain robust statistics, 100 rounds of resampling were performed from the IWPC cohort. All the quantitative data are presented as means with 95% confidence intervals (CI). P values < 0.05 were considered to be statistically significant. The MAE and differences of the mean percentage within 20% of the actual stable dose among the algorithms were compared with unpaired Student's t-test. All statistical analyses were conducted with R (version 3.4.4).

Basic characteristics of the IWPC cohort
The characteristics of the patients are shown in Table 1 and S1 Table. In the IWPC cohort, 5743 patients were included for analyses with a median warfarin stable dose of 28.0 mg/week. The target range for International Normalized Ratio (INR) fell within 1.7 to 3.3, with the majority of being prespecified between 2 and 3. The proportions of patient age less than 50, 50 to 80, and older than 80 were 17.0%, 71.0%, and 12.1%, respectively. There were 4.9% of patients concomitantly taking amiodarone, 19.6% of patients taking statins, 1.1% of patients using enzyme inducers (including phenytoin, carbamazepine, and rifampin). There were 8.4% of patients smoking. Patients with diabetes mellitus, cardiomyopathies and heart failure, and valve replacement were 10.8%, 12.5% and 17.0%, respectively.

Overall performance of the single algorithms
To establish the stacked generation frameworks to ensemble different models, we first examined the predictive performance of several single models. Eight different machine learning algorithms including MLR, SV, RR, NN, GBT, RF, ET, and K nearest neighbors (KN) were trained by the training set and then prediction was made on the hold-out set. The performance of each model measured by the MAE and percentage within 20% is shown in Table 2. The algorithms SV, RR, and MLR performed statistically indistinguishable and were the best among eight algorithms, followed by GBT and NN. The algorithms RF, ET and KN resulted in the least favorable results.

Configuration and performance of stacked generalization framework 1 and 2 (Stack 1 and 2)
To determine whether ensemble predictors constructed using stacked generalization improve the prediction accuracy for warfarin stable dose, we constructed two different stacked generalization frameworks Stack 1 and 2 using the exactly same parameters in individual algorithms. The frameworks in this study consisted of three level-0 models and one level-1 model (meta model).  Table 2. To improve the prediction accuracy, we included diabetes mellitus, heart failure, valve replacement, smoking, use of statins, and other VKORC1 genotypes as additional features to train the frameworks (Stack1 and 2). The MAEs for the Stack1 and 2 were 8.31 and 8.31 mg/ week, which were significantly better than 8.53 mg/week in the MLR ( Table 3). The mean percentage within 20% produced by Stack 1 and 2 (47.85% and 47.81%) were significantly higher than that (46.31%) of MLR (Table 3).

Evaluating the performance of the stacks by race
Given that genetic diversity is known to be great in different races [21], to further examine the performance of the Stacks in different races, Stack 1 and Stack 2 were evaluated in Asians, whites and blacks. The MAE of prediction was highest in blacks and lowest in Asians across the three algorithms (Table 4). In Asians, Stack 1 and Stack 2 performed significantly better than MLR for both MAE and mean percentage within 20% ( Table 4). The mean percentage within 20% for Stack 1 and 2 had a 12.7% improvement (from 42.47% to 47.86% and 47.66%) compared to that for MLR. In whites, the MAEs for Stack 1 and 2 were lower than that of MLR (P = 0.002). The performance of Stack 1 and Stack2 in blacks was better than MLR, but there was no statistical significance (Table 4).

Evaluating the performance of the stacks by warfarin dose range
We next to test the predictive ability of Stack 1 and 2 in three different warfarin dose ranges. The MAE of prediction was highest in high dose and lowest in intermediate dose across the three algorithms (Table 5). In the low-dose group, Stack 1 and Stack 2 provided a significantly

Discussion
The individual response to warfarin is highly variable, being influenced by clinical factors and genetic variants such as polymorphisms in CYP2C9 and VKORC1. Given that incorrect warfarin dosing can potentially increase the risk of thrombosis or bleeding, estimating appropriate warfarin dose for individual patient has been an active research area. Recently, personalized genotype-guided warfarin dosing has demonstrated clinical benefits and superior clinical outcomes in major clinical trials [22][23][24]. In these trials, predicted warfarin maintenance doses were based on MLR. However, due to the complex and non-linear association between warfarin dose and clinical and genetic variables, such limitations related to MLR-based algorithms may affect the prediction accuracy. Because of that, in this study, we set out to develop novel algorithms using the stacked generalization approach to estimate the warfarin dose, where different types of machine learning algorithms function together through a meta-machine learning model to maximize the prediction accuracy. This strategy has been successfully applied to and significantly improved the prediction of molecular atomization energies [25].
To be able to directly compare the performance among different algorithms, we used the IWPC data set and applied this data set to eight different machine learning algorithms with the same covariates in the IWPC algorithm. In line with the MAE (8.5 mg/week, 95% CI (8.0-9.0)) in the IWPC study [8], the MAE for MLR in this study was 8.53 mg/week (95% CI (8.08-8.99)). These data validated our data processing and allowed us to directly compare the performance of algorithms developed in this study to the IWPC algorithms. Moreover, we found that MLR provided a comparable performance as SV and RR and outperformed all other algorithms, including NN and GBT, which is consistent with the results reported in the IWPC study [8]. In order to take advantage of different algorithms, we constructed ensemble predictor Stack 1 and 2 using SV, RR, NN and GBT. Overall, Stack 1 and 2 performed significantly better than MLR. Subgroup analysis revealed that the mean percentage within 20% for Stack 1 and 2 had a 12.7% improvement (from 42.47% to 47.86% and 47.66%) compared to that for MLR in Asians. Similarly, the mean percentage within 20% for Stack 1 was improved by 13.5% (from 22.08% to 25.05%) compared to that for MLR in the low-dose group. The median warfarin weekly dose for Asians was lower compared to whites and blacks [21]. These data suggest that our algorithm would especially benefit patients requiring low warfarin dose, as subtle changes in warfarin dose could result in adverse clinical events such as thrombosis or bleeding.
Many studies developed the pharmacogenetic algorithms for warfarin dosing [7][8][9][10][11], but the performance of these algorithms can hardly be compared due to different data sets with different patient ethnicity and prevalence of CYP2C9 and VKORC1 polymorphisms. For example, we have reported that Asians, whites and blacks have different warfarin sensitivity, resulting from different polymorphisms of CYP2C9 and VKORC1 in the IWPC cohort [21]. For the IWPC data set, additional algorithms such as multivariate adaptive regression splines (MARS), lasso regression (LAR) and Bayesian additive regression trees (BART) have also been tested along with MLR [10]. Interestingly, the MAE for MLR was 9.28 mg/week in that study and they showed that BART, MARS and SVR performed superior than MLR in the prediction of warfarin dose. However, MLR and SVR performed better than MARS in the IWPC study. This discrepancy could be owing to different data processing and independent variables selected. In addition, the ensemble method "bagging" has been used to predict warfarin dose [12,13], which is a popular method to assemble base models to decrease the variance, but not to improve the predictive force. Therefore, in terms of the coefficient of determination (R 2 ), R 2 for the IWPC algorithm in Asians is 0.46 [26], which is not inferior to the "bagging" models (Ms+G) ranging from 0.402 to 0.441 for Chinese in the previous study [13]. "Bagging" method requires designing various feature functions for base models in order to achieve diversity, which is a key requirement for base models [13]. The novel regression models we proposed here can take advantage of distinct machine learning algorithms to achieve high diversity of base models. Designing feature functions is not necessary and all base models can use the same feature space, which can avoid extra information extraction efforts. Moreover, "voting" has been combined with "bagging" to assemble the base models to predict warfarin dose [12]. In contrast to "stacking", no learning takes place at the meta-level when combining base models by a voting scheme. Of note, "stacking" has been applied to combine different base models to predict warfarin dosing adjustment based on time-series data with patient's history of warfarin dose and INR [27], whereas in this study we focused on the development of pharmacogenetic models to predict warfarin maintenance dose using clinical and genetic factors. Taken together, the performance of our algorithms based on the stacked generalization framework appears superior to other published models [10]. This is probably attributable to the power of our model to identify complex, nonlinear relationships among genetic and clinical variables and to deal with many sources of inferential trouble such as outliers and collinearity among variables compared to the MLR in the IWPC study.
Our study has several limitations. First, due to the missing values in the IWPC cohort, we imputed missing genotypes of VKORC1 for some patients based on linkage disequilibrium [5]. We also imputed missing values for height and weight using multivariate linear regression models. These imputation strategies, which are generally reliable, could have introduced some errors. Second, many patients required very high doses of warfarin in the IWPC cohort (> 70 mg/week), especially in warfarin normal groups. The polymorphisms explored in VKORC1 and CYP2C9 to classify warfarin sensitivity primarily explain increased sensitivity to warfarin, not increased resistance. Mutations of VKORC1 have been associated with resistance to warfarin [28,29]. Third, our algorithms are more complex than MLR, which may not be very easy to be implemented in clinical setting.
In conclusion, we created novel regression models using the stacked generalization approach to estimate the warfarin maintenance dose. The performance of our algorithms was superior to the algorithm developed by IWPC, especially in Asians and the low-dose group. Our study offers alternative pharmacogenetic algorithms for clinical trials and practice.
Supporting information S1