Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Survival prognostic factors in patients with acute myeloid leukemia using machine learning techniques

Survival prognostic factors in patients with acute myeloid leukemia using machine learning techniques

  • Keyvan Karami, 
  • Mahboubeh Akbari, 
  • Mohammad-Taher Moradi, 
  • Bijan Soleymani, 
  • Hossein Fallahi


This paper identifies prognosis factors for survival in patients with acute myeloid leukemia (AML) using machine learning techniques. We have integrated machine learning with feature selection methods and have compared their performances to identify the most suitable factors in assessing the survival of AML patients. Here, six data mining algorithms including Decision Tree, Random Forrest, Logistic Regression, Naive Bayes, W-Bayes Net, and Gradient Boosted Tree (GBT) are employed for the detection model and implemented using the common data mining tool RapidMiner and open-source R package. To improve the predictive ability of our model, a set of features were selected by employing multiple feature selection methods. The accuracy of classification was obtained using 10-fold cross-validation for the various combinations of the feature selection methods and machine learning algorithms. The performance of the models was assessed by various measurement indexes including accuracy, kappa, sensitivity, specificity, positive predictive value, negative predictive value, and area under the ROC curve (AUC). Our results showed that GBT with an accuracy of 85.17%, AUC of 0.930, and the feature selection via the Relief algorithm has the best performance in predicting the survival rate of AML patients.


Acute myeloid leukemia (AML) is a clonal disorder that is associated with a reduction of differentiation of the myeloid lineage, accumulation of immature progenitors in the bone marrow, resulting in hematopoietic failure [1]. The peripheral blood is the most involved organ in this disease, while infiltration of other organs such as the brain and/or the lung is uncommon and found mostly in cases with high blast counts in the blood [2].

The criterion for AML according to World Health Organization (WHO) is observing at least 20% myeloblasts in the marrow (or blood) with myeloid lineage [3]. Besides the ≥20% criterion are cases of core-binding factor (CBF)-AML, nucleophosmin 1 (NPM1)-mutated AML, or acute promyelocytic leukemia (APL); in each of them, the AML diagnosis is blast %-independent. Cases with more than 20% blasts without markers are mentioned as acute undifferentiated leukemia (AUL) and mostly treated like AML [2].

Multiple studies have suggested the contribution of both genetic factors and clinical variables in predicting overall survival (OS) and event-free survival (EFS). Basically, besides aging, which is considered as an independent prognostic factor, the genetic mutation of RUNX1, ASXL1, and TP53 are associated with poor prognosis and a lower chance of survival. TP53 mutation and complex karyotype give independent prognostic information and their combination results in the worst outcome [4]. Generally, it has been suggested that 75% of variations are related to genomic instability, and the other 25% are associated with clinical, treatment, and demographic variables. So far, 37 Models with a combination of all of these factors could predict remission or life expectancy only in 75% to 80% of cases [4]. This emphasizes the need to find other prognostic factors with higher accuracy.

Machine learning (ML) techniques have become a popular tool for the prediction of disease outcomes. For example, it can identify existent patterns and relationships between datapoint to predict the outcome of cancer [5]. Machine learning is a type of artificial intelligence that mostly develops computer programs and could accommodate new data whenever they are available [6]. So, computer models developed based on the previous data could be used for classification, prediction, and detection processes. Among ML techniques, feature selection-based techniques are selecting a subset of features from the original set of features [7]. These techniques are very convenient to use and relatively accurate to implement for prediction processes.

Numerous studies have used prediction models for the prediction of cancer survivability. For example, Delen et al. used artificial neural networks (ANN), Decision Trees (DT), and logistic regression (LR) methods in breast cancer patients [8]. Also, Chao et al. have used several models such as support vector machine (SVM), LR, and a C5.0 decision tree model to the prediction of breast cancer survival [9]. Another study introduced a different method for the prediction of survival of breast cancer patients. They used the Naive Bayes (NB), Trees Random Forest (TRF), 1-Nearest Neighbor (1NN), AdaBoost (AD), Support Vector Machine (SVM), RBF Network (RBFN), and Multilayer Perceptron (MLP) machine learning techniques alongside with 10-cross fold technique for the prediction of breast cancer survival [10]. Similar work was conducted by Delen et al. for prostate cancer patients using SVM, DT, ANN, and LR [11]. Interestingly, in most clinical studies Random Forest (RF) appears to be the most preferred algorithm [12, 13].

In the current study, we have applied multiple machine learning techniques to a dataset of AML to evaluate the predictive power of these techniques in the prediction of survival outcomes (alive or death) of the patients. The aim is to improve the subsequent therapy and management of patients for increasing survivability. So, the application of machine learning models for accurate prediction of the survival in patients with acute myeloid leukemia based on clinical data was assessed in our study.

Materials and methods

Several variables have been registered for patients in the current database including numerical and categorical variables. The patients were classified according to the French-American-British (FAB) system. The data were prepared for analysis using data mining tools and algorithms. Feature selection was done via feature weighting methods and after this section, 25 of the high weight features were selected to continue the analysis. As listed in the following sections, multiple classifiers were trained and evaluated for their ability to predict the survival of patients.


Data used in this study were obtained from the Leukemia Sample Bank at the University of Texas M. D. Anderson Cancer Center that were collected between January 15, 1998, to March 9, 2006. The primary dataset contained information regarding 249 patients with AML. Several variables have been evaluated for patients. These variables show in Tables 1 and 2 including categorical and some numerical variables that were used in the study. A list of protein features used in the analysis can also be found in the (S1 Table). The missing values in categorical variables and numerical features replaced by the mode and average of the missing value in each class as imported into the RapidMiner software (RapidMiner7.5.003, [14].

Table 1. Statistical description of categorical variables in the AML patients.

Table 2. Statistical description of numerical variables in the AML patients.

Data cleaning

The preprocess of cleaning and formatting of the data often is crucial for obtaining a good fit of the model and better predictive ability. Therefore, correlated attributes with a Pearson correlation coefficient greater than 0.95 were removed from the list. Additionally, numerical attributes with a standard deviation less than or equal to a given threshold (0.1) were assumed to be useless in the analysis and removed from the initial dataset. The remaining data treated as the processed dataset and was used for conducting this study [15].

Feature selection

The performance of many algorithms would decrease by using irrelevant features. So, the selection of relevant features is an important step during data mining. Here, the feature selection was conducted through feature weighting approaches including information gain, information gain ratio, Gini index, chi-squared, correlation, relief, uncertainty [16]. We then selected 25 features with the highest weighting score for the following model evaluation steps (So we have used more than 30% of all features analyzed). In the following, we give a short description of the feature selection algorithms used in methods.

  1. Information gain
    Information gain is an entropy-based feature evaluation method, widely used in the field of machine learning. As information gain is used in feature selection, it is defined as the amount of information provided by the feature items for the text category. Information gain is calculated by how much of a term can be used for the classification of information, to measure the importance of lexical items for the classification [17, 18].
  2. Information gain ratio
    The information gain ratio for splitting according to some feature "A" is the gain divided by the entropy of A. One advantage of the gain ratio is that the gain ratio less prefers a finer partition. This measure is proposed by Jia et al. [19].
  3. Gini index
    Gini-Index is a split measure of total variance across the K classes. This rule is used for splitting attributes in choosing the most appropriate splitting attribute at each node in a decision tree [20, 21].
  4. Chi-squared
    The Chi-square test is a statistical technique used in statistics to test the independence of two events. In feature selection, the two events are the occurrence of the term and the occurrence of the class.
  5. Correlation
    Correlation is a statistical method of the linear relationship between two variables. Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have a high correlation, we can drop one of the two features.
  6. Relief
    Relief is considered one of the most successful feature selection algorithms that estimate the quality of features according to how well their values distinguish between the instances of the same and different classes that are near each other. Relief was originally defined for two-class problems and was later extended (ReliefF) to handle noise and multiclass datasets [22, 23].
  7. Uncertainty
    Symmetric uncertainty has been obtained by normalizing Mutual Information to the entropies of two variables. This nonlinear measurement indicates the correlation between two variables. In feature selection, this measure evaluates an attribute by measuring its symmetrical uncertainty concerning the class [24].

Model evaluation

Based on what was reported in the literature, six machine learning techniques were used to study the data including RF, DT, LR, Naive Bayes, W-Bayes Net, and GBT. We performed 10-fold cross-validation for all datasets. In each fold, the mining models are fitted to the training data and the test observations are used for validation. The conventional machine learning algorithms were assessed by their accuracy, kappa, sensitivity, specificity, positive predictive value, negative predictive value, and AUC (area under the ROC curve) which were used to investigate the performance of the machine learning algorithms. The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. In summary, a model with good predictive should have an AUC closer to 1 (1 is ideal) than 0.5. Some definitions concerning the measurements are given to explain how the indicators are gained. The following abbreviations were used for empirical quantities: P (# positive samples), N (# negative samples), TP (# true positives), TN (# true negatives), FP (# false positives), FN (# false negatives). Accuracy was estimated using the ratio of (TP+TN)/(P+N). Positive predictive value (PPV) was estimated by TP/(TP+FP). Negative predictive value (NPV) was estimated by TN/(TN+FN). Sensitivity was estimated by TP/P. Specificity was estimated by TN/N. Accuracy was used to select the optimal model using the largest value. Sensitivity in this context is also referred to as the true positive rate or Recall, and PPV is also referred to as precision. In the following, a short description of the data mining algorithms used in model selection are given:

  1. Random Forest
    A random forest is a classifier consisting of a certain number of random trees, specified by the number of trees parameter for classification, regression and other tasks. RF will create multiple classification and regression (CART) trees, each trained on a bootstrap sample of the original training data and searches across a randomly selected subset of input variables to determine the split [2527]. Random decision forests correct for decision trees’ habit of overfitting to their training set [28]. As for this study, the default number of trees (ntree = 500) in RF was employed to assess the model accuracy.
  2. Decision tree
    Decision trees are the most popular learning method for data mining. Decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining. Because They are relatively fast to construct and produce interpretable models. They are invariant under scaling and various other transformations of feature values, are robust to the inclusion of irrelevant features, and produce inspectable models [28].
  3. Logistic regression
    Logistic regression is a technique borrowed by machine learning from the field of statistics. This method is appropriate to conduct a regression analysis with a dichotomous (binary) dependent variable. The logistic regression model is very popular due to its simplicity and ability to make inferential statements about model terms [29, 30].
  4. Naive Bayes
    The Naive Bayes algorithm is a predictive model and classification technique based on the Bayes theorem. It simplifies the probabilities of the predictor values by assuming that all of the predictors are independent of the others. Naive Bayes Classifier is one of the simple and most effective classification algorithms which can be computed quickly and performs competitively in many cases [30].
  5. Bayes Net
    Bayesian networks, also known as belief networks (or Bayes nets for short) are a probabilistic graphical structure that represent a set of variables and their probabilistic relationships via a directed acyclic graph. Bayes nets are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases [31, 32].
  6. Gradient Boosted Tree
    The Gradient Boosted Trees Operator trains a model by iteratively improving a single tree model. After each iteration step, the Examples are reweighted based on their previous prediction. The final model is a weighted sum of all created models. Training parameters are optimized based on the gradient of the function described by the errors made. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification [33].


Patient characteristics

The mean age of the 249 patients included in the study was 60.12±16.29 years. Description of categorical and some numerical variables are summarized in Tables 1 and 2, respectively. Table 1 shows the frequencies and the percentages of the categorical variables in each class. Table 2 reports the minimum (Min), maximum (Max), mean, and standard deviation (Std. Deviation) of the numerical variables.

Feature selection

The 25 features (about 30% of all features) with the highest weighting score were selected using various feature selection operators and are presented in Table 3. A visual representation of all features selected by feature selection techniques is shown in Fig 1A, which includes the most important feature that had a higher weight score in multiple feature selection algorithms. Also, the most important protein features among all feature selection algorithms are given in Fig 1B. The size of each word shows the importance of that term, as it appears more frequently in all feature selection techniques. According to the figures, the most frequent non-protein features are CD19, CR_duration while more frequent protein features include Albumin, HGB, STAT5.p431, BAD.p112, P70S6K. Fig 2 presents the contribution of the protein and non-protein features for each dataset produced by various feature selection algorithms.

Fig 1. The most important features (A) and the most important proteins (B) in all feature selection algorithms.

Fig 2. Distribution of protein features among studied datasets.

Table 3. Top 25 most important features selected by various feature selection methods.

Assessing the predictive ability of the model

In this section, we evaluate the ability of the selected models to predict the survival status of AML patients. Table 4 gives the accuracy percent of the proposed prediction models using a 10-fold cross-validation procedure over various feature selection algorithms. Table 5 reports the Accuracy, Kappa value, Specificity, Sensitivity, PPV (precision), NPV, and AUC of the proposed prediction models.

Table 4. Accuracy (%) of the prediction algorithms using a 10-fold cross-validation procedure for each of the datasets.

Table 5. Performance of the prediction algorithms based on the best feature selection approach.

In the RF method, the dataset that came from the information gain criterion outperform others with an accuracy of 84.33% and AUC of 0.874. Fig 3 shows the decision tree pattern performed by the RF model with the Information Gain dataset. In the DT model, two datasets Information Gain ratio and Chi-Squared achieved close accuracy rates of 83.52% and 83.12%, while they have a precision rate of 81.36% and 76.47%, respectively. Results show that the LR method by Information Gain and Gini Index criteria has a close accuracy rate of approximately 82%, while Information Gain gives a higher precision rate with a precision of 71.79%. Naive Bayes produces approximately the same accuracy with the dataset that comes from the Chi-Squared, Correlation, and Uncertainty criteria (82.33%, 82.70%, and 82.73%, respectively). The W-Bayes Net method produces a higher accuracy and the precision rate with the dataset comes from Chi-Squared criteria (83.17%, 73.08%, respectively). Based on the result of the GBT algorithm, the accuracy of the Relief and Chi-Squared is approximately the same (about 85.17% versus 85.15%), while the Relief criterion gives a higher precision (78.87 and 78.08, respectively). Also, the GBT model with Relief dataset has the Kappa (0.644), Sensitivity (71.97), Specificity (91.23), PPV (78.87), NPV (87.64) and, AUC (0.930). More details are given in S2 Table.

Fig 3. The decision tree pattern performed by RF model with Information Gain dataset.

On the other hand, Table 4 shows that the prediction based on all features provides the lowest accuracy rates for the RF, LR, and Naive Bayes models. So, the feature selection datasets produce better accuracy among all feature datasets. Overall, RF and GBT models outperform others with accuracy 84.33% and 85.17%, respectively, while the RF has higher precision (81.97%) and the GBT has a higher AUC (0.930) (Table 5). Considering both precision and sensitivity is useful because there is an imbalance in the observations between the two classes alive and dead.

Due to the CR_duration variable may not be available at disease onset here we carry out prognosis prediction methods again without this feature. Table 6 presents the accuracy (in percentages %) of the prediction algorithms using a 10-fold cross-validation procedure for each of the datasets. As it is shown in the table, the accuracy decreases after removing the CR_duration variable from the model. Also, here again the GBT algorithm produced better accuracy in comparison to other algorithms in model evaluation but on the Information Gain dataset. Now, we give a summary of the GBT model output and produce both a variable importance table and a plot of the model. Fig 4 shows the model deviance as a function of number of trees (N: the number of gradient boosting iteration) using cross validation that is shown with the green color graph in the figure. As can be seen, model deviance increases after a certain number of trees. The value of N that minimizes the deviance is used for the optimal number of trees. Here the optimal number of trees is 318. Fig 5 provides a visual representation of the important features in the final GBT model. Also, Table 7 reports the rank of each feature in the GBT model based on their relative influence, which is a measure indicating the relative importance of each variable in the final model. We can see that Age, LDH, FAB, ERk2.p and, HGB are the five most important variables in our GBT model.

Table 6. Accuracy (%) of the prediction algorithms using a 10-fold cross-validation procedure for each of the datasets.

Table 7. The relative influence each feature in the GBT model.

Comparing with cytogenetics

The distribution of patients with favorable, intermediate, or unfavorable cytogenetics among the vital status was significantly uneven (χ2 test = 29.177 on 2 degrees of freedom, P-value<0.0001; Table 8). Patients with favorable and intermediate cytogenetics were significantly overrepresented in Alive and underrepresented in Dead. Patients with unfavorable cytogenetics were overrepresented in Dead and underrepresented in Alive.

Table 8. Distribution of cytogenetic risk groups across the vital status.

The three major cytogenetic risk groups are unevenly distributed across the three age groups (χ2 test = 24.05 on 4 degrees of freedom, P-value<0.0001; Table 9). The percentage of patients with cytogenetics type favorable dropped from 85.7% in those younger than age 56 to 9.5% in those between 56 and 75 years old. Whereas, the proportion of patients with intermediate (or unfavorable) cytogenetics increased from 35.7% (30.1%) in those younger than age 56 to 45.2% (46%) in patients between 56 and 75 years old.

Table 9. Distribution of cytogenetic risk groups across the three age groups.


In the present study, several machine learning methods were used and compared to predict survival outcomes in patients with AML. Six data mining algorithms were employed, including RF, DT, LR, Naive Bayes, W-Bayes Net, GBT. Based on the obtained accuracy measures, it was shown that all the classification methods performed almost similarly in classifying AML survival with a range between 66.27% and 85.17%. The GBT method produced slightly better accuracy (77.55%), in comparison to other methods in model evaluation. In addition, all classification methods were efficient in predicting the classes for AML survival status.

Previous studies have used prediction models for cancer survivability. For example, Delen et al. used the ANN, DT, and LR methods for the prediction of survival in breast cancer [8]. They showed that DT was the best predictor with 93.6% accuracy, while ANN and LR had 91.2% and 89.2% accuracy, respectively. Similarly, Delen et al. used SVM, DT, ANN, and LR algorithms for the prediction of survival in prostate cancer, in which SVM yielded the best accuracy (92.85%) [11]. Noohi et al. in the evaluation to predict survival of the colon cancer patient, showing that ANN was the most accurate model [34].

Although, most clinical studies reported that RF was the preferred algorithm [12, 13]. Ganggayah et al. showed that the RF algorithm produced slightly better accuracy (82.7%), in comparison to other evaluated algorithms in predicting factors for survival of breast cancer patients [35]. Chebouba et al. proposed to use a stochastic local search meta-heuristic as a feature selection method combined with a random forest classifier to classify AML patients’ response to treatment [36]. They used BAC and the AUC scores as evaluation criteria. They used three types of data: only clinical data, only proteomics data, and finally clinical and proteomics data combined. The numerical results showed that the highest scores are obtained by using clinical data alone, and the lowest score is obtained when using proteomics data alone. Further, their method succeeds in finding promising results compared to the methods presented in the DREAM challenge. Wang et al. proposed the Bayesian nonparametric variable selection approach to identify the prognostic genes for the clinical prognosis prediction of AML [37]. In their study, the overall survival times of AML have been dichotomized with a one-year cutoff and the patients were classified into two subcategories of high-risk and low-risk. There were 53 candidate genes identified from 14,892 genes. With the same parameters and the iteration times in the previous procedures, they separately identified 12, 16, and 18 prognostic genes as features for model construction. The top 18 genes were AADACL1, ABCB10, AIM1, APP, ATF3, BNIP3, DAPK1, DYM, FHL1, GMPR, GUCY1A3, LEF1, MKRN1, MXRA7, NPAL3, SOCS2, TESC, and TM4SF1.

Ganggayah et al. obtained an accuracy between 79.8 to 82.7% in predicting factors for survival of breast cancer patients. The factors that were important in their study were cancer stage, tumor size, number of total axillary lymph nodes removed, number of positive lymph nodes, types of primary treatment, and methods of diagnosis [35].

Based on the results of analysis shown in Tables 4 and 5 among the datasets driven from feature selection, GBT on the Relief dataset has the better performance compared to other algorithms and datasets (with an accuracy of 85.17% and AUC of 0.930). The features in this dataset are CR_duration,, Age, TP.AFTER.SAMPLE_DATE, CD19, SEX, RACE, PRIOR_MAL, PRIOR_XRT, FIBRINOGEN, PRIOR_CHEMO, PKCA.p, ALBUMIN, CD34, HGB, PTEN.p, FAB, PKCA, STAT5.p431, ITD, LDH, INFECTION, ZUBROD.S, P70S6K, PB_Blast. So, these features could be a better predictor for AML survival in this analysis.

Similar to our findings, Walter et al. declared albumin as one of the factors to determine the risk of treatment-related mortality (TRM) in AML patients by multivariate models [38]. Jabbour et al. using multivariate analysis showed low albumin is an independent adverse factor for CR in AML patients [39]. Also, Liu et al. showed that in the gastric cancer patients’ poor survival was observed with lower levels of BMI (P = 0.028), albumin (P = 0.004), and triglyceride (P = 0.043), respectively. Based on the ROC curve analyses they have suggested that BMI, albumin and, triglyceride have survival-predictor powers similar to the TNM staging system [40]. Other studies showed an inverse association between blood levels of albumin and mortality in the general population [4145].

Our results highlighted CD19 as one of the main deterministic factors for survival outcomes. CD19 is a cell-surface marker for the diagnosis of B-lineage of mixed-phenotype acute leukemia (MPAL) [4]. Wang et al. used multivariate analysis of clinical features of 188 patients with AML-M2 and showed that CD19 expression is one of the main factors impacting the prognosis of patients. Therefore, it appears that the level of CD19 might be a useful indicator of survival rate. CD19 expression is regulated by paired box transcription factor 5 (PAX5) [46]. Tiacci et al. reported that in most cases of T-cell acute lymphoblastic anemia and AML, PAX5 was not expressed, while PAX5 was expressed in RUNX1-RUNX1T1–positive AML cases (15 of 42 (35.7%) AML cases with RUNX1- RUNX1T1). So the PAX5 might be exceptionally expressed in RUNX1-RUNX1T1–positive AML, and result in CD19 surface expression [47]. Inappropriate PAX5 expression and simultaneously CD19 expression in RUNX1- RUNX1T1–positive AML cases induced bi-phenotypic features and blocked myeloid differentiation [48, 49].

The level of HGB could be another factor in determining survival as our results showed. There is a report that AML patients with monosomal karyotype were associated with significantly older, and lower HGB concentrations and lower WBC counts [50]. Xu et al. reported that low levels of hemoglobin, albumin, lymphocyte, and platelet could serve as a significant risk factor for recurrence-free survival and overall survival in patients with resected pancreatic cancer [51]. They showed that a low level of hemoglobin, albumin, lymphocyte, and platelet was associated with lymph node metastasis, poor tumor differentiation and, high TNM staging [51]. Also, Sweiss et al. suggested that hemoglobin and creatinine clearance are important predictors of outcomes treatment-free survival after autologous stem cell transplantation for multiple myeloma [52]. They reported that lower hemoglobin, lower creatinine clearance, and a combined low hemoglobin and creatinine clearance were strongly associated with improved treatment-free survival [52].

Furthermore, Zhang et al. have found that females with HGB≥100 g/L, FLT3-ITD mutation-negative, and 10 mg/m2 Ida were favorable factors for CR [53]. Our results are also shown an association between CR and survival rate. Most patients with newly diagnosed AML achieve CR with induction chemotherapy. Although the majority of patients relapse, despite intensive consolidation chemotherapy. The prognostic factor predicting the duration of the second CR is the duration of the first CR [54]. Ferrara et al. declared that the duration of first CR and cytogenetics are the most applicable prognostic factors in relapsed AML [55].

Similar to what we found regarding the most common protein features of AML patients, Ruvolo et al. showed that p-GSK3α/β, as an indicator of AKT activation, positively correlated with phosphorylation of AKT, BAD, and P70S6K [56]. They suggested AKT-mediated phosphorylation of GSK3α/β as a determinant of the overall survival of AML patients.

Age is generally the most important prognostic factor in AML [57]. Utkarsh et al, (2018) confirmed that aging confers inferior survival in AML, so with every 5-year increase in age hazard ratio rise to 22%. This may be the result of poor performance status, multi-drug resistance, and complicated disease biology, which follow the disease with increasing age [5860].

In comparison AML patients younger than age 56 with patients older than 75, Multidrug resistance found 33% compared to 57% respectively. Also, the percentage of patients with cytogenetics type favorable dropped from 17% to 4% respectively. Whereas, the proportion of patients with unfavorable cytogenetics increased from 35% to 51% respectively. Also, obvious increases in abnormalities of chromosomes 5, 7, and 17 among observed in the elderly. In the cytogenetic risk group, the increased incidence of unfavorable cytogenetics contributed to their poorer outcome and therapy outcome decline markedly with age [59]. Similar to these studies, as shown in Table 7, in the GBT model the features Age, LDH and FAB were the most influenced features in the analysis. Table 8 showed that proportion of favorable and unfavorable cytogenetic groups significantly different between two vital statuses, and affect the survival of patients. Also, Table 9 showed that with increasing age the proportion of patients with favorable cytogenetics decreased and instead the proportion of unfavorable cytogenetics increased. So the feature that shows the Cytogenetic classification in AML patients was one of the most important features in the Relief dataset and also influenced features in the GBT model (Tables 3 and 7).

Several studies used the Cytogenetic category for the prediction of survival AML patients [6163]. Cytogenetic analysis is an important value in the clinical management of patients with AML. Many kinds of chromosome changes may occur in AML cells of patients, some of that can affect a person’s prognosis. Cytogenetic abnormalities most commonly associated with AML have been characterized at the molecular level, also the identification of recurrent chromosomal translocations and inversions associated with this disease prepare the molecular characterization of the chromosomal breakpoint regions [64].

Manola reported that cytogenetic and molecular abnormalities are involved in the pathogenesis of childhood AML, with clonal chromosome abnormalities in 70–85% of cases. He suggested that cytogenetic analysis should be performed in all patients with pediatric AML at diagnosis and during the progress of the disease as additional chromosome abnormalities could predict an imminent relapse. Patients may be stratified to different therapies based on results of conventional cytogenetic analysis, molecular cytogenetic analysis, and/or molecular genetic investigations [65].

Also, Grimwade et al. showed the importance of diagnostic cytogenetics as an independent prognostic factor in AML that providing the framework for a stratified treatment approach of disease [66]. Also, Fröhling et al. reported the clinical importance of cytogenetics and age in patients older than 60 years with AML. They showed that a large subgroup of patients, characterized by age 70 or older or high-risk cytogenetics, or both, had very unfavorable long-term outcomes, with patients above age 70 with high-risk cytogenetics showing a particularly poor prognosis [67].

LDH feature was another important feature in the Relief dataset model (Tables 3 and 7). In Myelodysplastic syndromes (MDS) various mechanisms may cause an increase in LDH. One may be the increased turnover and degradation of myeloid cells in the bone marrow, spleen, and other tissues. Another reason may be ineffective hematopoiesis. Additional cofactors may be an infiltration of the liver and spleen by immature myeloid cells or iron overload. The exact biochemical basis of an increasing LDH in these patients remains unknown [68].

Carobbio et al. reported that LDH values of 40% of AML patients were >1.5 times the upper limit of normal value [69]. Also, Aul et al. have shown that an LDH level of >200 U/l indicates a significantly shorter survival when compared to a lower LDH level (≤200 U/l) [70]. Wimazal et al. showed that an increased LDH level (≥300) in myelodysplastic syndromes is associated with a reduced survival as well as an increased risk to transform to secondary AML [68]. The LDH may therefore be considered as an additional useful parameter in MDS. Wimazal et al. showed that an increase in LDH over time is associated with a higher probability of AML progress and a reduced probability of survival. They recommend using LDH as a prognostic follow-up parameter in MDS [71].

Another feature that was important in the Relief dataset was FAB classification (Tables 3 and 7). Canaani et al. suggested that FAB classification contributes and enhances the prognostic capacity of molecular data provided by FLT3-ITD and NPM1 mutational status. Also, they supported that using FAB classification along with molecular data in AML patients undergoing allogeneic stem cell transplantation [72].


Rapid prediction of survival of patients at the beginning of treatment would be beneficial for the selection of a better strategy for them and to change it as soon as conditions get worse. The methods used in this study introduced a methodology for analyzing cancer data and other diseases, we applied a combination of feature weighting and machine learning techniques to determine the survival chance of patients. The results of the present study indicated that the 5 top important features that could be used as a predictive measure for the AML patient’s survival include Age, LDH, FAB, Cytogenetic classification, CD19. We have also found that protein features including Albumin, HGB, TP53, STAT3, STAT5.p431, BAD.p112, ERk2.p and, P70S6K.p might be used to predict survival rate in AML patients.

Supporting information

S1 Table. Complete list of protein features in the present study.


S2 Table. The Kappa value, sensitivity, specificity, PPV, NPV and AUC of the prediction algorithms using a 10-fold cross-validation procedure for each of the datasets.



We thank The Medical Biology Research Center, Kermanshah University of Medical Sciences for providing research facilities to conduct this study.


  1. 1. Pollyea D.A., Kohrt H.E., and Medeiros B.C., Acute myeloid leukaemia in the elderly: a review. British journal of haematology, 2011. 152(5): p. 524–542. pmid:21314823
  2. 2. Estey E.H., Acute myeloid leukemia: 2019 update on risk‐stratification and management. American Journal of Hematology, 2018. 93(10): p. 1267–1291. pmid:30328165
  3. 3. Arber D.A., et al., The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood, 2016. 127(20): p. 2391–2405. pmid:27069254
  4. 4. Döhner H., et al., Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel. Blood, 2017. 129(4): p. 424–447. pmid:27895058
  5. 5. Kourou K., et al., Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal, 2015. 13: p. 8–17. pmid:25750696
  6. 6. Omondiagbe, D.A., S. Veeramani, and A.S. Sidhu. Machine Learning Classification Techniques for Breast Cancer Diagnosis. in IOP Conference Series: Materials Science and Engineering. 2019. IOP Publishing.
  7. 7. Dubey, V.K. and A.K. Saxena. Hybrid classification model of correlation-based feature selection and support vector machine. in 2016 IEEE International Conference on Current Trends in Advanced Computing (ICCTAC). 2016. IEEE.
  8. 8. Delen D., Walker G., and Kadam A., Predicting breast cancer survivability: a comparison of three data mining methods. Artificial intelligence in medicine, 2005. 34(2): p. 113–127. pmid:15894176
  9. 9. Chao C.-M., et al., Construction the model on the breast cancer survival analysis use support vector machine, logistic regression and decision tree. Journal of medical systems, 2014. 38(10): p. 106. pmid:25119239
  10. 10. Montazeri M., et al., Machine learning models in breast cancer survival prediction. Technology and Health Care, 2016. 24(1): p. 31–42. pmid:26409558
  11. 11. Delen D., Analysis of cancer data: a data mining approach. Expert Systems, 2009. 26(1): p. 100–112.
  12. 12. Mosca E., et al., A multilevel data integration resource for breast cancer study. BMC systems biology, 2010. 4(1): p. 76. pmid:20525248
  13. 13. Genuer, R., J.-M. Poggi, and C. Tuleau-Malot, VSURF: an R package for variable selection using random forests. 2015.
  14. 14. RapidMiner Technical Support
  15. 15. KayvanJoo A.H., Ebrahimi M., and Haqshenas G., Prediction of hepatitis C virus interferon/ribavirin therapy outcome based on viral nucleotide attributes using machine learning algorithms. BMC research notes, 2014. 7(1): p. 1–11. pmid:25150834
  16. 16. Beiki A.H., Saboor S., and Ebrahimi M., A new avenue for classification and prediction of olive cultivars using supervised and unsupervised algorithms. PloS one, 2012. 7(9): p. e44164. pmid:22957050
  17. 17. Lee C. and Lee G.G., Information gain and divergence-based feature selection for machine learning-based text categorization. Information processing & management, 2006. 42(1): p. 155–165.
  18. 18. Lei, S. A feature selection method based on information gain and genetic algorithm. in 2012 International Conference on Computer Science and Electronics Engineering. 2012. IEEE.
  19. 19. Jia P., et al., Novel algorithm for attribute reduction based on mutual-information gain ratio. Journal-Zhejiang university engineering science, 2006. 40(6): p. 1041.
  20. 20. Breiman L., et al., Classification and regression trees. 1984: CRC press.
  21. 21. Shang W., et al., A novel feature selection algorithm for text categorization. Expert Systems with Applications, 2007. 33(1): p. 1–5.
  22. 22. Kira K. and Rendell L.A., A practical approach to feature selection, in Machine learning proceedings 1992. 1992, Elsevier. p. 249–256.
  23. 23. Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. in European conference on machine learning. 1994. Springer.
  24. 24. Witten I.H., et al., Practical machine learning tools and techniques. Morgan Kaufmann, 2005: p. 578.
  25. 25. Breiman L., Random forests. Machine learning, 2001. 45(1): p. 5–32.
  26. 26. Liaw, A. and M. Wiener, Documentation for R package randomForest. PDF). Retrieved, 2013. 15: p. 191.
  27. 27. Izmirlian G., Application of the random forest classification algorithm to a SELDI‐TOF proteomics study in the setting of a cancer prevention trial. Annals of the New York Academy of Sciences, 2004. 1020(1): p. 154–174. pmid:15208191
  28. 28. Hastie T., Tibshirani R., and Friedman J., The elements of statistical learning: data mining, inference, and prediction 2 edition Springer. New York, 2009.
  29. 29. James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. New York: springer; 2013.
  30. 30. Kuhn M. and Johnson K., Applied predictive modeling. Vol. 26. 2013: Springer.
  31. 31. Neapolitan R.E., Learning bayesian networks. Vol. 38. 2004: Pearson Prentice Hall Upper Saddle River, NJ.
  32. 32. Ben‐Gal I., Bayesian networks. Encyclopedia of statistics in quality and reliability, 2008. 1.
  33. 33. Li P., Wu Q., and Burges C., Mcrank: Learning to rank using multiple classification and gradient boosting. Advances in neural information processing systems, 2007. 20: p. 897–904.
  34. 34. Noohi N.A., Ahmadzadeh M., and Fardaer M., Medical data mining and predictive model for colon cancer survivability. International Journal of Innovative Research in Engineering & Science, 2013. 2.
  35. 35. Ganggayah M.D., et al., Predicting factors for survival of breast cancer patients using machine learning techniques. BMC medical informatics and decision making, 2019. 19(1): p. 48. pmid:30902088
  36. 36. Chebouba L., Boughaci D., and Guziolowski C., Proteomics versus clinical data and stochastic local search based feature selection for acute myeloid leukemia patients’ classification. Journal of medical systems, 2018. 42(7): p. 129. pmid:29869179
  37. 37. Wang H., et al., Identifying oncogenes as features for clinical cancer prognosis by Bayesian nonparametric variable selection algorithm. Chemometrics and Intelligent Laboratory Systems, 2015. 146: p. 464–471.
  38. 38. Walter R.B., et al., Prediction of early death after induction therapy for newly diagnosed acute myeloid leukemia with pretreatment risk scores: a novel paradigm for treatment assignment. Journal of clinical oncology, 2011. 29(33): p. 4417. pmid:21969499
  39. 39. Jabbour E., et al., Acute myeloid leukemia after myelodysplastic syndrome and failure of therapy with hypomethylating agents: an emerging entity with a poor prognosis. Clinical Lymphoma Myeloma and Leukemia, 2014. 14(2): p. 93–97. pmid:24447728
  40. 40. Liu B.Z., et al., Preoperative body mass index, blood albumin and triglycerides predict survival for patients with gastric cancer. PLoS One, 2016. 11(6): p. e0157401. pmid:27309531
  41. 41. Okamura T., et al., A combination of serum low albumin and above-average cholesterol level was associated with excess mortality. Journal of clinical epidemiology, 2004. 57(11): p. 1188–1195. pmid:15612140
  42. 42. Kawut S.M., et al., New predictors of outcome in idiopathic pulmonary arterial hypertension. The American journal of cardiology, 2005. 95(2): p. 199–203. pmid:15642552
  43. 43. Tateishi R., et al., Proposal of a new prognostic model for hepatocellular carcinoma: an analysis of 403 patients. Gut, 2005. 54(3): p. 419–425. pmid:15710994
  44. 44. Schindl M., et al., Prognostic scoring in colorectal cancer liver metastases: development and validation. Archives of surgery, 2005. 140(2): p. 183–189. pmid:15724001
  45. 45. Greipp P.R., et al., International staging system for multiple myeloma. Journal of clinical oncology, 2005. 23(15): p. 3412–3420. pmid:15809451
  46. 46. Sakamoto K., et al., Negative CD19 expression is associated with inferior relapse‐free survival in children with RUNX1‐RUNX1T1–positive acute myeloid leukaemia: results from the Japanese Paediatric Leukaemia/Lymphoma Study Group AML‐05 study. British journal of haematology, 2019. 187(3): p. 372–376. pmid:31247675
  47. 47. Tiacci E., et al., PAX5 expression in acute leukemias: higher B-lineage specificity than CD79a and selective association with t (8; 21)-acute myelogenous leukemia. Cancer Research, 2004. 64(20): p. 7399–7404. pmid:15492262
  48. 48. Anderson K., et al., Ectopic expression of PAX5 promotes maintenance of biphenotypic myeloid progenitors coexpressing myeloid and B-cell lineage-associated genes. Blood, 2007. 109(9): p. 3697–3705. pmid:17218387
  49. 49. Sekine R., et al., Efficient retroviral transduction of human B-lymphoid and myeloid progenitors: marked inhibition of their growth by the Pax5 transgene. International journal of hematology, 2008. 87(4): p. 351–362. pmid:18415655
  50. 50. Luan Y., et al., Prognostic significance of monosomal karyotype in acute myeloid leukemia. Zhonghua xue ye xue za zhi = Zhonghua xueyexue zazhi, 2015. 36(4): p. 286–290. pmid:25916287
  51. 51. Xu S.-S., et al., Haemoglobin, albumin, lymphocyte and platelet predicts postoperative survival in pancreatic cancer. World journal of gastroenterology, 2020. 26(8): p. 828. pmid:32148380
  52. 52. Sweiss K., et al., Pretransplant hemoglobin and creatinine clearance correlate with treatment-free survival after autologous stem cell transplantation for multiple myeloma. Bone Marrow Transplantation, 2019: p. 1–7. pmid:31388085
  53. 53. Zhang Y., et al., Comparison of 10 mg/m2 or 8 mg/m2 idarubicin plus cytarabine regimen as induction chemotherapy for adult patients with newly diagnosed acute myeloid leukemia. Zhonghua xue ye xue za zhi = Zhonghua Xueyexue Zazhi, 2015. 36(3): p. 225–229. pmid:25854467
  54. 54. Lee S., et al., Duration of second complete remission compared with first complete remission in patients with acute myeloid leukemia. Leukemia, 2000. 14(8): p. 1345–1348. pmid:10942227
  55. 55. Ferrara F., Palmieri S., and Mele G., Prognostic factors and therapeutic options for relapsed or refractory acute myeloid leukemia. Haematologica, 2004. 89(8): p. 998–1008. pmid:15339685
  56. 56. Ruvolo P.P., et al., Phosphorylation of GSK3α/β correlates with activation of AKT and is prognostic for poor overall survival in acute myeloid leukemia patients. BBA clinical, 2015. 4: p. 59–68. pmid:26674329
  57. 57. Mauritzson N., et al., Survival time in a population-based consecutive series of adult acute myeloid leukemia–the prognostic impact of karyotype during the time period 1976–1993. Leukemia, 2000. 14(6): p. 1039–1043. pmid:10865970
  58. 58. Acharya U.H., et al., Impact of region of diagnosis, ethnicity, age, and gender on survival in acute myeloid leukemia (AML). Journal of drug assessment, 2018. 7(1): p. 51–53. pmid:30034924
  59. 59. Appelbaum F.R., et al., Age and acute myeloid leukemia. Blood, 2006. 107(9): p. 3481–3485. pmid:16455952
  60. 60. Leith C.P., et al., Acute myeloid leukemia in the elderly: assessment of multidrug resistance (MDR1) and cytogenetics distinguishes biologic subgroups with remarkably distinct responses to standard chemotherapy. A Southwest Oncology Group study. Blood, The Journal of the American Society of Hematology, 1997. 89(9): p. 3323–3329.
  61. 61. Wagner S., et al., A parsimonious 3-gene signature predicts clinical outcomes in an acute myeloid leukemia multicohort study. Blood advances, 2019. 3(8): p. 1330–1346. pmid:31015209
  62. 62. Lu, C.-C., et al. A BLSTM with attention network for predicting acute Myeloid Leukemia patient’s prognosis using comprehensive clinical parameters. in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2019. IEEE.
  63. 63. Shreve J., et al., A personalized prediction model to risk stratify patients with acute myeloid leukemia (AML) using artificial intelligence. 2019, American Society of Hematology Washington, DC.
  64. 64. Grimwade D., The clinical significance of cytogenetic abnormalities in acute myeloid leukaemia. Best practice & research Clinical haematology, 2001. 14(3): p. 497–529. pmid:11640867
  65. 65. Manola K.N., Cytogenetics of pediatric acute myeloid leukemia. European journal of haematology, 2009. 83(5): p. 391–405. pmid:19563518
  66. 66. Grimwade D., et al., The importance of diagnostic cytogenetics on outcome in AML: analysis of 1,612 patients entered into the MRC AML 10 trial. Blood, The Journal of the American Society of Hematology, 1998. 92(7): p. 2322–2333. pmid:9746770
  67. 67. Fröhling S., et al., Cytogenetics and age are major determinants of outcome in intensively treated acute myeloid leukemia patients older than 60 years: results from AMLSG trial AML HD98-B. Blood, 2006. 108(10): p. 3280–3288. pmid:16840728
  68. 68. Wimazal F., et al., Prognostic value of lactate dehydrogenase activity in myelodysplastic syndromes. Leukemia research, 2001. 25(4): p. 287–294. pmid:11248325
  69. 69. Carobbio A., et al., A multistate model of survival prediction and event monitoring in prefibrotic myelofibrosis. Blood cancer journal, 2020. 10(10): p. 1–7. pmid:33056979
  70. 70. Aul C., et al., Primary myelodysplastic syndromes: analysis of prognostic factors in 235 patients and proposals for an improved scoring system. Leukemia, 1992. 6(1): p. 52–59. pmid:1736014
  71. 71. Wimazal F., et al., Prognostic significance of serial determinations of lactate dehydrogenase (LDH) in the follow-up of patients with myelodysplastic syndromes. Annals of oncology, 2008. 19(5): p. 970–976. pmid:18272915
  72. 72. Canaani J., et al., Impact of FAB classification on predicting outcome in acute myeloid leukemia, not otherwise specified, patients undergoing allogeneic stem cell transplantation in CR 1: An analysis of 1690 patients from the acute leukemia working party of EBMT. American journal of hematology, 2017. 92(4): p. 344–350. pmid:28052366