Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Machine learning driven dashboard for chronic myeloid leukemia prediction using protein sequences

  • Waqar Ahmad ,

    Contributed equally to this work with: Waqar Ahmad, Abdul Raheem Shahzad, Muhammad Awais Amin, Waqas Haider Bangyal, Tahani Jaser Alahmadi, Saddam Hussain Khan

    Roles Conceptualization, Data curation, Methodology, Software, Writing – review & editing

    Affiliation Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences, Islamabad, Pakistan

  • Abdul Raheem Shahzad ,

    Contributed equally to this work with: Waqar Ahmad, Abdul Raheem Shahzad, Muhammad Awais Amin, Waqas Haider Bangyal, Tahani Jaser Alahmadi, Saddam Hussain Khan

    Roles Data curation, Formal analysis, Methodology, Software, Visualization

    Affiliation CECOS University of IT and Emerging Sciences, Peshawar, Khyber Pakhtunkhwa (KPK), Pakistan

  • Muhammad Awais Amin ,

    Contributed equally to this work with: Waqar Ahmad, Abdul Raheem Shahzad, Muhammad Awais Amin, Waqas Haider Bangyal, Tahani Jaser Alahmadi, Saddam Hussain Khan

    Roles Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences, Islamabad, Pakistan, Data Science Consultant, Datamatics Technologies, Islamabad, Pakistan

  • Waqas Haider Bangyal ,

    Contributed equally to this work with: Waqar Ahmad, Abdul Raheem Shahzad, Muhammad Awais Amin, Waqas Haider Bangyal, Tahani Jaser Alahmadi, Saddam Hussain Khan

    Roles Formal analysis, Investigation, Project administration, Supervision, Validation

    Affiliation Department of Computer Science, Kohsar University Murree, Punjab, Pakistan

  • Tahani Jaser Alahmadi ,

    Contributed equally to this work with: Waqar Ahmad, Abdul Raheem Shahzad, Muhammad Awais Amin, Waqas Haider Bangyal, Tahani Jaser Alahmadi, Saddam Hussain Khan

    Roles Conceptualization, Formal analysis, Funding acquisition, Project administration, Resources, Validation

    tjalahmadi@pnu.edu.sa

    Affiliation Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia,

  • Saddam Hussain Khan

    Contributed equally to this work with: Waqar Ahmad, Abdul Raheem Shahzad, Muhammad Awais Amin, Waqas Haider Bangyal, Tahani Jaser Alahmadi, Saddam Hussain Khan

    Roles Data curation, Formal analysis, Funding acquisition, Investigation, Supervision, Validation, Writing – review & editing

    Affiliation Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan

Abstract

The prevalence of Leukaemia, a malignant blood cancer that originates from hematopoietic progenitor cells, is increasing in Southeast Asia, with a worrisome fatality rate of 54%. Predicting outcomes in the early stages is vital for improving the chances of patient recovery. The aim of this research is to enhance early-stage prediction systems in a substantial manner. Using Machine Learning and Data Science, we exploit protein sequential data from commonly altered genes including BCL2, HSP90, PARP, and RB to make predictions for Chronic Myeloid Leukaemia (CML). The methodology we implement is based on the utilisation of reliable methods for extracting features, namely Di-peptide Composition (DPC), Amino Acid Composition (AAC), and Pseudo amino acid composition (Pse-AAC). We also take into consideration the identification and handling of outliers, as well as the validation of feature selection using the Pearson Correlation Coefficient (PCA). Data augmentation guarantees a comprehensive dataset for analysis. By utilising several Machine Learning models such as Support Vector Machine (SVM), XGBoost, Random Forest (RF), K Nearest Neighbour (KNN), Decision Tree (DT), and Logistic Regression (LR), we have achieved accuracy rates ranging from 66% to 94%. These classifiers are thoroughly evaluated utilising performance criteria such as accuracy, sensitivity, specificity, F1-score, and the confusion matrix.The solution we suggest is a user-friendly online application dashboard that can be used for early detection of CML. This tool has significant implications for practitioners and may be used in healthcare institutions and hospitals.

Introduction

Leukemia is a complex medical condition influenced by genetic regulation in the production of blood cells. When hematopoietic precursor cells turn malignant [1], it gives rise to abnormal cell growth due to alterations in DNA and RNA sequences. This transformation results in the infiltration of healthy cells by malignant ones, thus causing Leukemia. The illness primarily entails the uncontrolled proliferation of specifically White Blood Cells (WBC), i.e., neutrophils, basophils, and eosinophils, while lymphocytes remain unaffected. Acute myeloid Leukemia (AML), chronic myeloid Leukemia (CML), acute lymphoblastic Leukemia (ALL), and chronic lymphocytic Leukemia (CLL) are some of the several kinds of Leukemia [2]. The only subject of our research is Chronic Myeloid Leukemia (CML).

Leukemia cancer presents a substantial health challenge due to the abnormal proliferation of White Blood Cells (WBC) [1]. While research has concentrated on detecting cancer through blood cell images, exploration of Protein Sequential data is limited. Leukemia diagnosis heavily relies on hematologists, posing limitations in regions with a scarcity of specialists. Mortality rates are on the rise, particularly in South East Asia [3], creating a demand for an early detection approach. The motivation for driving the proposed research arises from the observation that a plethora of research has been conducted on cancer predictions—such as lung cancer, liver cancer, colon cancer, ovarian cancer, etc. utilizing MRI (magnetic resonance imaging), CT (computed tomography) scans, image processing techniques and protein sequences [46]. However, the realm of gene data in bioinformatics remains relatively uncharted, especially within the context of Chronic Myeloid Leukemia (CML). At present, no AI-based Dashboard system predicts Leukemia based on protein sequences, but developing such a system could revolutionize the diagnosis, leading to saved lives and eased healthcare burdens. Collaborative efforts between Machine Learning and Data Science can establish a robust model for accessible and timely Leukemia solutions.

As illustrated in Fig 1, the proposed research suggests the utilization of Machine Learning-based techniques to identify genes that cause Leukemia through Protein Sequences, aiming for early detection and a reduction in the mortality rate. This undertaking could emerge as a flagship initiative in health sciences, addressing the shortage of specialized hematologists. Implementation of the system would result in timely interventions and improved recovery prospects. Automating certain diagnostic processes could ease the load on specialists and enhance healthcare services. The potential impact goes beyond Leukemia diagnosis, garnering recognition, and interest from the medical community. Overall, this AI-driven research holds immense promise in reshaping healthcare and propelling the advancement of AI applications. Because of this research, innovative insights, and progress in predicting and comprehending CML could come to fruition. This might lead to more effective diagnostic and treatment methodologies, benefiting patients and healthcare systems. Furthermore, the successful integration of bioinformatics and AI could pave the way for pioneering applications and further interdisciplinary research at the intersection of these two promising domains.

thumbnail
Fig 1. Various stages of chronic Myeloid leukemia classification.

https://doi.org/10.1371/journal.pone.0321761.g001

The main contribution of our proposed research is as follows:

  • The current study focuses on protein sequential data rather than image data.
  • The most frequently mutated genes that were responsible for chronic myeloid leukemia were discovered through a literature review.
  • Datasets were formulated from the most frequently mutated gene data.
  • Features were gathered through analysing the physicochemical features of the amino acid composition, pseudo amino acid composition, and di-peptide composition.
  • The study aims to increase patient recovery prospects by improved early-stage prognosis.
  • The solution we suggest is a user-friendly online application dashboard that serves as a vital tool for early identification of CML. It can be easily implemented in healthcare facilities and hospitals.

This paper follows a structured format that aims to understand the research comprehensively. Introduction, outlines the problem statement. Literature review, discusses related research, positioning our study in the existing body of knowledge. Materials and methods, details the dataset creation process and experimental techniques. Development of individual classifiers, presents our methodology and analysis. Results and discussion, succinctly interprets the findings. Lastly, we offer a Conclusion summarizing our contributions and outlining future research directions.

Literature review

This section comprehensively discusses the recently conducted Leukemia research, focusing on Protein Sequences, RNA, and blood cell imagery. It elaborates acquiring and forming the dataset, which is pivotal in creating standardized Leukemia datasets by utilizing protein sequences. Importantly, previous researchers have not combined these three distinct feature extraction techniques while implementing a user-friendly dashboard, as done in this study. In [7], the Random Forest model was utilized to diagnose the cancerous growth of White Blood Cells with an accuracy of 94.3%. In the research by [8], the classifier was evaluated using 60 photos, demonstrating that models like K-nearest neighbors and Naive Bayes Classifier could identify ALL with an accuracy of 92.8%. According to research [9], the Artificial Bee Colony algorithm – Back Propagation Neural Network (ABC-BPNN) scheme and Principal Component Analysis (PCA) were used to classify Leukemia cells with an average accuracy of 98.72% while also speeding up the calculation.

In reference [10] Jothi et al. investigated the identification of leukemia sub-types, particularly ALL, using BSA-based clustering and advanced classification algorithms such as decision tree (DT), K-nearest neighbor (KNN), Naive Bayes (NB), and Support Vector Machine (SVM). The SVM model exhibited an accuracy rate of 89.81%. The SVM model was used in research [11] to identify ALL, with an accuracy rate of 89.81%. The dataset was used in [12] to classify ALL using the K-nearest neighbor method, with a 96.25% accuracy rate. In study gal [36,37], the exploration centered around the use of ML algorithms to analyze gene expression patterns derived from RNA sequencing (RNA-seq) for accurately predicting the likelihood of CR in pediatric AML patients’ post-induction therapy. Research [38] Developed models for predicting and classifying different stages of colon cancer using RNA-seq data of extracellular vesicles (EV) from healthy individuals and colon cancer patients. The study employed five canonical ML and Deep Learning (DL) classifiers, achieving high accuracy rates, resulting in an accuracy of 94.6% for K-nearest neighbor, 97.33% for Random Forest, 93% for LMT, and 92% for Random Tree. In [39], the early diagnosis and distinction between types of lung cancers, i.e., Non-Small Cell Lung Cancer and Small Cell Lung Cancer, were highlighted as crucial for improving patient survival rates. The proposed diagnostic system utilized sequence-derived structural and physicochemical attributes of proteins associated with tumor types, employing feature extraction, selection, and prediction models.

The study conducted by Dhakal et al. [40,41] Developed a stacking classifier method that specifically targets CTS selection criteria by utilising feature-encoding approaches. This algorithm generates feature vectors that include k-mer nucleotide composition, dinucleotide composition, pseudo-nucleotide composition, and sequence order coupling. The stacking classifier method demonstrated superior performance compared to prior cutting-edge algorithms in identifying functional miRNA targets, with an accuracy rate of 79.77%. In another study, Albitar et al. [50], Using Next Generation Sequencing (NGS) and targeted RNA sequencing along with a machine learning approach, Albitar et al. investigated the potential of discovering new biomarkers that can predict Acute graft-vs.-host disease (aGVHD). The study by Ahmad et al. [51], Predicted chronic Lymphocytic Leukemia using protein sequences with Chou’s Pseudo Amino Acid Composition (PseAAC) and statistical moments. In the study Jian et al.[52] utilised deep learning (DL) to develop a prediction model only for transcription factor binding sites, utilising just the original DNA base sequences. In this study, a deep learning approach utilising convolutional neural network (CNN) and long short-term memory (LSTM) was developed to analyse four distinct categories of Leukaemia based on transcription factor binding sites. The analysis was conducted using four extensive non-redundant datasets for acute, chronic, myeloid, and lymphatic Leukaemia. The method achieved an average prediction accuracy of 75%.

Materials and methods

The proposed research centers on the detection of leukemia, specifically targeting Chronic Myeloid Leukemia (CML), characterized by the neoplastic proliferation of White Blood Cells (WBCs) such as neutrophils, basophils, and eosinophils, while excluding lymphocytes. As previously mentioned, CML is linked to a heightened mortality rate due to its typical diagnosis at advanced stages, posing challenges for effective recovery. In response to this concern, we aim to create a dashboard to identify leukemia utilizing Protein Sequential data. To achieve this goal, we collected data on the most frequently mutated genes related to leukemia cancer, leveraging the physiochemical properties of protein sequences for feature extraction. Subsequently, data augmentation techniques were applied to enhance the extracted features, while outliers were detected and removed to ensure data quality. We employed a diverse set of machine learning algorithms, including Support Vector Machine (SVM) [14,15,53,57], XG Boost, Random Forest [16,17], KNN [18,19], logistic regression [54,58,59], and decision tree, as comprehensively described in a study review [20,21,26,55].

The accuracy of each algorithm was evaluated, and the one exhibiting the highest accuracy was selected for integration into our system. This chosen algorithm determines the presence or absence of cancer in an individual. Finally, we serialized our model using tools such as Pickle or Joblib, facilitating the preservation of the trained model alongside its associated data. These trained models were then incorporated into a Streamlit-based dashboard, enhancing their user-friendly deployment in hospitals and other medical facilities (see Fig 2).

Block diagram

Dataset collection

The dataset for this study was collected from the UniProt database, which is a comprehensive resource for protein sequence and functional information. A keyword search was conducted on UniProt using terms such as “Chronic Myeloid Leukemia," “BCL2," “HSP90,” “PARP,” and “RB.” This search yielded a total of 2248 protein sequences. mutated, i.e. BCL2, HSP90, PARP and RB, were utilized for CML [14]. Moreover, the homologous samples were eliminated by maintaining 0.6 as the cutoff level [16]. HSP90 functions as a chaperone protein, crucial in protein folding and degradation processes. Its up-regulation has been identified in various cancer types, including chronic myeloid leukemia (CML). Extensive research has demonstrated that inhibiting HSP90 can attenuate the growth of CML cells and enhance their susceptibility to chemotherapy and tyrosine kinase inhibitors (TKIs) [42,43]. PARP (Poly ADP-ribose polymerase) is an essential enzyme involved in DNA repair processes. Inhibiting PARP has demonstrated effectiveness in the treatment of cancers with BRCA mutations, and there is emerging evidence suggesting its potential applicability in managing chronic myeloid leukemia (CML) [44,45].

The BCL2 (B-cell lymphoma 2) protein family plays a crucial role in regulating programmed cell death, known as apoptosis. Elevated levels of BCL2 have been linked to resistance to chemotherapy in chronic myeloid leukemia (CML) cells. Studies have demonstrated that inhibiting BCL2 can reinstate apoptosis in CML cells and boost the effectiveness of tyrosine kinase inhibitors (TKIs) [46,47]. RB (Retinoblastoma) is a pivotal tumor suppressor gene involved in regulating cell cycle progression. The deactivation of RB is a prevalent characteristic in CML, and research has established that its reactivation can impede the proliferation of CML cells [48,49]. The FASTA file format was used to extract the CML-related protein sequences from the Universal Resource of Proteins (UniProtKB) [15,22]. A successful dataset was created as a result. The same number of negative and positive samples were gathered for CML using the opposite query phrase to create a negative dataset. Consequently, the dataset created for CML is balanced.

Fasta format.

In bioinformatics, the fasta format is a popular text-based format for representing proteins. It is derived from the FASTA software suite and follows a specific structure. A FASTA sequence starts with a single line that serves as a description and is followed by lines containing the sequencing data [22]. The description line is distinguished from the sequence data by the presence of a greater-than symbol (“>") in the first column. The term following the “" sign is used to identify the sequence, while the rest of the line can be used to provide an additional description, though both are optional.

Sample of protein sequence (HSP90).

Initially, protein sequences contained redundant data. We employed a benchmark method known as CD-Hit to address the issue of redundant data within the initial protein sequences (see Fig 3). It is essential to utilize a benchmark algorithm for redundancy removal to ensure the validity and reliability of the data. CD-Hit, an online clustered database, was selected for this purpose, with a threshold of 0.6 [23]. This threshold value helps in effectively removing redundancy while preserving the integrity of the dataset.

Feature extraction

This section elaborates on the feature extraction techniques using physiochemical properties of the protein sequences. These techniques enable the effective representation of protein sequences and extraction of meaningful information crucial for predicting Chronic Myeloid Leukemia. The feature extraction methods utilized in this study fall into three categories:

Amino acid composition.

The presence of specific amino acids often in a protein sequence is highlighted by AAC characteristics [24,25]. The percentage frequency of an amino acid, FAACi,j, in the protein is calculated using the formula below:

(1)

In the above equation, n denotes the amount of amino acids type (i) found in proteins j while na,j refers to the total amount of amino acids contained in a protein. The protein sequence in the FAAC features dataset is represented as a 20-dimensional (20-D) feature vector as follows:

(2)

where demonstrates how amino acids are composed.

The technique of amino acid composition involves extracting features from our data, resulting in a 20-dimensional feature set. However, the problem with this approach lies in the limited usefulness of the features extracted. Despite employing various data science feature engineering approaches and conducting hyper-parameter tuning, accuracy remains constrained. Consequently, this approach proves less efficacious in attaining the desired outcomes.

Pseudo amino acid composition.

A 25-dimensional feature set is produced using the Pseudo Amino Acid Composition (PAAC) approach to extract features from our data [13]. The remarkable fact is that the features extracted through this method are highly valuable. By further applying data science methods and feature engineering techniques, accuracy significantly improves, reaching an impressive range of 91% to 93%. This achievement represents a remarkable success in our endeavors.

(3)(4)(5)

Specifically, we depict the changes in data distribution before and after outlier removal. Additionally, we conducted data augmentation on the processed dataset to further enhance its accuracy.

Di-peptide composition.

The letters AA, AC, AD, YV, YW, and YY denote protein sequences with dipeptide characteristics. There are 400 components in these sequences. The DC feature of each component is determined as follows:

(6)

where represents the structure of dipeptide for . In vector form, this feature space is represented as:

The di-peptide composition technique extracts features from our data, resulting in 400 dimensions or four hundred features. However, it became evident that not all these features were essential. By applying data science methods and feature engineering, it is concluded that only 229 features out of the initial 400 were necessary. Surprisingly, after this selection process, the accuracy of our results significantly improved, reaching an impressive 91% to 93%. This outcome marks a great success. The graphs illustrate the impact of outlier removal on the dataset, both before and after the process.

Data augmentation.

The Data augmentation process is initiated by segregating our dataset into positive and negative segments. The method entails isolating patients who have tested positive from those with negative results. Subsequently, a series of operations are designed to generate numerical replicas of the existing data, thereby augmenting the sample size. This augmentation enhances the machine learning algorithm’s training procedure, attributed to the increased abundance of available data. However, it is important to note that the data transforms during the creation of these numerical duplicates, transitioning from its initial format into a list structure.

Consequently, the modified data is transited from this list format into a data frame. This procedural sequence ultimately leads to reintegrating the transformed data, thereby completing the data augmentation process.

Development of individual classifiers

Support vector machine

SVM classifier by creating a hyperplane with the greatest distance between any two points in the data [27,28,56]. SVM’s decision surface is as follows:

(7)

We selected the parameters such as, Kernel = “rbf”, Degree = 8, C = 10000, gamma = 100000, probability = True.

Random forest

This method generates a substantial quantity of decision trees that are combined to arrive at a final decision. For training, we selected 129,361, and for testing, 86,228 samples were selected, and we came up with the best number of estimators, i.e., n = 50. In the case of dipeptide composition, we selected 2536 for training and 845 for testing, and n = 150 estimators gave optimal results.

(8)

K-Nearest Neighbor (KNN)

The KNN algorithm is learned by observing samples [29,30]. Instance-based classifiers assume that the classification of unknown instances can be accomplished by comparing the unidentified instance to a known instance using a distance/similarity function [3133,56]. The calculation of the Euclidean distance (below, denoted as d(, ), between two m-dimensional vectors and is as follows:

(9)

Naïve Bayes

Bayes rules represent this learning procedure based on the notion of independent attributes/features [5759]. The Gaussian function to train the model with equal prior probabilities is in the following manner:

(10)(11)

XGBoost

Gradient boosting is a boosting approach that significantly lowers errors by adding several classifiers to pre-existing models. The term “gradient boosting" refers to using a gradient descent strategy to minimize loss. The steps involved in gradient boosting are as follows:

(12)(13)

Logistic regression

In categorical binary classification, a statistical machine-learning approach called logistic regression is employed [34]. The parameters we selected were C = 10, tol = 0.1, and penalty = L2.

(14)

Results and discussion

Results on pseudo amino acid composition (Pse-AAC) data

The findings of the matrices employed in the project, including Accuracy score, F1-score, Recall [35], and Specificity respectively on the data of Pse-AAC, are displayed in Table 1 below.

thumbnail
Table 1. Results on pseudo amino acid composition (Pse-AAC) data.

https://doi.org/10.1371/journal.pone.0321761.t001

Table 2 presents the results of each machine learning (ML) model concerning the data utilized, specifically the Pse-AAC data. It also includes the outcomes of additional metrics used in the research, namely Specificity and Confusion Matrix. These metrics provide insights into the True Positive, True Negative, False Positive, and False Negative values, contributing to a comprehensive evaluation of the models’ performance.

Accuracy results on amino acid composition (AAC) data

The research employs Accuracy score, F1-score, Recall score, and Specificity as metrics on the AAC data. The outcomes of these metrics are presented in Table 3 below.

The following table (Table 4) presents the results of each machine learning (ML) model concerning the utilized data, namely AAC. Additionally, it showcases the outcomes of other metrics employed in the project, such as the Specificity and Confusion Matrix. These matrices provide essential values, including True Positive, True Negative, False Positive, and False Negative, contributing to a comprehensive assessment of the models’ performance.

Accuracy results on di-peptide composition (DPC)

The table below (Table 5) displays the Accuracy score, F1-score, and Recall score matrices utilized in the research and their respective outcomes when applied to the DPC data.

The performance of each machine learning model is analyzed concerning the DPC data utilized. Additionally, the Specificity and Confusion Matrix results are presented (Table 6). This matrix provides essential values such as True Positive, True Negative, False Positive, and False Negative, contributing to a comprehensive evaluation of the models’ performance.

Machine learning based dashboard

In Figures, we provide an overview of the dashboard developed using Streamlit, which is accessible through Streamlit Cloud. This interactive dashboard enables users to select their preferred model Fig 4 for analysis. Within this user-friendly interface, individuals are prompted to upload patient records directly through the web application and select a specific prediction model. Subsequently, users can review the results Fig 5 to ascertain whether an individual is affected by leukemia. Users can effortlessly select

Conclusion

This research is focused on Chronic Myeloid Leukemia (CML), a condition characterized by genetic mutations leading to abnormal proliferation of white blood cells, red blood cells, and platelets. While MRI and CT scans have been extensively used in cancer detection, research on protein sequence data in this domain is limited. By leveraging information from mutated genes like BCL2, HSP90, PARP, and RB, the research aims to revolutionize early CML prediction. Through rigorous data preprocessing and feature extraction techniques, we achieved an impressive accuracy rate of 92–94%. The proposed approach integrates diverse machine learning algorithms such as SVM, Decision Trees, XGBoost, Random Forest, and KNN, each offering unique strengths in pattern recognition and prediction. The resulting dashboard facilitates easy prediction of CML in patients, enhancing clinical workflows and potentially saving lives. This study sheds light on critical scientific challenges in CML research, offering insights into disease mechanisms and biomarker identification. We envision expanding this research to encompass multi-cancer detection, integrating AI and bioinformatics with healthcare systems for enhanced cancer diagnosis and improved patient outcomes.

Acknowledgments

The authors extend their appreciation to the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R513), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia and would like to express their gratitude to anonymous referees for their insightful comments and recommendations, which have significantly enhanced this paper. Furthermore, the authors would like to express their gratitude to Datamatics Technologies for their invaluable contributions.

References

  1. 1. Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, 2021. CA Cancer J Clin. 2021;71(1):7–33.
  2. 2. Bibi N, Sikandar M, Ud Din I, Almogren A, Ali S. IoMT-based automated detection and classification of leukemia using deep learning. J Healthc Eng. 2020;2020:6648574. pmid:33343851
  3. 3. IARC IAfRoC. Leukaemia Source: Globocan 2020. 2022. Available from: https://gco.iarc.fr/today/data/factsheets/cancers/36-Leukaemia-fact-sheet.pdf]
  4. 4. Munteanu CR, Magalhães AL, Uriarte E, González-Díaz H. Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices. J Theor Biol. 2009;257(2):303–11. pmid:19111559
  5. 5. Ramani RG, Jacob SG. Improved classification of lung cancer tumors based on structural and physicochemical properties of proteins using data mining models. PLoS One. 2013;8(3):e58772. pmid:23505559
  6. 6. Yang J-Y, Yoshihara K, Tanaka K, Hatae M, Masuzaki H, Itamochi H, et al. Predicting time to ovarian carcinoma recurrence using protein markers. J Clin Invest. 2013;123(9):3740–50. pmid:23945238
  7. 7. Mohamed H, Omar R, Saeed N, Essam A, Ayman N, Mohiy T, et al. Automated detection of white blood cells cancer diseases. In: 2018 First International Workshop on Deep and Representation Learning (IWDRL). IEEE. 2018. p. 48–54. https://doi.org/10.1109/iwdrl.2018.8358214
  8. 8. Kumar S, Mishra S, Asthana P. Automated detection of acute leukemia using k-mean clustering algorithm. In: Advances in Computer and Computational Sciences: Proceedings of ICCCCS 2016, vol. 2; 2018. p. 655–70.
  9. 9. Sharma R, Kumar R. A novel approach for the classification of leukemia using artificial bee colony optimization technique and back-propagation neural networks. In: Proceedings of 2nd International Conference on Communication, Computing and Networking. NITTTR Chandigarh. 2019. p. 685–94.
  10. 10. Jothi G, Inbarani HH, Azar AT, Devi KR. Rough set theory with Jaya optimization for acute lymphoblastic leukemia classification. Neural Comput Appl. 2018;31(9):5175–94.
  11. 11. Moshavash Z, Danyali H, Helfroush MS. An automatic and robust decision support system for accurate acute leukemia diagnosis from blood microscopic images. J Digit Imaging. 2018;31(5):702–17. pmid:29654425
  12. 12. Umamaheswari D, Geetha S. A framework for efficient recognition and classification of acute lymphoblastic leukemia with a novel customized-KNN classifier. CIT. 2018;:131–40.
  13. 13. American Society of Clinical Oncology A. Genes and cancer. 2023.
  14. 14. Rodríguez D, Bretones G, Quesada V, Villamor N, Arango JR, López-Guillermo A, et al. Mutations in CHD2 cause defective association with active chromatin in chronic lymphocytic leukemia. Blood. 2015;126(2):195–202. pmid:26031915
  15. 15. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2004;32(Database issue):D115-9. pmid:14681372
  16. 16. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. pmid:23060610
  17. 17. Feng P, Lin H, Chen W. Identification of antioxidants from sequence information using naive Bayes. Comput Math Methods Med. 2013;2013:1–9.
  18. 18. Feng P-M, Ding H, Chen W, Lin H. Naïve Bayes classifier with feature selection to identify phage virion proteins. Comput Math Methods Med. 2013;2013:530696. pmid:23762187
  19. 19. Jia J, Liu Z, Xiao X, Liu B, Chou K-C. pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J Theor Biol. 2016;394:223–30. pmid:26807806
  20. 20. Lin W-Z, Fang J-A, Xiao X, Chou K-C. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One. 2011;6(9):e24756. pmid:21935457
  21. 21. Qu K, Han K, Wu S, Wang G, Wei L. Identification of DNA-binding proteins using mixed feature representation methods. Molecules. 2017;22(10):1602. pmid:28937647
  22. 22. Cai Y-D, Chou K-C. Predicting subcellular localization of proteins in a hybridization space. Bioinformatics. 2004;20(7):1151–6. pmid:14764553
  23. 23. Chou K-C. Impacts of bioinformatics to medicinal chemistry. Med Chem. 2015;11(3):218–34. pmid:25548930
  24. 24. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43(3):246–55. pmid:11288174
  25. 25. Khan YD, Ahmad F, Anwar MW. A neuro-cognitive approach for iris recognition using back propagation. World Appl Sci J. 2012;16(5):678–85.
  26. 26. Khan YD, Ahmed F, Khan SA. Situation recognition using image moments and recurrent neural networks. Neural Comput Appl. 2013;24(7–8):1519–29.
  27. 27. Butt A, Khan S, Jamil H, Rasool N, Khan Y. A prediction model for membrane proteins using moments based features. Biomed Res Int. 2016;2016:1–14.
  28. 28. Butt AH, Rasool N, Khan YD. A treatise to computational approaches towards prediction of membrane protein and its subtypes. J Membr Biol. 2017;250(1):55–76. pmid:27866233
  29. 29. Khan YD, Khan SA, Ahmad F, Islam S. Iris recognition using image moments and k-means algorithm. ScientificWorldJournal. 2014;2014:723595. pmid:24977221
  30. 30. Sugiyama M. Introduction to statistical machine learning. Morgan Kaufmann. 2015.
  31. 31. Theodoridis S. Machine learning: a Bayesian and optimization perspective. Academic Press. 2015.
  32. 32. Vapnik V. The nature of statistical learning theory. Springer. 1999.
  33. 33. Hart P, Stork D, Duda R. Pattern classification. Hoboken: Wiley. 2000.
  34. 34. MontesinosLópez O, MontesinosLópez A, Crossa J. Multivariate statistical machine learning methods for genomic prediction. Springer Nature. 2022.
  35. 35. Jiao Y, Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol. 2016;4(4):320–30.
  36. 36. Fawcett T. Roc graphs: notes and practical considerations for researchers. Mach Learn. 2004;31(1):1–38.
  37. 37. Gal O, Auslander N, Fan Y, Meerzaman D. Predicting complete remission of acute myeloid leukemia: machine learning applied to gene expression. Cancer Inform. 2019;18:1176935119835544. pmid:30911218
  38. 38. Bostanci E, Kocak E, Unal M, Guzel MS, Acici K, Asuroglu T. Machine learning analysis of RNA-seq data for diagnostic and prognostic prediction of colon cancer. Sensors (Basel). 2023;23(6):3080. pmid:36991790
  39. 39. Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, Goliaei B. Prediction of lung tumor types based on protein attributes by machine learning algorithms. Springerplus. 2013;2(1):238. pmid:23888262
  40. 40. Dhakal P, Tayara H, Chong KT. An ensemble of stacking classifiers for improved prediction of miRNA-mRNA interactions. Comput Biol Med. 2023;164:107242. pmid:37473564
  41. 41. Armya REA, Abdulazeez AM, Sallow AB, Zeebaree DQ. Leukemia diagnosis using machine learning classifiers based on correlation attribute eval feature selection. AJRCoS. 2021;:52–65.
  42. 42. Khajapeer KV, Baskaran R. Hsp90 inhibitors for the treatment of chronic myeloid leukemia. Leuk Res Treatment. 2015;2015:757694. pmid:26770832
  43. 43. Alves R, Santos D, Jorge J, Gonçalves AC, Catarino S, Girão H. Alvespimycin inhibits heat shock protein 90 and overcomes imatinib resistance in chronic myeloid leukemia cell lines. Molecules. 2023;28(3):1210.
  44. 44. Ellisen LW. PARP inhibitors in cancer therapy: promise, progress, and puzzles. Cancer Cell. 2011;19(2):165–7. pmid:21316599
  45. 45. Liu Y, Song H, Song H, Feng X, Zhou C, Huo Z. Targeting autophagy potentiates the anti-tumor effect of PARP inhibitor in pediatric chronic myeloid leukemia. AMB Express. 2019;9(1):108. pmid:31309361
  46. 46. Kaloni D, Diepstraten ST, Strasser A, Kelly GL. BCL-2 protein family: attractive targets for cancer therapy. Apoptosis. 2023;28(1–2):20–38. pmid:36342579
  47. 47. Ko TK, Chuah CTH, Huang JWJ, Ng K-P, Ong ST. The BCL2 inhibitor ABT-199 significantly enhances imatinib-induced cell death in chronic myeloid leukemia progenitors. Oncotarget. 2014;5(19):9033–8. pmid:25333252
  48. 48. Zhou L, Ng DS-C, Yam JC, Chen LJ, Tham CC, Pang CP, et al. Post-translational modifications on the retinoblastoma protein. J Biomed Sci. 2022;29(1):33. pmid:35650644
  49. 49. Yin D-D, Fan F-Y, Hu X-B, Hou L-H, Zhang X-P, Liu L, et al. Notch signaling inhibits the growth of the human chronic myeloid leukemia cell line K562. Leuk Res. 2009;33(1):109–14. pmid:18687467
  50. 50. Albitar M, Zhang H, Pecora AL, Ip A, Goy AH, Antzoulatos S, et al. Bone marrow-based biomarkers for predicting aGVHD using targeted RNA next generation sequencing and machine learning. Blood. 2021;138(Supplement 1):2892–2892.
  51. 51. Ahmad W, Hameed M, Bilal M, Majid A. ML-pred-cll: Machine learning based prediction of chronic lymphocytic leukemia using protein sequential data. In: 2022 International Conference on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS). 2007. p. 1–7.
  52. 52. He J, Pu X, Li M, Li C, Guo Y. Deep convolutional neural networks for predicting leukemia-related transcription factor binding sites from DNA sequence data. Chemomet Intel Lab Syst. 2020;199:103976.
  53. 53. Ashraf A, Zhao Q, Bangyal W, Iqbal M. Analysis of brain imaging data for the detection of early age autism spectrum disorder using transfer learning approaches for internet of things. IEEE Trans Consum Electron. 2023. p. 1–10.
  54. 54. Bangyal WH, Ahmad J, Abbas Q. Recognition of off-line isolated handwritten character using counter propagation network. Int J Eng Technol. 2013;5(2):227–34.
  55. 55. Ali AM, Mohammed MA. A comprehensive review of artificial intelligence approaches in omics data processing: evaluating progress and challenges. Int J Math Stat Comput Sci. 2024;2:114–67.
  56. 56. Zahoor MM, Qureshi SA, Bibi S, Khan SH, Khan A, Ghafoor U, et al. A new deep hybrid boosted and ensemble learning-based brain tumor analysis using MRI. Sensors (Basel). 2022;22(7):2726. pmid:35408340
  57. 57. Amin MA, Chughtai JR, Ahmad W, Bangyal WH, Ul Haq I. Trajectory data mining and trip travel time prediction on specific roads. In: 2024 International Conference on Engineering & Computing Technologies (ICECT); 2024. p. 1–8.
  58. 58. Bangyal WH, Qasim R, Rehman NU, Ahmad Z, Dar H, Rukhsar L. Detection of fake news text classification on COVID-19 using deep learning approaches. Comput Math Methods Med. 2021;2021:5514220.
  59. 59. Ali A, Nafees M, Amin M, Rehman I, Tayyab M, Ahmad W. Systematic literature review on swarms of uavs. Spectrum Eng Sci. 2024;2(4):386–415.