Figures
Abstract
Cancer is a serious public health concern worldwide and is the leading cause of death. Blood cancer is one of the most dangerous types of cancer. Leukemia is a type of cancer that affects the blood cell and bone marrow. Acute leukemia is a chronic condition that is fatal if left untreated. A timely, reliable, and accurate diagnosis of leukemia at an early stage is critical to treating and preserving patients’ lives. There are four types of leukemia, namely acute lymphocytic leukemia, acute myelogenous leukemia, chronic lymphocytic in extracting, and chronic myelogenous leukemia. Recognizing these cancerous development cells is often done via manual analysis of microscopic images. This requires an extraordinarily skilled pathologist. Leukemia symptoms might include lethargy, a lack of energy, a pale complexion, recurrent infections, and easy bleeding or bruising. One of the challenges in this area is identifying subtypes of leukemia for specialized treatment. This Study is carried out to increase the precision of diagnosis to assist in the development of personalized plans for treatment, and improve general leukemia-related healthcare practises. In this research, we used leukemia gene expression data from Curated Microarray Database (CuMiDa). Microarrays are ideal for studying cancer, however, categorizing the expression pattern of microarray information can be challenging. This proposed study uses feature selection methods and machine learning techniques to predict and classify subtypes of leukemia in gene expression data CuMiDa (GSE9476). This research work utilized linear programming (LP) as a machine-learning technique for classification. Linear programming model classifies and predicts the subtypes of leukemia Bone_Marrow_CD34, Bone Marrow, AML, PB, and PBSC CD34. Before using the LP model, we selected 25 features from the given dataset of 22283 features. These 25 significant features were the most distinguishing for classification. The classification accuracy of this work is 98.44%.
Citation: Ilyas M, Aamir KM, Manzoor S, Deriche M (2023) Linear programming based computational technique for leukemia classification using gene expression profile. PLoS ONE 18(10): e0292172. https://doi.org/10.1371/journal.pone.0292172
Editor: Muhammad Attique Khan, HITEC University, PAKISTAN
Received: May 27, 2023; Accepted: September 14, 2023; Published: October 9, 2023
Copyright: © 2023 Ilyas et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data for this study are available from the Kaggle repository (https://www.kaggle.com/datasets/brunogrisci/leukemia-gene-expression-cumida).
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Blood is the most important component of the human body, consisting of 55% liquid termed plasma that flows freely through a blood vessel. Plasma primarily aims to transport nutrients, proteins, and hormones to and remove waste from the human body. Red blood cells (RBCs), white blood cells (leukocytes), and platelets are the three biological components that can be distinguished by their colour, shape, size, texture, and content (thrombocytes) [1]. When certain blood cells die, the blood transports less oxygen, which causes fatigue and weakness. Because blood cells are abundant, an excess or deficiency of any blood cell causes a variety of health issues, including leukemia, sickle cell disease, thalassemia, and anaemia. Leukemia develops due to an excess of WBCs in the immune system, which crowds out the healthy RBCs and platelets of blood. Leukemia is one of the most frequent diseases that can result in mortality. To overcome the severity of this disease, diagnosing the forms of immature cells at an early stage is vital, which minimises the patients’ modality rate. Many researchers have proposed approaches and algorithms for detecting, segmenting, and classifying leukemia. A precise early-stage leukemia diagnosis is essential for treating patients and their survival. The four types of leukemia [2] are acute lymphocytic leukemia (ALL), acute myelogenous leukemia (AML), chronic lymphocytic leukemia (CLL), and chronic myelogenous leukemia (CML). Recognising these cancerous development cells is often done manually via microscopic image analysis and requires an extraordinarily skilled pathologist. A professional pathologist collects a blood sample to detect these cells, which are then stained on a blood slide. Staining allows cells to be examined under a high-quality microscope to detect morphological characteristics of many components of blood cells, such as WBC, RBC, platelets, parasites, blasts, and any other anomalous condition. This identification and diagnosis are crucial and require a qualified and experienced pathologist to conclude. As a result, providing an automated method for this diagnostic is always necessary and required. Microarray technology offers a way to concurrently track hundreds of genes’ levels of RNA expression in primary tumours and cell lines. Making a diagnosis is difficult for medical practitioners since they must consider a wide range of clinical data aspects. Because of the difficulties in diagnosing leukemia, all medical professionals seek to remove uncertainty by acquiring precise information to treat the patients’ various conditions. Diagnosis, a clinical decision-making process, provides valuable information for improving healthcare quality [3].
Researchers may now simultaneously study thousands of genes that make up a significant part of the genome using microarray technology. The development of this promising technology has generated interest in its prospective uses in clinical diagnostics and pharmacological research. Finding gene subsets, sometimes referred to as biomarkers that distinguish between occurrences with different labels, such as different tumor types, cancer vs. non-cancer, and therapeutic response, is a vital step in these kinds of applications. A supervised learning task is the first that tries to accurately determine its label, such as whether it is either a tumor or normal tissue, given a gene expression pattern. Over the past few years, this strategy has been successfully tested with several algorithms [4] and has several applications in clinical diagnostics. Gene selection, the other objective for linear programming is a subclass of the broader issue of feature selection (FS) that is a dimensionality reduction technique [5]. The dimensions are chosen by FS, yet, by merging the many input dimensions, feature extraction techniques like Principal Component Analysis (PCA) minimize dimensionality. This is important for gene selection because it preserves the physical meaning of the genes, facilitating interpretation [6], and the features are the genes’ expression values. On gene declaration information, highlight extraction and component selection procedures are time-consuming for dimensionality reduction, information perception, as a phase in the readiness of different calculations, or to recognize a subset of more significant genes. The study of gene expression data is motivated by the challenge of classification between cancer classes or finding multiple subclasses of malignancies. However, a variety of factors could have an impact on the analysis’s findings. The curse of dimensionality which is brought on by small sample sizes in comparison to high numbers of characteristics, is one of the biggest problems. The generalizable performance of a classifier suffers from having too many characteristics, some of which may be unimportant to analysis. Choosing selective genesis is therefore crucial to enhancing the precision and speed of prediction systems [7]. Choosing an appropriate feature set is critical for developing effective and efficient models, improving comprehensibility, minimising overfitting, and reducing complexity.
This research is based on linear programming classification of leukemia subtypes that accompany cancer diagnosis. Linear programming (LP) approaches may be able to quickly and precisely identify expression patterns. Microarray data has typically been subjected to linear programming with the two different but complementary goals of sample categorization and gene choice. This study uses feature selection and linear programming techniques to classify types of leukemia based on leukemia gene expression data. Given that the dataset used to evaluate gene expression levels had 22283 genes (columns) from 64 samples, and that this was too large, utilizing a feature selection strategy improved the prediction technique’s effectiveness (rows). As a result, reducing the data quantity contributed to improving classification performance.
Our classification study is useful to classify the different subtypes of leukemia represented in chosen dataset GSE9476 on leukemia gene expression from CuMiDa through feature selection methods and machine learning techniques.
The remainder of the paper is organised as follows: The literature review is discussed in section II, the proposed methodology with its steps is discussed in section III, the results are discussed in section IV, and the conclusion and future work is discussed in section V.
Literature review
Many researchers have identified and predicted different cancer subtypes using different types of methods. This section discussed the most notable research that made use of gene datasets via machine learning including Linear programming-based leukemia subtypes research. Y. Tang et al. [4] developed an FCM-SVM-RFE Recursive Feature Elimination (RFE) algorithm for predicting AML/ALL gene expression data, which achieved an average accuracy of 92.94%. The Fuzzy C-Means clustering approach was used to group related genes into clusters, and then a Support Vector Machine (SVM) was modeled in each cluster-induced space. This method was more accurate for predicting unknown samples of cancer [2]. Yoo et al. [8] suggested a gene selection and multivariate fuzzy statistical analysis technique for evaluating microarray data from leukemia patients. It was used to analyze the gene expression pattern and investigate the leukemia subtypes whose expression patterns were found to be linked to the cases of acute leukemia gene expression. They used PCA to evaluate ALL and AML patterns. It also eliminates the drawbacks of threshold-based gene selection, such as the impossibility of an unknown subclass selection. Taskesen et al. [9] worked on bringing gene expression profiles (GEP) and DNA methylation profiles (DMP) together. Gene expression profiles, as well as the gene patterns obtained from GEP, can be utilized to predict AML subtypes. Similarly, DNA-methylation profiles were used to make successful predictions. Both have different patterns that aid in the classification of AML subtypes. They employed a logistic regression model with Lasso regularization to predict AML subtypes. He et al. [10] worked on classification methods for leukemia cancer. To efficiently extract high-level data abstraction and transform this quantitative data into fuzzy discrete transactions, authors combined data clustering approaches with fuzzy interval partitioning on given features. These transactions were supplied to the A priori algorithm to mine association rules that supported better classification and decision. Experiments reveal that the FARM-DS mining technique for Fuzzy Association Rules (FARs) has good interpretability since it extracts considerably shorter rules and has great prediction accuracy. Klein et al. [11] presented a novel approach for systematic and rigorous comparison of published gene expression identifiers to a demonstrative given dataset. Identifying related analyses and gene mutations, enhanced the analysis of new microarray data. This technique enables researchers to integrate learnings from multiple microarray experiments into the structured analysis of a new dataset. Stiglic et al. [12] introduced a new method for interpreting tiny ensembles of classifiers using gene expression data called Visual Interpretation of Small Ensembles (VISE). It was proven that interactive interpretation tools, which were created for traditional machine learning challenges, also provide a wide variety of opportunities for researchers in the bioinformatics discipline. They also serve as an interactive tool for experts in the classification process. Feltes et al. presented Curated Microarray Database (CuMiDa) which is a resource that contains 78 cancer Microarray datasets that have been rigorously cross-checked from 30,000 Gene Expression Omnibus (GEO) articles. CuMiDa is a database of datasets dedicated to the testing and benchmarking of machine learning algorithms in cancer research. Feltes et al. observed sample division for this, all data sets were tested using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (tSNE) analyses, as well as various machine learning (ML) approaches including SVM and RF, to provide a base accuracy of 88–85% for the major techniques used for microarray data sets [13]. Bilen et al. have developed a new method for rapidly classifying leukemia cancer microarrays and decreasing data size by focusing on the most important genes. Bilen et al. employed two methods, the ensemble, and the hybrid method. Firstly, a gene filtering algorithm is created using the Wilcoxon rank-sum, Fisher correlation score, and information gain approach to create an ensemble gene selection algorithm. Secondly using an upgraded genetic algorithm, the most successful genes among these genes are exposed in the feature selection phase. Cross-validation findings after the classification process were 100% (LOOCV), 98.57% (5-fold), and 97.14% (10-fold) [14]. To categorize microarray data with a small sample size and a large number of features, Xu et al. used two Modified Linear discriminant analysis techniques. Xu et al. mentioned the reason behind the sub-optimal performance of classical LDA on microarray data in terms of uncertainty and uniqueness of the within-group covariance matrix. The MLDA and NLDA have been used in their study and analyzed that modified LDA techniques work better in classifying data that has a large number of features and small samples when compared with the k-nearest neighbor, diagonal linear discriminant analysis, and classical LDA [15]. When working with high-dimensional data that has a little quantity of labeled data and a significant number of unlabeled data, it is never easy to get better classification results. A semi-supervised sparse Fisher’s LDA was proposed by Lu and Qiao. LDA is rebuilt and sparsity is attained using a direct estimation technique. To deal with the no convex loss function related to the unlabeled data, they additionally employ the difference-convex approach. Overall, the suggested strategy improves the LDA method’s capability [16]. Feltes et al., manually curated the Gene Expression Omnibus GEO using extensive filtering parameters to select the major homogeneous and high-quality RNA-seq using microarray datasets having several cancer types. TCGA data was used to study frequently unregulated genetic mechanisms behind the tumoral process using machine learning techniques and biological processes. His findings showed that tumor is more closely linked to the overexpression of essential unregulated machinery than to the under-expression of a specific gene [13]. Zhou et al. has been used Neural networks, Bayesian statistics, and a self-organizing map in research. In Microarray datasets, neural networks are best for feature learning and computation. When compared to large feature scales, the sample sizes are found to be insufficient. Because dealing with such high-dimensional, small-sample-size data is tough, a combination of BNN and SOM can perform well, particularly in classification problems involving gene expression-related disorders. The self-organizing map is best for dimension reduction, whereas Bayesian statistics are used to estimate feature ambiguity from the posterior distribution [17]. Grisci et al. worked on a novel strategy that uses Neuroevolution as a machine-learning method to classify microarray data and choose more relevant genes at the same time. The author used the FS-NEAT algorithm. In addition, quality microarray datasets were selected using a strict filtering and preprocessing approach. When evaluated with microarray datasets of three different forms of cancer with variable numbers of samples, characteristics, and classes, the Grisci et al. approach reduced the number of dimensions in all datasets by over 99.9%. The use of the features chosen by his method improved the performance of algorithms [18]. Liu et al. used basic particle swarm optimization (PSO) to identify acute leukemia samples with 96.43 percent accuracy. It was compared to K-means clustering, and the findings showed that PSO performs better than K-means, but stability is flipped [19]. Karim et al. introduced a deep learning-based gene expression data classification method that used the Grey Wolf Optimizer (GWO) to train Sparse Auto-Encoders via an unsupervised training process. Auto-Encoders (AE) has a unique property that allows them to extract high-level attributes from row data, and thus they achieved 98.99% accuracy. Under the same test conditions and for the same datasets, in this research, the GWO method results has been compared with some other against extensive Meta heuristic algorithms such as Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), and Genetic Algorithms (GA). Sparse Auto Encoders trained on GWO outperform both conventional approaches and the abovementioned techniques [20]. Sun et al. [21] proposed a model by using gene expression patterns for cancer and other gene-disease categorization. This field of clinical diagnosis is becoming increasingly important for accurate cancer diagnosis and the identification of cancer subtypes. To increase the accuracy of microarray data outputs, authors suggested a gene selection strategy based on Fishers Linear Discriminant (FLD) and neighborhood rough set (NRS). Fisher’s Linear Discriminant technique was useful in reducing generic data so that a potential gene subset with strong classification capability could be obtained. After that, Sun et al. worked on defining neighborhood roughness and precision in a neighborhood decision system. Experiments showed that Sun et al. proposed a strategy that can pick a smaller and well-classified gene sample and improve classification results. W. Tang et al [22] proposed a novel compressive sensing (CS)-based technique for leukemia subtyping to classify ALL and AML. The CS method, a new technique for computational and statistical signal analysis, allows signals to be recovered from a small number of incoherent projections. To determine the class, the LOO method was used, which allows for signal reconstruction from a small number of incoherent signal projections. It uses fewer computations and resources and achieves 97% classification accuracy.
Silva et al. compared three distinct machine-learning models and data mining techniques to diagnose acute myeloid leukemia and acute lymphoblastic leukemia on gene expression data. The primary algorithm was the support vector machine, the second was the artificial neural network, and the third was the machine learning ensemble, which is a collection of various intelligence algorithms (Artificial Neural Network, Support Vector Machine, Random Forest, Gradient Boosting, and k-NN). The learning ability and classifying potential of the Ensemble model were consistent, and it performed better than 94% in classifying AML and ALL leukemia types [23]. CuMiDa is a valuable resource in cancer biomedical research for benchmarking machine learning techniques in leukemia gene expression analysis [24, 25]. The existing literature has been summarized in Table 1.
Proposed methodology
The proposed methodology for the classification of leukemia consists of the following steps including data acquisition, data preparation, feature selection, and design of the classifier. A detailed flow diagram of the proposed model is given below in Fig 1.
Data acquisition
The datasets Leukemia gene expression–CuMiDa [27] has been used in this methodology for the classification of leukemia. The details of the selected datasets are given in Table 2.
In this dataset, there is a total of 64 leukemia samples including 8 cases of Bone_Marrow_CD34, 10 cases of Bone_Marrow, 26 cases of AML, 10 cases of PB, and 10 cases of PBSC_CD34. It consists of 5 classes, these five types include Bone_Marrow_CD34, Bone_Marrow, AML, PB, and PBSC_CD34.
Data preparation
After acquiring the dataset the next step was understanding its features and types. The chosen Dataset GSE94769 is a numeric dataset. We must first display the dataset to understand the behavior of the characteristics and make predictions regarding anomalies. For this, we used MATLAB R2021a environment for performing tasks and operations on our dataset. This study was implemented by using the open-source platform MATLAB. We displayed the whole data for each class to identify any outliers or abnormalities. Additionally, visualization greatly aids in data understanding and interpretation, allowing us to use the proper machine-learning techniques and algorithms to create computationally robust models. The data is uniform and has few outliers. Leukemia data is split into two parts, testing and training data.
Feature selection
As the microarrays hold enormous potential for accelerating the discovery of new biological information since they can concurrently measure the expression levels of thousands of genes. One characteristic of microarray data is that there are many more variables ’P’ (genes) than sample size ’N’. As in the case of our dataset, there are N = 64 samples and P = 22283 genes in total. In limiting the size of the feature set, thus we must find an appropriate gene selection strategy for our microarray dataset so that the feature size should be reduced. Total 25 data features have been obtained from leukemia gene expression dataset. These are the most distinguishing features that are useful for classification. As biomedical problems related to genes are complex and it is difficult to build a perfect model, the ideal case gives near about 100% classification accuracy. Different feature extraction techniques were recommended in the literature to improve classification rates and lower processing costs for identifying important genes. So, we improved our results by extracting those features that give better data separability. Reducing the features helped improve prediction performance in terms of speed and accuracy. The process of choosing the best suitable subset of characteristics is known as feature extraction; need to select a subset of characteristics that contributes more to the best classification, hence features should be prioritized according to their importance in the classification problem.
Feature selection algorithm.
The following algorithm was used for feature extraction in this study.
Let y ∈ R64 be a feature vector of the data matrix and z ∈ R64 be the same feature vector in the transformed domain.
(1)
Where α and β are constant, and α > β
We establish maps to segregate z according to the classes. The feature vector z is a noralization form of y.
(2)
Where z1 ∈ R8, z2 ∈ R10, z3 ∈ R26, z4 ∈ R10, z5 ∈ R10. Vectors z1, z2, z3, z4, and z5 are split of z with respect to the classes labels. Let
be mean of a vector zi, we define
(3)
For i = 1, 2, 3, 4, 5
Where N1, N2, N3, N4 and N5 are 8, 10, 26, 10 and 10 respectively.
Now, we concatenate through a non-linear map “g” such that:
(4)
(5)
Let d be a measure used for the selection of features.
(6)
for i = 1, 2,…, 22283
Where xj is the value in the zero vector. We are doing this for every feature.
Training on test data.
Consider classes 1,2,3,4 and 5 that have the number of samples n1, n2, n3, n4, and n5 respectively.
Let Di where i ∈ C be the corresponding datasets. After feature extraction:
Let F be the set of features extracted. F = {f1, f2, f3,…. fk}
We define a mapping h such that:
(7)
Where i ∈ C and
Consider two datasets and
such that i ≠ j and i, j ∈ C.
We have to perform two tasks. Firstly, we have to find whether and
are linearly separable. Secondly, if both are separable then find the separating hyperplanes Pij. For Linear separability: We define
(8)
(9)
Both classes i and j are linearly separable if fx ∈ Rk and b ∈ R1 such that:
(10)
∀ u ∈ {1,2,.…. ni} and ∀ v ∈ {1,2,.…. nj}
We model this problem as an optimization. The Problem is given as follows:
(11)
(12)
for ∀ u ∈ {1,2,.….ni} and ∀ v ∈ {1,2,.….nj} (11) and (12) is
(13)
and
(14)
We introduce variables as yi ≥ 0 (i = 1,2,.…. ni) and yj ≥ 0 (j = 1,2,.…. nj)
Writing these equations in the matrix form:
(17)
(18)
Vector 1a has 1 a-times.
We introduce constraints in the standard form of an LP from the above equations
(21)
Planes testing.
Let Pij be a place between class i and j (i ≠ j and j ∈ C). The list of planes is given below in Table 3.
Let be an optimal threshold that gives maximum correct binary classification decisions for plane Pij for the entire data from yij. These are considered as bias for the plane Pij.s
(23)
Results
In this section, we provide our findings on leukemia subtypes classification using microarray data. Our method is based on linear programming. The best selected of the observed features are those that are extremely clearly distinct, and these features are more useful for classification and diagnosing the various subtypes of leukemia. From the dataset of 22283 features, we have chosen those features that satisfied the following two goals: 1. Distinguish those traits that make it easier to data separability. 2. Decide which qualities are most useful for classifying new data. Leukemia 64 samples overall, comprising 8 instances of Bone Marrow CD34, 10 cases of Bone Marrow, 26 cases of AML, 10 cases of PB, and 10 cases of PBSC CD34, are included in our dataset. The feature size of our dataset, 22283 genes’ expression levels, is included in the dataset. The gene expression profiles’ values are contained in these characteristics (GEP). Each of these characteristics aids in placing a sample into a certain class. Our dataset includes five distinct classifications or subtypes of leukemia. The proposed model utilizes these five categories—Bone Marrow CD34, Bone Marrow, AML, PB, and PBSC CD34—to identify the class to which each sample belongs. The 22283 features in our dataset contribute significantly to the curse of dimensionality, which is caused by small sample sizes relative to huge numbers of characteristics. When we compute such data, the curse of dimensionality will require our time, memory, and effort. The selected extracted features details have been given in Table 4.
Feature Number is the serial number of features in the whole dataset as the dataset contains 22283 features in total. Probe Set ID is the label of each feature in the used dataset. It is also a unique number allotted to a specific and relevant group of genes in genetic engineering. The results of 25 selected features are drafted in a table given below in Table 5.
The total number of samples in each class and divided into testing and training samples. The training and testing samples provide 60% and 40%, respectively. Pairwise precision classification on training and testing samples yields 100% and 98.44% accuracy, respectively. Pairwise precision for testing samples have been discussed in Table 6.
Table 7 describes, pairwise classification. Class 1 is initially evaluated against classes 2, 3, 4, and 5. Then class 2 is tested using (class 3, class 4, and class5). Then Class 3 is examined using (Class 4 and Class 5). Finally, Class 4 and Class 5 are put to the test. Precision is calculated at each stage of this classification. Class 1 & 2 are initially assessed with pairwise classification by combining two classes (class 3, class 4, and class 5). Then class 3 is tested using (class 4 and class5). Then Class 4 is compared to Class 5. Table 7 depicts the gene expression levels of 22283 genes from 64 samples (rows).
Table 8 lists the pairwise classification plane values. We analyzed the performance of pairwise classification, which was initially developed to reduce multi-class issues to two-class problems. Paired classification is also advantageous for computationally expensive learning approaches. Instead of initially attempting to arrange the items, pairwise comparisons between the individual items and later adding the wins for each item make it simpler for a human to discern the order between the n items.
Table 9 shows the pairwise classification plane values obtained by combining Class 1 and Class 2. These plane values are employed in the classification of testing samples following class merging. As previously explained, pairwise classification yields binary classification, which reduces a multi-class problem to a two-class problem.
Table 10 shows binary classification with planes and a threshold. It includes Leukemia used as a threshold for classes represented by numbers 1,2,3,4, and 5. Ҩ is used as a threshold. PNo stands for Plane Number. Output classification demonstrates class partition.
Fig 2 depicts the categorization of testing samples utilizing ten planes at the same time. It discusses the pairwise categorization of classes by matching them. Identify misclassified samples as well. Fig 2, the first half of the picture, shows five sub-images.
Fig 3 depicts the categorization of testing samples utilizing ten planes at the same time. It discusses the pairwise categorization of classes by matching them. Identify misclassified samples as well. Fig 3, the second half of the whole picture, shows five sub-images.
Fig 4 depicts planes that are utilized for binary classification. The objective of binary classification is to divide the items of a set into two groups (each termed class) based on a classification rule. Fig 4 depicts five sub planes in the first half of the Figure.
Fig 5 depicts planes that are utilized for binary classification. Fig 5 depicts five sub planes in the first half of the picture.
Fig 6 depicts the categorization of testing samples utilizing six planes at the same time. It discusses the pairwise categorization of classes by matching them. Identify misclassified samples as well. The table contains a list of related samples (Table 9).
Fig 7 depicts planes that are utilized for binary classification. This is done by merging two classes and comparing them with all other classes. The objective of binary classification is to divide the items of a set into two groups (each termed class) based on a classification rule.
Fig 8 employs five colours: red, green, blue, cyan, and black. These colours were used in the projection of testing samples. On planes, this colour scheme successfully distinguishes 5 classes and emphasizes their role in classification.
Fig 9 employs four colours: red, green, blue, and black. Here red colour used for two merging classes (classes 1& 2). Total of 4 classes were used in the projection of testing samples. On planes, this colour scheme successfully distinguishes 4 classes and emphasizes their role in the classification.
Pairwise precision classification on training samples yields 100% accuracy. Pairwise precision classification on testing samples gives 98.44% accuracy. We improved our results by extracting data separation features. Reduced feature count improved prediction performance in terms of accuracy as well as speed. The confusion matrix is given in Table 11 below.
The accuracy, precision, recall, and F1 score have been given the Table 12.
The comparison results of the proposed model have been discussed in Table 13.
Conclusion and future work
Targeting particular treatments for various categories of leukemia patients is one of the biggest medical problems. Improvements to classification models have made them crucial for better cancer treatment. In this work, Linear programming computational models were used to establish the diagnosis of various leukemia subtypes, such as Bone Marrow CD34, Bone Marrow, AML, PB, and PBSC CD34. Leukemia gene expression data from CuMiDa was employed. To make our diagnosis computationally fast, we first rescaled the dataset’s 22283 features and then we chose the most important features technique. The most significant 25 features were selected that have high discrimination power. This study improved the accuracy of the dataset by 98%. Linear Programming models play an important role in the classification of leukemia subtypes. Our model’s overall performance was outstanding. This work contributed to the revelation that when leukemia subtypes are accurately classified and data is fitted with high classified accuracy, cure rates increase and unnecessary toxicities decrease. Because the patient will be able to take preventative measures and doctors will be able to spot the condition earlier. In the future, we can predict such data that have more samples and more subtypes of leukemia. So those types that are not addressed in this study can also be addressed. The expansion of datasets, which will give us access to more samples in the future, brings with it some new challenges. We are able to predict more accurate and complicated classifiers. Reducing the number of created classifiers while using a large number of them simultaneously, as is the case with ensembles of classifiers, is thus one of the primary goals for the future. It is challenging to consistently and precisely classify cancerous cells while avoiding overfitting because of a lack of data, digitization problems, and the curse of dimensionality.
References
- 1. Escobar Francesca Isabelle F., Alipo-on Jacqueline Rose T., Novia Jemima Louise U., Tan Myles Joshua T., Karim Hezerul Abdul, and AlDahoul Nouar. "Automated counting of white blood cells in thin blood smear images." Computers and Electrical Engineering 108 (2023): 108710.
- 2. Raina R, Gondhi NK, Singh D, Kaur M, Lee HN. A Systematic Review on Acute Leukemia Detection Using Deep Learning Techniques. 2023; Archives of Computational Methods in Engineering.; 30(1):251–70.
- 3. Falini Brunangelo, and Martelli Maria Paola. "Comparison of the International Consensus and 5th WHO edition classifications of adult myelodysplastic syndromes and acute myeloid leukemia." American Journal of Hematology 98,3no. 3 (2023): 481–492. pmid:36606297
- 4.
Y. Tang, Y.-Q. Zhang, and Z. Huang, FCM-SVM-RFE Gene Feature Selection Algorithm for Leukemia Classification4 from Microarray Gene Expression Data,” in The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ ‘05., May 2005, pp. 97–101.
- 5. Shukla Alok Kumar, et al. "A study on metaheuristics approaches for gene selection in microarray data: algorithms, applications and open challenges." Evolutionary Intelligence 13 (2020): 309–329.
- 6. Huang D., Quan Y., He M., and Zhou B., “Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data,” J. Exp. Clin. Cancer Res., vol. 28, no. 1, p. 149, Dec. 2009, pmid:20003274
- 7. Peng H.Y., Jiang C.F., Fang X. and Liu J.S. Variable selection for Fisher linear discriminant analysis using the modified sequential backward selection algorithm for the microarray data. 2014, Applied Mathematics and Computation, 238, pp.132–140.
- 8. Yoo C., Lee I.-B., and Vanrolleghem P. A., “Interpreting patterns and analysis of acute leukemia gene expression data by multivariate fuzzy statistical analysis,” Comput. Chem. Eng., vol. 29, no. 6, pp. 1345–1356, May 2005,
- 9. Taskesen E., Babaei S., Reinders M. M., and de Ridder J., “Integration of gene expression and DNA-methylation profiles improves molecular subtype classification in acute myeloid leukemia,” BMC Bioinformatics, vol. 16, no. 4, p. S5, Feb. 2015, pmid:25734246
- 10.
Y. He, Y. Tang, Y.-Q. Zhang, and R. Sunderraman, “Mining fuzzy association rules from microarray gene expression data for leukemia classification,” in 2006 IEEE International Conference on Granular Computing, May 2006, pp. 461–464.
- 11. Klein H.-U. et al., “Quantitative comparison of microarray experiments with published leukemia related gene expression signatures,” BMC Bioinformatics, vol. 10, no. 1, p. 422, Dec. 2009, pmid:20003504
- 12.
Stiglic G., Khan N., Verlic M., and Kokol P., “Gene Expression Analysis of Leukemia Samples Using Visual Interpretation of Small Ensembles: A Case Study,” in Pattern Recognition in Bioinformatics, Berlin, Heidelberg, 2007, pp. 189–197.
- 13. Feltes B. C., Chandelier E. B., Grisci B. I., and Dorn M., “CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research,” J. Comput. Biol., vol. 26, no. 4, pp. 376–386, Apr. 2019, pmid:30789283
- 14. Bilen M., Işik A. H., and Yiğit T., “A New Hybrid and Ensemble Gene Selection Approach with an Enhanced Genetic Algorithm for Classification of Microarray Gene Expression Values on Leukemia Cancer,” Int. J. Comput. Intell. Syst., vol. 13, no. 1, pp. 1554–1566, Oct. 2020,
- 15. Xu P., Brock G. N., and Parrish R. S., “Modified linear discriminant analysis approaches for classification of high-dimensional microarray data,” Comput. Stat. Data Anal., vol. 53, no. 5, pp. 1674–1687, Mar. 2009,
- 16. Lu Q. and Qiao X., “Sparse Fisher’s linear discriminant analysis for partially labeled data,” Stat. Anal. Data Min. ASA Data Sci. J., vol. 11, no. 1, pp. 17–31, 2018,
- 17.
G. Zhou, “Gene-Based Disease Classification Using Bayesian Self-Organizing Map Neural Networks,” PhD Thesis, Northern Illinois University, 2021.
- 18. Grisci B. I., Feltes B. C., and Dorn M., “Neuroevolution as a tool for microarray gene expression pattern identification in cancer research,” J. Biomed. Inform., vol. 89, pp. 122–133, Jan. 2019, pmid:30521855
- 19.
Y. Liu, X. Shi, and Z. An, “Classification of Leukemia Gene Expression Data Using Particle Swarm Optimization,” in 2012 Sixth International Conference on Genetic and Evolutionary Computing, Aug. 2012, pp. 241–244.
- 20.
A. M. Karim, “A new Sparse Auto-encoder based Framework using Grey Wolf Optimizer for Data Classification Problem,” ArXiv Prepr. ArXiv220112493, 2022.
- 21. Sun L., Liu R., Xu J., and Zhang S., “An Adaptive Density Peaks Clustering Method With Fisher Linear Discriminant,” IEEE Access, vol. 7, pp. 72936–72955, 2019,
- 22. Tang W., Cao H., Duan J., & Wang Y. P. (2011). A compressed sensing based approach for subtyping of leukemia from gene expression data. Journal of bioinformatics and computational biology, 9(05), 631–645. pmid:21976380
- 23. Silva J.M.L., dos Santos Costa J. and da Costa E.M., da Silva Holanda Maria Eliana et al.“. Leukemia Diagnosis with Machine Learning Ensemble from Gene Expression Data”, International Journal of Development Research, 11(09), pp.50641–50646.
- 24.
Patel, S., Patel, H., Vyas, D., & Degadwala, S. (2021, October). Multi-Classifier Analysis of Leukemia Gene Expression From Curated Microarray Database (CuMiDa). In 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC) (pp. 1174–1178). IEEE.
- 25.
Ramisa, A. J., Hossain, A., Islam, S. M. I., Swadesh, P. M., Islam, M. T., Rahman, M. A., & Parvez, M. Z. (2021, December). Gene Expression Data Classification and Pattern Analysis Using Data Driven Approach. In 2021 International Conference on Machine Learning and Cybernetics (ICMLC) (pp. 1–9). IEEE.
- 26. Xie Fanfan, He Mingxiong, He Li, Liu Keqin, Li Menglong, Hu Guoquan, and Wen Zhining. "Bipartite network analysis reveals metabolic gene expression profiles that are highly associated with the clinical outcomes of acute myeloid leukemia." Computational Biology and Chemistry 67 (2017): 150–157. pmid:28110245
- 27.
Grisci, B. Leukemia Gene Expression—CuMiDa—Kaggle.com. 2019. https://www.kaggle.com/datasets/brunogrisci/leukemia-gene-expression-cumida