A framework model using multifilter feature selection to enhance colon cancer classification

Gene expression profiles can be utilized in the diagnosis of critical diseases such as cancer. The selection of biomarker genes from these profiles is significant and crucial for cancer detection. This paper presents a framework proposing a two-stage multifilter hybrid model of feature selection for colon cancer classification. Colon cancer is being extremely common nowadays among other types of cancer. There is a need to find fast and an accurate method to detect the tissues, and enhance the diagnostic process and the drug discovery. This paper reports on a study whose objective has been to improve the diagnosis of cancer of the colon through a two-stage, multifilter model of feature selection. The model described deals with feature selection using a combination of Information Gain and a Genetic Algorithm. The next stage is to filter and rank the genes identified through this method using the minimum Redundancy Maximum Relevance (mRMR) technique. The final phase is to further analyze the data using correlated machine learning algorithms. This two-stage approach, which involves the selection of genes before classification techniques are used, improves success rates for the identification of cancer cells. It is found that Decision Tree, K-Nearest Neighbor, and Naïve Bayes classifiers had showed promising accurate results using the developed hybrid framework model. It is concluded that the performance of our proposed method has achieved a higher accuracy in comparison with the existing methods reported in the literatures. This study can be used as a clue to enhance treatment and drug discovery for the colon cancer cure.


Introduction
Generally, cancer is reckoned, by the World Health Organisation (WHO), to be the secondmost communal source of death in the world [1]. Colon cancer, in particular, is ranked as the third-most prevalent cancer in the United States [2]; similarly, it is ranked in third position among cancers in the UK [3] and is responsible for a large number of fatalities across the globe [4,5].
There is short of effective medical treatment that exists for most common types of cancer [6]. One major traditional approach for detecting cancer is to use the microscopic observation of a biopsy sample that is time overwhelming, not cost effective, and sometimes ends with inaccurate results [7,8]. Other traditional approaches are using morphological presence of tumors or parameters resulting from clinical inspections, but they may lead to imprecise results [9,10]. As the cancer is considered to be a disease involving dynamic genome changes [10,11], the considerable efforts have been made by researchers and technologists to explore the precise assessment and diagnose of the cancer, including the tumor prediction. Gene expression profiles using microarray data combined with computation method analysis are considered as the recent techniques and approaches toward reliable cancer features investigation and can predict more accurate results [6-10, 12, 13].
A major technological advance in classifying cancers has been the development of DNA microarray techniques, which have enabled the simultaneous measurement of a large number of genes' expression levels [10,[14][15][16]. The big challenge that faces the high dimensionality of genes (features) compared to the limited sample size available [6,9,14,[17][18][19][20][21][22][23][24]. They might result in many redundant, noisy, and irrelevant genes (features). To overcome the high dimensionality of genes resulted from the microarray technology, there must be a way to choose a reduced subset of genes (features) from the immense number of genes, in order to produce high cancer classification accuracy, and reduce the redundant genes. Therefore, features selection becomes an important pre-requisite step for cancer classification and detection; because it reduces the redundancy and selects the most relevant genes and enhance the classification of the cells into benign (normal) and malignant (cancerous) [14,[24][25][26].
The current paper describes a two-stage approach to improving the successful identification of colon cancer genes. The proposed model will be composed of a pre-selection step by applying a hybrid between an Information Gain ranker and a Genetic Algorithm. Thereafter, mRMR (minimum Redundancy Maximum Relevance) filter method was applied as second stage. This mechanism is deployed to produce out a reduced subset of genes that contains an optimal subset with less noise and more relevant genes. To assess and compare the results of the proposed two stage hybrid method, a set of machine learning classification methods are used in this investigation.
The rest of the paper is structured as follows: section two presents the background and the literature review, section three presents the dataset, tools and techniques applied, while section four discusses the methodology implemented and the research approach. Section 5 renders the experimentation of the proposed method. Section six presents the results of the experiments, while section seven discusses and analyze the performance of the results. Finally, conclusion and future work are presented in section eight.

Background and literature review
In the context of the microarray technology, feature selection can be organized into three categories [14,15,18,19,27]: filter, wrapper, and embedded. In the filter method; the genes are evaluated and ranked against the class label and it does not take into considering the correlation and the interaction between the genes. It is independent from the predictor without using a learning algorithm (classifier) [28][29][30][31][32][33][34]. While, the wrapper method depends on adding or deleting features using the learning algorithm (classification algorithm) to assess the subset features [18,31,34,35].
When comparing alternative classification algorithms, the advantage of 'filter methods' over 'the wrapper method' is that they provide a faster alternative, albeit with reduced accuracy [36,37]. In contrast, the latter approach achieves accurate results, but with the disadvantage of being computationally slow. The embedded method, comparable to the wrapper method, applies searching algorithms for optimal feature subsets but correlated with a specific classifier construction [32,34,38]. The model proposed in [34] is an amalgamation of the filter and wrapper approaches and is designed to mitigate the problem of the wrapper method's computational complexity. The approach of using such a hybrid method of classification has been used extensively in recent years to categorise cancer genes.

Algorithms reviewed
The performance of various models in accurately classifying genes associated with colon cancers is summarised in Table 1. Among these studies, which have used algorithms based on multifilter and hybrid approaches to feature selection, four in particular [24,[46][47][48] have been successful in as much as they performed with an accuracy in excess of 90% in the context of colon cancers. Other approaches assessed among this sample of studies have been found to have an accuracy level of from 66% to 89%. Among these algorithms, ten of them applied a SVM as a machine learning algorithm and noted to have the highest classification accuracy [18,[46][47][48][49][50][51][52]. An accuracy of 90% or more was attained using GA, PSO and mRMR at a preselection stage.
Shutao et al. [48] isolated a subset of the top ten genes in their study in order to perform highly accurate classification, while Abdi et al. [46] used the minimum Redundancy Maximum Relevance (mRMR) technique on a pre-selected sample of 50 genes. In both cases, the size of the sample was predetermined. Mohamed et al. [47] conveyed a classification accuracy of 90.32%, using a hybrid selection approach and Support Vector Machines (SVM).
Hybrid selection was also employed by Ammu et al. [59], to ascertain information gain figures, after which a biography-based optimisation technique was used. The hybrid approach used by Chaung et al. [31], started with a genetic algorithm with a dynamic variable to select a sample of genes, which were then ordered using chi square analysis; the level of accuracy of the selection was then evaluated using SVM.
The strategy used by Dash et al. [18] was to use a combination of wrappers and filters. Feature selection was carried out using three wrappers-J48, Random Forest (RF) and Random Trees-and a sample of genes, which were assessed using the Correlation-based Feature Selection (CFS) technique. K-Nearest Neighbour (KNN) analysis and SVM were then used to measure classification accuracy. El Akadi et al. [49] initially used both mRMR and GA to study genes associated with colon cancer, verifying this approach using Naïve Bayes classifiers and SVM.
Wang et al. [60] used a two-stage hybrid method which entailed initially using a ranking procedure to obtain a sub-sample of genes. This was followed by a hierarchical grouping of the genes selected, after which an analysis was carried out using the classification algorithms C4.5, KNN, NB and SVM. The hybrid approach used by Tan et al. [61] involved a feature selection enhancement of a sample of genes using a GA; this was achieved by combining the best results from a group of feature selection methods, after which SVM were used to analyse the data. Kim and Cho [62] classified genes by employing an evolutionary neural network, while Mohamad et al. [19] made their selection of genes from microarray data, using a Cyclic-GASVM hybrid method. In a separate study, Mohamad et al. [63] used a variation of the GASVM, (referred to as "GASVM-II + GASVM"), for the gene selection process.
An alternative means of feature selection was used by Hanaa et al. [57], using a combination of GA and information gain; subsequent analysis was carried out using Genetic Programming (GP). Elyasigomari et al. [64] applied "MRMR-COA-HS", which first used mRMR to make a selection of genes, before using a wrapper which involved an algorithm known as COA-HS and SVM for classification. Alshamlan et al. [17] also used SVM at the final stage, having carried out feature selection using both mRMR and an ABC algorithm. Shukla et al. [22] presented a two-stage selection approach composed of the combination of Spearman's Correlation (SC) and the distributed filter FS methods. In [35] Shkula et al. had proposed another hybrid wrapper method to obtain the key gene expressions which is composed of Correlation-based Feature Selection (CFS) as the first step, followed by the TLBO algorithm as the second step. The accuracy has been ranged from 92.23% to 88.52% [35].  Table 1 had listed 32 different approaches of applying the hybrid feature selection method, 4 of these methods had achieved a better classification accuracy of 90% or above. Most of the state-of-the-art technologies found that for the colon cancer dataset, the mRMR, GA, IG, and PSO are commonly applied for the hybrid feature selection and evaluates to better results.

The limitations of previous studies
In the light of above, the limitations of previous studies are highlighted below.
• The most literatures reported good results when they limited the quantity of gene selection to a fixed number of genes prior to classification, thus ignoring the rest of genes which may cause an ignore to important gene.
• Many studies had claimed that reducing the number of genes will enhance the classification accuracy, but as shown in Table 1 the superlative accuracy reached 92%. Thus, there is a need to a better method or a framework model to proof the classification enhancement of the hybrid methods.
• To the superlative of the author's knowledge, there is no previous study stated in the literature had touched the hybrid feature selection method with the approach of a two-stage multifilter hybrid selection method.

The objectives of investigation
The main objective of this investigation is to develop a new framework for selecting colon cancer genes, in two stages, the first comprising a multifilter hybrid stage (GA+IG) to optimize the quality data from dataset, and the second consisting of an mRMR procedure for making the final selection. The both stages will work as selection algorithms along with machine learning classifiers to predict the cases of colon cancer. These hybridizations of algorithms are proposed to obtain genes subsets with a minimal number of relevant genes, which thereafter can produce high classification accuracy that can be employed to better detect colon cancer.
To overcome the limitations mentioned in section 2.2, in this study we use three selection algorithms (GA, IG, and mRMR). This combination is different from previous two stage approached, and we tested the accuracy using four classifiers: SVM, NB, DT and KNN to ensure the investigation is conducted rigorously. The reasons to employ these algorithms are: 1) They had shown better performance than other selection algorithms in the field, and had reflected very good effectiveness in many colorectal cancer research studies [39][40][41]; 2) GA has the ability to manage high dimensionality datasets for the colon cancer [65][66][67]; 3) GA can achieve interesting results when combined with other algorithms [68]; 4) GA is easily integrated and worked in parallel with other algorithms; 5) IG had advantages in eliminating redundant genes and reducing noise [26,69,70].; 6) Combining the GA and IG in stage 1 of this framework model will achieve the target of generating a subset of features which are top ranked and with very good quality; 7) utilizing the mRMR in as a multifilter in stage 2 will refine the subset generated from stage 1 through another subset selection of features. These features are more correlated and relevant with the class that has the lease correlation between the features. It follows that all of these algorithms will be expected to result in very good interruptible gene expressions in order to achieve a better identification to the colorectal cancer disease.

Dataset, tools and techniques applied
The datasets that are used for the colon cancer throughout the study are described in this section; the tools used for the experimentation, and outline the selection techniques used.

Background of the dataset
In this paper two datasets were used. The first one was collected from Alon et al. [71], which has been used in several colon cancer research studies [18, 46-57, 59, 72]. This dataset is publicly available and is still utilized in most recent studies [22-24, 35, 57, 56, 73-77]. Moreover, to mandate the performance of the proposed model, another colorectal dataset was used. This dataset was collected from Notterman [78], which is also used in recent studies [8,79,80]. Both data sets are publicly available and acquired as gene expressions. Table 2 presents the details of these two datasets.

Tools utilised
The Weka machine learning environment is employed in this research https://ai.waikato.ac. nz/weka/, as the Weka resource provides a number of techniques that can be used for data validation. Two such techniques are 'leave-one-out cross-validation', or LOOCV, and k-fold cross-validation, both of which randomly classify items of data as being part of either 'training' or a 'testing' set [81,82]. The LOOCV approach involves a 'classifier' being learned for all bar one of a sample and tested on that one data point [83]. The k-fold cross-validation technique is different in that the data are divided into an equal number of sub-samples. Each sub-sample is tested once and then used for training; this process will be repeated k times to make sure that all sub-samples are tested [84].

Feature selection techniques
Improving the accuracy of predictions by identifying certain features on the grounds of correlation statistics is known as 'feature selection'. For a dataset D having d dimensions, feature set F can be expressed as: where F stands for the feature set. The objective is to deduce an optimum group of features F', where (1) F' � F and (2) F', since this will represent a very good rate of classification. On the other hand, the classification process is the way to present out the test accuracy of the result. It is also possible, using this technique, to assess accuracy as a function of the ratio of predicted samples to total samples.

Methodology
In this section, a description will be given of the methodology used, including the system design and the creation and use of the appropriate algorithm.

System design
The key contribution of this research is to develop an original framework for the two-stage multifilter hybrid method for colon cancer feature selection, to achieve better classification

PLOS ONE
of top genes in the selected subset. In addition, we applied the same machine learning algorithms without any option of parameters tuning. Although the use of hybrid models appears in the research literature, the novel aspect of the present study is that it sets out to decrease the number of genes selected and enhance the accuracy of the classification by means of a multifilter two-stage feature selection process.
The rationale for striving to improve on current selection and ranking approaches is that they rely on a one-stage process and the probability that their results contain the 'noise' of redundant and unrelated genes still exist. The current study tries to alleviate this problem using a two-stage, multifilter technique, which proceeds as follows: • As a first stage, a hybrid procedure (GA+IG) is applied to the entire dataset, which both selects genes (GA) and ranks them (IG). The key idea of utilizing this hybridization is that the IG will rank the genes according to their importance, while GA is considered a wellknown algorithm to find an optimal solution and easy to implement. Both algorithms will refine and reduce the dataset for stage two and for the classification thereafter.
• Then we filter out the selected features using a secondstage of ranking genes (mRMR), which will remove redundant genes, reduce noise, and leave only correlated genes in the newly subset selected.
As rendered in Fig 2, the procedure is as follows: • The raw dataset comprises of actual tissue samples obtained from patients suffering with colon cancer, prepared for analysis in the form of a microarray.
• The comprehensive gene expression information that is contained within the microarray is prepared in a format that enables analysis using the appropriate computer programs.
• The first phase of the analysis is to process the data to reduce the 'noise' in the dataset and to perform some initial categorisation, to improve the accuracy of the subsequent classification. This consists of a two-stage action: � 1. The (GA+IG) hybrid procedure � 2. Feature selection, consisting of mRMR and Stage 1 hybridisation • Assessment of prediction accuracy, which is performed with a number of classification algorithms, such as SVM, NB, DT and K-NN. This final stage provides an evaluation of the accuracy with which a cell from a patient can be diagnosed as being cancerous or normal.
The following paragraphs outline the proposed framework model, whilst Fig 2 presents the pseudocode. To recap, the overall objective of this work is to improve the accuracy with which cells are classified as being cancerous or non-cancerous, with the approach of this work being to improve feature selection so that a better subset of genes is used for the analysis, including genes that are more closely related.

Definitions and descriptions
It is assumed that the dataset subjected to the initial GA process is m-dimensional and that the format of the data can be defined by the matrix (Data n×m ), where n represents the number of data points (individuals being treated for colon cancer, in the current context) and m is the number of genes involved in the analysis. The process of multifilter feature selection has the objective of deriving the best possible subset of features for the analysis. Let the initial set of features, X, having m dimensions, be defined by the equation X = {x(i)|i = 1,2,3,. . .,m} where x

PLOS ONE
(i) are the defined features and m are the genes. The feature selection process, IG, is used to derive Y, which is calculated as Y = {y(i)|i = 1,2,3,. . .,p} where y(i) are the selected optimal features and p represents the revised set of genes. The next step in the method is to rank all of the genes (features) in terms of the amount of information that is derived from including each one, with the criterion for inclusion being a positive value (i.e. an information gain threshold value of above zero). This ordering is passed out to identify the features that have the greatest influence on the classification of the genes. Y must be an optimal subset of X, so that Y�X, and p�m. The features y(i)2Y are then subjected to GA, to create the vector Z = {Z(i)|i = 1,2,3,. . ., q}, where Z(i) represents the new subset of features and q is now the number of features in the subset, although Z�Y and q�p. A disadvantage of using the IG procedure is that the features are dealt with separately, so that the correlations between them may be lost. Using mRMR minimises redundancy in the process, due to its emphasis on high relevance and close correlation; in the context of the Z data, mRMR identifies features that are strongly relevant to the task of classification and which carry with them the least redundancy, thus deriving an original set of vectors A = {A(i)|i = 1,2,3,. . .,s}, where A(i) is the final subset of features and s is the number of features. In this case, A�Z and s�q. In the next phase, the vectors A are categorised in terms of whether they refer to a tumour or normal tissue, using the binary labelling system {−1,+1}. This provides a new dataset of genes, which is defined by the equation where ' selects m 0 <m features from n genes, and D represents the microarray of gene expressions.
The effect of the procedure described here is to create a situation where F:A!C, whereas previously F:X!C.

Experimentation
This section presents the data preparation, the instrumentation tools, the experiments design, and the experiment process.

Data preparation
An issue that needs to be overcome in gene research is that any set of data analysed will be small in relation to the total gene population. Furthermore, the global genetic dataset is characterised by 'noise' and redundant information [85]. Using feature filtering techniques is considered one way to address this situation which prepares the raw data into a suitable form for analysis.
A popular method to pre-process the data is to discretise it using the entropy-based discretisation method proposed by Fayyad & Irani [86]. The approach used in the present study as a means of global discretisation is one that has already been used elsewhere [10,16,18,49,87,88]. Since the first dataset is unprocessed [71], then we discretised the original data into categorical ones to minimize and eliminate the noise. This algorithm applies an entropy minimization heuristic recursively to discretise the continuous-valued attributes. The stop of the recursive step for this algorithm depends on the minimum description length (MDL) principle [10]. However, the second dataset [78] is being processed by first removing any duplicated genes to keep only the unique ones, and then each array is being standardised into zero mean and unit variance. It is found that 860 duplicates exist, and they were removed.
When using GA, GI, mRMR and the selected classification process, certain default assumptions were made initially, namely the sample population for GA was 20 and the termination criterion was 20; similarly, the crossover probability was 0.6 and the mutation probability was 0.033, and the IG threshold was fixed at zero.

Instrumentation and resources used
The experiment was conducted using the Weka machine learning environment and the related library packages, with default values for all parameters [11,89]. The computing environment used a PC with the Windows 10 operating system, a 1.8GHz Intel Core i5 processor and 8GB of installed RAM.A number of programs were used for the analysis, including Windows 7, Windows 8, Intel Core i7 and 16GB RAM, but this did not affect the output obtained.

Experimental design
Prior to starting the analysis, the data were separated into two sets: training and testing, in order to create an independent test set, and improve the validity and the accuracy of the classification. The experiments were using different testing models (K Fold cross-validation, LOOCV, and splitting into training and testing proportions). As the number of samples in the datasets are considered small, the 10-fold cross validation is adopted as a value for the cross validation [90]. We also adopted the testing model to divide the samples into training almost (70%) and testing about (30%). The creation of the training set enabled a validation of feature selection; the test set fulfilled a similar validation role in relation to the classification process. It is significant to note that in the proposed method to implement cross validation, we separately discretised the training set for each fold in order not to have an access to the testing data, which will result in optimistic error rates and compromise the reliability of the experiment. Thus, during the dataset training process, the test set will be unseen (hidden) to assure the validation of the results when applied to fresh data.

Experimental processes
Data preprocessing techniques were carried out on the datasets prior to the analysis (see Section 5.1). Features were then selected using the following two-stage approach: • Stage 1: Discriminative scores were derived for each gene using IG, and all genes with a score of zero were eliminated from the dataset. Genes providing a large amount of information were selected, using GA, in order to optimise the dataset with informative and correlated genes.
• Stage 2: Redundancy levels were further reduced using mRMR, to maximise the efficiency of the gene selection process. The objective, here, was to reduce the number of features in the analysis to a minimum and to lower the amount of 'noise' in the data. mRMR was used to derive a subset of preferred genes.
Next, the classifications carried out were evaluated using a number of approaches-DT, the K-NN, NB and SVM-in order to identify the best and most efficient classification algorithm, and to measure the classification rate. Fig 2 shows the pseudocode used in the method described in sub-section 4.2. The first step was to rank the features by the extent of their information gain, using IG, after which subset features were searched using the GA technique. The evaluation, which dealt with each gene in turn, used a fitness function. The next step was to derive a new, improved population, by selection, crossover and mutation; this was repeated until pre-defined criteria for halting the process were achieved. mRMR was then used on the subset of genes obtained by this method, retaining only genes that had high relevance and which were closely correlated with one another. Finally, an evaluation was carried out, employing a number of algorithms, of the quality and accuracy of the classifications. A summary of the implementation of the research is provided in Fig 3.

Experimental results
As described above, a two-stage hybrid approach (IG+GA) and (mRMR) were used for the selection of features, followed by subsequent classification. The results obtained will now be presented in terms of the features selected number and of the outcome of the evaluation of the quality of the classification process. Table 3 shows the results of Stage 1 of the analysis-feature selection. From the initial sample population of 2,000 genes from the first dataset (Dataset 1), a subset of 68 genes was selected, based on the parameters Information Gain and Genetic Algorithm and 475 genes were selected at the same stage from the second dataset (Dataset 2).

Number of selected features
On stage 2 of the analysis mRMR is used to rank the gene population according to each gene's level of redundancy and level of correlation with the other genes. This resulted in the creation of a subset that minimised redundancy and maximised the chosen genes' contribution to the classification process. As Table 3 shows, a total of 22 features had been included at this stage of the process from dataset 1, with the original dataset having been reduced by almost

PLOS ONE
99%. However, 35 features had been included at this stage of the process from dataset 2, reducing the original dataset by almost 99.5%. Table 4 illustrates that the top genes are ranked, selected, and considered to be as the key genes in the occurrence and the development of colorectal cancer. The table contains the Expressed Sequence Tag Number (EST) and Genes Expression Description. For example, some of the key features in dataset 1 are M26383, M63391, M76378, J02854, and T968730, while in dataset 2 are R36977, M77836, T96548, T64297, and M97496 as the key gene expressions based on the proposed model. Table 5 compares the classification accuracy prediction results between stage 1 and stage 2, in order to verify the effectiveness of the proposed framework model with multiple testing models. From this table; it is recognized that the framework model has a clear direct effect on dataset 2, because of the data nature and the structure of the dataset. Since the dataset 2 had showed a very high classification accuracy in stage 1 (highest prediction accuracy 97% using K-fold and LOOCV), then the effect is slightly noticeable in stage 2 (highest prediction accuracy 100%). However, splitting out dataset 2 into training and testing validation sets will not have an effect on the dataset because of its nature and the smaller sample values included. While the effectiveness of the proposed framework model on dataset 1 is clearly noticed, as the highest accuracy in stage 1 is (90%),-in comparison with stage 2 that is (94%). Moreover, Fig  4 shows the evaluation results of the proposed procedure's classification accuracy that was carried out using a number of algorithms: DT, K-NN, NB and SVM following the different testing models. In addition, Fig 5 renders the results that are considered as an appropriate process with lower predication error rates and less computational time when validating dataset 1 using the training and testing set, and for dataset 2 using the k-fold cross validation. The key findings as per Fig 5 are: 1) for the dataset 1 was that DT and K-NN performed best, with classification accuracy measured at (94%) when used as part of a two-stage process that began with a preselection stage. The least accurate algorithm was SVM (81.25%), whilst the level of performance achieved by NB (87.5%) was acceptable; 2) for the dataset 2 was that NB performed the best with a classification accuracy measured at (100%) under the implication of the two-stage model. The least accurate algorithm was DT (94.4%), whilst the level of performance achieved by both SVM and K-NN was (97.2%) see Table 5 and Fig 5.

Analysis and discussion
There is clear evidence to suggest that the hybrid multifilter method proposed here performs the task of feature selection better than similar approaches presented in the literature (see Table 1). Classification algorithms providing the best performance in classification were K-NN and DT (with an accuracy rate of 94%) for dataset 1, with NB emerging best algorithm for

PLOS ONE
dataset 2, with an accuracy level of (100%) using the appropriate testing validation models as discussed in section 6.2. A comparison of the proposed approach with those used in similar studies using the same dataset (the dataset 1) indicated that it achieved better, in terms of classification accuracy, than the method used by Zhang et al. [24] who resulted in (91.9%) accuracy using their proposed method of FSBRR and MI, followed by the K-NN. Also, our proposed model outperforms Abdi et al. [46], who reported a 90.32% level of accuracy when using mRMR and PSO, followed by SVM. The approach described in the current paper also out-performed that of Shutao et al. [48], who achieved an accuracy of 91.9% using a PSO+GA hybrid method, followed by SVM. Al Akadi et al. [52] reported a classification accuracy of 85.48%, using mRMR+GA, followed by SVM.
One difference with the previous studies is that they used fewer genes than the genes were selected for the present study; Abdi et al. [46] used 10.3 genes, reporting a classification accuracy that did not match that measured in the current study, whilst Shutao et al. [48] and Al Akadi et al. [52] used 18 and 40 genes, respectively. The classification accuracy achieved by [48] is (91.90%), while by [52] is (85.48%).
Another comparison was conducted with studies which used similar dataset (the dataset 2), and it was clearly indicated that our proposed method achieved better than Rathore et al. [8] who achieved an accuracy of (97.2%) while ours achieved (100%) which is also similar and better to approaches used by Al Snousy et al [80] who achieved also (97% -100%). To confirm the comparative performance of the approach used in the current study, classification accuracy was 94% and 100%. The outcome of the research, therefore, is that, although some previous research was carried out using fewer genes, the approach described in this paper yielded better outcomes in terms of classification accuracy. This is because of the strategy to eliminate all bar the most informative and relevant genes.

PLOS ONE
It is noticed that most methods in the literature had achieved high classification accuracy when they applied the ML algorithms to a limited and selected number of genetic populations prior to classification not to all genes as in our case in which we didn't exclude any gene from the beginning. It follows that our methods achieved an advanced level.
A confusion matrix records True Positives (TP), which are the number of successfully identified positive samples, True Negatives (TN), which are the number of correctly identified negative samples, False Positives (FP), the samples erroneously diagnosed as being positive, and False Negatives (FN), those positive samples wrongly diagnosed as negative. An overall measure of classification efficiency is derived from this matrix, expressed as the percentage of correct diagnoses from the entire population of observations. According to Bolón-Canedo et al. [29], a sensitivity analysis measures the proportion of True Positives (TP), which, in practical terms, refers to the percentage of patients correctly diagnosed with cancer, whilst a specificity analysis refers to performance in identifying True Negatives (TN).
To achieve total predictive accuracy, an algorithm needs to perform with both 100% sensitivity and 100% specificity. The measurement of performance according to both of these indicators is known as a measure of 'accuracy', with the parameters TP, TN, FP and FN being used to calculate all of these measures.
One test for the level of performance of an approach to solving binary problems is Matthews' Correlation Coefficient (MCC), which yields values ranging from -1 to 1, where 1 describes a perfect classification performance and -1 indicates 100% error. An MCC value of zero is used to represent random prediction. The coefficient can be computed using the following equation:

MCC ¼
TP � TN À FP � FN ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi An alternative evaluation method is the Receiver Operating Characteristic (ROC) Area. This method uses a two-dimensional graph to illustrate TP and FP outcomes in relation to defined thresholds. The observed rate of False Positives (FPR) is represented by the x-axis (sensitivity), whilst the y-axis shows the rate of True Positives (TPR) (1-specificity). The optimum plotted position on this graph is the coordinate (0, 1), which is called the 'best classification' or 'perfect classification', since it indicates both perfect sensitivity and perfect specificity.

Analysis of evaluation results
The confusion matrix used in the present study to assess the classification performance of the different approaches is shown in Tables 6 and 7. These data reveal that K-NN and DT performed best in classification of the sample of 62 genes using the dataset 1, while the NB performed the best in classification of 36 genes using the dataset 2. In dataset 1, these approaches accurately identified four positive, and eleven negative samples, with the sole error being the identification of a positive sample as being negative. However, in dataset 2, these approaches accurately identified 18 positive, and 18 negative samples.
The TPR and FPR data can be plotted graphically, as shown in Fig 6 for the dataset 1 and in Fig 7 for the dataset 2, with the TPR and FPR data being shown on the x-axis and y-axis, respectively (ROC Curve). The four approaches used are plotted in the two-dimensional space shown in Fig 6, which clearly indicates that DT and K-NN performed the best, with SVM yielding the worst performance for the dataset 1. On the other hand, Fig 7 clearly also indicates that NB performed the best dataset 2, with DT yielding the least but still considered reasonable.
The performance of the algorithms used in terms of accuracy, sensitivity, specificity and the Matthews' Correlation Coefficient are shown in Table 8. Although the sample of genes eventually analysed was relatively small in both datasets, both specificity and sensitivity varied considerably, from 75% to 100%. In dataset 1, the equivalent figures for K-NN and DT were better, with both methods achieving 100% sensitivity and 91.7% specificity. In contrast, SVM performed poorly, with a sensitivity rate of 75% and an 83.3% level of specificity. In dataset 2, the figure of NB was the best with 100% sensitivity and specificity.
In summary, the proposed model of the two stage multifilter outperforms other previously reported models in prediction accuracy and the numbers of genes selected within parentheses, evidenced in Table 9, for example, for dataset 1, it is 94.0% for 22 gens; for dataset 2, it is 100% for 35 gens. F-Score-Majority Voting [8] is doing very well with 95 gens to achieve 97% accuracy.
An issue that might be investigated in the future, is the impact of different parameters on various algorithms' level of classification performance. Also, the same method can be applied with other machine learning algorithms and can be extended to include other genetic datasets.

Conclusions and future work
The present study has proposed a two-stage hybrid multifilter data mining approach to feature selection, which has been shown to improve the diagnosis of colon cancer. The key improvement provided by the proposed approach was better classification of genes and accuracy of diagnosis. This was achieved through a decrease in the number of features considered in the analysis.

Achievements of the research
The proposed two-stage model delivered the following improvements: Stage 1: The number of features used in the analysis was reduced by nearly 99% for both datasets included in this study, as compared with the sample population used at the beginning of the analysis. This was achieved through initially using the (IG+GA) selection approach.
Stage 2: During this stage, the approach reduced the number of genes used in the analysis to 22 for dataset 1 from an initial sample size of 2,000, and to 35 for dataset 2 from the initial sample size 6597. Furthermore, the amount of 'noise' in the data was lessened and genes having little or no relevance were eliminated. The approach also yielded enhanced levels of accuracy and displayed greater efficiency. The greatest classification accuracy was achieved by the K-NN and DT algorithms (at 94%) for dataset 1 and by NB algorithm (100%) for dataset 2.
The key outcome of the study is that the implementation of a feature selection procedure prior to the application of a classification algorithm provides more accurate predictions and diagnoses. The use of a hybrid multifilter process substantially reduced the number of features included in the dataset.

Future research
There are number of challenges for the further research to take on. Currently, we are going to focus on the investigations into: