Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Novel Strategy for Gene Selection of Microarray Data Based on Gene-to-Class Sensitivity Information

  • Fei Han ,

    hanfei@mail.ujs.edu.cn

    Affiliation School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China

  • Wei Sun,

    Affiliation School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China

  • Qing-Hua Ling

    Affiliations School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China, School of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang, China

A Novel Strategy for Gene Selection of Microarray Data Based on Gene-to-Class Sensitivity Information

  • Fei Han, 
  • Wei Sun, 
  • Qing-Hua Ling
PLOS
x

Abstract

To obtain predictive genes with lower redundancy and better interpretability, a hybrid gene selection method encoding prior information is proposed in this paper. To begin with, the prior information referred to as gene-to-class sensitivity (GCS) of all genes from microarray data is exploited by a single hidden layered feedforward neural network (SLFN). Then, to select more representative and lower redundant genes, all genes are grouped into some clusters by K-means method, and some low sensitive genes are filtered out according to their GCS values. Finally, a modified binary particle swarm optimization (BPSO) encoding the GCS information is proposed to perform further gene selection from the remainder genes. For considering the GCS information, the proposed method selects those genes highly correlated to sample classes. Thus, the low redundant gene subsets obtained by the proposed method also contribute to improve classification accuracy on microarray data. The experiments results on some open microarray data verify the effectiveness and efficiency of the proposed approach.

Introduction

One of the major applications of microarray data analysis is to perform sample classification between different disease phenotypes, for diagnostic and prognostic purposes [1][5]. The classification involves a wide range of algorithms such as differential gene expression analyses, clustering analyses and supervised learning [6]. Gene selection is one of the critical steps in the course of the classification of microarray data [7], [8]. Selecting a useful gene subset not only decreases the computational complexity, but also increases the classification accuracy.

The methods for gene selection are broadly divided into three categories: filter, wrapper and embedded methods [9]. A filter method relies on general characteristics of the training data to select genes without involving any classifier for evaluation. Most filter methods consider each feature separately with ignoring feature dependencies, which may lead to worse classification performance when compared to other types of feature selection methods [10]. In addition to considering feature dependencies, wrapper methods take into account the interaction between feature subset search and model selection. However, wrapper methods have a higher risk of overfitting than filter ones and are very computationally intensive [11]. Embedded methods have the advantage that they include the interaction with the classification model, while being far less computationally intensive than wrapper methods [12].

In recent years, many methods combined with population-based stochastic optimization techniques such as genetic algorithm (GA) [13] and particle swarm optimization (PSO) [14], [15] have been used increasingly as an effective technique for microarray data analyses [16][22]. In [19], [20], binary PSO (BPSO) [23] combined with filter method was applied for searching optimal gene subsets. The method in [19] simplified gene selection and obtained a higher classification accuracy compared to some similar gene selection methods based on GA, while the method in [20] could determine the appropriate number of genes and obtained high classification accuracy by support vector machine. In [21], an approach combined GA with K-nearest neighbor (KNN) method was proposed to identify genes that could jointly discriminate between different classes of samples. The GA/KNN approach could capture the correlated structure in the data and are highly repeatable in independent runs [21]. A combination of Integer-Coded GA (ICGA) and particle swarm optimization, coupled with extreme learning machine (ELM) was used to select an optimal set of genes [22]. These hybrid methods were capable of selecting a compact subset of predictive genes for sample classification. However, these methods considered only the features’ relevance by evaluating their utility for achieving accurate predication or exploiting data variance and distribution, and the selected genes were usually poorly explicable.

Similar to GA, PSO searches for optima by updating population with generations. Unlike GA, PSO has no evolution operators such as crossover and mutation. PSO is easy to implement with few parameters need to adjust. Binary PSO (BPSO) algorithm is a binary version of PSO, which is suitable to solve discrete optimization problems [23].

In the gene selection process, different kinds of classification algorithms such as backpropagation (BP) [24], KNN [25] and support vector machine (SVM) [26], were used to evaluate the prediction ability of gene subsets, which may result into high computational cost in wrapper and embedded methods. The convergence performance, especially convergence rate, of the classification algorithm to evaluate the candidate gene subsets is a significant factor in the gene selection process. ELM, developed recently for a single hidden layered neural network (SLFN) [27], has good generalization performance with a fast training procedure. In ELM, the input weights are chosen randomly and the output weights are calculated analytically. Normally, ELM’s performance is superior to other classifiers such as gradient-based learning algorithms and SVM for problems with larger sample sets and a smaller number of features [22]. Therefore, it is a reasonable choice to use ELM to evaluate the candidate gene subsets.

In Kmeans-PSO-ELM method [28], we used K-means method to group the initial gene pool into several clusters, and standard PSO combined with ELM was used to perform gene selection, which could obtain a compact set of informative genes. Based on the work in [28], a prior information referred to as gene-to-class sensitivity (GCS) is considered in the gene selection process to select smallest possible set of predictive genes with high interpretability in this paper. The GCS information is exploited by SLFN from microarray data. The GCS information, indicating if a gene is sensitive to sample classes, contributes to select those genes significantly correlated to the sample classes. To identify relevant genes effectively for subsequent researches such as sample classification, we partition all genes into different clusters by K-means method, and filter out some clusters and genes with low GCS values. However, some remainder genes, especially those in the same cluster, may have high similarity so that there is probably large with rich redundancy among these remainder genes. To identify small sets of genes that could be used for diagnostic purpose, a modified binary PSO coupling GCS information combined with ELM is used to select smallest possible gene subsets. The modified BPSO may select the representative genes from the remainder clusters to form the optimal gene subset. Moreover, the hybrid method could reduce computational cost for using ELM to evaluate the candidate gene subsets.

Methods

K-means Clustering Method

Clustering is a search for hidden patterns that may exist in datasets. It is a process of grouping data objects into disjointed clusters so that the data in each cluster are similar, yet different to the others [29]. There are two kinds of clustering algorithms: hierarchical clustering and partitioned clustering. Different from hierarchical approaches, the partitioned clustering approach divided the input data into specified in advance number of clusters by minimizing a certain cost-function [29].

K-means, a typical partitioned clustering method, is simple and generally very fast [29]. It initializes specified in advance number of centers by some initial values called seed-points. Then, K-means computes the squared distances between the input data points and centers, and assigns the input data points to the nearest centers. Each cluster is represented by an adaptively-changing center. The above process is repeated until all center positions are optimized. The standard K-means algorithm minimizes the square-error cost-function as follows:(1)where p is a input data point in the cluster Ci, i = 1,2,…, Nc, and mi is the center of cluster Ci (the mean of all data points in the cluster Ci).

Particle Swarm Optimization

PSO is an evolutionary computing mechanism in searching for the best solution by simulating the movement of flocking birds [14]. The population of the birds is called the swarm, and the members of the population are called the particles. Each particle represents a possible solution to the problem. In the process of optimization, each particle flies independently in its own direction which is guided by its individual historical best position (pbest) as well as the global best position of all particles (gbest). Supposing the dimension of a searching space is D, and the swarm is S = (X1,X2,X3,…,Xns); each particle stands for a position in D-dimensional space; the position of the i-th particle in the search space can be denoted as Xi = (xi1, xi2,…, xiD), i = 1, 2, …, ns, where ns is the swarm size. The individual historical best position of the i-th particle is expressed as Pi = (pi1,pi2,…,piD). The global best position of all particles is denoted as Pg =  (pg1,pg2,…,pgD). The velocity of the i-th particle is expressed as Vi =  (vi1,vi2,…,viD). According to [15], the basic PSO is described as:(2)(3)where c1, c2 are the acceleration constants with positive values; rand() is a random number ranged from 0 to 1; w is the inertia weight.

The basic PSO algorithm is usually applied to solve problems in which the elements of the solution are continuous real number, whereas BPSO is more suitable for discrete optimization [23]. In BPSO, the velocity update formula remains unchanged as shown in Eq.(2), while the new position of the particle is calculated according to the following equation:(4)where the function S(Vi) is a sigmoid function, .

Extreme Learning Machine

In [30], a learning algorithm for SLFN called extreme learning machine (ELM) was proposed to solve the problem caused by gradient-based learning algorithms. ELM randomly chooses the input weights and hidden biases, and analytically determines the output weights of SLFN. ELM has much better generalization performance with much faster learning speed than gradient-based algorithms [31], [32].

For N arbitrary distinct samples (XXi, Ti) (i = 1,2,…,N.), where XXi = [xxi1, xxi2, …,xxin]∈Rn, Ti = [ti1, ti2, …,tim]∈Rm. A SLFN with NH hidden neurons and activation function g(·) can approximate these N samples with zero error. This means that(5)where

The whi = [whi1, whi2, …,whin]T is the input weight vector connecting the i-th hidden neuron and the input neurons, the woi = [woi1, woi2, …,woim] is the output weight vector connecting the i-th hidden neuron and the output neurons, and the bi is the bias of the i-th hidden neuron.

In the course of learning, first, the input weights and the hidden biases are arbitrarily chosen and need not be adjusted at all. Second, the smallest norm least-squares solution of the Eq. (5) is obtained as follows:(6)It was concluded that the ELM has the minimum training error and smallest norm of weights [31], [32]. The smallest norm of weights tends to have the best generalization performance [31], [32]. Since the solution is obtained by an analytical method and all the parameters of SLFN need not be adjusted, ELM converges much faster than gradient-based algorithm.

The Proposed Hybrid Method

Gene-to-class sensitivity information.

To get a better understanding of gene-to-class sensitivity (GCS) information, the input-to-output sensitivity of a SLFN should be given first. whki is the connected weight from the i-th input node to the k-th hidden neuron, and wojk is the connected weight from the j-th hidden neuron to the k-th output neuron. xxi is the i-th input component and Oj is the j-th output component of the SLFN. The i-th input to j-th output sensitivity [33], [34] is defined as:(7)where Nsam is the number of samples. For simplicity, the hidden activation function of the SLFN is logsig function, and the output neurons are linear. Obviously, the derivative obtained from the r-th sample, XXr, can be calculated as:(8)When the Oj is the value of the j-th class and xxi represents the expression level of the i-th gene, Sij in Eq.(7) can be considered as the i-th gene to the j-th class sensitivity. The larger the Sij value is, the more sensitive to the j-th class the i-th gene is. The GCS value for the k-th gene is normalized as follows:(9)where NO and Ngn1 are the number of the output neurons in the SLFN and the number of genes in the first-level initial gene pool.

In this study, a SLFN is trained with the training dataset by ELM, and the input and output weights of the SLFN are determined. According to Eqs. (7)–(9), the GCS values of all genes are easily obtained. To obtain more accurate GCS value of each gene, the above operation is repeated 50 times and the average GCS value of each gene is calculated.

The Proposed Gene Selection Method

In this paper, gene selection also consists of two respects, which are to identify relevant genes from each cluster and to tend to select smallest subsets from the relevant genes. To obtain compact and explicable gene subsets, GCS information (GCSI) exploited from microarray data is considered in the whole process of gene selection. The rough frame of the proposed method is shown in Figure 1.

To identify relevant genes for subsequent sample classification, firstly, the GCS values of all genes are calculated according to Eqs.(7)–(9), and all genes are clustered by K-means algorithm. Since the genes in the same cluster might have similar functions, it is possible that several genes serve as the equivalent for the subsequent sample classification. To reduce the redundancy, only the representative genes among these equivalent genes are selected for further processing. Then, if maximum GCS value of all genes in a cluster is far lower than those of other clusters, this cluster will be filtered out. Moreover, if the GCS value of a gene in the remainder cluster is lower than the mean of the GCS values of all genes in this cluster, this gene will also be filtered out. Thus, the remainder genes in the remainder clusters have comparatively high GCS values, and these relevant genes for sample classification are probably kept for further gene selection.

Although the remainder genes are relevant to data classes, there is still rich redundancy among these genes. In this study, a modified BPSO encoding GCS information is proposed to select the most compact gene subsets from the remainder genes. The detailed steps are described as follows.

Step 1: Form a first-level initial gene pool. The dataset is divided into training and testing datasets. Select 200∼400 genes from all original genes by using information index to classification (IIC) method [35] on the training data. Since the original IIC method is used for two-class microarray data, we develop it for multi-class microarray data as follows:(10)where and are the means of expression value of the gene g in the j-th and k-th classes, respectively, and and are the standard deviations of expression value of gene g in the j-th and k-th classes, respectively. c is the total number of classes. From [35], the higher the value of d(g), the more classification information the gene g contains, so the gene g is more relevant to samples categories. The high classification accuracy will be obtained with high probability by a classifier if the microarray data is projected onto the gene g whose IIC value, d(g), is high. The genes are ranked by their IIC values on the training dataset, and those genes with higher IIC values are chosen to establish the first-level gene pool.

Step 2: Establish a second-level initial gene pool based on the first-level initial gene pool. Cluster the genes in the first-level initial gene pool to predetermined number of groups with K-means method. The predetermined number of the clusters is determined by trial and error. Delete those clusters whose maximum GCS values are much smaller than other clusters. Moreover, in each remainder cluster, the genes whose GCS values are smaller than the mean GCS value of all genes in the cluster are also removed. All the remainder genes in the remainder clusters form a second-level initial gene pool to perform further gene selection.

Step 3: Use BPSO to select the optimal gene subsets from the second-level initial gene pool. To improve the search ability of the swarm, GCS information is encoded into BPSO for further gene selection.

Firstly, initialize a population of particles with random positions and velocities in the search space. The i-th particle Xi = (xi1, xi2, …,xiD) represents a candidate gene subset, and the element xij∈{0,1},(1≤j≤D) indicates whether the j-th gene is selected. The dimension of a particle is equal to the number of genes in the second-level initial gene pool. Since the number of the remainder clusters is comparatively small and each cluster owns its representativeness, at least one gene should be selected from each cluster in the initialization of each particle.

Secondly, update the position of each particle. To encode GCS information, two modified equations for updating the particles based on the discrete PSO in [20] are proposed as follows:(11)(12) where GCS(j) is the GCS value of the j-th gene, and Ngn2 is the number of genes in the second-level initial gene pool. vij(t+1) on the right-hand side of Eq.(11) is calculated by Eq.(2). The parameter, a, is a fixed number, and is set as 0.5 according to [20]. The subscripts i and j denote the i-th particle and its j-th component. The pij is the j-th component of the historical best position of the i-th particle, and the pgj is the j-th component of the global best position of all particles. The above equations make the genes with high GCS values be selected with high probability.

Thirdly, evaluate the fitness value of each particle, and update the historical best position of each particle and the global best position of all particles. To select smallest possible predictive gene subsets, the fitness function is defined as follows:(13)According to [36], to correct for the selection bias in the gene selection process, we perform 5-fold cross validation (CV) to evaluate the selected genes. The accuracy(i) in Eq. (13) is the 5-fold cross validation accuracy on training data obtained by ELM with the candidate genes denoted by the i-th particle. The GenesNumber(i) is the number of the selected genes denoted by the i-th particle.

Finally, if the fitness value reaches the threshold value, or the maximal iterative generations are arrived, the particle with the best fitness value is output. So the optimal gene subset is obtained.

As a wrapper or embedded gene selection method, its main computational cost is the computational time of evaluating the candidate gene subset by classifier. In the proposed method, the classifier to evaluate the candidate gene subsets is ELM which is trained by an analytical method without iterations. In those gene selection methods using SVM or BP algorithm to evaluate the candidate gene subsets, the classifier is trained with thousands of iterations and the training process is very time consuming. Therefore, the computational time of the proposed method is much less than that of methods using SVM or BP to evaluate the candidate gene subset. Moreover, for better generalization performance of ELM, the final selected gene subset evaluated by ELM has high predictive ability. The proposed method combines K-means method with the modified BPSO, GCS information and ELM, so it is referred to as KMeans-GCSI-MBPSO-ELM.

Results and Discussion

Datasets

To verify the effectiveness and efficiency of the proposed gene selection method, we conduct experiments on six open microarray datasets including Leukemia, Colon, SRBCT, LUNG, Brain cancer and Lymphoma data. The detailed description of the datasets is listed in Table 1.

The Leukemia data [37] contains total 72 samples in two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), which contain 47 and 25 samples, respectively. Every sample contains 7129 gene expression values. The Leukemia data are available at http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi.

The Colon data consists of expression levels of 62 samples of which 40 samples are colon cancer samples and the remaining are normal samples. Although original expression levels for 6,000 genes are measured, 4,000 genes out of all the 6,000 genes were removed considering the reliability of measured values in the measured expression levels. The measured expression values of 2,000 genes are publicly available at http://microarray.princeton.edu/oncology/.

The entire SRBCT data [33] includes the expression data of 2308 genes. There are totally 63 training samples and 25 testing samples, five of the testing samples being not SRBCT. The 63 training samples contain 23 Ewing family of tumors (EWS), 20 rhabdomyosarcoma (RMS), 12 neuroblastoma (NB), and 8 Burkitt lymphomas (BL). The 20 testing samples contain 6 EWS, 5 RMS, 6 NB, and 3 BL. The data are available at http://www.biomedcentral.com/content/supplementary/1471-2105-7-228-S4.tgz.

The LUNG data [38], [39] contains in total 203 samples in five classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, which have 139, 21, 20, 6,17 samples, respectively. Each sample has 12600 genes. The genes with standard deviations smaller than 50 expression units were removed and a dataset with 203 samples and 3312 genes was obtained [38], [39]. The data is also available at http://www.biomedcentral.com/content/supplementary/1471-2105-7-228-S4.tgz.

The Brain cancer data contains 60 samples in two classes, 46 patients with classic and 14 patients with desmoplastic brain cancer. The Lymphoma data includes 58 samples where 32 patients did cured and 26 patients did not cured. Each sample in the Brain cancer and Lymphoma has 7129 genes. These two data are available at http://linus.nci.nih.gov/~brb/DataArchive_New.html.

Since there has no guidance on how to select the population size and maximum iteration number in PSO, we determine the values of these parameters within the crossvalidation runs on the validation dataset. As for using modified BPSO to select optimal genes, the population size is 100 on the SRBCT data, while the one is 50 on the other five data; the maximum iteration number is 20 on the SRBCT, LUNG and Brain cancer data, and the one is 40 on the Leukemia, Colon and Lymphoma data. According to [15], a recommended choice for the acceleration constants c1 and c2 is 2, and a better decrease for the inertia weight, w, is form 1.4 to 0.5. In this study, based on the conclusions in [15] and the crossvalidation runs on the validation dataset, the initial and final inertia weight are set as 1.2 and 0.4, respectively on all data, and the acceleration constants c1 and c2 are both selected as 1.6 on all data.

Reduce Redundant Genes

In the first-level initial gene pool, some clusters with low maximum GCS values are deleted. In the experiments, Cluster 3 is removed on the Leukemia data, Clusters 2 and 6 are deleted on the Colon data, Clusters 1, 2 and 3 are deleted on the SRBCT data, Clusters 3 and 8 are filtered out on the LUNG data, Cluster 6 are deleted on the Brain cancer data, and Cluster 3 are filtered out on the Lymphoma data. Moreover, some genes in a remainder cluster whose GCS values are lower than the mean GCS values of the cluster are removed. Figure 2 shows the GCS values of the genes in the remainder clusters on all data. From Figure 2, over half the genes in the remainder clusters are filtered out. The less the reserved genes are, the less the computational cost of the further gene selection is.

thumbnail
Figure 2. The GCS values of genes in the remainder clusters (The character ‘C’ along the X axis is the abbreviation for ‘Cluster’) (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.

https://doi.org/10.1371/journal.pone.0097530.g002

The Classification Ability of the Selected Gene Subsets

To verify the classification ability of the selected gene subsets, ELM is used to perform sample classification with these selected gene subsets on the six datasets. Each experiment is conducted 100 times, and the mean results are listed in Table 2.

thumbnail
Table 2. The classification accuracies with different gene subsets by ELM on the six data.

https://doi.org/10.1371/journal.pone.0097530.t002

From Table 2, with the three genes selected by the proposed approach, ELM obtains 100% testing accuracy and 5-fold CV testing accuracy on Leukemia. ELM obtains 100% testing accuracy and 5-fold CV testing accuracy with all selected gene subsets on the SRBCT data. With the about ten genes selected by the KMeans-GCSI-MBPSO-ELM method, ELM obtains high prediction accuracies on the other four data. These results indicate that the KMeans-GCSI-MBPSO-ELM method has the ability of selecting most predictive genes highly related to sample classes.

The GCS values of all reserved genes listed in Figure 2 are shown in Figure 3. It can be found that the KMeans-GCSI-MBPSO-ELM method does not always select those genes with the highest GCS values, and it also selects those critical genes with comparatively low GCS values. This is mainly because the modified BPSO considers not only the GCS values of the selected genes but also the classification ability of each gene subset.

thumbnail
Figure 3. The GCS values for all reserved genes (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.

https://doi.org/10.1371/journal.pone.0097530.g003

Biological and Functional Analysis of the Selected Gene Subsets

The experiment on each microarray data is conducted 1000 times, and the top ten frequently selected genes are listed in Tables 38 for the six data.

thumbnail
Table 3. The top ten frequently selected genes with the proposed method on the Leukemia data.

https://doi.org/10.1371/journal.pone.0097530.t003

thumbnail
Table 4. The top ten frequently selected genes with the proposed method on the Colon data.

https://doi.org/10.1371/journal.pone.0097530.t004

thumbnail
Table 5. The top ten frequently selected genes with the proposed method on the SRBCT data.

https://doi.org/10.1371/journal.pone.0097530.t005

thumbnail
Table 6. The top ten frequently selected genes with the proposed method on the LUNG data.

https://doi.org/10.1371/journal.pone.0097530.t006

thumbnail
Table 7. The top ten frequently selected genes with the proposed method on the Brain cancer data.

https://doi.org/10.1371/journal.pone.0097530.t007

thumbnail
Table 8. The top ten frequently selected genes with the proposed method on the Lymphoma data.

https://doi.org/10.1371/journal.pone.0097530.t008

From Table 3, all genes except genes X03934, Z69881 and J03473 were also selected by the methods proposed in [40] and [37]. Gene Z69881 is also selected by the proposed method in [37]. Gene U05259, a B lymphocyte antigen receptor, encodes cell surface proteins for which monoclonal antibodies have been demonstrated to be useful in distinguishing lymphoid from myeloid lineage cells [40]. Gene M63138 is the member of the peptidase C1 family involved in the pathogenesis of breast cancer and possibly Alzheimer’s disease [40]. Gene X95735, an AML-related gene, is an adhesive plaque protein, which plays a central role in regulation of cell differentiation [40]. Form Tables 2 and 3, it can be concluded that gene X03934 is also a critical gene for sample classification.

From Table 4, genes H20709 and R22197 were also selected in [41], and genes H20709, T94579 and R88740 were also selected in [42]. A muscle index can be calculated based on an average intensity of 17 ESTs in the array that are homologous to smooth muscle genes which included gene H20709 [41]. Gene R22197 is one of the ribosomal protein cluster which are ESTs homologous to genes that appear to be related to cellular metabolism such as an ATP-synthase component and an elongation factor [41].

From Table 5, genes 796258, 629896, 1409509, 1435862, 784224, 21652 and 770394 were also both selected by the methods proposed in [33] and [43]. Genes 812105 and 42558 were also selected by the method proposed in [33]. Some genes are over-expressed in a certain type of tumor but lack of specificity. For instance, Gene 784224 (fibroblast growth factor receptor 4) was noted to be highly expressed only in RMS and not in normal muscle, but it is also expressed in some other cancers and normal tissues [33]. This tyrosine kinase receptor is expressed during myogenesis but not in adult muscle, and is of interest because of its potential role in tumor growth and in prevention of terminal differentiation in muscle [33].

Although many gene selection methods were used on the LUNG, Brain cancer and Lymphoma data, they did not give the detailed information of the selected genes [38], [39], [44], [45], and thus we do not know which genes listed in Tables 68 were also selected by these methods. Genes listed in Tables 68 can be a useful reference for future study of these three data.

To further verify that the proposed method is capable of selecting predictive genes, the heatmap with top ten frequently selected genes for the six data are shown in Figure 4. From Figure 4 (A), all ten genes expression levels clearly differentiate between AML and ALL. From Figure 4 (B), the expression levels of genes 237, 286 and 141 are distinct in two classes. From Figure 4 (C), some genes are over-expressed in a certain type of tumor but lack specificity. For instance, the expression levels of genes 1601 and 742 are distinct from NB and other classes, while the ones of genes 1003, 603, 1055 and 1955 are distinct from RMS and other classes. Gene 545 has different expression level in EWS, and expression level of gene 1 clearly differentiate in BL and NB from EWS and RMS. From Figure 4 (D), the expression levels of genes 498, 1243 and 614 clearly differentiate between NOR and other four classes, while the ones of gene 3178, 1765, 1014, 2750 and 567 differentiate between SQ and other classes. From Figure 4 (E–F), there has no single gene in the Brain cancer and Lymphoma data (especially in the Lymphoma data) whose expression levels are distinct between two classes. This is mainly because all genes in these two datasets have low deviation over all samples.

thumbnail
Figure 4. The Heatmap of expression levels based on the top ten frequently selected genes on the six data (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.

https://doi.org/10.1371/journal.pone.0097530.g004

Principal component (PC) analysis is a linear orthogonal transform such that the coordinates of the data in the new space are uncorrelated and the most amount of variance of the original data is preserved by only a few coordinates [46]. Hence, the first few PCs explain most of the variance in the data. A plot of the first three PCs often reveals patterns in the data [47]. Here it is used to project samples of high dimensions onto a three-dimensional plot for visual display. PC analysis is applied to all samples in the six data using the top 30 frequently selected genes, and the display is shown in Figure 5. Noticeably, two and four distinct clusters emerge for the Leukemia and SRBCT data, respectively (Figure 5(A) and (C)). As for the LUNG data (Figure 5(D)), although several samples of different classes are overlapped, five clusters are apparent. As for the Colon data (Figure 5(B)), four normal samples are in the cluster of the tumor samples, which were considered as the contaminated samples in [47]. From Figure 5 (E–F), a few samples of different classes are overlapped. This further indicates that the samples from different classes are not very distinct in both the Brain cancer and Lymphoma data, which was verified in [44], [45].

thumbnail
Figure 5. Plot of the first three principal components using the top 30 frequently selected genes (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.

https://doi.org/10.1371/journal.pone.0097530.g005

Reproducibility of the Proposed Method

To examine the reproducibility of gene selection [21], we repeat the same proposed approach on the six data with different random seed numbers. Another 1000 subsets of optimal particles are thus obtained through an independent run. This means that the proposed method runs 1000 times in one independent run to obtain 1000 optimal gene subsets. All genes in the second-level initial gene pool are ranked according to the frequency of being selected in 1000 solutions obtained by the independent run. Figure 6 depicts the correlation between the ranks from the two independent runs. The points in Figure 6 are distributed along the line, y = x, which indicates the proposed method is highly repeatable in two independent runs, so the reproducibility of KMeans-GCSI-MBPSO-ELM is high.

thumbnail
Figure 6. The correlation between the ranks of genes from two independent runs on the six data to assess reproducibility of the proposed approach (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.

https://doi.org/10.1371/journal.pone.0097530.g006

Convergence Rate of the Proposed Method

Figure 7 shows the 5-fold CV accuracy on the training data versus the iteration number of the modified BPSO. From Figure 7, the MBPSO finds the optimal gene subset with only five epochs on the SRBCT and LUNG data, 13, 23, 13 and 25 epochs on the Leukemia, Colon, Brain cancer and Lymphoma data, respectively. For the low redundancy of the second-level initial gene pool and encoding the GCS information into BPSO, the modified BPSO could find the optimal gene subset with fast convergence rate. Figure 7 could give a guide for selecting the maximum iteration number in the modified BPSO for different types of microarray data.

thumbnail
Figure 7. 5-fold CV accuracy on the training data versus the iteration number of the MBPSO.

https://doi.org/10.1371/journal.pone.0097530.g007

Comparison with Other Gene Selection Methods

First, to compare the KMeans-GCSI-MBPSO-ELM method to other PSO-based gene selection methods such as BPSO-ELM, KMeans-BPSO-ELM and our previous work in [28], ELM is used to perform sample classification with the selected genes obtained by the four gene selection methods. In the BPSO-ELM method, BPSO search the optimal gene subsets and ELM is used to evaluate the gene subsets. As for the KMeans-BPSO-ELM method, all genes are grouped by K-means clustering method first and BPSO combined with ELM to perform gene selection. Compared to the KMeans-GCSI-MBPSO-ELM method, KMeans-BPSO-ELM method does not consider GCS information and not reduce redundant genes in each cluster before using BPSO-ELM to select genes. Each experiment is conducted for 100 times, and the corresponding results of 5-fold CV are listed in Table 9. With the genes selected by the KMeans-GCSI-MBPSO-ELM method, ELM obtains the highest 5-fold CV classification accuracy on all data, whereas it obtains the lowest 5-fold CV classification accuracy on all data with the genes selected by the BPSO-ELM. Although all the four methods combine PSO with ELM to select informative gene subsets, the KMeans-GCSI-MBPSO-ELM method has the best performance because it uses the GCS information to reduce the redundancy of genes and select those gene subsets highly correlated to sample classes.

thumbnail
Table 9. The 5-fold CV classification accuracies of the ELM classifier based on four PSO-based gene selection methods on the six data.

https://doi.org/10.1371/journal.pone.0097530.t009

Then, to compare the proposed method with GS1 [38], GS2 [38], Cho’s [48], F-test [49] and the method in [28], we have adopted two ways to build a classifier more than ELM using the top frequently selected genes, one is through support vector machines (SVMs) [50] and the other is through K-nearest-neighbor (KNN) search [49]. As in [38], we also chose a linear kernel in SVM and chose the Euclidean distance in KNN classifier with K = 5. To be consistent with [38], each experiment is conducted 100 times on the Leukemia, SRBCT, LUNG and Colon data, and the mean 5-fold CV classification accuracies for three numbers of top ranked genes (that is, 30, 60, and 100) on the Leukemia, SRBCT, LUNG and Colon data are listed in Tables 1013. The results of GS1, GS2, Cho’s and F-test are directly quoted from [38], and thus the results of these four methods on the Colon data are not provided in Table 13. Since the number of the reserved genes for further gene selection by BPSO in the proposed method is less than 100 on the Leukemia data, the results with 100 genes for six methods are not provided in Table 10. Similarly, the results with 60 genes on the Leukemia data for the method in [28] are not provided. In all cases except the KNN classifier with top 60 frequently selected genes on the Leukemia data, the KNN and SVM classifiers achieve higher classification accuracies with the genes selected by the KMeans-GCSI-MBPSO-ELM method than with the genes selected by other methods but the method in [28]. Moreover, the standard deviations in all cases except the method in [28] on the LUNG data of the KMeans-GCSI-MBPSO-ELM method are always less than those of other gene selection methods, which shows that the proposed method is most robust in most cases. Compared to the proposed method, the KNN and SVM classifiers achieve almost high classification accuracies on the SRBCT and LUNG data with the 60 and 100 genes selected by the method in [28], which indicates the method in [28] requires a comparatively large set including those predictive genes selected by the proposed method.

thumbnail
Table 10. The 5-fold CV classification accuracies of the KNN-classifier and SVM-classifier based on six gene selection methods on the Leukemia data.

https://doi.org/10.1371/journal.pone.0097530.t010

thumbnail
Table 11. The 5-fold CV classification accuracies of the KNN-classifier and SVM-classifier based on six gene selection methods on the SRBCT data.

https://doi.org/10.1371/journal.pone.0097530.t011

thumbnail
Table 12. The 5-fold CV classification accuracies of the KNN-classifier and SVM-classifier based on six gene selection methods on the LUNG data.

https://doi.org/10.1371/journal.pone.0097530.t012

thumbnail
Table 13. The 5-fold CV classification accuracies of the KNN-classifier and SVM-classifier based on two gene selection methods on the Colon data.

https://doi.org/10.1371/journal.pone.0097530.t013

Finally, to compare the KMeans-GCSI-MBPSO-ELM method with MIDClass [45], SGC (based on t-test (SGC-t) and based on WMW (SGC-W)) [44] and the method in [28], we conduct the experiments on the Brain cancer and Lymphoma data. To be consistent with [44], [45], the experiments are conducted with standard Leave-One-Out-Cross-Validation (LOOCV). The results of the KMeans-GCSI-MBPSO-ELM and the method in [28] is the average of 100 trials, and the results of the MIDClass and SGC methods are retrieved from [44], [45]. From the results shown in Table 14, it can be found that the KMeans-GCSI-MBPSO-ELM method outperform the method in [28], MIDClass and SGC methods.

thumbnail
Table 14. The LOOCV classification accuracies of five methods on the Brain cancer and Lymphoma data.

https://doi.org/10.1371/journal.pone.0097530.t014

Discussion on the Selection of the Number of the Clusters

In the KMeans-GCSI-MBPSO-ELM method, the number of the clusters of the genes in the first-level initial gene pool (Nc) is determined by trial and error where the number of the clusters is determined within the crossvalidation runs on the validation dataset. However, there are some methods available such as the Elbow method [51] or Silhouette scores [52] that could be used to determine the number of the clusters for K-means method. To show the influence of the selection of the Nc, we use KMeans+Elbow-GCSI-MBPSO-ELM method to perform gene selection on six data. The only difference between the KMeans-GCSI-MBPSO-ELM and the KMeans+Elbow-GCSI-MBPSO-ELM method is the selection of the Nc. In the KMeans+Elbow-GCSI-MBPSO-ELM method, the Elbow method [51] is used to determine the number of the clusters by evaluating the ratio of the between-group variance to the total variance (also known as an F-test). The comparison results between the KMeans-GCSI-MBPSO-ELM and the KMeans+Elbow-GCSI-MBPSO-ELM methods are provided in Table 15. The results in Table 15 are the average of 100 trials.

thumbnail
Table 15. The LOOCV and 5-fold CV classification accuracies of ELM based on two gene selection methods on the six data.

https://doi.org/10.1371/journal.pone.0097530.t015

From Table 15, the KMeans-GCSI-MBPSO-ELM and the KMeans+Elbow-GCSI-MBPSO-ELM methods select the same number of the clusters on some data including the Leukemia, Colon, Brain cancer and Lymphoma data, so ELM obtains the same classification accuracies with the genes selected by these two methods on these four data. The value of the Nc in the KMeans-GCSI-MBPSO-ELM method is different from that of the KMeans+Elbow-GCSI-MBPSO-ELM method on both the SRBCT and LUNG data, and ELM obtains higher classification accuracies with the genes selected by the KMeans-GCSI-MBPSO-ELM method than with the one selected by the KMeans+Elbow-GCSI-MBPSO-ELM method on both the SRBCT and LUNG data. In summary, since the selection of the Nc is determined by the crossvalidation accuracies obtained by ELM on the validation dataset in the KMeans-GCSI-MBPSO-ELM method, ELM tends to obtain higher classification accuracies on the full dataset with the genes selected by the KMeans-GCSI-MBPSO-ELM method than with the one selected by the KMeans+Elbow-GCSI-MBPSO-ELM method.

Figure 8 shows the relationship between the number of the clusters (Nc) and the classification accuracies on the full dataset obtained by ELM with the genes selected by the KMeans-GCSI-MBPSO-ELM method on the six data. The result in Figure 8 is the average of 100 trials. From Figure 8, the KMeans-GCSI-MBPSO-ELM method is less sensitive to the choice of the Nc on the Leukemia and SRBCT data than on the other four data. The best Nc for the KMeans-GCSI-MBPSO-ELM method is 5, 8, 8 and 10 on the Leukemia, Colon, SRBCT and LUNG data, respectively, which is same as the one determined by trial and error in the proposed method. The best Nc is 4 and 7 on the Brain cancer data, and the best one is 6 on the Lymphoma data, while the Nc is selected as 6 and 7 by trial and error in the KMeans-GCSI-MBPSO-ELM method on the Brain cancer and Lymphoma data, respectively. This difference lies in the fact that the Nc in the KMeans-GCSI-MBPSO-ELM method is selected by trial and error within the crossvalidation runs on the validation dataset while the best Nc obtained from Figure 8 is determined within the crossvalidation runs on the full dataset.

thumbnail
Figure 8. The number of the clusters (Nc) versus the classification accuracy obtained by ELM with the genes selected by the KMeans-GCSI-MBPSO-ELM method on the six data.

https://doi.org/10.1371/journal.pone.0097530.g008

Conclusions

In this study, the GCS information combined with K-means clustering method was used to reduce redundant genes, and then a modified BPSO encoding the GCS information was used to perform further gene selection. The new method could select the predictive gene subsets with comparatively high GCS values. Experiment results also shown that ELM, SVM and KNN classifiers obtained high prediction accuracy with the genes selected by the proposed method. Moreover, the proposed gene selection method has high reproducibility. Since the proposed method reduced the redundancy of genes only by removing the genes with low GCS values, it might filter out a few critical genes highly related to sample classification in some cases and thus lead into worse classification accuracy. Future work will include how to solve this problem in the proposed gene selection method as well as apply the new method to more complex microarray data.

Author Contributions

Conceived and designed the experiments: FH WS. Performed the experiments: FH WS. Analyzed the data: QHL. Wrote the paper: FH.

References

  1. 1. Zhang JG, Deng HW (2007) Gene selection for classification of microarray data based on the Bayes error. BMC Bioinformatics 8: 370–378.
  2. 2. Huang HY, Liu CC, Zhou XH (2010) Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proc. Natl. Acad. Sci. 107: 6823–6828.
  3. 3. Wang DF, Arapostathis A, Wilke CO, Markey MK (2012) Principal-Oscillation-Pattern analysis of gene expression. PLoS ONE 7: e28805.
  4. 4. Wong L (2013) An introduction to some new results in bioinformatics and computational biology. Journal of Bioinformatics and Computational Biology 11: 1301001.
  5. 5. Gutkin M, Shamir R, Dror G (2009) SlimPLS: A method for feature selection in gene expression-based disease classification. PLoS ONE 4: e6416.
  6. 6. Varma S, Simon R (2004) Iterative class discovery and feature selection using Minimal Spanning Trees. BMC Bioinformatics 5: 126–134.
  7. 7. Cao HB, Lei SF, Deng HW, Wang YP (2012) Identification of genes for complex diseases using integrated analysis of multiple types of genomic data. PLoS ONE 7: e42755.
  8. 8. Liu QZ, Sung AH, Chen ZX, Liu JZ, Huang XD, et al. (2009) Feature selection and classification of MAQC-II breast cancer and multiple myeloma microarray gene expression data. PLoS ONE 4: e8250.
  9. 9. Kohavi G, John R (1997) Wrappers for feature subset selection. Artificial Intelligence 97: 273–324.
  10. 10. Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23: 2507–2517.
  11. 11. Maldonado S, Weber R (2009) A wrapper method for feature selection using support vector machines. Information Sciences 179: 2208–2217.
  12. 12. Bermejo P, Gamez JA, Puerta JM (2011) A GRASP algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets. Pattern Recognition 32: 701–711.
  13. 13. Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning, Boston: Addison-Wesley Longman Publishing Co., Inc.
  14. 14. Kennedy J, Eberhart RC (1995) Particle swarm optimization. in Proc. IEEE International Conference on Neural Networks, 1942–1948.
  15. 15. Shi Y, Eberhart RC (1998) A modified particle swarm optimizer. in Proc. IEEE World Congress on Computational Intelligence, 69–73.
  16. 16. O’Hagan S, Knowles J, Kell DB (2012) Exploiting genomic knowledge in optimising molecular breeding programmes: Algorithms from evolutionary computing. PLoS ONE 7: e48862.
  17. 17. Chuang LY, Lin YD, Chang HW, Yang CH (2012) An Improved PSO algorithm for generating protective SNP barcodes in breast cancer. PLoS ONE 7: e37018.
  18. 18. Chuang LY, Huang HC, Lin MC, Yang CH (2011) Particle swarm optimization with reinforcement learning for the prediction of CpG islands in the human genome. PLoS ONE 6: e21036.
  19. 19. Yang CS, Chuang LY, Ke CH, Yang CH (2008) A hybrid feature selection method for microarray classification. International Journal of Computer Science 35: 285–290.
  20. 20. Shen Q, Shi WM, Kong W, Ye BX (2007) A combination of modified particle optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 71: 1679–1683.
  21. 21. Li LP, Weinberg CR, Darden TA, Pedersen LG (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17: 1131–1142.
  22. 22. Saraswathi S, Sundaram S, Sundararajan N, Zimmermann M, Nilsen-Hamilton M (2011) ICGA-PSO-ELM approach for accurate multiclass cancer classification resulting in reduced gene sets in which genes encoding secreted proteins are highly represented. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 452–463.
  23. 23. Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. in Proc. 1997 IEEE International Conference On Systems, Man, And Cybernetics, 4104–4108.
  24. 24. Werbos PJ (1974) Beyond Regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.
  25. 25. Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13: 21–27.
  26. 26. Cortes C, Vapnik V (1995) Support-vector networks. Machine Learning 20: 273–297.
  27. 27. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70: 489–501.
  28. 28. Yang SX, Han F, Guan J (2013) A hybrid gene selection and classification approach for microarray data based on clustering and PSO. Communications in Computer and Information Science 375: 88–93.
  29. 29. Zalik KR (2008) An efficient k-means clustering algorithm. Pattern Recognition Letters 29: 1385–1391.
  30. 30. Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: a new learning scheme of FNNs. in Proc. 2004 International Joint Conference on Neural Networks, 985–990.
  31. 31. Soria-Olivas E, Gómez-Sanchis J, Martín JD, Vila-Francés J, Martínez M, et al. (2011) BELM: Bayesian extreme learning machine. IEEE Transactions on Neural Networks 22: 505–509.
  32. 32. Han F, Huang DS (2006) Improved extreme learning machine for function approximation by encoding a priori information. Neurocomputing 69: 2369–2373.
  33. 33. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7: 637–679.
  34. 34. Han F, Ling QH, Huang DS (2008) Modified constrained learning algorithms incorporating additional functional constraints into neural networks. Information Sciences 178: 907–919.
  35. 35. Li YX, Ruan XG (2005) Feature selection for cancer classification based on support vector machine. Journal of Computer Research and Development 42: 1796–1801.
  36. 36. Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99: 6562–6566.
  37. 37. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531–536.
  38. 38. Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinformatics 7: 228–243.
  39. 39. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. 98: 13790–13795.
  40. 40. Tong DL (2009) Hybridising genetic algorithm-neural network (GANN) in marker genes detection. in Proc. The Eighth International Conference on Machine Learning and Cybernetics, 12–15.
  41. 41. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, et al. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96: 6745–6750.
  42. 42. Huang TM, Kecman V (2005) Gene extraction for cancer diagnosis by support vector machines-An improvement. Artificial Intelligence in Medicine 35: 185–194.
  43. 43. Chu F, Wang L (2005) Applications of support vector machines to cancer classification with microarray data. International Journal of Neural Systems 15: 475–484.
  44. 44. Wang XS, Simon R (2011) Microarray-based cancer prediction using single genes. BMC Bioinformatics 12: 391.
  45. 45. Giugno R, Pulvirenti A, Cascione L, Pigola G, Ferro A (2013) MIDClass: Microarray data classification by association rules and gene expression intervals. PLoS ONE 8(8): e69873.
  46. 46. Oliveira PR, Romero RF (1996) A comparison between pca. and neural networks and the jpeg standard for performing image compression. in Proc. IEEE Workshop on. Cybernetic Vision, 112–116.
  47. 47. Li LP, Darden TA, Weinberg CR, Levine AJ, Pedersen LG (2001) Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Comb. Chem. High Throughput Screen 4: 727–739.
  48. 48. Cho JH, Lee D, Park JH, Lee IB (2003) New gene selection for classification of cancer subtype considering within-class variation. FEBS Letters 551: 3–7.
  49. 49. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97: 77–87.
  50. 50. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Machine Learning 46: 389–422.
  51. 51. Ketchen DJ, Shook CL (1996) The application of cluster analysis in Strategic Management Research: An analysis and critique. Strategic Management Journal 17(6): 441–458.
  52. 52. Rousseuw PJ (1987) Silhouettes: a Graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20: 53–65.