Automated Detection of Cancer Associated Genes Using a Combined Fuzzy-Rough-Set-Based F-Information and Water Swirl Algorithm of Human Gene Expression Data

Pugalendhi Ganesh Kumar; Muthu Subash Kavitha; Byeong-Cheol Ahn

doi:10.1371/journal.pone.0167504

Abstract

This study describes a novel approach to reducing the challenges of highly nonlinear multiclass gene expression values for cancer diagnosis. To build a fruitful system for cancer diagnosis, in this study, we introduced two levels of gene selection such as filtering and embedding for selection of potential genes and the most relevant genes associated with cancer, respectively. The filter procedure was implemented by developing a fuzzy rough set (FR)-based method for redefining the criterion function of f-information (FI) to identify the potential genes without discretizing the continuous gene expression values. The embedded procedure is implemented by means of a water swirl algorithm (WSA), which attempts to optimize the rule set and membership function required to classify samples using a fuzzy-rule-based multiclassification system (FRBMS). Two novel update equations are proposed in WSA, which have better exploration and exploitation abilities while designing a self-learning FRBMS. The efficiency of our new approach was evaluated on 13 multicategory and 9 binary datasets of cancer gene expression. Additionally, the performance of the proposed FRFI-WSA method in designing an FRBMS was compared with existing methods for gene selection and optimization such as genetic algorithm (GA), particle swarm optimization (PSO), and artificial bee colony algorithm (ABC) on all the datasets. In the global cancer map with repeated measurements (GCM_RM) dataset, the FRFI-WSA showed the smallest number of 16 most relevant genes associated with cancer using a minimal number of 26 compact rules with the highest classification accuracy (96.45%). In addition, the statistical validation used in this study revealed that the biological relevance of the most relevant genes associated with cancer and their linguistics detected by the proposed FRFI-WSA approach are better than those in the other methods. The simple interpretable rules with most relevant genes and effectively classified samples suggest that the proposed FRFI-WSA approach is reliable for classification of an individual’s cancer gene expression data with high precision and therefore it could be helpful for clinicians as a clinical decision support system.

Citation: Ganesh Kumar P, Kavitha MS, Ahn B-C (2016) Automated Detection of Cancer Associated Genes Using a Combined Fuzzy-Rough-Set-Based F-Information and Water Swirl Algorithm of Human Gene Expression Data. PLoS ONE 11(12): e0167504. https://doi.org/10.1371/journal.pone.0167504

Editor: Yong Deng, Southwest University, CHINA

Received: July 22, 2016; Accepted: November 15, 2016; Published: December 9, 2016

Copyright: © 2016 Ganesh Kumar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI15C0001) and a grant of the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (no. NRF-2015M2A2A7A01045177).

Competing interests: The authors have declared that no competing interests exist.

Introduction

Multiclass classification of gene expression data with a reduced number of genes remain challenging problems in cancer diagnosis. Microarrays and next-generation sequencing [1, 2] are the chief tools of cancer research for quantification of gene expression, DNA copy number, and microRNA activity of each individual. Hence, analyzing such data could give researchers useful information not only about the mechanism and cause of cancer but also a way to predict and prevent cancer and to find possible novel treatments. However, classification of multiclass data is more complex than binary data and further the classification accuracy may decline as the number of classes increased [3]. The implementation of artificial intelligence using data-mining tasks such as classification and clustering techniques has been applied to analyze gene expression values for cancer diagnosis [4–7]. However these techniques suffered by a greater computational cost and training time.

Rule-based approaches produced knowledge out of gene expression data with acceptable classification accuracy for diagnosing cancer. [8–11]. In addition some of the other approaches such as a decision tree [12] and ensemble classification tree [13, 14] have been used for identification of cancer-causing genes in gene expression data. Nonetheless, these approaches failed to consider the overlapping behavior of gene expression levels in uncertain situations. Data-driven approaches [15, 16] have been applied for extracting knowledge from the gene expression data without a human expert, but they were found to be weak in terms of the self-learning process. In general, these approaches are problematic for subtyping of cancer with identical expression levels in multiclass cancer data [17, 18].

In several studies the concept of fuzzy logic has been used to develop a rule-based system with the help of a learning algorithm to address multiclass issues among cancer genes as well as for suitable generation of if-then rules and a membership function (MF) for classification of a disease [19–25]. The genetic algorithm [20] and particle swarm optimization (PSO) techniques [21] can generate rules through simultaneous tuning of the MF, but it becomes too lengthy with more linguistic terms and was found to be incomprehensible for making diagnostic decisions. The ant bee algorithm [22] was recommended to produce compact if-then rules with better readability, but it results in consumption of more computation time because of the more complicated operations and more tunable control parameters. Fuzzy ontology [23] can extract the knowledge quickly, but its performance degrades with the scarce data distribution found in the multiclass gene expression data. The framework described in reference [24] transforms crisp rules into fuzzy rules using a stochastic global optimization procedure; however, the generation of the crisp rule using experts for multicategories of cancers is again a difficult task. Majority voting and fuzzy aggregation [25] are used in a multi-classification system, and it was reported that the combination of results from the individual classifiers for the final decision yields poor performance with more skewedness for the multiclass data on cancer gene expression.

Recently, fuzzy-rule-based multiclassification systems (FRBMS) [26] using combinations of methods were proposed, to take advantage of the crucial benefit of interpretability offered by the fuzzy system. Nevertheless, the presence of numerous genomic variables versus a relatively small number of patients poses challenges in understanding the data. Attempts have been made to use a genetic algorithm (GA) [27] in an FRBMS to perform classifier fusion and selection; this approach does not fulfill the skewness of the gene expression data. Furthermore, underfitting should be avoided during multiclassification because it results in a non-optimally robust system due to inadequate experimentation. To build a beneficial system for cancer diagnosis to overcome many shortcomings [28, 29] such as scarceness and highly nonlinear multicategory values, it is necessary to design an ideal method with precise principles of data analysis.

The abundance of genes expressed in microarray experiments requires a long computation time and results in complex output for an FRBMS. To implement an FRBMS for a gene expression-based cancer diagnosis problem, identification of most relevant genes associated with cancer from the large set of genes is mandatory [4, 15]. The purpose of this newly proposed combined fuzzy-rough-set-based f-information & water swirl algorithm (FRFI-WSA) approach was to design an FRBMS for analyzing gene expression data for cancer diagnosis. For an effective cancer diagnostic system, two levels of gene selection (by filtering and embedding procedures using 22 cancer gene expression datasets collected from various sources) were introduced. Next, we conducted a comparison of the performance of the proposed FRFI-WSA with GA, PSO, and artificial bee colony algorithm (ABC) for cancer gene expression datasets.

Materials and Methods

Cancer gene expression datasets

This study includes 22 gene expression datasets including name, number of genes (#Genes), samples (#Sam), and categories (#Cat) along with the source of collection and its type (Table 1). The performance of the proposed algorithm for classifying datasets irrespective of the number of output classes was evaluated with 13 multiclass and 9 binary datasets. All the datasets were generated using oligonucleotide-based technology where RNA was hybridized using Affymetrix arrays HG-U95/Hu6800/HuGeneFL/Hu35K. The gene expression values of all the datasets were computed using the Affymetrix GENECHIP MAS 4.0 analysis software. The data on small round blue cell tumors (SRBCTs), NCI60 (National Cancer Institute), and Lymphoma were acquired using a two-color cDNA platform with successive image analysis by means of the DeArray Software. To summarize, 22 datasets included in our experiments each have 2–11 distinct diagnostic categories, 24–253 samples (patients), and 182–54614 genes collected from different tissues under different experimental conditions. The number of samples per class is highly sparse and imbalanced (varies from 6 to 579).

Download:

Table 1. Characteristics of gene expression datasets used for analysis.

https://doi.org/10.1371/journal.pone.0167504.t001

Proposed architecture for analyzing cancer gene expression data

A clinical challenge concerning the limited number of patients (scarcity) that is skewed in favor of one group (disparity) with a huge number of genes (dimensionality) across many categories of cancer (multiclass) are the problems faced by clinicians during analysis of gene expression data for prediction of cancer [30–33]. To overcome these drawback, problem-specific computational techniques for multiclass cancer diagnosis was developed here. As shown in Fig 1, the implementation procedure of the proposed combinatorial approach can be viewed in seven phases. The first phase reads the input data into the FRFI method. It helps to find the candidate genes in the huge number of genes using well-narrated steps as presented in Fig 1. The candidate genes are then fed into FRBMS in the second phase to find the initial points for the membership function (MF) and rule set (RS). In the third phase, these initial MF points and RS are read into the WSA to generate a population of points as a water particle’s position. The generated points are submitted to the inference procedure of FRBS in the fourth phase to compute the correctly classified samples (Cs), the selected number of rules (Rs), and selected number of informative genes (Gs). The parameters Cs, Rs, and Gs calculated in the FRBMS are then input to the WSA in the fifth phase for evaluating the objective function, which determines optimality of the generated water particle’s position as a knowledge base. If the optimality criteria are not met, then the water particle’s strength and position are updated accordingly to generate a useful knowledge base which results in improved classification of samples. The fifth and sixth phases are repeatedly executed until the desired convergence criterion is achieved. In the final phase, acceptable classification accuracy with interpretable knowledge is generated in the form of if-then conditional statements that help to identify the cancer-causing genes. The details of subcomponents of the proposed architecture are given below.

Download:

Fig 1. Architecture of the proposed FRFI-WSA approach for cancer gene expression data.

https://doi.org/10.1371/journal.pone.0167504.g001

FRFI

Regardless of the dimensionality issue, the fuzzy rough set (FR) [49] effectively calculates the redundancy (severance) as well as relevance (significance) using f-information (FI) without discretizing the continuous gene expression values. The detailed concepts of the fuzzy set, rough set, fuzzy rough set and f-information is presented in S1 Appendix. Even though the FR offers a regimented means for FI-based gene selection, it becomes inadequate for the noisiness and poor dispersal of multiclass samples. Hence, it was upgraded with a fuzzy lower approximation [50] to compute FI extrinsically to filter a gene subset. Given an n × m matrix of gene expression data with “m” gene vectors, the goal of gene filtering is to produce an n × f gene expression data matrix with “f” filtered gene vectors, where f < m. The steps involved in computing FI using the FR are as follows.

Read the gene expression dataset G_i×j where I = 1, 2, … m; c and j = 1, 2, … n; m is the number of genes, c is a class label, and n is the number of samples.
Calculate the mean value μ = {μ₁, μ₂,…μ_m, μ_c} for each gene of all the samples and class labels.
Generate two gene groups (High H, Low L) by comparing each gene value with respective mean values, so that, H = {Genes with a value greater than its mean} and L = {Genes with a value lower than its mean}
Calculate the mean value of two gene groups for each gene, and
The mean value calculated at step (iii) is considered the medium mean value,
Calculate the standard deviation for each mean value {μ_L, μ_M, μ_H}: , and .
Calculate the membership value in lower fuzzy approximation spaces for each gene G_i×j,
Calculate the positional values for each gene:
Construct the fuzzy equivalence partition matrix (FEPM) FP_i = for each gene
Suppose G_i×j represents a gene and G_c represents a class label. Then the Gene-Group significance value is calculated as
Now, Gene-Gene Severance between F_sig and the remaining genes G_rem is calculated as
Calculate the FI value for each gene G_i×j using the formula FI = min|F_sig−F_sev| and sort them in descending order of FI values for filtering.

It is expected that the proposed method of fuzzifying the criterion function of FI with a rough set can filter genes extrinsically in a way similar to human intervention into gene identification.

FRBMS

The filtered candidate genes from the FRFI method are partitioned into linguistics to generate the MF and RS points. As shown in Fig 2, this study includes three partitions such as low (“L”), medium (“M”), and high (“H”), and thus nine membership points (P₁, P₂, P₃, P₄, P₅, P₆, P₇, P₈, and P₉) are required to encode each candidate gene. P₁ and P₉ are permanent to designate the limits of the gene expression value. The optimal values for other points are selected between the limits [P₁, P₉] for P₂, [P₂, P₉] for P₃, [P₂, P₃] for P₄, [P₄, P₉] for P₅, [P₅, P₉] for P₆, [P₅, P₆] for P₇, and [P₇, P₉] for P₈. These points take floating-point numbers in which triplets P₁, P₂, P₃ and P₇, P₈, P₉ draw a trapezoidal MF and the triplet P₄, P₅, P₆ draws a triangular MF.

Download:

Fig 2. Partitioning of input genes in fuzzy space.

https://doi.org/10.1371/journal.pone.0167504.g002

The representation of typical MF points and RS for FRBMS is shown in Fig 3. A rule choses integer numbers in three sections viz., Rule selection, Antecedent, and Consequent. “R” denotes a rule selection that can be either 0 or 1 to select or deselect the rule. G₁, G₂, G₃ … G_f in the antecedent part represents filtered genes, denoting a random integer value among 0, 1, 2, and 3 to perform linguistic as well as gene selection. The consequent C_l takes any value among 0, 1, 2 … n to assign the category of cancer. These single MF and RS points are fed to WSA to initialize more MF and RS points randomly as a position for the initial water particle. Based on the procedural evaluation of WSA, a knowledge base is constructed that contains the optimal data base (MF points) and rule base (RS points). This knowledge base extracted by WSA is used in a Mamdani inference procedure to perform classification of samples.

Download:

Fig 3. Representation of typical membership function (MF) points and rule set (RS) for FRBMS.

https://doi.org/10.1371/journal.pone.0167504.g003

WSA

This is a new optimization algorithm [51, 52] inspired by the way water finds a drain in a sink. The learning principle of WSA is used to make the FRBMS as self-learning system by providing the knowledge base in the form of optimal MF and RS points. The WSA starts by initializing the control parameters like the number of water particles, boundary conditions, and iteration followed by random initialization of the position for water particles using the initial MF and RS points received from the FRBMS. Then, for each water particle position, WSA generates water particle’s strength and a reference position randomly. After that, each water particle’s position (i.e., MF and RS points) are evaluated using the objective function given in this equation: (1) where T_s is the total number of samples, C_s is the number of correctly classified samples, R_s is the selected rules from the maximum rules R_m, and G_s is the selected number of genes from the filtered genes. k₁ and k₂ are constants used to amplify R_s and G_s. The component (T_s−C_s) calculates error. The WSA approach used in this study attempts to minimize the error component and to improve accuracy of the system. Similarly, the component (k₁ × R_s) tries to produce a RS whose interpretability is addressed suitably by WSA. The component (k₂ × G_s) attempts to find out the minimal number of potential genes on the basis of the linguistic selection.

The optimality of the generated MF and RS points is checked during every iteration to yield the result. If the optimal points are not obtained, then the MF and RS points are updated iteratively using the strength and position update eqs (2) and (3): (2) (3) where α, x_p, and x_q,ref are all randomly generated using the range given for the solution variable; α_q,ref is a random number generated between 0 and 1; α_old and α_new are the strength vectors of water particles during i^th and (i + 1)^th iterations. Similarly, x_p,old and x_p,new are the positions of water particles during i^th and (i + 1)^th iterations; x_q,ref, x_prevBest, and x_gBest denote the reference position, previous best position, and global best position of the water particle, respectively.

Results

FRFI-WSA for the global cancer map with repeated measurements (GCM_RM) dataset

The steps of the proposed FRFI-WSA are demonstrated for tumor data categories of the GCM_RM dataset, which contains 123 samples. Out of 123 samples, 96 and 27 are used for training and testing, respectively. Furthermore, this dataset has 11 categories of tumors with 7129 genes. The 96 training samples include all categories of tumors. Nonetheless, the set of 27 test samples does not include samples of breast, melanoma, and pancreatic tumors. Hence, in this simulation, both the training and testing samples are mixed to have a reasonable sample for each category. Similar consideration is given to other kinds of datasets. The distributions of classes among the training (#Tr) and the testing (#Te) samples of GCM_RM are given in Table 2.

Download:

Table 2. Distribution of the training and testing tumor data categories in the GCM_RM dataset.

https://doi.org/10.1371/journal.pone.0167504.t002

At the first level of gene filtering, all the 123 samples are considered for the GRM dataset and other datasets as well. Initially, a fuzzy equivalence class (FEC) was calculated for an individual gene via the steps (i) through (viii) of FRFI. The FEC calculated for the individual gene is then used to produce an FEPM using step (ix) of FRFI. The FEC and FEPM calculated for the gene of GCM_RM whose accession id is AB002380_at are given in Table 3. Then Gene-Group significance is analyzed using step (x). Based on the Gene-Group significance value, genes are rated, and the gene with the highest significance value is designated as the first gene. Gene AB002380_at of GCM_RM has the highest significance value of 0.6489 and it is nominated as the top-rated significant gene. After significance calculation, Gene-Gene severance (redundancy) is analyzed among gene “AB002380_at” and the residual genes of the GCM_RM using step (xi) of FRFI as specified in Table 4.

Download:

Table 3. FEC and FEPM values for gene AB002380_at of the GCM_RM dataset.

https://doi.org/10.1371/journal.pone.0167504.t003

Download:

Table 4. Gene group significance and gene-gene severance values of the GCM_RM dataset.

https://doi.org/10.1371/journal.pone.0167504.t004

From the significance and severance values, an FI value for each gene is calculated using step (xii) of FRFI so that it maximizes the significance and minimizes severity. The FI values of first 100 genes are shown in Fig 4. There are variations among the FI values computed for each gene. The genes are arranged in descending order of FI values to filter out the top 50 genes from 7129 genes to achieve a good trade-off between significance and severance for further evaluation. Identification of the most significant gene among the initially filtered 50 genes is carried out using WSA, which aims to generate minimum rules with less informative genes to classify more samples by means of the FRBMS during classification.

Download:

Fig 4. The F-information (FI) values of first hundred genes for GCM_RM dataset.

https://doi.org/10.1371/journal.pone.0167504.g004

Each rule is found to take 52 varying integer numbers (1 for R, 50 for “G₁, G₂, G₃ … G_f,” 1 for C_l) as per the representation strategy given in Fig 3. The maximal number R_m of initial rules in the RS is determined heuristically by multiplying the number of classes (#Cat) in the dataset by 3 with the goal of obtaining at least a single rule for each category of cancer. For the GCM_RM dataset, 33 rules (11 × 3) are randomly initialized in the RS. Hence, the RS of GCM_RM contains 1716 integer numbers (33 × 52). Seven points are required to figure out the linguistic variables of every gene, and hence 350 floating-point numbers (7 × 50) are needed. The count of an integer variable differs from dataset to dataset depending on the number of cancer categories, whereas the count of a floating-point number is common for all the datasets.

The size of the initial solution space is considered within 20 to 50. Each position of the water particle is evaluated using the objective function (1) by changing the iterations from 10 to 100. The value for constants k₁ and k₂ in eq (1) is varied from two to five depending on the R_s and G_s obtained during a particular iteration. A maximum of 40 independent trials of experiments have been conducted by varying the water space as well as the iteration. The resulting performance of every particle inside water is examined. The finest results for GCM_RM datasets for 30 water spaces between 80 to 100 iterations were observed. A similar experiment was conducted for all other datasets used in this study. The selection of the most significant 16 genes in the RS along with their descriptions for identification of tumor categories among the 50 filtered genes are presented in S1 Table. The rule set gleaned for the GCM_RM dataset is presented in Table 5. Twenty-six rules were generated to achieve classification accuracy of 96.45%.

Download:

Table 5. The rule set generated for the GCM_RM dataset by the FRFI-WSA method.

https://doi.org/10.1371/journal.pone.0167504.t005

In Table 6, the accession ID of the most significant genes is presented along with the selected linguistic label and tumor category, which help to identify the genes causing the tumor. Furthermore, the GCM_RM dataset was examined with a different number of initial rules such as four, five, and six. It ultimately resulted in 44, 55, and 66 rules in the RS. The selected optimum genes involved in a different RS are not distributed reasonably among common genes. Hence, it is understandable that the various subsets of genes are selected for categorizing the classes of patients. Nevertheless, the genes selected beyond 20 to 100 in the RS yielded a minor improvement (roughly 0.6%) in the classification. Hence, it could be said that the proposed approach shows robust performance with 26 generated rules because it utilizes 16 selected genes to classify 119 out of 123 samples in the GCM_RM dataset.

Download:

Table 6. Identification of the most significant genes and their linguistic label in the rule set for the classification of tumor categories for the GCM_RM dataset by FRFI-WSA.

https://doi.org/10.1371/journal.pone.0167504.t006

Empirical results

Performance comparison and evaluation metrics.

The performance of the proposed WSA approach was compared with the competing methods such as GA [20], PSO [21], and ABC [22] on all the datasets. A comparison in convergence between the proposed WSA for the GCM_RM dataset and other approaches is shown in Fig 5. It is noteworthy that the convergence of other approaches is worse than that of the proposed WSA approach. Although the other approaches based on GA, PSO, and ABC are relatively good at tuning the MF, they consume more generations to converge. It is clear in the figure that both ABC and WSA show an abrupt rise in the fitness value whereas the GA and PSO approaches showed only a steady increase in the fitness value. The reason could be the more tunable parameters.

Download:

Fig 5. Convergence comparison of WSA with other methods for GCM_RM dataset.

https://doi.org/10.1371/journal.pone.0167504.g005

In Table 7, a comparison is presented between the proposed WSA and the other methods for all datasets. For each dataset (DS), the table shows the classification accuracy (CA%), number of genes (#Gs), and central processing unit (CPU) time (CT). All methods are credibly good in their performance, but it appears that PSO is a little faster than the others except WSA because of PSO’s simplified operations. Nonetheless, PSO did not produce an optimal solution better than ABC did. Even though ABC is relatively good at producing interpretable rules, it consumes more CPU time due to the different phases of bee operations in generating simple rules. In contrast, the proposed WSA acquired a quick desired fitness value with a minimum number of most significant genes for all the binary and multiclass datasets used in this study. It is indicated that the properly tuned regularization parameters by optimization using WSA can be possible to extend the proposed approach to classify binary and multiclass samples for cancer gene expression datasets.

Download:

Table 7. Comparison of the performance of the water swirl algorithm with existing methods on all datasets.

https://doi.org/10.1371/journal.pone.0167504.t007

The Monte-Carlo cross-validation (MCCV) method.

The performance of the proposed approach in terms of generalization was assessed using MCCV [53, 54] method. The mean value of the error calculated for the GCM_RM dataset using MCCV is presented in Fig 6. One can see that the error rate diminishes as the number of genes rises at every trial. Nevertheless, beyond 16 genes, the error rate surges to some extent. Hence, it is clear that a reasonably limited set of genes is sufficient to categorize the diverse cancer classes competently. Thus, the proposed FRFI-WSA approach can identify meaningful genes that cause cancer effectively with great precision for the classification of 11 tumor categories in GCM_RM datasets. Similar generalization performance was observed in all other datasets used in this study.

Download:

Fig 6. Generalization ability of WSA for GCM_RM dataset.

https://doi.org/10.1371/journal.pone.0167504.g006

Wilcoxon’s signed-rank test.

To evaluate noteworthy dissimilarities in outcomes between the competing methods and the proposed approach, Wilcoxon’s signed-rank test [21] was used. Table 8 presents the effects of the proposed approach are compared with those of the other methods for gene selection and knowledge acquisition. In this table, “r+” denotes the number of times the first method is superior to the second, and “r-“means the grades for disagreeing with the result. The null hypothesis “h” related to the Wilcoxon’s test is rejected (rej) because ρ < α = 0.01 in all comparisons favor WSA owing to variance in r+ and r- values. The results indicate that the fuzzy lower approximation space for computing significance and severance values of genes can deliver improvements in all metrics better than the existing methods can.

Download:

Table 8. Comparison of the performance of the water swirl algorithm with existing methods by Wilcoxon’s signed rank test on all datasets.

https://doi.org/10.1371/journal.pone.0167504.t008

The receiver operating characteristics (ROC) curve.

The ROC curve was drawn to understand the strength of the proposed FRFI-WSA using the true positive rate (TPR) against the false positive rate (FPR) in diverse cut points (Fig 7) [21, 55]. The proposed approach shows the ROC curve nearer to the higher left corner for all the data sets (for clear visualization, ROC curves are shown only for selected datasets). Our proposed approach has shown the highest sensitivity and specificity for all the datasets except for SRB and Car. Even though the proposed approach yields a lower value of the area under the curve (AUC) for SRB and Car datasets, this shortcoming does not disqualify the proposed approach as a screening test for cancer diagnosis because the effect of this shortcoming on performance is negligible.

Download:

Fig 7. Receiver operating characteristics curve analysis for selected datasets by FRFI-WSA.

https://doi.org/10.1371/journal.pone.0167504.g007

Interpretability and gene ontology analysis.

Readability and comprehensibility [56] are the two key valuation metrics to assess the interpretability of rules. The former deals with the model description that is quantified using the indices like coverage of the rules (R_cov), accuracy of the rules (R_acc), goodness of the rules (R_gud), average rule length (A_rl), average fired rules (A_fr), and average confidence firing degree of the rules (A_cfd). Values of those indices for every generated rule/RS can be obtained using eqs (4 to 9). (4) (5) (6) (7) (8) (9) where N_con is the count of samples concealed by rule R in the total number of samples #S, and N_pro is the count of samples properly classified by R in N_con. PCS_fd, NCS_fd, and TCS_fd are the firing degrees of positive, negative, and total covered samples, respectively. T_rl is the total rule length, i.e., the count of linguistic variables, T_fr is the total number fired rules, A_fd is the average firing degree of a rule, and #R is the total number of rules. The values of the indices for all the datasets are reported in Table 9. Throughout the execution, the proposed WSA tunes the MF points of each gene so that there is a reasonable overlap among the curves of linguistics. WSA also tries to ignore the MF points that attempt to go out of range. Likewise, the semantic label gained for each gene results in a reasonable length for each rule to use it compactly. The linguistic values (low, medium, and high) associated with each gene can help a physician to identify the patient’s distinct genomic contour to produce a verdict. The confidence about the average firing degree shows that the rules produced by WSA are fired more recurrently and have a tendency to be cofired with other rules. To avoid redundancy and to improve the compactness and interpretability without losing the classification accuracy, the rules with the lowest firing degrees are not included in this study.

Download:

Table 9. Reliability analysis of the rule set generated by FRFI-WSA in all datasets.

https://doi.org/10.1371/journal.pone.0167504.t009

Comprehensibility of the rules (which deals with explanation of the system concerning the inference complexity of rules) is analyzed using the information on cofiring of rules. For each rule R, the number of instances fired individually (IF) and simultaneously (SF) with every neighboring rule are recorded to compute a cofiring measure, CF, using the following equation: (10)

Then the number of premises P in each rule is counted for computing the comprehensibility index (CI) using this equation: (11) Where r is the total number of rules. Based on a heuristic threshold (T) between 0 and 1, the cofiring comprehensibility index (CFCI) is computed using eq (12) to understand the implied and clear semantics set in the fuzzy partitions and reasoning as well.

(12)

The details of such analysis are illustrated in Fig 8 for the rules of the GCM_RM dataset. All the rules generated by WSA without any rule selection were used for examining its comprehensibility. Rule R₁₆ has the largest CFCI value, while rule R₁₃ has the smallest value. We found that the majority of the samples are fired between regions R₁ and R₉. Because R₁₆ and R₃₂ cover many problem instances, they overlap with the rules among R₁ and R₉. Linguistic simplification is carried out by combining rules R₂₆ and R₂₇ showing a similar CFCI value. As anticipated, it is easy to see that the evidence related to the new fused rules varies for FRBMS with the complete RS. Likewise, elimination of certain rules is done to fine-tune the system performance. We found that the accuracy of the system is improved after elimination of rules R₁₃, R₂₀, R₂₂, R₂₃, and R_25. The interpretability analysis confirmed that the rules produced by the proposed WSA for all the datasets are transparent and comprehensible as well meet the requirements for understanding cancer gene expression data.

Download:

Fig 8. Comprehensibility of the generated rules by WSA for GCM_RM dataset.

https://doi.org/10.1371/journal.pone.0167504.g008

During gene expression-based cancer diagnosis, in addition to finding the subset of potential genes causing the cancer, the researcher is expected to trace out the physiognomies of the causative genes in terms of their part in multiple cancer classes [57]. The GO Sim package in the R platform [22] was used to compute the similarity value for the genes identified in the GCM_RM dataset using the GO terms. It is noteworthy that the genes are related to DNA metabolism and are enriched only in categories repair, positive regulation, reduction, cell size, development, and assembly. The nitrogen compound metabolic process of gene AFFX-CreX-3_st has an “is a” relation with GO:0006328 and is involved in a DNA metabolic process.

The primary metabolic process of AB000464_at has an “is a” relation with GO:004891, and the cellular process of AFFX-PheX-3_at has an “is a” relation with GO:006813. The process of cellular nitrogen compound metabolism relevant to Z49107_s_at has a “part of” relation with GO:000524 and GO:004271. The process of nucleobase-containing compound metabolism relevant to M33336_at has a “part of” relation with GO:013608 and GO:0044167. It was confirmed that the genes selected are involved in a DNA metabolic process, encode proteins associated with critical substances implicated in cancer. Such substances promote angiogenesis; help to elude apoptosis; increase differences from normal tissues; and enhance independent progression signs that lead to perfect prediction of cancer. Furthermore, the biological processes are consistent with the molecular activities that occur in active and proliferating cells of a cancer. The inequitable control of genes produced by the proposed procedure defines the extracellular environments that are important to understand the communications between the cells. Because most of the cancer genes restrained by the latest technology do not have entries in the GO database, it is not feasible to construct similarity relations between cancer genes for all the datasets used in this study. Overall, the refinement power of the nominated genes and their linguistics in the proposed model are sufficient to detect samples of a certain type of cancer and then to quickly rule out healthy samples.

Discussion

In this study, we propose a new combined FRFI-WSA approach for designing an FRBMS to analyze gene expression data for cancer diagnosis. The WSA method showed the highest classification accuracy for detection of cancer genes in comparison with the GA, PSO, and ABC algorithms (Fig 5). Furthermore, the proposed approach showed the highest diagnostic sensitivity and specificity in the ROC analysis for estimation of classification performance. The superior performance of FRFI-WSA is obvious because the implementation of gene filtering in this study maximizes the gene-class relevance, minimizes the gene-gene redundancy, and arranges genes in an increasing order of the FI values without dependence on the classifier model. In addition, the most relevant genes associated with cancer were identified by the WSA, which attempts to optimize the RS and MF required for classification of samples using an FRBMS.

The combined FRFI-WSA approach quickly attained a desired fitness value using shorter computing time and a minimal number of rules for identification of the most significant cancer genes in comparison with the GA [20] and PSO [21] techniques (Table 7). This is probably because WSA is based on the novel strength and position update eqs (2) and (3), respectively, and simplifies operations with fewer or no parameters, thus rapidly extracting the RS and MFs. The fuzzy model integrated into GA reported in reference [19] deals only with binary data using a wide range of genes for classification of cancer genes. Moreover, it was also demonstrated that finding an optimal number of genes for multiclass problems is more beneficial for diagnosis of cancer. The ensemble combinatorial search is integrated into GA [14] as a single objective GA for optimization of the ensemble technique to classify class-imbalanced datasets. Nonetheless, a single objective GA attempts to locate solutions closer to the local optimum and hence the average error is much greater than in the proposed approach, which finds global optimal solutions for the classification. Hence, the proposed FRFI-WSA approach can effectively identify the most relevant genes associated with cancer (16 genes) with great precision (96.5%) and to generate understandable compact rules with fewer parameters for the classification of multiclass cancer categories. The classification performance of the FRFI-WSA according to cross-validation also proved that the two levels of gene selection implemented in this approach can eliminate or do not include some of the noisy genes that worsen classification performance.

The optimization using WSA in the present study effectively extracts comprehensible RS (26) and understandable linguistics for an MF for classifying the multiclass cancer samples. These data are also supported by another study [14], where the repeated tuning of an MF and RS was carried out by the optimization method could achieve the dimensionality challenges and multiple-class imbalanced data for optimal classifications. The lack of previous studies with the application of WSA for gene selection and RS based on multiclass gene expression datasets, making it difficult to compare our results directly is also one of the limitations in this study. Although the proposed model is better at identifying genes that are strongly responsible in order to classify different types of cancer, it consumes time, particularly in generating fuzzy equivalence class. In the future, the complexity of generating a fuzzy equivalence class by the FRFI method can be reduced by evaluating the Cartesian product using a fuzzy lower approximation for more rapid selection of a smaller subset of genes without any skewedness to multicategory data. However the proposed classifier model based on gene expression datasets extracted the most relevant genes associated with cancer by WSA method. Furthermore, the employment of other global optimization techniques such as genetic swarm and ant bee algorithms could be combined along with WSA method to generalize the interpretable rules with most relevant genes for cancer. In addition, further study also needed to verify the performance of the proposed approach to investigate the similarities of the gene expression data generated from other platforms such as Illumina, Agilent, etc. Our study revealed that the FR implemented here computes the FI without losing the biological meaning of the gene expression and should be helpful for identification of potential genes. Next, the WSA method will produce highly interpretable rules and will classify the maximal number of samples using an FRBMS better than the existing methods reported in the literature [14, 19–21]. Thus, the two levels of gene selection implemented in this study result in an efficient diagnostic system with lower complexity. Furthermore, the proposed approach reduces the computational cost and thus improves the classification accuracy of an FRBMS. In addition, the highest sensitivity and specificity in the selected multiclass datasets strongly indicate that the new FRFI-WSA approach is practically useful for construction of an effective system for making diagnostic decisions about cancer.

Supporting Information

S1 Appendix. The detailed concepts of the fuzzy set, rough set, fuzzy rough set and f-information.

https://doi.org/10.1371/journal.pone.0167504.s001

(PDF)

S1 Table. Identification of the most significant genes along with their descriptions for the GCM_RM dataset by FRFI-WSA.

https://doi.org/10.1371/journal.pone.0167504.s002

(DOC)

Acknowledgments

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI15C0001) and a grant of the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2015M2A2A7A01045177).

Author Contributions

Conceptualization: PGK MSK.
Data curation: PGK MSK.
Formal analysis: PGK MSK.
Funding acquisition: BCA.
Investigation: PGK MSK.
Methodology: PGK MSK.
Project administration: PGK MSK.
Resources: PGK MSK.
Software: PGK.
Supervision: MSK BCA.
Validation: PGK MSK BCA.
Visualization: PGK MSK.
Writing – original draft: PGK MSK.
Writing – review & editing: PGK MSK BCA.

References

1. Willenbrock H, Salomon J, Søkilde R, Barken KB, Hansen TN, Nielsen FC, Møller S, Litman T. Quantitative miRNA expression analysis: comparing microarrays with next-generation sequencing. RNA. 2009; 15(11):2028–2034. pmid:19745027
- View Article
- PubMed/NCBI
- Google Scholar
2. Liu L, So ASL, Fan J-B. Analysis of cancer genomes through microarrays and next generation sequencing. Translational Cancer Research. 2015; 4(3):212–218.
- View Article
- Google Scholar
3. Zhang R, Huang G-B, Sundararajan N, and Saratchandran P. Multicategory classification using an extreme learning machine for microarray gene expression cancer diagnosis. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2007; 4(3):485–495. pmid:17666768
- View Article
- PubMed/NCBI
- Google Scholar
4. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine. 2001; 7:673–679. pmid:11385503
- View Article
- PubMed/NCBI
- Google Scholar
5. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002; 97(457):77–87.
- View Article
- Google Scholar
6. Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, et al. Molecular classification of multiple tumor types. Bioinformatics. 2001; 17 (Suppl 1).
- View Article
- Google Scholar
7. Alon U, Barkai N, Notternman D, Gish K, Ybarra S, Mack D, Levine A. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academic of Sciences, USA. 1999; 96: 6745–6750.
- View Article
- Google Scholar
8. Komori O, Pritchard M, Eguchi S. Multiple suboptimal solutions for prediction rules in gene expression data. Computational and Mathematical Methods in Medicine. 2013: 798189: 14pp. pmid:23662163
- View Article
- PubMed/NCBI
- Google Scholar
9. Schaefer G, Nakashima T. Data mining of gene expression data by fuzzy and hybrid fuzzy methods. IEEE Transactions on Information Technology in Biomedicine 2010; 14(1): 23–29. pmid:19846381
- View Article
- PubMed/NCBI
- Google Scholar
10. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences, USA. 2001; 98:15149–15154.
- View Article
- Google Scholar
11. Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, et. al., Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Research. 2003; 63:1602–1607. pmid:12670911
- View Article
- PubMed/NCBI
- Google Scholar
12. Czajkowski M, Grzes M, Kretowski M. Multi-test decision tree and its application to microarray data classification. Artificial Intelligence in Medicine. 2014; 61(1):35–44. pmid:24630712
- View Article
- PubMed/NCBI
- Google Scholar
13. Moon H, Ahn H, Kodell RL, Baek S, Lin CJ, Chen JJ. Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artificial Intelligence in Medicine. 2007; 41:197–207. pmid:17719213
- View Article
- PubMed/NCBI
- Google Scholar
14. Haque MN, Noman N, Berretta R, Moscato P. Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification. PLoS One. 2016; 11(1):1–28.
- View Article
- Google Scholar
15. Tan A, Naiman D, Xu L, Winslow R, Geman D. Simple decision rules for classifying human cancer from gene expression profiles. Bioinformatics. 2005; 21:3896–3904. pmid:16105897
- View Article
- PubMed/NCBI
- Google Scholar
16. Yoon Y, Bien S, Park S. Microarray data classifier consisting of k-top-scoring rank-comparison decision rules with a variable number of genes. IEEE Transactions on Systems, Man, and Cybernetics–Part C: Applications and Reviews. 2010; 40(2):216–226.
- View Article
- Google Scholar
17. Alizadeh A, Eisen M, Davis E, Ma C, Loossos I, Rosenwald A, et al. Different types of diffuse large B-cell lymphoma identified by gene expression profiles. Nature. 2000; 403:503–511. pmid:10676951
- View Article
- PubMed/NCBI
- Google Scholar
18. Kraan VPTC, Wijbrands CA, Van Baarsen LG, Voskuyl AE, Rustenburg F, Baggen JM, et al. Rheumatoid arthritis subtypes identified by genomic profiling of peripheral blood cells: Assignment of a type I interferon signature in a subpopulation of patients. Annals of the Rheumatic Diseases. 2007; 66:1008–1014. pmid:17223656
- View Article
- PubMed/NCBI
- Google Scholar
19. Nguyen T, Khosravi A, Creighton D, Nahavandi S. Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification. PLoS One. 2015; 10(3):1–23.
- View Article
- Google Scholar
20. Ho S-Y, Hsieh C-H, Chen H-M, Huang H-L. Interpretable gene expression classifier with an accurate and compact fuzzy rule base for microarray data analysis. BioSystems. 2006; 85: 165–176. pmid:16490299
- View Article
- PubMed/NCBI
- Google Scholar
21. Xu R, Anagnostopoulos GC, Wunsch DC 2nd. Multiclass cancer classification using semisupervised ellipsoid artmap and particle swarm optimization with gene expression data. IEEE/ACM Transactions on computational biology and informatics. 2007; 4(1):65–77.
- View Article
- Google Scholar
22. Ganesh Kumar P, Rani C, Devaraj D, Albert Victoire AT. Hybrid ant bee algorithm for fuzzy expert system based sample classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014; 11(2):347–360. pmid:26355782
- View Article
- PubMed/NCBI
- Google Scholar
23. Lee CS, Wang MH. A Fuzzy expert system for diabetes decision support application. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics. 2011; 41(1):139–153.
- View Article
- Google Scholar
24. Tsipouras MG, Voglis C, Fotiadis DI. A framework for fuzzy expert system creation–application to cardiovascular diseases. IEEE Transactions on Biomedical Engineering. 2007; 54(11):2089–2105. pmid:18018705
- View Article
- PubMed/NCBI
- Google Scholar
25. Lee Y, Lee CK. Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 2003; 19:1132–1139. pmid:12801874
- View Article
- PubMed/NCBI
- Google Scholar
26. Trawinski K, Cordon O, Quirin A. On designing fuzzy rule based multiclassification systems by combining furia with bagging and feature selection. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 2011; 19(4):589–633.
- View Article
- Google Scholar
27. Trawinski K, Cordon O, Sanchez L, Quirin A. A genetic fuzzy linguistic combination method for fuzzy rule based multiclassifiers. IEEE Transactions on Fuzzy Systems. 2013; 21(5):950–965.
- View Article
- Google Scholar
28. Sundaresh S, Hung S., HatField WG, Baldi P. How noisy and replicable are DNA microarray data. International Journal of bioinformatics and Research Applications 2005; 1(1):31–50.
- View Article
- Google Scholar
29. Pamukcu E, Bozdogan H, Calik S. A novel hybrid dimension reduction technique for undersized high dimensional gene expression data sets using information complexity criterion for cancer classification. Computational and Mathematical Methods in Medicine. 2015; 370640: 14 pp. pmid:25838836
- View Article
- PubMed/NCBI
- Google Scholar
30. Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002; 1:133–143. pmid:12086872
- View Article
- PubMed/NCBI
- Google Scholar
31. Hippo Y, Taniguchi H, Tsutsumi S, Machida N, Chong JM, Fukayama M, et al. Global gene expression analysis of gastric cancer by oligonucleotide microarrays. Cancer Research 2002; 62(1):233–40. pmid:11782383
- View Article
- PubMed/NCBI
- Google Scholar
32. Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, et al. Large-scale analysis of the human and mouse transcriptomes. Proceedings of the National Academy of Sciences, USA. 2002; 99(7):4447–4465.
- View Article
- Google Scholar
33. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, et al. Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature. 2002; 415:436–442. pmid:11807556
- View Article
- PubMed/NCBI
- Google Scholar
34. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics. 2002; 30:41–47. pmid:11731795
- View Article
- PubMed/NCBI
- Google Scholar
35. Risinger JI, Maxwell GL, Chandramouli GVR, Jazaeri A, Aprelikova O, Patterson T, et al. Microarray analysis reveals distinct gene expression profiles among different histologic types of endometrial cancer. Cancer Research. 2003; 63:6–11. pmid:12517768
- View Article
- PubMed/NCBI
- Google Scholar
36. Li J, Liu H. (2002). Kent ridge biomedical data set repository. [http://research.i2r.a-star.edu.sg/rp].
37. Dyrskjot L, Thykjaer T, Kruhoffer M, Jensen JL, Marcussen N, et al. Identifying distinct classes of bladder carcinoma using microarrays. Nature Genetics. 2003; 33(1): 90–96. pmid:12469123
- View Article
- PubMed/NCBI
- Google Scholar
38. Yeung K, Bumgarner R. Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biology. 2003; 4(12):R83. pmid:14659020
- View Article
- PubMed/NCBI
- Google Scholar
39. Dehan E, Ben-Dor A, Liao W, Lipson D, Frimer H, Rienstein S, et al. Chromosomal aberrations and gene expression profiles in non-small cell lung cancer. Lung Cancer. 2007; 56(2):175–84. http://dx.doi.org/10.1016/j.lungcan.2006.12.010 pmid:17258348
- View Article
- PubMed/NCBI
- Google Scholar
40. Gordon GJ, Jensen RV, Hsiao LL, Gullans SR, Blumenstock JE, Ramaswamy S, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research. 2002; 62: 4963–4967. pmid:12208747
- View Article
- PubMed/NCBI
- Google Scholar
41. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002; 203–209. pmid:12086878
- View Article
- PubMed/NCBI
- Google Scholar
42. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 2002; 359:572–577. pmid:11867112
- View Article
- PubMed/NCBI
- Google Scholar
43. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, et al. Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine. 2002; 8:68–74. pmid:11786909
- View Article
- PubMed/NCBI
- Google Scholar
44. Cromer A, Carles A, Millon R, Ganguli G, Chalmel F, Lemaire F, et al. Identification of genes associated with tumorigenesis and metastatic potential of hypopharyngeal cancer by microarray analysis. Oncogene. 2004; 23(14):2484–2498. pmid:14676830
- View Article
- PubMed/NCBI
- Google Scholar
45. Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, et al. Patterns of resistance and incomplete response to docetaxel by gene expression profiling in breast cancer patients. Journal of Clinical Oncology 2005; 23(6):1169–77. pmid:15718313
- View Article
- PubMed/NCBI
- Google Scholar
46. Chowdary D, Lathrop J, Skelton J, Curtin K, Briggs T, Zhang Y, et al. Prognostic gene expression signatures can be measured in tissues collected in RNA later preservative. Journal of Molecular Diagnostics. 2006; 8(1): 31–39. pmid:16436632
- View Article
- PubMed/NCBI
- Google Scholar
47. Laiho P, Kokko A, Vanharanta S, Salovaara R, Sammalkorpi H, Jarvinen H, et al. Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis. Oncogene. 2007; 26(2):312–320. pmid:16819509
- View Article
- PubMed/NCBI
- Google Scholar
48. National Centre for Biotechnology Information (NCBI) (2009), U.S. National Library of Medicine, http://www.ncbi.nlm.nih.gov.
49. Ganesh Kumar P, Rani C, Mahibha D, Albert Victoire AT. Fuzzy-rough-neural-based f-information for gene selection and sample classification. International Journal of Data Mining and Bioinformatics. 2015; 11(1):31–52. pmid:26255375
- View Article
- PubMed/NCBI
- Google Scholar
50. Hu Q, Zhang L, An S, Zhang D, Yu D. On robust fuzzy rough set models. IEEE Transactions on Fuzzy Systems. 2012; 20(4): 636–651.
- View Article
- Google Scholar
51. Cengel Y, Cimbala J. Fluid Mechanics Fundamentals and Applications. Mcgraw-Hill, New York. 2006.
52. Menser S, Hereford J. A new optimization technique. Proceedings of IEEE Southeast Conference. 2006; 250–255.
53. Saito PT, Nakamura RY, Amorim WP, Papa JP, de Rezende PJ, Falcão AX. Choosing the most effective pattern classification model under learning-time constraint. PLoS One. 2015; 10(6):1–23.
- View Article
- Google Scholar
54. Picard RR, Cook RD. Cross-validation of regression models. Journal of the American Statistical Association. 1984; 79(387):575–583.
- View Article
- Google Scholar
55. Peterson LE, Coleman MA. Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research. International Journal of Approximate Reasoning. 2008; 47(1):17–36. pmid:19079753
- View Article
- PubMed/NCBI
- Google Scholar
56. Pancho DP, Alonso JM, Cordon O, Quirin A, Magdalena L. FINGRAMS: Visual representations of fuzzy rule-based inference for expert analysis of comprehensibility. IEEE Transactions on Fuzzy Systems. 2013; 21(6):1133–1149.
- View Article
- Google Scholar
57. Whitworth J, Hoffman J, Chapman C, Ong KR, Lalloo F, Evans DG, Maher ER. A clinical and genetic analysis of multiple primary cancer referrals to genetics services. European Journal of Human Genetics. 2015; 23(5):581–587. pmid:25248401
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Willenbrock H, Salomon J, Søkilde R, Barken KB, Hansen TN, Nielsen FC, Møller S, Litman T. Quantitative miRNA expression analysis: comparing microarrays with next-generation sequencing. RNA. 2009; 15(11):2028–2034. pmid:19745027
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Liu L, So ASL, Fan J-B. Analysis of cancer genomes through microarrays and next generation sequencing. Translational Cancer Research. 2015; 4(3):212–218.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Zhang R, Huang G-B, Sundararajan N, and Saratchandran P. Multicategory classification using an extreme learning machine for microarray gene expression cancer diagnosis. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2007; 4(3):485–495. pmid:17666768
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine. 2001; 7:673–679. pmid:11385503
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002; 97(457):77–87.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref6] 6. Yeang CH, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin RM, Angelo M, et al. Molecular classification of multiple tumor types. Bioinformatics. 2001; 17 (Suppl 1).
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref7] 7. Alon U, Barkai N, Notternman D, Gish K, Ybarra S, Mack D, Levine A. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academic of Sciences, USA. 1999; 96: 6745–6750.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref8] 8. Komori O, Pritchard M, Eguchi S. Multiple suboptimal solutions for prediction rules in gene expression data. Computational and Mathematical Methods in Medicine. 2013: 798189: 14pp. pmid:23662163
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref9] 9. Schaefer G, Nakashima T. Data mining of gene expression data by fuzzy and hybrid fuzzy methods. IEEE Transactions on Information Technology in Biomedicine 2010; 14(1): 23–29. pmid:19846381
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref10] 10. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences, USA. 2001; 98:15149–15154.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref11] 11. Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, et. al., Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Research. 2003; 63:1602–1607. pmid:12670911
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Czajkowski M, Grzes M, Kretowski M. Multi-test decision tree and its application to microarray data classification. Artificial Intelligence in Medicine. 2014; 61(1):35–44. pmid:24630712
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Moon H, Ahn H, Kodell RL, Baek S, Lin CJ, Chen JJ. Ensemble methods for classification of patients for personalized medicine with high-dimensional data. Artificial Intelligence in Medicine. 2007; 41:197–207. pmid:17719213
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. Haque MN, Noman N, Berretta R, Moscato P. Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification. PLoS One. 2016; 11(1):1–28.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref15] 15. Tan A, Naiman D, Xu L, Winslow R, Geman D. Simple decision rules for classifying human cancer from gene expression profiles. Bioinformatics. 2005; 21:3896–3904. pmid:16105897
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref16] 16. Yoon Y, Bien S, Park S. Microarray data classifier consisting of k-top-scoring rank-comparison decision rules with a variable number of genes. IEEE Transactions on Systems, Man, and Cybernetics–Part C: Applications and Reviews. 2010; 40(2):216–226.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref17] 17. Alizadeh A, Eisen M, Davis E, Ma C, Loossos I, Rosenwald A, et al. Different types of diffuse large B-cell lymphoma identified by gene expression profiles. Nature. 2000; 403:503–511. pmid:10676951
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref18] 18. Kraan VPTC, Wijbrands CA, Van Baarsen LG, Voskuyl AE, Rustenburg F, Baggen JM, et al. Rheumatoid arthritis subtypes identified by genomic profiling of peripheral blood cells: Assignment of a type I interferon signature in a subpopulation of patients. Annals of the Rheumatic Diseases. 2007; 66:1008–1014. pmid:17223656
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref19] 19. Nguyen T, Khosravi A, Creighton D, Nahavandi S. Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification. PLoS One. 2015; 10(3):1–23.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref20] 20. Ho S-Y, Hsieh C-H, Chen H-M, Huang H-L. Interpretable gene expression classifier with an accurate and compact fuzzy rule base for microarray data analysis. BioSystems. 2006; 85: 165–176. pmid:16490299
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref21] 21. Xu R, Anagnostopoulos GC, Wunsch DC 2nd. Multiclass cancer classification using semisupervised ellipsoid artmap and particle swarm optimization with gene expression data. IEEE/ACM Transactions on computational biology and informatics. 2007; 4(1):65–77.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref22] 22. Ganesh Kumar P, Rani C, Devaraj D, Albert Victoire AT. Hybrid ant bee algorithm for fuzzy expert system based sample classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014; 11(2):347–360. pmid:26355782
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref23] 23. Lee CS, Wang MH. A Fuzzy expert system for diabetes decision support application. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics. 2011; 41(1):139–153.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref24] 24. Tsipouras MG, Voglis C, Fotiadis DI. A framework for fuzzy expert system creation–application to cardiovascular diseases. IEEE Transactions on Biomedical Engineering. 2007; 54(11):2089–2105. pmid:18018705
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref25] 25. Lee Y, Lee CK. Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics. 2003; 19:1132–1139. pmid:12801874
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref26] 26. Trawinski K, Cordon O, Quirin A. On designing fuzzy rule based multiclassification systems by combining furia with bagging and feature selection. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 2011; 19(4):589–633.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref27] 27. Trawinski K, Cordon O, Sanchez L, Quirin A. A genetic fuzzy linguistic combination method for fuzzy rule based multiclassifiers. IEEE Transactions on Fuzzy Systems. 2013; 21(5):950–965.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref28] 28. Sundaresh S, Hung S., HatField WG, Baldi P. How noisy and replicable are DNA microarray data. International Journal of bioinformatics and Research Applications 2005; 1(1):31–50.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref29] 29. Pamukcu E, Bozdogan H, Calik S. A novel hybrid dimension reduction technique for undersized high dimensional gene expression data sets using information complexity criterion for cancer classification. Computational and Mathematical Methods in Medicine. 2015; 370640: 14 pp. pmid:25838836
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref30] 30. Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002; 1:133–143. pmid:12086872
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref31] 31. Hippo Y, Taniguchi H, Tsutsumi S, Machida N, Chong JM, Fukayama M, et al. Global gene expression analysis of gastric cancer by oligonucleotide microarrays. Cancer Research 2002; 62(1):233–40. pmid:11782383
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref32] 32. Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, et al. Large-scale analysis of the human and mouse transcriptomes. Proceedings of the National Academy of Sciences, USA. 2002; 99(7):4447–4465.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref33] 33. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, et al. Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature. 2002; 415:436–442. pmid:11807556
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref34] 34. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics. 2002; 30:41–47. pmid:11731795
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref35] 35. Risinger JI, Maxwell GL, Chandramouli GVR, Jazaeri A, Aprelikova O, Patterson T, et al. Microarray analysis reveals distinct gene expression profiles among different histologic types of endometrial cancer. Cancer Research. 2003; 63:6–11. pmid:12517768
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref36] 36. Li J, Liu H. (2002). Kent ridge biomedical data set repository. [http://research.i2r.a-star.edu.sg/rp].

[ref37] 37. Dyrskjot L, Thykjaer T, Kruhoffer M, Jensen JL, Marcussen N, et al. Identifying distinct classes of bladder carcinoma using microarrays. Nature Genetics. 2003; 33(1): 90–96. pmid:12469123
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref38] 38. Yeung K, Bumgarner R. Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biology. 2003; 4(12):R83. pmid:14659020
View Article
PubMed/NCBI
Google Scholar

[133] View Article

[134] PubMed/NCBI

[135] Google Scholar

[ref39] 39. Dehan E, Ben-Dor A, Liao W, Lipson D, Frimer H, Rienstein S, et al. Chromosomal aberrations and gene expression profiles in non-small cell lung cancer. Lung Cancer. 2007; 56(2):175–84. http://dx.doi.org/10.1016/j.lungcan.2006.12.010 pmid:17258348
View Article
PubMed/NCBI
Google Scholar

[137] View Article

[138] PubMed/NCBI

[139] Google Scholar

[ref40] 40. Gordon GJ, Jensen RV, Hsiao LL, Gullans SR, Blumenstock JE, Ramaswamy S, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research. 2002; 62: 4963–4967. pmid:12208747
View Article
PubMed/NCBI
Google Scholar

[141] View Article

[142] PubMed/NCBI

[143] Google Scholar

[ref41] 41. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002; 203–209. pmid:12086878
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref42] 42. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 2002; 359:572–577. pmid:11867112
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref43] 43. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, et al. Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine. 2002; 8:68–74. pmid:11786909
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref44] 44. Cromer A, Carles A, Millon R, Ganguli G, Chalmel F, Lemaire F, et al. Identification of genes associated with tumorigenesis and metastatic potential of hypopharyngeal cancer by microarray analysis. Oncogene. 2004; 23(14):2484–2498. pmid:14676830
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref45] 45. Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, et al. Patterns of resistance and incomplete response to docetaxel by gene expression profiling in breast cancer patients. Journal of Clinical Oncology 2005; 23(6):1169–77. pmid:15718313
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref46] 46. Chowdary D, Lathrop J, Skelton J, Curtin K, Briggs T, Zhang Y, et al. Prognostic gene expression signatures can be measured in tissues collected in RNA later preservative. Journal of Molecular Diagnostics. 2006; 8(1): 31–39. pmid:16436632
View Article
PubMed/NCBI
Google Scholar

[165] View Article

[166] PubMed/NCBI

[167] Google Scholar

[ref47] 47. Laiho P, Kokko A, Vanharanta S, Salovaara R, Sammalkorpi H, Jarvinen H, et al. Serrated carcinomas form a subclass of colorectal cancer with distinct molecular basis. Oncogene. 2007; 26(2):312–320. pmid:16819509
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref48] 48. National Centre for Biotechnology Information (NCBI) (2009), U.S. National Library of Medicine, http://www.ncbi.nlm.nih.gov.

[ref49] 49. Ganesh Kumar P, Rani C, Mahibha D, Albert Victoire AT. Fuzzy-rough-neural-based f-information for gene selection and sample classification. International Journal of Data Mining and Bioinformatics. 2015; 11(1):31–52. pmid:26255375
View Article
PubMed/NCBI
Google Scholar

[174] View Article

[175] PubMed/NCBI

[176] Google Scholar

[ref50] 50. Hu Q, Zhang L, An S, Zhang D, Yu D. On robust fuzzy rough set models. IEEE Transactions on Fuzzy Systems. 2012; 20(4): 636–651.
View Article
Google Scholar

[178] View Article

[179] Google Scholar

[ref51] 51. Cengel Y, Cimbala J. Fluid Mechanics Fundamentals and Applications. Mcgraw-Hill, New York. 2006.

[ref52] 52. Menser S, Hereford J. A new optimization technique. Proceedings of IEEE Southeast Conference. 2006; 250–255.

[ref53] 53. Saito PT, Nakamura RY, Amorim WP, Papa JP, de Rezende PJ, Falcão AX. Choosing the most effective pattern classification model under learning-time constraint. PLoS One. 2015; 10(6):1–23.
View Article
Google Scholar

[183] View Article

[184] Google Scholar

[ref54] 54. Picard RR, Cook RD. Cross-validation of regression models. Journal of the American Statistical Association. 1984; 79(387):575–583.
View Article
Google Scholar

[186] View Article

[187] Google Scholar

[ref55] 55. Peterson LE, Coleman MA. Machine learning-based receiver operating characteristic (ROC) curves for crisp and fuzzy classification of DNA microarrays in cancer research. International Journal of Approximate Reasoning. 2008; 47(1):17–36. pmid:19079753
View Article
PubMed/NCBI
Google Scholar

[189] View Article

[190] PubMed/NCBI

[191] Google Scholar

[ref56] 56. Pancho DP, Alonso JM, Cordon O, Quirin A, Magdalena L. FINGRAMS: Visual representations of fuzzy rule-based inference for expert analysis of comprehensibility. IEEE Transactions on Fuzzy Systems. 2013; 21(6):1133–1149.
View Article
Google Scholar

[193] View Article

[194] Google Scholar

[ref57] 57. Whitworth J, Hoffman J, Chapman C, Ong KR, Lalloo F, Evans DG, Maher ER. A clinical and genetic analysis of multiple primary cancer referrals to genetics services. European Journal of Human Genetics. 2015; 23(5):581–587. pmid:25248401
View Article
PubMed/NCBI
Google Scholar

[196] View Article

[197] PubMed/NCBI

[198] Google Scholar

Figures

Abstract

Introduction

Materials and Methods

Cancer gene expression datasets

Proposed architecture for analyzing cancer gene expression data

FRFI

FRBMS

WSA

Results

FRFI-WSA for the global cancer map with repeated measurements (GCM_RM) dataset

Empirical results

Performance comparison and evaluation metrics.

The Monte-Carlo cross-validation (MCCV) method.

Wilcoxon’s signed-rank test.

The receiver operating characteristics (ROC) curve.

Interpretability and gene ontology analysis.

Discussion

Supporting Information

S1 Appendix. The detailed concepts of the fuzzy set, rough set, fuzzy rough set and f-information.

S1 Table. Identification of the most significant genes along with their descriptions for the GCM_RM dataset by FRFI-WSA.

Acknowledgments

Author Contributions

References