Automated Detection of Cancer Associated Genes Using a Combined Fuzzy-Rough-Set-Based F-Information and Water Swirl Algorithm of Human Gene Expression Data

This study describes a novel approach to reducing the challenges of highly nonlinear multiclass gene expression values for cancer diagnosis. To build a fruitful system for cancer diagnosis, in this study, we introduced two levels of gene selection such as filtering and embedding for selection of potential genes and the most relevant genes associated with cancer, respectively. The filter procedure was implemented by developing a fuzzy rough set (FR)-based method for redefining the criterion function of f-information (FI) to identify the potential genes without discretizing the continuous gene expression values. The embedded procedure is implemented by means of a water swirl algorithm (WSA), which attempts to optimize the rule set and membership function required to classify samples using a fuzzy-rule-based multiclassification system (FRBMS). Two novel update equations are proposed in WSA, which have better exploration and exploitation abilities while designing a self-learning FRBMS. The efficiency of our new approach was evaluated on 13 multicategory and 9 binary datasets of cancer gene expression. Additionally, the performance of the proposed FRFI-WSA method in designing an FRBMS was compared with existing methods for gene selection and optimization such as genetic algorithm (GA), particle swarm optimization (PSO), and artificial bee colony algorithm (ABC) on all the datasets. In the global cancer map with repeated measurements (GCM_RM) dataset, the FRFI-WSA showed the smallest number of 16 most relevant genes associated with cancer using a minimal number of 26 compact rules with the highest classification accuracy (96.45%). In addition, the statistical validation used in this study revealed that the biological relevance of the most relevant genes associated with cancer and their linguistics detected by the proposed FRFI-WSA approach are better than those in the other methods. The simple interpretable rules with most relevant genes and effectively classified samples suggest that the proposed FRFI-WSA approach is reliable for classification of an individual’s cancer gene expression data with high precision and therefore it could be helpful for clinicians as a clinical decision support system.


Introduction
The abundance of genes expressed in microarray experiments requires a long computation time and results in complex output for an FRBMS. To implement an FRBMS for a gene expression-based cancer diagnosis problem, identification of most relevant genes associated with cancer from the large set of genes is mandatory [4,15]. The purpose of this newly proposed combined fuzzy-rough-set-based f-information & water swirl algorithm (FRFI-WSA) approach was to design an FRBMS for analyzing gene expression data for cancer diagnosis. For an effective cancer diagnostic system, two levels of gene selection (by filtering and embedding procedures using 22 cancer gene expression datasets collected from various sources) were introduced. Next, we conducted a comparison of the performance of the proposed FRFI-WSA with GA, PSO, and artificial bee colony algorithm (ABC) for cancer gene expression datasets.

Cancer gene expression datasets
This study includes 22 gene expression datasets including name, number of genes (#Genes), samples (#Sam), and categories (#Cat) along with the source of collection and its type ( Table 1). The performance of the proposed algorithm for classifying datasets irrespective of the number of output classes was evaluated with 13 multiclass and 9 binary datasets. All the datasets were generated using oligonucleotide-based technology where RNA was hybridized using Affymetrix arrays HG-U95/Hu6800/HuGeneFL/Hu35K. The gene expression values of all the datasets were computed using the Affymetrix GENECHIP MAS 4.0 analysis software. The data on small round blue cell tumors (SRBCTs), NCI60 (National Cancer Institute), and Lymphoma were acquired using a two-color cDNA platform with successive image analysis by means of the DeArray Software. To summarize, 22 datasets included in our experiments each have 2-11 distinct diagnostic categories, 24-253 samples (patients), and 182-54614 genes collected from different tissues under different experimental conditions. The number of samples per class is highly sparse and imbalanced (varies from 6 to 579).

Proposed architecture for analyzing cancer gene expression data
A clinical challenge concerning the limited number of patients (scarcity) that is skewed in favor of one group (disparity) with a huge number of genes (dimensionality) across many categories of cancer (multiclass) are the problems faced by clinicians during analysis of gene expression data for prediction of cancer [30][31][32][33]. To overcome these drawback, problemspecific computational techniques for multiclass cancer diagnosis was developed here. As shown in Fig 1, the implementation procedure of the proposed combinatorial approach can be viewed in seven phases. The first phase reads the input data into the FRFI method. It helps to find the candidate genes in the huge number of genes using well-narrated steps as presented in Fig 1. The candidate genes are then fed into FRBMS in the second phase to find the initial points for the membership function (MF) and rule set (RS). In the third phase, these initial MF points and RS are read into the WSA to generate a population of points as a water particle's position. The generated points are submitted to the inference procedure of FRBS in the fourth phase to compute the correctly classified samples (Cs), the selected number of rules (Rs), and selected number of informative genes (Gs). The parameters Cs, Rs, and Gs calculated in the FRBMS are then input to the WSA in the fifth phase for evaluating the objective function, which determines optimality of the generated water particle's position as a knowledge base. If the optimality criteria are not met, then the water particle's strength and position are updated accordingly to generate a useful knowledge base which results in improved classification of samples. The fifth and sixth phases are repeatedly executed until the desired convergence criterion is achieved. In the final phase, acceptable classification accuracy with interpretable knowledge is generated in the form of if-then conditional statements that help to identify the cancer-causing genes. The details of subcomponents of the proposed architecture are given below.  FRFI Regardless of the dimensionality issue, the fuzzy rough set (FR) [49] effectively calculates the redundancy (severance) as well as relevance (significance) using f-information (FI) without discretizing the continuous gene expression values. The detailed concepts of the fuzzy set, rough set, fuzzy rough set and f-information is presented in S1 Appendix. Even though the FR offers a regimented means for FI-based gene selection, it becomes inadequate for the noisiness and poor dispersal of multiclass samples. Hence, it was upgraded with a fuzzy lower approximation [50] to compute FI extrinsically to filter a gene subset. Given an n × m matrix of gene expression data with "m" gene vectors, the goal of gene filtering is to produce an n × f gene expression data matrix with "f" filtered gene vectors, where f < m. The steps involved in computing FI using the FR are as follows.

7.
Calculate the membership value in lower fuzzy approximation spaces for each gene G i×j , 11. Now, Gene-Gene Severance between F sig and the remaining genes G rem is calculated as It is expected that the proposed method of fuzzifying the criterion function of FI with a rough set can filter genes extrinsically in a way similar to human intervention into gene identification.

FRBMS
The filtered candidate genes from the FRFI method are partitioned into linguistics to generate the MF and RS points. As shown in Fig 2, this study includes three partitions such as low ("L"), medium ("M"), and high ("H"), and thus nine membership points (P 1 , P 2 , P 3 , P 4 , P 5 , P 6 , P 7 , P 8 , and P 9 ) are required to encode each candidate gene. P 1 and P 9 are permanent to designate the limits of the gene expression value. The optimal values for other points are selected between the limits [P 1 , P 9 ] for P 2 , [P 2 , P 9 ] for P 3 , [P 2 , P 3 ] for P 4 , [P 4 , P 9 ] for P 5 , [P 5 , P 9 ] for P 6 , [P 5 , P 6 ] for P 7 , and [P 7 , P 9 ] for P 8 . These points take floating-point numbers in which triplets P 1 , P 2 , P 3 and P 7 , P 8 , P 9 draw a trapezoidal MF and the triplet P 4 , P 5 , P 6 draws a triangular MF.
The representation of typical MF points and RS for FRBMS is shown in Fig 3. A rule choses integer numbers in three sections viz., Rule selection, Antecedent, and Consequent. "R" denotes a rule selection that can be either 0 or 1 to select or deselect the rule. G 1 , G 2 , G 3 . . . G f in the antecedent part represents filtered genes, denoting a random integer value among 0, 1, 2, and 3 to perform linguistic as well as gene selection. The consequent C l takes any value among 0, 1, 2 . . . n to assign the category of cancer. These single MF and RS points are fed to WSA to initialize more MF and RS points randomly as a position for the initial water particle. Based on the procedural evaluation of WSA, a knowledge base is constructed that contains the optimal data base (MF points) and rule base (RS points). This knowledge base extracted by WSA is used in a Mamdani inference procedure to perform classification of samples.

WSA
This is a new optimization algorithm [51,52] inspired by the way water finds a drain in a sink. The learning principle of WSA is used to make the FRBMS as self-learning system by providing the knowledge base in the form of optimal MF and RS points. The WSA starts by initializing the control parameters like the number of water particles, boundary conditions, and iteration followed by random initialization of the position for water particles using the initial MF and RS points received from the FRBMS. Then, for each water particle position, WSA generates water particle's strength and a reference position randomly. After that, each water particle's position (i.e., MF and RS points) are evaluated using the objective function given in this equation: where T s is the total number of samples, C s is the number of correctly classified samples, R s is the selected rules from the maximum rules R m , and G s is the selected number of genes from the filtered genes. k 1 and k 2 are constants used to amplify R s and G s . The component (T s −C s )  calculates error. The WSA approach used in this study attempts to minimize the error component and to improve accuracy of the system. Similarly, the component (k 1 × R s ) tries to produce a RS whose interpretability is addressed suitably by WSA. The component (k 2 × G s ) attempts to find out the minimal number of potential genes on the basis of the linguistic selection.
The optimality of the generated MF and RS points is checked during every iteration to yield the result. If the optimal points are not obtained, then the MF and RS points are updated iteratively using the strength and position update eqs (2) and (3): where α, x p , and x q,ref are all randomly generated using the range given for the solution variable; α q,ref is a random number generated between 0 and 1; α old and α new are the strength vectors of water particles during i th and (i + 1) th iterations. Similarly, x p,old and x p,new are the positions of water particles during i th and (i + 1) th iterations; x q,ref , x prevBest , and x gBest denote the reference position, previous best position, and global best position of the water particle, respectively.

FRFI-WSA for the global cancer map with repeated measurements (GCM_RM) dataset
The steps of the proposed FRFI-WSA are demonstrated for tumor data categories of the GCM_RM dataset, which contains 123 samples. Out of 123 samples, 96 and 27 are used for training and testing, respectively. Furthermore, this dataset has 11 categories of tumors with 7129 genes. The 96 training samples include all categories of tumors. Nonetheless, the set of 27 test samples does not include samples of breast, melanoma, and pancreatic tumors. Hence, in this simulation, both the training and testing samples are mixed to have a reasonable sample for each category. Similar consideration is given to other kinds of datasets. The distributions of classes among the training (#Tr) and the testing (#Te) samples of GCM_RM are given in Table 2.
At the first level of gene filtering, all the 123 samples are considered for the GRM dataset and other datasets as well. Initially, a fuzzy equivalence class (FEC) was calculated for an individual gene via the steps (i) through (viii) of FRFI. The FEC calculated for the individual gene is then used to produce an FEPM using step (ix) of FRFI. The FEC and FEPM calculated for the gene of GCM_RM whose accession id is AB002380_at are given in Table 3. Then Gene-Group significance is analyzed using step (x). Based on the Gene-Group significance value, genes are rated, and the gene with the highest significance value is designated as the first gene. Gene AB002380_at of GCM_RM has the highest significance value of 0.6489 and it is nominated as the top-rated significant gene. After significance calculation, Gene-Gene severance (redundancy) is analyzed among gene "AB002380_at" and the residual genes of the GCM_RM using step (xi) of FRFI as specified in Table 4.
From the significance and severance values, an FI value for each gene is calculated using step (xii) of FRFI so that it maximizes the significance and minimizes severity. The FI values of first 100 genes are shown in Fig 4. There are variations among the FI values computed for each gene. The genes are arranged in descending order of FI values to filter out the top 50 genes from 7129 genes to achieve a good trade-off between significance and severance for further evaluation. Identification of the most significant gene among the initially filtered 50 genes is carried out using WSA, which aims to generate minimum rules with less informative genes to classify more samples by means of the FRBMS during classification.
Each rule is found to take 52 varying integer numbers (1 for R, 50 for "G 1 , G 2 , G 3 . . . G f ," 1 for C l ) as per the representation strategy given in Fig 3. The maximal number R m of initial rules in the RS is determined heuristically by multiplying the number of classes (#Cat) in the dataset by 3 with the goal of obtaining at least a single rule for each category of cancer. For the GCM_RM dataset, 33 rules (11 × 3) are randomly initialized in the RS. Hence, the RS of GCM_RM contains 1716 integer numbers (33 × 52). Seven points are required to figure out the linguistic variables of every gene, and hence 350 floating-point numbers (7 × 50) are needed. The count of an integer variable differs from dataset to dataset depending on the number of cancer categories, whereas the count of a floating-point number is common for all the datasets.
The size of the initial solution space is considered within 20 to 50. Each position of the water particle is evaluated using the objective function (1) by changing the iterations from 10 to 100. The value for constants k 1 and k 2 in eq (1) is varied from two to five depending on the R s and G s obtained during a particular iteration. A maximum of 40 independent trials of experiments have been conducted by varying the water space as well as the iteration. The resulting performance of every particle inside water is examined. The finest results for GCM_RM datasets for 30 water spaces between 80 to 100 iterations were observed. A similar experiment was conducted for all other datasets used in this study. The selection of the most significant 16 genes in the RS along with their descriptions for identification of tumor categories among the 50 filtered genes are presented in S1 Table. The rule set gleaned for the GCM_RM dataset is presented in Table 5. Twenty-six rules were generated to achieve classification accuracy of 96.45%. In Table 6, the accession ID of the most significant genes is presented along with the selected linguistic label and tumor category, which help to identify the genes causing the tumor. Furthermore, the GCM_RM dataset was examined with a different number of initial rules such as four, five, and six. It ultimately resulted in 44, 55, and 66 rules in the RS. The selected optimum genes involved in a different RS are not distributed reasonably among common genes. Hence, it is understandable that the various subsets of genes are selected for categorizing the classes of patients. Nevertheless, the genes selected beyond 20 to 100 in the RS

Empirical results
Performance comparison and evaluation metrics. The performance of the proposed WSA approach was compared with the competing methods such as GA [20], PSO [21], and ABC [22] on all the datasets. A comparison in convergence between the proposed WSA for the GCM_RM dataset and other approaches is shown in Fig 5. It is noteworthy that the convergence of other approaches is worse than that of the proposed WSA approach. Although the other approaches based on GA, PSO, and ABC are relatively good at tuning the MF, they consume more generations to converge. It is clear in the figure that both ABC and WSA show an abrupt rise in the fitness value whereas the GA and PSO approaches showed only a steady increase in the fitness value. The reason could be the more tunable parameters.
In Table 7, a comparison is presented between the proposed WSA and the other methods for all datasets. For each dataset (DS), the table shows the classification accuracy (CA%), number of genes (#Gs), and central processing unit (CPU) time (CT). All methods are credibly good in their performance, but it appears that PSO is a little faster than the others except WSA because of PSO's simplified operations. Nonetheless, PSO did not produce an optimal solution better than ABC did. Even though ABC is relatively good at producing interpretable rules, it consumes more CPU time due to the different phases of bee operations in generating simple rules. In contrast, the proposed WSA acquired a quick desired fitness value with a minimum number of most significant genes for all the binary and multiclass datasets used in this study. It is indicated that the properly tuned regularization parameters by optimization using WSA can be possible to extend the proposed approach to classify binary and multiclass samples for cancer gene expression datasets. The Monte-Carlo cross-validation (MCCV) method. The performance of the proposed approach in terms of generalization was assessed using MCCV [53,54] method. The mean value of the error calculated for the GCM_RM dataset using MCCV is presented in Fig 6. One can see that the error rate diminishes as the number of genes rises at every trial. Nevertheless, beyond 16 genes, the error rate surges to some extent. Hence, it is clear that a reasonably limited set of genes is sufficient to categorize the diverse cancer classes competently. Thus, the proposed FRFI-WSA approach can identify meaningful genes that cause cancer effectively with great precision for the classification of 11 tumor categories in GCM_RM datasets. Similar generalization performance was observed in all other datasets used in this study.
Wilcoxon's signed-rank test. To evaluate noteworthy dissimilarities in outcomes between the competing methods and the proposed approach, Wilcoxon's signed-rank test [21] was used. Table 8 presents the effects of the proposed approach are compared with those of the other methods for gene selection and knowledge acquisition. In this table, "r+" denotes the number of times the first method is superior to the second, and "r-"means the grades for disagreeing with the result. The null hypothesis "h" related to the Wilcoxon's test is rejected (rej) because ρ < α = 0.01 in all comparisons favor WSA owing to variance in r+ and r-values. The results indicate that the fuzzy lower approximation space for computing significance and severance values of genes can deliver improvements in all metrics better than the existing methods can.
The receiver operating characteristics (ROC) curve. The ROC curve was drawn to understand the strength of the proposed FRFI-WSA using the true positive rate (TPR) against the false positive rate (FPR) in diverse cut points (Fig 7) [ 21,55]. The proposed approach shows the ROC curve nearer to the higher left corner for all the data sets (for clear visualization, ROC curves are shown only for selected datasets). Our proposed approach has shown the highest sensitivity and specificity for all the datasets except for SRB and Car. Even though the proposed approach yields a lower value of the area under the curve (AUC) for SRB and Car datasets, this shortcoming does not disqualify the proposed approach as a screening test for cancer diagnosis because the effect of this shortcoming on performance is negligible. Interpretability and gene ontology analysis. Readability and comprehensibility [56] are the two key valuation metrics to assess the interpretability of rules. The former deals with the model description that is quantified using the indices like coverage of the rules (R cov ), accuracy of the rules (R acc ), goodness of the rules (R gud ), average rule length (A rl ), average fired rules (A fr ), and average confidence firing degree of the rules (A cfd ). Values of those indices for every generated rule/RS can be obtained using eqs (4 to 9).
where N con is the count of samples concealed by rule R in the total number of samples #S, and N pro is the count of samples properly classified by R in N con . PCS fd , NCS fd , and TCS fd are the firing degrees of positive, negative, and total covered samples, respectively. T rl is the total rule length, i.e., the count of linguistic variables, T fr is the total number fired rules, A fd is the average firing degree of a rule, and #R is the total number of rules. The values of the indices for all the datasets are reported in Table 9. Throughout the execution, the proposed WSA tunes the MF points of each gene so that there is a reasonable overlap among the curves of linguistics. WSA also tries to ignore the MF points that attempt to go out of range. Likewise, the semantic label gained for each gene results in a reasonable length for each rule to use it compactly. The linguistic values (low, medium, and high) associated with each gene can help a physician to identify the patient's distinct genomic contour to produce a verdict. The confidence about the average firing degree shows that the rules produced by WSA are fired more recurrently and have a tendency to be cofired with other rules. To avoid redundancy and to improve the compactness and interpretability without losing the classification accuracy, the rules with the lowest firing degrees are not included in this study. Comprehensibility of the rules (which deals with explanation of the system concerning the inference complexity of rules) is analyzed using the information on cofiring of rules. For each rule R, the number of instances fired individually (IF) and simultaneously (SF) with every neighboring rule are recorded to compute a cofiring measure, CF, using the following equation: Then the number of premises P in each rule is counted for computing the comprehensibility index (CI) using this equation: Where r is the total number of rules. Based on a heuristic threshold (T) between 0 and 1, the cofiring comprehensibility index (CFCI) is computed using eq (12) to understand the implied Automated Detection of Cancer Associated Genes Using Optimization Algorithm and clear semantics set in the fuzzy partitions and reasoning as well.
The details of such analysis are illustrated in Fig 8 for the rules of the GCM_RM dataset. All the rules generated by WSA without any rule selection were used for examining its comprehensibility. Rule R 16 has the largest CFCI value, while rule R 13 has the smallest value. We found that the majority of the samples are fired between regions R 1 and R 9 . Because R 16 and R 32 cover many problem instances, they overlap with the rules among R 1 and R 9 . Linguistic simplification is carried out by combining rules R 26 and R 27 showing a similar CFCI value. As anticipated, it is easy to see that the evidence related to the new fused rules varies for FRBMS with the complete RS. Likewise, elimination of certain rules is done to fine-tune the system performance. We found that the accuracy of the system is improved after elimination of rules R 13 , R 20 , R 22 , R 23 , and R 25. The interpretability analysis confirmed that the rules produced by the proposed WSA for all the datasets are transparent and comprehensible as well meet the requirements for understanding cancer gene expression data.
During gene expression-based cancer diagnosis, in addition to finding the subset of potential genes causing the cancer, the researcher is expected to trace out the physiognomies of the causative genes in terms of their part in multiple cancer classes [57]. The GO Sim package in the R platform [22] was used to compute the similarity value for the genes identified in the GCM_RM dataset using the GO terms. It is noteworthy that the genes are related to DNA metabolism and are enriched only in categories repair, positive regulation, reduction, cell size, development, and assembly. The nitrogen compound metabolic process of gene AFFX-CreX-3_st has an "is a" relation with GO:0006328 and is involved in a DNA metabolic process.
The primary metabolic process of AB000464_at has an "is a" relation with GO:004891, and the cellular process of AFFX-PheX-3_at has an "is a" relation with GO:006813. The process of cellular nitrogen compound metabolism relevant to Z49107_s_at has a "part of" relation with GO:000524 and GO:004271. The process of nucleobase-containing compound metabolism relevant to M33336_at has a "part of" relation with GO:013608 and GO:0044167. It was confirmed that the genes selected are involved in a DNA metabolic process, encode proteins associated with critical substances implicated in cancer. Such substances promote angiogenesis; help to elude apoptosis; increase differences from normal tissues; and enhance independent progression signs that lead to perfect prediction of cancer. Furthermore, the biological processes are consistent with the molecular activities that occur in active and proliferating cells of a cancer. The inequitable control of genes produced by the proposed procedure defines the extracellular environments that are important to understand the communications between the cells. Because most of the cancer genes restrained by the latest technology do not have entries in the GO database, it is not feasible to construct similarity relations between cancer genes for all the datasets used in this study. Overall, the refinement power of the nominated genes and their linguistics in the proposed model are sufficient to detect samples of a certain type of cancer and then to quickly rule out healthy samples.

Discussion
In this study, we propose a new combined FRFI-WSA approach for designing an FRBMS to analyze gene expression data for cancer diagnosis. The WSA method showed the highest classification accuracy for detection of cancer genes in comparison with the GA, PSO, and ABC algorithms (Fig 5). Furthermore, the proposed approach showed the highest diagnostic sensitivity and specificity in the ROC analysis for estimation of classification performance. The superior performance of FRFI-WSA is obvious because the implementation of gene filtering in this study maximizes the gene-class relevance, minimizes the gene-gene redundancy, and arranges genes in an increasing order of the FI values without dependence on the classifier model. In addition, the most relevant genes associated with cancer were identified by the WSA, which attempts to optimize the RS and MF required for classification of samples using an FRBMS.
The combined FRFI-WSA approach quickly attained a desired fitness value using shorter computing time and a minimal number of rules for identification of the most significant cancer genes in comparison with the GA [20] and PSO [21] techniques (Table 7). This is probably because WSA is based on the novel strength and position update eqs (2) and (3), respectively, and simplifies operations with fewer or no parameters, thus rapidly extracting the RS and MFs. The fuzzy model integrated into GA reported in reference [19] deals only with binary data using a wide range of genes for classification of cancer genes. Moreover, it was also demonstrated that finding an optimal number of genes for multiclass problems is more beneficial for diagnosis of cancer. The ensemble combinatorial search is integrated into GA [14] as a single objective GA for optimization of the ensemble technique to classify class-imbalanced datasets. Nonetheless, a single objective GA attempts to locate solutions closer to the local optimum and hence the average error is much greater than in the proposed approach, which finds global optimal solutions for the classification. Hence, the proposed FRFI-WSA approach can effectively identify the most relevant genes associated with cancer (16 genes) with great precision (96.5%) and to generate understandable compact rules with fewer parameters for the classification of multiclass cancer categories. The classification performance of the FRFI-WSA according to cross-validation also proved that the two levels of gene selection implemented in this approach can eliminate or do not include some of the noisy genes that worsen classification performance.
The optimization using WSA in the present study effectively extracts comprehensible RS (26) and understandable linguistics for an MF for classifying the multiclass cancer samples.
These data are also supported by another study [14], where the repeated tuning of an MF and RS was carried out by the optimization method could achieve the dimensionality challenges and multiple-class imbalanced data for optimal classifications. The lack of previous studies with the application of WSA for gene selection and RS based on multiclass gene expression datasets, making it difficult to compare our results directly is also one of the limitations in this study. Although the proposed model is better at identifying genes that are strongly responsible in order to classify different types of cancer, it consumes time, particularly in generating fuzzy equivalence class. In the future, the complexity of generating a fuzzy equivalence class by the FRFI method can be reduced by evaluating the Cartesian product using a fuzzy lower approximation for more rapid selection of a smaller subset of genes without any skewedness to multicategory data. However the proposed classifier model based on gene expression datasets extracted the most relevant genes associated with cancer by WSA method. Furthermore, the employment of other global optimization techniques such as genetic swarm and ant bee algorithms could be combined along with WSA method to generalize the interpretable rules with most relevant genes for cancer. In addition, further study also needed to verify the performance of the proposed approach to investigate the similarities of the gene expression data generated from other platforms such as Illumina, Agilent, etc. Our study revealed that the FR implemented here computes the FI without losing the biological meaning of the gene expression and should be helpful for identification of potential genes. Next, the WSA method will produce highly interpretable rules and will classify the maximal number of samples using an FRBMS better than the existing methods reported in the literature [14,[19][20][21]. Thus, the two levels of gene selection implemented in this study result in an efficient diagnostic system with lower complexity. Furthermore, the proposed approach reduces the computational cost and thus improves the classification accuracy of an FRBMS. In addition, the highest sensitivity and specificity in the selected multiclass datasets strongly indicate that the new FRFI-WSA approach is practically useful for construction of an effective system for making diagnostic decisions about cancer.
Supporting Information S1 Appendix. The detailed concepts of the fuzzy set, rough set, fuzzy rough set and f-information. (PDF) S1 Table. Identification