A novel feature ranking method for prediction of cancer stages using proteomics data

Proteomic analysis of cancers' stages has provided new opportunities for the development of novel, highly sensitive diagnostic tools which helps early detection of cancer. This paper introduces a new feature ranking approach called FRMT. FRMT is based on the Technique for Order of Preference by Similarity to Ideal Solution method (TOPSIS) which select the most discriminative proteins from proteomics data for cancer staging. In this approach, outcomes of 10 feature selection techniques were combined by TOPSIS method, to select the final discriminative proteins from seven different proteomic databases of protein expression profiles. In the proposed workflow, feature selection methods and protein expressions have been considered as criteria and alternatives in TOPSIS, respectively. The proposed method is tested on seven various classifier models in a 10-fold cross validation procedure that repeated 30 times on the seven cancer datasets. The obtained results proved the higher stability and superior classification performance of method in comparison with other methods, and it is less sensitive to the applied classifier. Moreover, the final introduced proteins are informative and have the potential for application in the real medical practice.


Introduction
Cancer has always been one of the most fundamental health problems of the human society. Every year, between 100 and 350 out of every 100,000 people die due to cancer in the worldwide [1][2][3][4]. Understanding the nature of cancer, which caused by the malfunction of the mechanisms that regulate growth and cell division, has always been a topic of interest to researchers. The development of molecular biology in recent decades enhanced understanding of complex interactions of the genetic variants, transcription and translation [5]. Proteomic studies can play a critical role in prevention, early detection and treatment of cancer. Given that proteomic studies can help identify cancer biomarkers, it might cause early detection and treatment of cancer [6,7].
The robustness of microarray-derived cancer biomarkers that have been identified by using gene expression profiles is very poor [8,9]. Thus, the evaluation of tumor cells at protein expression levels, which are more robust than gene expression level, is necessary to explain were taken by RPPA. We used RPPA, an antibody-based high-throughput technique, for analyzing concurrent expression levels of hundreds of proteins in a single experiment. The related pathological information for each patient in the TCPA dataset was downloaded from Broad Institute TCGA (https://confluence.broadinstitute.org/display/GDAC/ Dashboard-Stddata and http://tcpaportal.org/tcpa/download.html). Then, we divided the samples into two groups of early stage (stage I and II) and advanced stage (stage III and IV).
In TCPA, the proteins are divided into three groups including "validated", "under evaluation" and "used with caution". In this work, we only used 115 validated labeled proteins per patient to obtain reliable results. See Table 1 for details.
We used the R software and FEAST Toolbox [35] in MATLAB to implement different classification models and feature selection algorithms. All the cleaned data and the algorithms scripts used in this manuscript, can be downloaded from www.github.com/E-Saghapour/ FRMT.

Hybrid models
The hybrid model approaches were used by many previous investigators to study various biological or biomedical problems [36][37][38][39][40]. The stratification of cancer can be considered as traditional pattern recognition problems. Data analysis procedure, including feature selection and classification steps, is shown in Fig 1. The explanation of different blocks in Fig 1 is presented in the next subsections. Feature selection. In a filter feature selection (FFS) method, a criterion function would be used for independently ranking features. Then, the top ranked features, called informative features, would be used in the classification model. Various criterion functions have been introduced and applied to the gene expression profiles that led to different subset of genes with different classification performance. Although the FFS methods produce unstable results in different datasets, but they are robust against overfitting. FFS methods can also be applied to the protein expression profiles for protein ranking, however, they do not take into account protein-protein interactions. In this study, a novel ensemble method is proposed to improve the stability of results obtained by integrating common FFS methods (Table 2). We utilized the TOPSIS method to score the proteins and choose the most informative ones for classification (Fig 2). The TOPSIS method is described in detail in the next section.
TOPSIS method. The TOPSIS was first presented by Hwang and Yoon in 1981 [41]. It is a multi-criteria decision analysis method relied on selecting the option that its geometric distances from the positive ideal solution (PIS) and the negative ideal solution (NIS) are the shortest and longest, respectively.
The workflow of the TOPSIS method contains the following seven steps: 1. Generating an m-by-n evaluation matrix contains m alternatives A 1 ; A 2 ; . . . ; A m , each assessed by n local criteria C 1 ; C 2 ; . . . ; C n .
2. Normalizing the decision matrix: Where x ij is the score of alternative A i with respect to the criterion C j .
3. Calculating the weighted normalized decision matrix which its values V ij are computed as: . . . ; m; i ¼ 1; 2; . . . ; n: let W i = [w 1 , w 2 ,. . .,w n ] be the vector of local criteria weights satisfying X n i¼1 W i ¼ 1. 4. Determining the positive ideal (A + ) and negative ideal (A -) solutions as follows: In the proposed method, all criteria are considered as benefit, therefor J' is empty and (2), (3) will be reduced to (6), (7); 5. Measuring the Euclidean Distances between each alternative and both the positive and negative ideal, which are calculated as follows: 6. Computing the relative closeness to the ideal solution by Eq (10).  Classification. In this study, we utilized seven models for classification including SVM, RF, DT, LDA, NB, FL, and kNN.
SVM rely on the concept of decision planes that specify decision borders. Classification task performed by building hyperplanes in a multidimensional space that distinct various class labels. The classes that have nonlinear boundaries in the input space employ the kernel function method to map the input space in to a higher dimensional feature space in which linear differentiation may be feasible. The kernel trick computes all training data without using or knowing the mapping, thus high dimensionality of the feature space does not increase computational cost of classification and training task.
RF is an ensemble classifier comprised of many decision trees. The mode of class output obtained by individual trees would be the class that is output by RF [42]. The Random Decision Forests learning algorithm was developed by Leo Breiman [21] based on decision trees, which are non-parametric supervised learning approaches used for regression and classification. Using of a set of tree classifiers and randomness in the RF design led to good accuracy and stability of the resulting classifier.
RF is a classifier including a set of tree-structured classifiers {g(x, b k ) k = 1, 2,. . .}, where the {b k } are independent identically distributed random vectors and each tree puts a unit vote for the famous class at input x. The RF method (along with other ensemble learning methods) has been very popular in biomedical research, and it considers random tree building using both bagging and random variable selection [43].
Fuzzy Inference System (FIS) is a method of mapping the input space to an output space using FL. FIS attempts to formalize the reasoning procedure of human language by means of FL and building fuzzy IF-THEN rules. The procedure of fuzzy inference involves all of the sections that are explained in Membership Functions, Logical Operations, and If-Then Rules. They have become strong methods to afford various problems such as uncertainty, imprecision, and non-linearity. They are generally used for identification, classification, and regression works. Instead of employing crisp sets as in classical rules, fuzzy rules exploit fuzzy sets. Rules were initially taken from human experts through knowledge engineering procedures. However, this approach may not be possible when facing complicated tasks or when human experts are not accessible [44].
The kNN algorithm, one of the popular machine learning algorithms, is a non-parametric method used for classification and regression predictive problems. In both cases, the input vector contains the k closest training samples in the feature space. The output is dependent to value of k whether it is used for classification or regression. In kNN classification, the output is a class membership. An object is classified by a majority vote of its neighbors with the object being allocated to the class most usual between its k nearest neighbors. The best election of k depends on the data; a good k can be elected by different heuristic methods. Larger values of k decrease the effect of noise on the classification, but it creates boundaries between classes less distinct. A drawback of the kNN algorithm is its sensitivity to the local structure of the data. In kNN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors rather than voting from nearest neighbors [45].
LDA is a method used in pattern recognition, statistics, and machine learning to detect a linear combination of features that separate two or more classes of objects and is an extension of Fisher's linear discriminant; Such combination might be used as a linear classifier, or, more generally, for dimensionality reduction before later classification. LDA attempts to represent one dependent variable as a linear combination of other features and is closely relevant to analysis of variance and regression analysis [46]. LDA is closely relevant to factor analysis and principal component analysis (PCA) in that they both look for linear combinations of variables which best illustrate the data. LDA clearly efforts to model the diversity among the classes of data. PCA, on the other hand, does not take into account any diversity in class, and factor analysis creates the feature combinations according to the differences rather than similarities [46].
The Bayesian Classification is a statistical method for classification that illustrates a supervised learning strategy. Bayesian classification provides practical learning algorithms in which the former knowledge and the observed data can be combined. Bayesian Classification provides an effective perspective for evaluating and understanding many learning algorithms. It is not affected by noise in input data and calculates clear probabilities for hypothesis. The NB classifier is used when features are independent of each other within each class, but it works well in practice even when that independence assumption is not valid. NB classifier requires a small amount of training data to estimate the parameters such as mean and variance of the variables necessary for classification [47].
Performance measures K-fold cross-validation test, independent dataset test, sub-sampling test and jackknife crossvalidation test are four widely used classes of schemes in statistical classify for examining the performance of a prediction model [48][49][50][51][52][53]. The jackknife test has been widely used in Bioinformatics [54-68], because it can achieve unique outcome [33,69]. However, it is time-consuming. For saving the computational time, in this study, ten-fold cross-validation was used to investigate the performance of the prediction model. In k-fold cross-validation, the data is divided into k subset, each time, one of the k subsets and k-1 subsets are used as test and train data, respectively. Then the mean error across all k experiment is calculated. Since the utilized dataset is unbalanced in terms of number of samples in two groups of early and advanced stages, the Area under Curve (AUC) and Matthews Correlation Coefficient (MCC) were used.
The MCC, introduced by Brian W. Matthews [70], is used for measuring the quality of binary classification. The MCC is a number between -1 and +1. Values of 1 and 0 demonstrate a perfect and random prediction, respectively. In addition, -1 represents total disagreement between the predicted and actual values. It can be computed from the confusion matrix as: where TP is the number of true positives (early stage), TN is the number of true negatives (advanced stage), FP is the number of false positives, and FN is the number of false negatives. Eq 11 can be represented in another form like Eq. 11 in [40], which were derived by Xu et al.
[40] based on the symbols introduced by Chou in studying signal peptides and those used in many recent studies [26][27][28][29][30][31][32]. The set of metrics is valid only for the singlelabel systems. For the multi-label systems which has become more frequent in systems biology [71] and systems medicine [37, [72][73][74], a completely different set of metrics as defined in [75] is needed. Moreover, the AUC is defined as the area under the ROC curve, which illustrates the performance of a binary classifier system as its discrimination threshold is varied. An AUC of 1, 0.5, and under 0.5 indicates a perfect, random, and bad classifier, respectively.

Results
As can be seen from the data in Table 1, the KIRC cancer with 453 samples, contains the most samples in the whole dataset. The UCEC Cancer is the second-most with 404 samples. According to the pathologic stage, data are unbalanced and the READ and OV data has the less and the most unbalanced level, respectively.
To present the performance of the proposed FRMT method, we have provided 7 tables; one table for each cancer data (S1 Table), which demonstrated the effect of applying different feature selection technique on various classifier architectures. In this regard, MCC and AUC are used as evaluation measures in 30 repetitions of 10-fold cross-validation procedure. For a fair performance evaluation, we should consider different constraints that affect the classification performance such as: train dataset, classifier model, and the number of selected features. In this regard, we should evaluate different possible combinations, which contains 49 states due to the seven classifiers that applied on seven datasets. Then, we select subset of features with different sizes (5, 10, 15 and 20) obtained by each feature selection method, considering which method reaches the highest accuracy in each of 49 state. Fig 4 shows the percentage of states that each feature selection method reached the best performance (winning frequency) for various numbers of features. As it is shown in Fig 4, the proposed method has reached the best result for all sizes of feature subsets in comparison with other methods, and the peak of result obtained by using 10 features.
After this point the same number of features (top 10 proteins) has been selected as the input of all classifiers in all experiments. The best results in the tables are highlighted by shading. The frequency of selection of each feature selection method as the best, or winning frequency regarding the classification performance, is depicted in Fig 5. For each cancer, each classifier model obtained the best answer with only one of the eleven feature selection methods.  The results presented in S1 Table are summarized in Table 3. The left part of Table 3 demonstrates a comparative analysis of the FRMT method performance by applying different classification models in seven datasets. In the right part of Table 3, the best results of every feature selection method in combination with a classifier that led to the best performance for prediction of cancer stage are shown.
After applying the FRMT method in different datasets, top ranked proteins were extracted and the name of first 10 informative ones were reported in S1 Table.

Discussion
In this study, a new approach called FRMT method was proposed to select protein biomarkers. 10 FFS methods were integrated to extract the best stage prediction cancer biomarkers. Finding the best proteins via a multi-criteria decision analysis method, the FRMT method demonstrates a proficient method for ranking proteins using protein expression profile data without concerning about the selection of suitable FFS method for a specific problem.  The performance of six well-known classifiers was evaluated and reported using 10 top ranked proteins selected by the FRMT and other FFS approaches. The results indicate that the FRMT method is more advantageous than other FFS methods in terms of robustness in classification performance; By measuring the number of times that a method obtained the best results, we observed that the best frequency has been achieved by the FRMT method with 6 out of 7 times in UCEC and KIRC cancer (Fig 5). Furthermore, in the READ and LUSC cancer dataset, the maximum frequency of 3 out of 7 has been reached by the FRMT method. In the HSNE cancer, the frequency for disr and FRMT methods are equal to 3. It should be noted that some methods of feature selection were never chosen as the best model.
Looking at the pie chart in Fig 6, in 62 percent of experiments the FRMT method achieved the best classification performance in the whole proteomic dataset. Afterward, the disr and mim methods reached to success rate of 8 and 6 percent, respectively.
As it is reported in Table 3, the best performance in prediction of cancer stage evaluated by using AUC and MCC has occurred in HSNE dataset. The AUC of 69.55 with SE (Standard Error) of 0.07 and MCC of 0.37 with SE of 0.0002 are the best results among all dataset achieved by the FRMT method as feature selection and NB method as classifier. It should be noted that in HSNE dataset, the disr method reached the second-best place by AUC of 60.59 with SE of 0.31 and MCC of 0.27 with SE of 0.0018.
Comparison of the FRMT method with other methods in Table 3 are suggestive of the wrs method achieving the best results among other FFS methods; AUC of 63.51 with SE of 0.13 and MCC of 0.28 with SE of 0.0005 have been achieved by using RF as classifier in KIRC dataset. However, in the KIRC dataset, the FRMT method already obtained the best result with NB classifier, which are AUC of 66.54 with SE of 0.03 and MCC of 0.34 with SE of 0.0001.
As it is seen from the data in Table 3, the NB classifier was achieved the best results in the majority of experiments evaluated various feature selection methods. NB achieved the best performance in 4 out of 7 datasets using the proposed method, and in two datasets by applying cife and disr methods. Notably, the SVM classifier obtained the second place.
Top-ranked protein selected by FRMT method from each dataset, showed significant overlap with recently discovered biomarkers that were associated with cancer development. According to Fig 7, MAPK_pT202_Y204 is the most frequently selected protein from 4 datasets among top 10 ones by FRMT. The striking point about the MAPK_pT202_Y204 is its significant role in MAPK pathway (Mitogen-activated protein kinases) and regulation of cell growth and differentiation [34].
In addition, S6_pS235_S236 which involved in growth factors and mitogens induced protein translation [34], is the second frequently selected protein selected from 3 datasets among top 10 ones by FRMT.
Gab2 that is selected by FRMT as the most informative protein in the READ dataset is recently introduced as an overexpressed protein in several cancer types [76][77][78]. Moreover, several researchers have reported that overexpression of Gab2 stimulates cell proliferation, cell transformation, and tumor progression; Ding et al. [79] showed Gab2 overexpression in clinical colorectal cancer (CRC) specimens. Moreover, Gab2 is selected by FRMT as the second discriminative protein in OV dataset, and this is in concordance with recent studies that reported Gab2 amplification and overexpression in a subset of primary high-grade serous ovarian cancers and cell lines [78]. Furthermore, the expression level of IRS1, which is selected by FRMT as the second discriminative protein in READ dataset, was utilized by Hanyuda et al. as a predictive marker for classification of patients according to their survival benefit gained by the exercise [80].
About S6 phosphorylation(S6_pS240_S244), which is selected by FRMT as the most discriminative protein in the HSNE dataset, previous studies have revealed its high occurrence in HNSCC specimens and demonstrated its correlation on clinical outcomes [81].
Bcl-2 protein is chosen by FRMT as an important marker in the COAD dataset; This finding broadly supports the work of Poincloux et al., linking loss of Bcl-2 protein expression with increase in relapse of stage II colon cancer, and it could be a potential histo-prognostic marker in therapy decision making [82].
Many studies [53, [60][61][62][83][84][85] have demonstrated that high dimension data will bring about information redundancy or noise that results in bad prediction accuracy, over-fitting that results in low generalization ability of prediction model, and dimension disaster which in turn is a handicap for the computation. Thus, a novel two-step feature selection technique was applied to optimize features.
As demonstrated in a series of recent publications (see, e.g., [26-32, 86, 87]) in evaluating new prediction/classification methods, user-friendly and publicly accessible webservers will significantly enhance their impacts [88,89], we shall try to provide a web-server in our future work for online application of the method presented in this paper. Moreover, for extending our experiment, we shall consider combining different feature selector as in [90].

Conclusion
Various FFS methods may lead to diverse biomarkers with different discriminative power in different datasets. However, the proposed FRMT method can help researchers to select more stable biomarkers from protein expression profiles by integrating various FFS methods. The proposed method has the advantage of stability and classification performance compared with other approaches. However, it suffers from the computational complexity problem comparing to FFS methods. On the other hand, the FRMT method in comparison to the wrapper feature selection approaches, has lower computational complexity and produce more general results without overfitting.
Supporting information S1 Table