Improved Classification of Lung Cancer Tumors Based on Structural and Physicochemical Properties of Proteins Using Data Mining Models

Detecting divergence between oncogenic tumors plays a pivotal role in cancer diagnosis and therapy. This research work was focused on designing a computational strategy to predict the class of lung cancer tumors from the structural and physicochemical properties (1497 attributes) of protein sequences obtained from genes defined by microarray analysis. The proposed methodology involved the use of hybrid feature selection techniques (gain ratio and correlation based subset evaluators with Incremental Feature Selection) followed by Bayesian Network prediction to discriminate lung cancer tumors as Small Cell Lung Cancer (SCLC), Non-Small Cell Lung Cancer (NSCLC) and the COMMON classes. Moreover, this methodology eliminated the need for extensive data cleansing strategies on the protein properties and revealed the optimal and minimal set of features that contributed to lung cancer tumor classification with an improved accuracy compared to previous work. We also attempted to predict via supervised clustering the possible clusters in the lung tumor data. Our results revealed that supervised clustering algorithms exhibited poor performance in differentiating the lung tumor classes. Hybrid feature selection identified the distribution of solvent accessibility, polarizability and hydrophobicity as the highest ranked features with Incremental feature selection and Bayesian Network prediction generating the optimal Jack-knife cross validation accuracy of 87.6%. Precise categorization of oncogenic genes causing SCLC and NSCLC based on the structural and physicochemical properties of their protein sequences is expected to unravel the functionality of proteins that are essential in maintaining the genomic integrity of a cell and also act as an informative source for drug design, targeting essential protein properties and their composition that are found to exist in lung cancer tumors.


Introduction
Oncogenic tumors are the leading cause of death around the world with Lung Cancer bearing the major toll of malignant fatalities [1][2][3]. Smoking and use of tobacco along with diverse environmental carcinogens increased human susceptibility to this deadly ailment [4][5]. Gene Polymorphisms concerned with detoxification of carcinogens have been associated with formation of lung tumors. Lung tumors have been broadly categorized as Non-Small Cell Lung Cancer (NSCLC) affecting nearly two-thirds of patients with a low-survival rate and Small Cell Lung Cancer (SCLC), both of which respond to different forms of therapy [6][7][8][9][10]. This drives the need to precisely identify pathological differences between these two types of tumors.
Gene expression patterns from microarray analysis enabled the sub-categorization of lung cancer types that related to the degree of tumor demarcation, nature of therapy and victim survival rate [11][12][13][14]. It was an established fact that Lung carcinogenesis was a process that involved gradual phenotypic changes that occurred as a result of onco-gene activation and deactivation of tumor suppressor genes [8]. Reports thus far in literature have failed to identify any reliable biomarkers for this condition since wet-lab experiments often consumed more time, expertise and capital with unsure returns [1] [4][5][6]. Microarray technology has been utilized in the recent past to detect appropriate biomarkers but present methodologies were more susceptible to overlook potential facts contained in patient tissue samples [14]. Hence determination of potential and informative markers (diagnostic and prognostic) from both the biological and molecular perspective is highly essential to study and evaluate the genetic and molecular distinctiveness that characterized tumors and Tumor Node metastasis (TNM) staging in lung carcinogenesis to make possible effective diagnosis, and corroborate therapeutic strategies.
In recent research undertakings, several classifiers and data mining models have been used that targeted the appropriate categorization of lung cancer tumors. Forty-one samples characterized by 26 attributes computed from the mass-to-charge ratio (m/z) and peak heights of proteins identified by mass spectroscopy of blood serum samples from lung cancer affected and nonaffected patients was utilized to train a classification and regression tree (CART) model [13]. Molecular classification of NSCLC based on a percentage train-test approach was used to evaluate the reliability of cDNA microarray-based classifications of resected human non-small cell lung cancers (NSCLCs) [14]. In further research Linear Discriminant Analysis and Artificial Neural Network classification of individual lung cancer cell lines (SCLC and NSCLC) was performed based on DNA methylation markers [13]. The results reported that Artificial Neural Network analysis of DNA methylation data was a potential technique to develop automated methods for lung cancer classification. In another study Support Vector Machine [14] was used in lung cancer gene expression database analysis and the results proposed that incorporated prior knowledge into cancer classification based on gene expression data was essential to improve classification accuracy. Automatic classification of lung TNM cancer stages from free-text pathology reports using symbolic rule-based classification was attempted [15]. The methodology was assessed based on accuracy parameters and confusion matrices against a database of multidisciplinary team staging by decisions and a machine learning-based text classification system using support vector machines.
The current investigation was focussed on a very recent article by Hosseinzadeh et.al [1] that aimed to classify lung cancer tumors based on structural and physiochemical properties of proteins using Bioinformatics models. We chose this paper for three main reasons. (i) The work is the most recent and the data is publicly available. (ii) The research involved plenty of data cleaning and pre-processing strategies which could be avoided. (iii) Their work involved few assumptions on the obtained data which are not adopted in this work. Moreover the method proposed in this paper was able to generate higher classification accuracy in differentiating between lung cancer tumors based on protein properties while retaining the original data and eliminating assumptions. Precisely this paper makes the following contributions: (a) Design of a new methodology with hybrid feature selection techniques to identify the optimal protein features that distinguished between lung cancer tumors with higher accuracy. (b) Eliminated the need for data cleaning and assumptions on attribute significance. (c) Contributing features identified are believed to influence drug design that could target the protein property leading to lung cancer tumors.

Dataset
The Gene Set Enrichment Analysis database (GSEA db) [16] was utilized to obtain the gene sets that contributed to the development of NSCLC and SCLC. It was obtained from the Kyoto Encyclopaedia of Genes and Genomes (KEGG) [17] gene sets. A total of 84 genes [17] were present in the SCLC gene set while 54 genes [17] were found contributing to NSCLC. In order to precisely discriminate between the two classes of tumors, the genes commonly occurring in both tumors were placed in a different class called COMMON. The strength of the gene set for SCLC was 59, NSCLC included 29 while the COMMON gene set summed up to 25. Proteins for each group of genes were obtained from the Gene Card database [18] and the corresponding protein sequences extracted from UniProt Knowledgebase database [19]. These sequences were saved as text file and loaded onto PROFEAT web server [20][21] to compute the structural and physicochemical properties associated with the protein. A total of one thousand four hundred and ninety seven attributes were computed and represented as Fi.j.k.l where 'l' represented the descriptor value and 'k' denoted the descriptor while 'j' indicated the feature and 'í' signified the feature group [20][21]. The features and their annotations have been provided as File S1. The complete data set comprising of 1497 features and 113 tumor samples [17] were loaded in to WEKA 3.7.7 machine learning software [22] and the tumor type was set to be the target class. The complete pre-processed dataset is provided as File S2. The variation in sample size as compared to previous work is attributed to possible updations in the database. The methodology proposed in this research work is described in the following section.

Proposed Computational Methodology
The proposed methodology comprised of two phases: The training phase and the prediction phase. The training phase incorporated the data preparation, feature selection and classification process while the prediction phase involved evaluation of the classifier model using Jack-knife cross-validation test based on the performance parameters [23][24]: Matthews Correlation Coefficient (MCC) and Accuracy. The diagrammatic representation of the proposed methodology is given in Figure 1. The data preparation phase incorporated categorization of the input gene sets as SCLC, NSCLC and the COMMON classes. This was followed by Hybrid feature selection with Incremental Feature Selection. The classification models were then built and compared to identify the best performing computational prediction technique on lung tumor classification using protein structural and physicochemical properties.
Hybrid Feature Selection. Feature ranking presented significant features in the order of their contribution to categorizing the samples under the different target classes [25][26][27][28]. Since most feature selection algorithms focused on ranking the attributes according to their significance value, the liability of choosing the limiting constraint rested with the user [29][30][31]. Hence in order to automate the process of finding the minimal yet optimal set of features, the ranking feature selection algorithms were followed by Correlation Subset Evaluators [32] that included features highly correlated to the class and least correlated to each other. Since both the ranking and subset evaluators were utilized to obtain the optimal feature set, this was termed the Hybrid Feature Selection strategy. The description of the methods used in this research is detailed below.
Gain Ratio Criterion. Gain ratio criterion [33][34], revealed the association between an attribute and the class value, being primarily computed from the Information Gain using the Information Entropy (InfoE) values [35]. After having obtained the value of the Entropy H(S R ), and assuming 'F' to be the set of all features, and S R to be the set of all records, Value(r,f) is taken to be  the value of a specific instance 'r S' for the feature 'f F'. Information Gain for the attribute was computed using Equation (1) as follows [35]: In order to compute the Intrinsic Value for a test, the following formula was adopted: The Information Gain Ratio [33][34][35] was calculated as the ratio between the Information Gain and the Intrinsic value, according to Equation (3) IGRatio The attributes were thus ranked according to their rank in the descending order of the Gain Ratio score and were used for the CFS Subset Evaluator method described below.

Correlation
Feature Selection (CFS) Subset Evaluator. The CFS hypothesis [36] suggested that the most predictive features needed to be highly correlated to the target class and least relevant to other predictor attributes. The following equation [36][37] recorded the value of a feature subset S that consisted of 'k' features  where, r cf was the average value of all feature-classification correlations, and r ff was the average value of all feature-feature correlations. The CFS criterion [36] was defined as follows: CFS~MAX S K r cf 1 z r cf 2 z:::z r cfk ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi kz2( r f 1f 2 z:::z r fifj p z:::z r fkf 1 ) Where r cfi and r fifj variables were referred to as correlations. The attributes that portrayed a high correlation to the target class and least relevance to each other were chosen as the best subset of attributes. The attributes filtered by the CFS Subset Evaluator method were added in an incremental manner to identify the optimal set of features that contributed to lung tumor categorization. This methodology is reported below.
Incremental Feature Selection. The predictor attributes generated by the Gain Ratio and CFS Subset Attribute Evaluator (Hybrid Feature Selection) method were later utilized for Incremental Feature Selection (IFS) [38][39] to determine the minimal and optimal set of features. On adding each feature, a new feature set was obtained and the k th feature set could be stated as AT k~f at 1 , at 2 ,::: Where M denoted the total number of predictor subsets. On constructing each feature set, the predictor model was constructed and tested through Jack-knife cross-validation method. The MCC and Accuracy of cross-validation was measured, leading to the formation of the IFS table with the number of features and the classification accuracy they were able to generate. 'AT o ' was the minimal and optimal feature set that achieved the highest MCC and accuracy.
In order to determine the best classification model for lung tumor classification [40], a total of five benchmark prediction techniques viz, Support Vector Machine [29], Random Forest [1], Nearest Neighbor algorithm [39], Bayesian Network Learning [22] and Random Committee (Ensemble classifier) [22] were analyzed and compared. Our results affirmed that Bayesian Network approach generated higher accuracy in tumor classification with the optimal feature set.
Bayesian Network Learning. The learning phase in this approach incorporated the process of finding an appropriate Bayesian network [41] given a data set D over R where R = {r 1 , r n }, n $1 was the set of input variables. The classification task consisted of classifying a variable V = v 0 called the class variable (NSCLC/SCLC/COMMON) given a set of variables R = r 1 . . . r n . A classifier C: r R v was a function that mapped an instance of 'r' to a value of 'v'. The classifier was learned from a dataset D that consisted of samples over (r, v) [42]. A Bayesian network over a set of variables R was a network structure B s , a directed acyclic graph (DAG) over the set of variables R and a set of probability tables [43] was given by Where pa(r) was the set of parents of r in B S and the network represented a probability distribution given by Eq. (8) The inference made from the Bayesian Network [41][42][43] was to allocate the category with the maximum probability [44]. The Simple Estimator with the K2 local search method using Bayes Score were utilized (default parameters) for the execution of the algorithm in WEKA 3.7.7 [22]. The clustering methods are briefed about in the following section.
Supervised Clustering. Supervised clustering [45][46][47] deviated from unsupervised clustering in that it was applied on already categorized examples with the prime aim of detecting clusters that had high probability density with respect to a single class. Supervised clustering required the number of clusters to be kept to a minimum, and objects were assigned to clusters using the notion of closeness with respect to a given distance function [48][49]. Supervised clustering evaluated a clustering technique based on the following two criteria [47][48][49]: N Class impurity, Impurity(X): It was measured by the percentage of marginal examples in the different clusters of a clustering X. A marginal example was an example that belonged to a class different from the most frequent class in its cluster.

N Number of clusters, k.
In this research we have compared the classes to cluster evaluation accuracy of seven clustering algorithms [22] namely Expectation-Maximization (EM) Algorithm, COBWEB [22], Hierarchical clustering, K-Means clustering, Farthest First Clustering, Density-Based clustering and Filtered Clustering. The number of clusters was automatically assigned in the COBWEB algorithm whereas the remaining algorithms allowed the user to select the desired number of clusters [22]. Some algorithms exhibited better performance on inclusion of all the attributes for clustering while the performance deteriorated on the hybrid feature selection datasets. The performance evaluation methods and parameters are briefed about in the subsequent sections.
Jack-knife Cross-Validation Test. Statistical prediction methods [50] were utilized for measuring the predictor performance in order to assess their efficiency in practical applications. In this study, the jack-knife cross validation method [50][51] was used for verification and validation of classifier accuracy since previous reports have stated it to be least arbitrary in nature and widely acclaimed by researchers and practitioners to estimate the performance of predictors. In jack-knife cross-validation [38][39] [52], each one of the statistical records in the training dataset was in turn singled out as a test sample and the predictor was trained by the remaining samples. During the jack-knifing process [23][24] [39], both the training dataset and testing dataset were actually open, and a statistical sample moved from one group to the other. In this research, the following indexes [50][51][52] were adopted to test the proposed methodology.

Experimental Results and Discussion
The experimental results are discussed in three sections. The foremost describes the ranking of the structural and physicochemical properties according to their gain ratio. The entire list of attributes was ranked and the file is provided as Table S1. The second section deals with the results of Incremental Feature Selection while the final section portrays the comparative performance of the benchmark classification models on the protein sequence properties in categorizing lung tumors.

Hybrid Feature Selection
A total of 1497 attributes were initially loaded as the training data with 113 instances [17][18]. No records were duplicated and there were no missing values. On ranking the attributes by the Gain Ratio criterion, a total of 134 attributes were assigned a gain ratio greater than zero. The CFS subset evaluator returned 39 features as the most optimal subset that was highly correlated to the target class but least correlated to each other. These features were then utilized for the Incremental feature Selection process. The results of the Hybrid Feature Selection techniques are given as Table S1.

Incremental Feature Selection
The ranked attributes from the CFS subset evaluator were then input in the descending order of their rank to the classifier. At each attribute entry, the MCC and accuracy of the classifier on Jackknife test was calculated. The Bayesian Network Learning was found to give the highest prediction MCC of 0.812 and accuracy of 87.6% with 36 features. The IFS curves generated on classifier accuracy and the corresponding MCC is represented in Figure 2. The optimal prediction accuracy with the proposed methodology for each feature subset is given in Table 1. The complete results of Incremental Feature Selection process on all the three Hybrid Feature Selection datasets are given in Table S2.

Classifier Models
Benchmark classification models that have been reported [14] [38][39] [53][54] to generate high accuracy in classification of biological data were compared to determine the optimal prediction technique that generated highest accuracy in prediction. The comparative performance of the classification models with the feature set generated by the Hybrid Feature Selection technique is depicted in Table 2. The performance is compared based on the MCC and prediction accuracy.

Clustering Models
This study utilized seven clustering algorithms [22] in order to compare their performance in categorizing the classes of lung tumors based on the attribute values. The results of generating the clustering algorithms on the dataset before and after performing  Table 3. It is evident from the tabulated results that clustering algorithms were not useful in providing any new idea on the attribute significance in detecting clusters since their performance accuracy was substantially low. The discussions on the data and the results are presented in the ensuing section.

Influence of Structural and Physicochemical Properties
There have been several researches on lung cancer classification [55][56][57][58][59][60][61][62][63][64][65] but the only previous computational study on the influence of protein sequence based structural and physicochemical properties in categorization of lung tumors was done by Hosseinzadeh et.al [1] who utilized the decision tree generated by the Random Forest classifier to identify the contributing attributes. In this study, we utilized the smallest tree among the 10 decision tree models generated by the Random Forest classifier [66] on the training dataset in order to identify the most contributing attributes to lung tumor classification. Albeit the Random Committee algorithm also depicted 100% accuracy and a high MCC of 1 in the training phase, the results obtained on Jack-knife cross-validation were not as high as the Random Forest Model. The decision tree model with the smallest number of nodes generated by the Random Forest on the training dataset is portrayed in Figure 3. The visualization of this tree made it easier to identify the composition of each protein property in the different types of lung cancer tumors, thus providing a source for drug design targeting the protein composition.
The following novel insights on the protein properties were gained from the Random Forest Model with a new set of discriminative features being reported for the first time in discriminating the lung tumor classes.  It was evident that a strict demarcation among the tumor categories was a complicated task since many properties were found to exhibit similar composition in both the tumor classes. However the proposed methodology was found to differentiate between the tumor classes with a high MCC of 0.812 and classification accuracy of 87.6%, the highest reported thus far in protein -property based lung tumor categorization.

Comparison to Previous Work
As stated earlier, the only previous computational study on lung tumor categorization based on the protein sequence-based structural and physicochemical properties was reported by Hosseinzadeh et.al [1] that made a comparison of ten different feature selection techniques and reported the feature set generated by the Gain Ratio criterion to generate optimal 10-fold cross validation accuracy of 86% with the Random Forest classifier. Their methodology incorporated 114 sequences with 30 genes in the NSCLC class, 59 in the SCLC and 25 in the COMMON class of tumors. Moreover their methodology also involved extensive data cleaning and pre-processing. Here we made use of the 113 sequences [16][17][18] from the KEGG gene sets corresponding to the NSCLC and SCLC tumor classes and segregated the genes under the three classes viz, NSCLC, SCLC and COMMON. The number of records summed up to 113 with 29 genes [16][17] in the NSCLC class. This study was aimed at identifying the minimal and optimal set of features to categorize the lung tumor classes for use in diagnostic practice and drug design. Hence we used the Gain Ratio criterion, Information Gain criterion and Symmetric Uncertainty to rank the features and then applied the Correlation Feature Subset evaluator [22] with a search termination threshold of 5 and Best First Search approach to identify the smallest subset of features with a high correlation to the target class and least correlation to each other. This resulted in a feature subset with 39 features. On comparing the jack-knife cross-validation accuracy of five benchmark classification models, the Bayesian Network Learning algorithm was found to generate the highest MCC of 0.77 with an accuracy of 85% with all the three hybrid feature selection subsets. On applying Incremental Feature Selection we obtained the most optimal feature set of 36 features (feature subset of Gain Ratio + CFS) generating an accuracy of 87.6%.
The previous work by Hosseinzadeh et.al reported a high accuracy of 86% only on the cleaned data after removal of duplicate records, correlated records and based on the standard deviation values. When considering the same data, our proposed work has achieved a higher accuracy with the original, unmodified data thus saving computational time by the elimination of the data cleaning process. In order to bring out the comparison more clearly we have identified the accuracy of Random Forest with Gain Ratio (previously proposed classifier model) on the original data which was able to generate an optimal accuracy of only 79.6% with 26 features from the Gain Ratio -CFS feature set compared to our proposed method which produced 87.6% accuracy with 36 features from the same feature subset. We believe our proposed methodology can easily be extended to classify and discriminate between other oncogenic tumors since the original data was retained for computational analysis. However the previous method appears to have generated a high accuracy (86%) only on the cleaned data which makes it a limitation when extending the methodology to other cancer datasets. Moreover the previously proposed model would entail additional data pre-processing time when applied to new cancer datasets.

Comparison with Other Methods
We compared three feature selection methods [22] namely Information Gain, Symmetric Uncertainty and Gain Ratio. We applied CFS Subset evaluator on all the feature sets ranked by the three algorithms. All the five benchmark classification algorithms [67][68] were applied on the reduced feature datasets. The results are tabulated in Table 2. All the three predictor methods displayed consistently high accuracy with the Bayesian Network prediction technique. The optimal accuracy was obtained only during the process of Incremental Feature Selection with the Gain Ratio and CFS subset evaluator combination which attained an improved accuracy of 87.6% with 36 features. Albeit the Bayesian Network learning algorithm showed consistent accuracy with the reduced feature sets of the Information Gain and Symmetric Uncertainty ranked features, yet during the process of Incremental Feature Selection, substantial decline in accuracy was apparent with the Information Gain and Symmetric Uncertainty subsets as detailed in the Table S2. Hence the Gain Ratio based ranking of features was considered to be the most optimal feature set for lung tumor categorization. The features selected by all the three hybrid feature selection techniques and the commonality among the selected features are displayed as a graph using NodeXL graph visualization software [69] in Figure 4. On careful analysis of the graphical representation of the feature subsets, it could be concluded that many features were commonly filtered by all the three hybrid feature selection techniques and hence reasonably similar performance accuracy was evident across the filtered subsets. However the process of Incremental Feature Selection disclosed the optimal and minimal feature set required for optimum prediction accuracy.

Benefits of the Bayesian Network Learning Algorithm
Bayesian Networks have been used in several [70][71][72][73] clinical prediction problems. Previous research has stated that a Bayesian network is a mathematically rigorous way to model a domain problem, being flexible and adaptable to available knowledge, and computationally efficient [72] [74][75]. Some notable features of Bayesian Networks [44] for use in clinical prediction are narrated below.
(i) Bayes net only relates nodes that are probabilistically related by some sort of causal dependency. This eliminates the need to store all possible configurations of states. The algorithm stores and works with all possible combinations of states between sets of related parent and child nodes that greatly reduce computational complexity. (ii) Bayes Net utilizes expert knowledge and data to build models dynamically. It allows both backward and forward reasoning.
The medical domain is one research area where expert knowledge always has room for improvement and backward reasoning is a definite requirement. Hence application of computational techniques like Bayesian Networks in discriminating and classifying tumor classes based on protein sequence based physicochemical properties is expected to advance the current state of molecular and biological analysis of oncogenic tumor classes for drug design.

Conclusion
Research on the utilization of computational techniques and predictions on clinical and biological data has intensified in the recent past owing to the fact that most wet-lab experiments consumed more human expertise, time and capital with irresolute rewards. This research was aimed at identifying the minimal and optimal set of protein sequence based structural and physicochemical properties in lung tumor categorization into NSCLC, SCLC and the COMMON tumor classes. The findings of this study are believed to be both a computational and biological advancement, the former revealing a new combination of feature selection and prediction techniques for categorizing tumor classes with enhanced accuracy and the latter acquiring information on protein properties prevalent in lung tumors that could aid in diagnostic practice and drug design. Possible extensions to this work would involve application of this novel computational framework in categorization of other oncogenic tumors and detecting properties that could be targeted for cancer therapy. Moreover computational advancement would require improving the prediction accuracy of the proposed methodology by possible updations to the existing algorithms.

Supporting Information
File S1 Attribute description file.

(DOC)
File S2 Pre-processed protein based structural and physicochemical data.