A Prostate Cancer Model Build by a Novel SVM-ID3 Hybrid Feature Selection Method Using Both Genotyping and Phenotype Data from dbGaP

Through Genome Wide Association Studies (GWAS) many Single Nucleotide Polymorphism (SNP)-complex disease relations can be investigated. The output of GWAS can be high in amount and high dimensional, also relations between SNPs, phenotypes and diseases are most likely to be nonlinear. In order to handle high volume-high dimensional data and to be able to find the nonlinear relations we have utilized data mining approaches and a hybrid feature selection model of support vector machine and decision tree has been designed. The designed model is tested on prostate cancer data and for the first time combined genotype and phenotype information is used to increase the diagnostic performance. We were able to select phenotypic features such as ethnicity and body mass index, and SNPs those map to specific genes such as CRR9, TERT. The performance results of the proposed hybrid model, on prostate cancer dataset, with 90.92% of sensitivity and 0.91 of area under ROC curve, shows the potential of the approach for prediction and early detection of the prostate cancer.


Introduction
In Genome Wide Association Studies (GWAS) Single Nucleotide Polymorphisms (SNP)-complex disease associations are searched such as, age related macular degeneration [1], heart diseases [2], diabetes [3], rheumatoid arthritis [4], Crohn's Disease [5], Hypertension [6], Multiple Sclerosis [7] and cancer types [8-9-10] neurodegenerative diseases [11] and psychiatric diseases such as bipolar disorder [12]. Current GWAS of SNP profiles with such chronic and complex diseases are leading to the discovery of different genetic loci and individual SNPs related with the conditions, but association of only SNP genotyping profiles are not strong enough for prediction of disease condition. So, this study is designed to test the hypothesis if and to which degree integrating genotype profiles and the phenotypic features; including demographic information, environmental factors, lifestyle habits along with clinical findings of a patient will strengthen the predicative performance of the disease models. So far there isn't any publication that combines multiple genotypic and multiple phenotypic features, which would require implementation of new data mining approaches that can handle data with such different characteristics and even higher dimensionality.
Methods used in GWAS can be grouped under two main categories which are parametric and non-parametric [13]. Nonparametric methods do not require a genetic model given beforehand; instead they build their own models based on given data by using data mining and machine learning [13]. Nonparametric methods are preferred due to the high dimensionality of the genetic data in which traditional statistic methods are not sufficient enough for the analysis [14]. Almost all known machine learning algorithms have been used in GWAS, some of the foremost methods are Decision Trees [15][16], Artificial Neural Networks [16], Bayesian Belief Networks [17], Support Vector Machines [18-19-20] and Genetic Algorithms [21]. For the analysis of genotyping data, as observed from various applications of data mining, there is no clear evidence that any of the methods performs better than others [13]. All methods have their own advantages and disadvantages, and the selection of the appropriate method is mostly based on the given problem, data type, study design and aim of the work. There are also few examples for the application of different hybrid data mining approaches with GWAS data to increase the predicative performance, in which one main method is selected and genetic based algorithms, are used as the second step for the optimization of the main method [22].
Here, for first time we are introducing a hybrid feature selection model combining two non-parametric data mining methods, SVM and ID3, for the determination of most predictive phenotypic and genotypic features related with a complex disease. As distinct from many works in the literature, in this study we have used both methods individually rather than just optimizing the main method. The prostate cancer data is used as a case study and we have demonstrated that combining genotype information with phenotypes has better predictive performance than using only genotypes or only phenotypes in disease diagnosis, while exceeding the performance of prostate specific antigen (PSA) screening test [23].

Prostate Cancer Data Set
The dataset, ''Multi Ethnic Genome Wide Scan of Prostate Cancer'', used in this work is downloaded from NCBI's dbGaP database and has an accession number phs000306 version 2. This data consists of 4650 cases and 4795 controls with three different ethnicities, African Americans, Latinos and Japanese. Each individual in the study has 600,000 SNPs and 20 phenotypes and the number of subjects that contains both phenotypic and genotypic attributes is 9130.

Data Preprocessing
Data preprocessing consisted of three steps. In the first step Plink analysis was conducted in order to find the statistical power of relations between the genotype and the given disease. The threshold for the association of the SNPs with prostate cancer was determined as p,0.005 after the GWAS and 22,848 SNPs satisfying this condition formed the first representative subset. At second step METU-SNP's AHP (Analytical Hierarchical Process) feature was used to prioritize SNPs based on the biological and the statistical significance, which filtered the associated SNPs down to 2710 SNPs.
Data matching, cleaning and transformation were done in the final step of the data preprocessing. The genotypic and the phenotypic attributes of the subjects are combined in the data matching step based on the subject ID's and the subject ID conversions given in the manifest data. In the cleaning phase missing values caused by phenotypic attributes were replaced by class mean calculation and the attribute was deleted where class mean cannot be calculated. Data transformation was needed to code the alleles because SVMs use numerical values instead of categorical ones. In literature allele combinations are coded by three numeric values based on the heterozygous and homozygous major alleles [18]. Disadvantage of these schemes are that ''the alleles are not treated symmetrically [18]''. As the parent of origin was not indicated in our data we used an alternative coding scheme, in which symmetric alleles are treated in the same manner. This coding scheme is presented in Table 1.

Analysis
According to the literature the most widely used algorithms for detecting the relations between genotype information and the disease are ANN, SVM and Decision Trees. There are also examples for applications of different data mining approaches in a hybrid manner to increase the predicative performance where one main method is selected and genetic based algorithms are used as the second step for optimization of the main method [15][16][17][18][19][20][21][22].
In our model we've combined two different methods, SVM and ID3, and for each of these methods an appropriate optimization was applied rather than combining a main method with an advanced optimization as stated above. By this way instead of benefitting from one strong method, we've combined the strengths of different methodologies; ID3's robustness to noise and outliers [24] as well as its power to handle non-linear problems and SVM's prediction performance over non-linear binary classification problems. Also both methods are more interpretable when compared to other methods.
Our SVM-ID3 Hybrid Model was constructed in RapidMiner 5.0 which is a free open source software tool for data mining applications and preferred in various applications in the literature such as [25]. For the SVM phase RBF kernel is chosen. This kernel is widely used in GWAS [19] and preferred in our study for its faster learning speed and its advantage of to be used as both linear kernel and sigmoid kernel in some special conditions [26]. Besides the kernel function SVM has two important parameters (C,c) if not adjusted well, could cause overfitting or underfitting of the condition. The C constant is used to adjust the margin of the hyperplane that separates the classes and gamma parameter gives its shape to decision boundary. Optimization of these parameters has been reported previously [27], and we have selected to apply the grid search approach for the optimization, which has been described previously [28]. The value ranges for C and gamma, used during the grid search is decided based on literature [27] along with our own experience with the data. For gamma the value range is selected in between [0.0001, 100] with powers of ten and the value range for C is selected in between [0-10] with five linear steps. The grid search for SVM optimization has lasted around ten hours to complete in a system with a 16 GB memory and 3.4 GHz Intel Core i7 processor, revealing 42 combinations.
In literature there are various studies that combine SVMs and decision trees. Although previously published hybrid models of SVM and decision trees (SVM-DT) are generally used for multiclassification and multi-clustering problems, there are also examples of the SVM-DT combinations used for binary classification problems [29]. In all of the cases the SVM-DT models, SVM is applied first in order to optimize the parameters and the datasets to be used next in the decision tree. In our study we have also applied SVM in the first step, however instead of ranking the attributes and selecting the top listed ones according to SVM weights, which present a risk for loss of information, we have used the entire SVM weights as the weight feature in ID3. These weights for the ID3 attributes are calculated according to the formula given below.
The ID3 Tree is implemented on RapidMiner with weighting strategy explained above. A second grid search was run in order to find the optimum value for weighted information gain ratio. The range for this value was set in the range [10 23 , 10] and searched by 50 logarithmic steps which resulted in 51 combinations and completed in 11 hours. The overall workflow for the data pre-processing, which also includes GWAS and integration of phenotype and genotyping data, and the Hybrid SVM-Tree model described here is summarized in Figure 1.

Results
In the first phase only SVM model was run to present the classification performance of the stand-alone method on three different datasets. First and the second set was either only genotyping or phenotype data and the third dataset contained both genotyping and phenotype data. The results of the standalone SVM model are given in Table 2.
These results in the Table 2 clearly shows that combining phenotypic information with genotype data slightly increased the decision performance in all aspects of accuracy, precision, recall and AUC. The hybrid SVM-ID3 model is then applied on the same three datasets and the performance comparison is presented in the Table 3.
According to SVM ID3 hybrid model structure, given in Tree S1, the most important attribute is the ethnicity. Our model made a strict distinction on ethnicity attribute, which leads different decision paths for African American, Latino and Japanese subjects. For all ethnicities the body mass index (BMI) attribute is the second descriptive feature of the decision path. For African American population descriptive phenotypes on different levels of tree are the attributes that indicate smoking and alcohol consumption habits. Surprisingly only phenotypic attribute found for Japanese population is the BMI. Attributes indicating family history, physical activity, lycopene intake and smoking behavior are observed for Latin population. The overall tree structure of the hybrid model is presented in the Figure 2. Some of the prominent decision paths extracted from tree are mainly based on ethnicity. For example if the subject's ethnicity is African American and its BMI is in first category, which is BMI,22.5, by looking at rsid 11729739 our hybrid system can decide whether the subject is a case or control. If the allelic profile for this SNP is TT then the subject is called as a case, but if the subject is heterozygous carrying CT, than the subject is called as a control. When the results of hybrid system for Japanese population are examined, the BMI was also in the first level of decision path. If the subjects are in fourth branch of BMI, which is . = 30, then these subjects are directly classified as case. If the subjects are in first branch of BMI then the decision is made based on the SNP rs2442602; the subjects homozygous for the major allele (with AA genotype) are called as cases, but the decisions for the subjects carrying other alleles require investigation of additional SNPs.
The tree structure shows that the decision path for Latin population is more complex than the Japanese or African American populations. If the subjects are in first category of BMI then the subjects heterozygous for SNP rs17799219, carrying AG, are called healthy. If the subjects are in third category of BMI, which is ,29.9, then a second phenotypic attribute, family history must be examined. If these subjects have first degree relatives with prostate cancer, then SNP rs6475584 is examined, to call if the subject is a case or not. Many rules, like given above, can be extracted from tree structure given in the Tree S1. Overall our hybrid model identified 28 SNPs for African American, 22 SNPs for Japanese and 65 SNPs for Latino populations. We have investigated the SNPs mapping to genes within the SNPNexus database [30] and the non-coding SNPs through RegulomeDB [31] in order to see if they have been associated with prostate cancer or any other condition before.
When the SNPs found by hybrid model are searched through SNPnexus, 107 unique rsIDs matched with 62 unique Entrez GeneID and 42 of them were previously found to be associated with a condition listed in Genetic Association of Complex Diseases and Disorders (GAD) database. A representative set of genes-phenotypes and disease classes is given in the Table 4 and the whole list can be found in Table S1 material.
The non-coding SNPs in our final disease model are investigated through RegulomeDB, which showed that the SNPs found by our hybrid model have regulative effects. Table 5 below shows the SNPs with score lower than 4 from RegulomeDB. The whole list is given in the Table S2 material.

Discussion
Here, we have presented a diagnostic disease model utilizing data mining methods, based on phenotype and genotyping data for the prostate cancer. Overall our results showed that the hybrid model developed by integrating SVM and ID3 methods is capable of using both genotype and phenotype information as input, and has the best performance for predicting the case vs. controls.
SVM is selected as the first step in our hybrid model as it is known for its high performance in GWAS [26], and ability to classify non-separable problems. The decision logic behind ANNs, which can also be utilized for GWAS, is not very clear because of its black box structure. Also ANNs have many parameters to adjust such as number of layers, number of nodes in layers, number of epochs and learning rate, and most importantly ANNs have the disadvantage of getting stuck at local minima. On the other hand SVMs has clear decision logic [20], has less number of parameters and due to the quadratic problem structure it only offers one solution, which is present at the global minima. As the second step in our hybrid model, ID3 decision tree is selected for its strong performance on classifying the discrete valued datasets as in GWAS. ID3 is easy to construct and works with good performance on noisy data with missing values, and easy to interpret with its visual features [24]. ID3 is also advantageous over C4.5 and CART trees because these methods construct trees by pruning which would hide some decision paths for the disease, and ID3 is also more suitable for categorical data.
To the best of our knowledge, there is no similar hybrid or stand-alone data mining method established as a gold standard for  Overall, our hybrid model was capable of efficiently using the high-volume, high-dimensional integrated genotyping and phenotype data as input. Currently, there are many published studies focused on analysis of genotyping data, but no example of combining phenotype with genotyping profile has been presented yet. Infilling this gap, for the first time genotyping and phenotype data are integrated together to build a diagnostic disease model for prostate cancer. As we have presented in Table 3, integrating the phenotype and genotype data increased the decision performance by terms of sensitivity and AUC. Sensitivity of the proposed hybrid model on a dataset with only genotypes is 68.69%, with only phenotypes is 83.78% where sensitivity increases to 90.92% when genotyping is integrated with phenotype data. In parallel to the sensitivity AUC value also increases; AUC for only genotyping data and only phenotype data are 0.674 and 0.857, respectively, but when both data is used AUC increases to 0.910.
In addition to its better classification performance, our results showed that the proposed SVM -ID3 Hybrid model was also able to identify the functional and regulatory SNPs related with prostate cancer. The selected SNPs and their gene-disease relations are checked by using the databases such as SNPnexus and RegulomeDB, which integrates third party information from different databases and studies in SNP-centric format. This means that the SNPs selected to build the diagnostic disease model with the proposed hybrid method are also candidates for further biological investigation of molecular etiology of the prostate cancer.
The proposed hybrid method has identified 107 unique SNPs for the diagnostic model out of 2710 highly associated SNPs selected after GWAS. When these 107 SNPs are searched in SNPnexus and RegulomeDB some of them are found to be related with specific genes and others affect regulation and binding. For example, rs2853668 is known to be associated with CRR9, TERT which plays an important role in the regulation of telomerase activity. The rs11790106 affects the regulation of ATP2B2 gene which is important for energy production and calcium transportation of the cells. rs12644498 affects regulation of ARL9 gene and rs6887293 affects the regulation of AGBL4 which are also important for ATP/GTP cycle in cells. These genes are closely related to IGF1 gene which plays an important role in insulin metabolism. Many of the genes, the 107 SNPs in the disease model map to, is related with growth and energy processes. These molecular functions are in fact related to the BMI, which the most important phenotypic attribute for all ethnicities found by our hybrid model.  Resulting feature set of our hybrid model was examined and phenotypic attribute ethnicity was found to be the most related attribute with the prostate cancer. This result was not surprising because several works in the literature already showed that there is a relation with ethnic features and prostate cancer disease. Kleinmann's work shows that the ethnic background of the patients plays an important role in the prostate cancer related quality of life [32]. According to Hoffman, the etiology of the prostate cancer is highly depended on ethnicity and African American's has the highest risk for having prostate cancer [33]. As a supporting result, our hybrid model strictly divides the prostate dataset according to ethnicity and for each ethnicity different paths were observed.
Although decision paths for ethnicities are all different, at the second level all decision paths indicate the BMI attribute. BMI is already known for its relations with different types of cancer such as breast cancer [34] and esophagus [35], and is also a strong phenotypic attribute for prostate cancer [36]. In literature along with BMI, age and family history, which are also among the selected attributes by our hybrid model, has been showed to be as important features for the diagnosis of the prostate cancer [36]. The preventive effect of high BMI values beyond 30 kg/m 2 been stated previously [36], and interestingly for Japanese population we have also observed the same preventive effect of BMI for morbid obese cases at the lower levels of the decision path. Additionally, other most common phenotypic attributes in the decision paths such as family history, smoking habit, physical activity and lycopene intake were also associated with prostate cancer previously [37]. Overall, our results show that the proposed hybrid model included the previously established phenotypic attributes for prostate cancer.
Currently the blood Prostate Specific Antigen (PSA) levels is the gold standard for early detection of prostate cancer condition before biopsy, with the maximum sensitivity reported as 86%, and a specificity of 33% with AUC 0.67 [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42]. PSA levels under 4 ng/ml is considered normal, levels between 4 ng/ml-10 ng/ml are known as suspicious and levels higher than 10 ng/ml known to be associated with high risk [38]. The problem with PSA test is determination of the thresholds. The range between 4 ng/ml-10 ng/ml is a grey area for decision and while some subjects below 4 ng/ml can have prostate cancer, but some above 10 ng/ml can still be healthy [39]. In addition, the cut off values also change with respect to the subject's age [40]. This introduces a serious problem and as the various literature state PSA should not be used as an early diagnosis tool in prostate cancer [41] until its performance is increased in terms of sensitivity and specificity [42]. When the diagnostic performance results of the proposed hybrid model with 90.92% sensitivity, and 0.91 AUC is considered, it presents a potentially good tool for the early detection of the prostate cancer. After validation with pilot studies, the proposed model which only requires a buccal swap would stand as a good alternative to blood PSA test.
Here, for first time we have proposed a predicative disease model integrating genotyping and phenotype data through a hybrid feature selection, which combines two non-parametric data mining methods, SVM and ID3. As distinct from many works in the literature, in this study we have used both methods individually rather than just optimizing the main method. The prostate cancer data is used as a case study and we have demonstrated that the model combining genotype information with phenotypes yields a better performance than using only genotype or phenotype data in disease diagnosis while also exceeding the performance of prostate specific antigen (PSA) screening test [23].

Conclusions
In this study for the first time genotyping and phenotype data are integrated and a hybrid model of SVM-ID3 for prostate cancer is build. An important contribution of this work was the integration of genotyping with phenotype data. Effect of this integration is tested in both stand-alone SVM and SVM-ID3 hybrid model. In terms of performance measures such as sensitivity and AUC the integrated data set outperformed the datasets with only genotype and with only phenotype in both models. Sensitivity and AUC of integrated dataset for stand-alone SVM was 71.34% and 0.829 respectively. When the same integrated dataset is used in the hybrid model sensitivity increased to 90.92% and AUC increased to 0.91, also outperforming the blood PSA test. The model was able to identify prostate cancer associated SNPs that either map to a cancer specific genes such as CRR9, TERT, ATP2B2, ARL9, and AGBL4 and/or with regulatory effects. Experimental and clinical validation of the described associations for prostate cancer can lead us to better understand the progression of the disease at the molecular level. Additionally, the descriptive phenotypes selected by the hybrid model were also previously identified features for their relations with prostate cancer in previous studies. Ethnicity was observed to be at the root of the decision tree structure, whereas BMI, family history and smoking were the other phenotypes that are at the top levels of the decision model. Overall, our study showed that the predictive disease model build with the hybrid SVM-ID3 approach based on genotyping and phenotype data provides a promising tool for early detection of the prostate cancer. After validation of the proposed model with pilot studies, it can be implemented as a clinical decision support module to evaluate patients risk to develop prostate cancer, and the phenotypes related to life style (BMI, exercise, smoking, etc..) that have high impact on patients risk can be identified for each individual to be monitored in the upcoming visits.
Further studies on the proposed hybrid SVM-ID3 method and other data mining approaches for the integrative analysis of the GWAS results and phenotypic information would aid in development of other successful disease models, which would excel the translation of variant-disease association findings into the clinical setting for the development of new decision support tools and personalized medicine approaches. Tree S1 Text representation of tree structure. The tree structure of the SVM-ID3 hybrid model. (DOCX)