Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.


Introduction
DNA-binding proteins play key roles in a wide variety of molecular functions, including recognizing specific nucleotide sequences, maintenance of cellular DNA, transcriptional and translational regulation, DNA replication, and DNA damage repair [1][2][3]. Currently, both computational and experimental techniques have been developed to identify the protein-DNA interactions. The experimental techniques such as filter binding assays [4], ChIP-chip [5], genetic analysis [6] and X-ray crystallography [7] can provide a detailed picture about the binding, however, they are both time-consuming and expensive [3]. Thus, it is highly desired to develop automated computational methods for identifying the DNA-binding proteins from the extremely fast increased amount of newly discovered proteins [8].
So far, a number of predictors of DNA-binding proteins have been proposed. These methods can be divided into two categories, structure based modeling [9][10][11][12][13][14][15][16][17][18] and sequence based prediction [8,[19][20][21][22][23][24][25][26][27][28][29][30]. Since the protein structure could directly reveal its function mechanics, the availability of structure information about a given protein is believed to contribute towards predicting its function and to provide higher performance than sequence based methods. However, the pitfall of various structure-based methods for predicting DNA-binding function is that they are all limited to a relatively small number of proteins for which high-resolution three-dimensional structures are available. In a contrast, sequence based methods have the main advantage with no need for known structures and thus can be applied to large-scale datasets and genomics targets. For instance, Szilágyi and Skolnick [24] used logistic regression to predict the DNA-binding proteins from the amino acid composition. Kumar et al. [23] utilized support vector machine and coded the features from evolutionary profiles for the prediction of DNA-binding proteins. Another group, Kumar et al. [22], proposed DNA-Prot method for the classification of the DNA-binding proteins using random forest. Gao and Skolnick [19] proposed a threading-based method which required only the target protein sequence to identify the DNA-binding domains based on a template library composed of DNA-protein complex structure. Lin et al. [8] developed a DNA-binding protein predictor using random forest by integrating the features into the general form of pseudo amino acid composition with grey model. The latest work by Zou et al. [20] provided a comprehensive feature analysis using support vector machine for the prediction of DNA-binding proteins. As a summary, sequence based prediction methods for DNA-binding proteins have been investigated with several classifiers such as logistic regression [24], random forest [8,22], support vector machine [20,21,23,[25][26][27][28][29][30], and threading based method [19], using various features including (pseudo) amino acid composition [8,20,[22][23][24][25][26][27][28][29][30], physicochemical properties [20][21][22]28], predicted secondary structure [20,22,28], predicted solvent accessibility [28], evolutionary profile [20,23], and their various transformations.
The aim of this work is to propose a new predictor for determination of the DNA-binding proteins based on the features composed of sequence, predicted solvent accessibility, predicted secondary structure, and evolutionary profiles. The size of the feature set was reduced by ranking the features using random forest and furthermore by a wrapper based feature selection using best-first forward search strategy based on Gaussian naïve Bayes. The differences between this work and the previous studies are reflected mainly in four aspects: (1) we designed new features concerning the hybrid forms of the amino acid composition, predicted solvent accessibility and predicted secondary structure, the auto-correlation coefficients of the position specific scoring matrix (PSSM), and the percentile values of PSSM scores; (2) we applied random forest to rank the feature importance and subsequently performed wrapper based feature selection based on best-first forward search strategy; (3) we compared the prediction performance of several classifiers including Gaussian naïve Bayes, logistic regression, decision tree, random forest, knearest neighbor and support vector machines under the proposed framework and found that Gaussian naïve Bayes outperformed other considered machine learning methods; (4) we conducted a much more complete fair comparison of the proposed model tested on several different independent datasets with the existing sequence-based methods that have web server or standalone software version, which include iDNA-Prot [8], DNA-Prot [22], DNAbinder [23], DNABind [24] and DBD-Threader [19], to our best knowledge. The results show that the proposed method, called DBPPred, is an improved and alternative method for identifying the DNA-binding proteins.

Datasets
DNA-binding protein sequences comprising the training dataset and the independent dataset were extracted from Protein Data Bank (PDB) [31] by searching the mmCIF keyword of 'DNA binding protein' through the advanced search interface. The entire set after removing the chains with length of less than 60 and character of 'X' was subsequently clustered with NCBI's BLASTCLUST [32] at 25% sequence identity. A dataset, called PDB390, was created by selecting one chain in each cluster and was finally composed of 390 protein chains with local 25% pairwise sequence identity. Furthermore, DBP390 was divided into two datasets, the training dataset in which the sequences were deposited in PDB before Jan, 2011 and the remaining independent dataset. As a result, the training dataset, named as DBP297, is composed of 297 protein chains, and the independent set, called DBP93, comprises 93 protein chains deposited in PDB after Jan, 2011. Such division based on the deposition date is to avoid the sequence intersection and similarity as much as possible with the training sets used in the existing methods including iDNA-Prot [8], DNA-Prot [22], DNABinder [23], DNABind [24] and DBD-Threader [19], since these methods were published before or in 2011 . Thus the blind test can be performed on the independent  dataset for a relatively fair comparison with the existing methods. Similarly, 390 non DNA-binding proteins were randomly selected from a set that was deposited in PDB between Jan, 2011 and Dec, 2012 and was clustered with BLASTCLUST [32] at 25% sequence identity. The set was furthermore divided into two datasets based on the deposition dates of the sequences. These two sets are respectively called NDBP297 composed of 297 chains for training and NDBP93 consisting of 93 chains for independent blind test, where the deposition dates of the sequences in NDBP93 are newer than NDBP297. Accordingly, the benchmark dataset, called PDB594, consists of 594 chains by combining DBP297 and NDBP297, and the independent set, named as PDB186, comprises 186 chains by merging DBP93 and NDBP93. The PDB IDs of PDB594 and PDB186 together with the information concerning primary sequence and deposition date in PDB are listed in Dataset S1 and Dataset S2, respectively.
The small number of DNA-binding proteins as compared to the enormous number of proteins deposited in PDB demonstrates that DNA-binding proteins are only a fraction of all proteins. We collected an independent set composed of a few hundreds of totally non-DNA binding proteins, in order to investigate the false positive rates of the proposed work and the relevant existing methods. This set includes sequences that were deposited in PDB between Jan. 2011 and Nov. 2013 and that contain no DNA binding proteins and no 'X' characters. Next, BLASTCLUST with the local identity threshold at 25% was applied to the union of this set, PDB594 and PDB186. The independent non-DNA binding protein set was constructed by selecting one chain with length .60 from each cluster that contains no sequences from PDB594 and PDB186. Consequently, this dataset, called NDBP4025, includes 4025 non-DNA binding proteins that have local identity of at most 25% with each other and also with the protein chains from PDB594 and PDB186.
Moreover, another similar issue, i.e. the prediction of RNAbinding proteins, has been focused on by recent several studies [33][34][35][36]. We examined the ability of the proposed method and several other existing predictors to distinguish RNA and DNA binding proteins. Two datasets including only RNA-binding proteins, RB-C174 and RB-IC257 used in [34], were used to test the ability for separating DNA and RNA-binding proteins. One sequence in RB-IC257 was removed since it contains 'X' characters. These two datasets are renamed RB174 and RB256, which include 174 and 256 RNA-binding proteins, respectively, and their union is denoted by RB430. The RB430 dataset includes 430 sequences that have local identity of at most 25% with each other described in [34]. Similarly as NDBP4025, the sequences in RB430 should be regarded as non-DNA binding proteins, which are examined to compute the false positive rates of considered methods.

Features
One of the steps for designing predictor is to convert the input protein sequence into a set of numerical features that are fed into the classifier to generate prediction of the DNA-binding proteins. The features in this study are coded from primary sequence, predicted secondary structure (PredSS), predicted relative solvent accessibility (PredRSA), position specific scoring matrix (PSSM) generated by PSI-BLAST [32]. They are divided into four categories, secondary structure based, average RSA based, amino acid (AA) composition based, and PSSM score based (see Table 1). The raw features concerning PredRSA and PredSS are derived by SPINE-X program [37], which was evaluated with high quality outcomes for predicting secondary structures and RSA values.
The motivation for using PredSS comes from several studies that have shown the benefit to the protein function predictions, including protein folding rate [38] and kinetic type [39], binding residues [40] and catalytic sites [41]. SPINE-X predicts three types of secondary structures, i.e. helix (H), strand (E), and coil (C). The SS based features are coded by the secondary structure content in total number of 3.
The relative solvent accessibility (RSA) is defined as the solvent accessible surface area (ASA) of a given residue normalized by the ASA of this residue in an extended tripeptide, Ala-X-Ala, conformation [42]. The RSA values are often used to distinguish between the interior and the surface of proteins by setting a cutoff. For a given cutoff h, the residue with RSA$h are considered to be solvent exposed; otherwise, they are assumed to be buried. We followed our previous work [39] for the determination of protein folding kinetic types and computed average RSA (AveRSA) values over the residues with certain AA type, with a given predicted secondary structure conformation, and with certain AA type and predicted secondary structure conformation.
The AA composition based features include the composition of the 20 AA types in the input sequence, the composition of the residues of certain AA type in a given predicted secondary structure conformation, the composition of the residues of certain AA type which are either buried or exposed based on different RSA cutoffs, and the composition of the 400 dipeptide types (see Table 2).
PSSM generated by PSI-BLAST has been widely used to represent the evolutionary information of a protein sequence, which was proved to be highly effective in a variety of prediction areas in protein structure bioinformatics, including the prediction of DNAbinding proteins [20,23] and sites [43], function sites [41,44], contact map [45,46], disordered region [47], domain boundary [48,49], solvent accessibility [37], to name just a few. The PSSM is a L620 matrix, where L is the length of the protein sequence and 20 is the number of amino acid types. The score values are first normalized by using the following standard logistic function: Next, we computed the average score of the residues with respect to the column of certain AA type, the average score of the residues of certain AA type with respect to the column of some AA type, the percentile value of the PSSM scores along with the column of certain AA type according to percent thresholds, and auto-correlation coefficient (AutoCC) of scores along with the column of certain AA type according to various lag values. The percent thresholds for the percentile statistics are set to be {0, 25, 50, 75, 100}. For a threshold t, the percentile statistics is the top (1002t)% value of scores in one column. Thus, threshold value 0 corresponds to the minimum score in one column of certain AA type, and threshold value 100 is actually associated with the maximum score in the column. The auto-correlation coefficient with certain lag can be calculated as follows: Average RSA of the residues with secondary structure type y AveRSA_SS y 3 Average RSA of the residues with AA type x and secondary structure type y AveRSA_Res x _SS y 60 Amino acid composition based Composition of the residues with AA type x AAC_Res x 20 Composition of the residues with AA type x and secondary structure type y AAC_Res x _SS y 60 Composition of the residues with AA type x and RSA value$h (i.e., the residue is assumed exposed) Composition of the residues with AA type x and RSA value,h (i.e., the residue is assumed buried) where n is equal to L-lag, L is the length of the protein sequence, and S i,j is the PSSM score corresponding to the element in the i-th row and the j-th column in the matrix. The usage of AutoCC features was motivated by the wide-spread application of auto covariance to various fields of bioinformatics [50][51][52]. Here, AutoCC is actually a variant of auto covariance that the former is standardized between 21 and 1 while the later is not.

Random Forest and Gaussian Naïve Bayes
Random forest (RF) has been widely used for pattern recognition in bioinformatics [53]. It can provide not only the high prediction performance [8,22] but also information on variable importance [53][54][55] for classification task. The algorithm of random forest is based on the ensemble of a large number of decision trees [56], where each tree gives a classification and the forest chooses the final classification having the most votes (over all the trees in the forest). In the most commonly used type of random forests, split selection is performed based on the so-called decrease of Gini impurity. In this study, the random forest is used to rank the features using Gini importance that is implemented with the machine learning platform scikit-learn [57].
Naïve Bayes (NB) is a set of supervised learning algorithms that apply Bayes' theorem with the ''naive'' assumption of independence between every pair of features [58]. A NB classifier calculates the probability that a given instance (example) belongs to a certain class. Given an instance X, described by its feature vector (x 1 ,…, x n ), and a class target y, Bayes' theorem allows us to express the conditional probability P(y|X) as a product of simpler probabilities using the naïve independence assumption: Since P(X) is constant for a given instance, the following rule is used to classify the sample: Maximum a posteriori (MAP) estimation is commonly used to estimate the parameters in the naïve Bayes model, including P(y) and P(x i |y); the former is the frequency of samples with class y in the training set. Moreover, Gaussian naïve Bayes (GNB) implements the classification by assuming the likelihood of the features to be Gaussian: where the parameters s y and m y are estimated by maximum likelihood. Due to its simplicity and being extremely fast compared to more sophisticated methods, GNB has been also widely applied to prediction problems in bioinformatics [59][60][61]. Here, GNB was used to train the prediction model of DNA-binding proteins and to perform the wrapper-based feature selection. On the other hand, our computational experiments in this work showed that GNB exhibited better performance than other classifiers, including logistic regression (LogR), decision tree (DT), k-nearest neighbor (KNN), and support vector machine (SVM). All of the machine learning methods were implemented in scikit-learn [57].

Performance Evaluation
Prediction performance is assessed using four quality indices including sensitivity (the ratio between the number of correct predictions for DNA-binding proteins and the total number of the actual DNA-binding proteins), specificity (the ratio between the Table 2. Comparison of the prediction performance of the Gaussian naïve Bayes (GNB)-based wrapper, logistic regression (LogR)based wrapper, decision tree (DT)-based wrapper, k-nearest neighbor (KNN)-based wrapper, and two support vector machine (SVM)-based wrappers with the RBF and polynomial kernels (denoted as SVM-RBF and SVM-Poly respectively).  The performance is tested using n-fold cross validation (nCV) with multiple runs (to improve validity of the results) on PDB594 dataset. In the nCV, chains are randomly divided into n subsets with the same numbers of sequences, and the test is repeated n times, each time using one subset to test the prediction model and the remaining n21 subsets to establish the model. Execution of one nCV is called a run and the n subsets for the run are named a seed. In the wrapper-based feature selection, we performed fivefold cross validation (5 CV), but we executed ten runs using ten different randomly created seeds. The sensitivity, specificity, accuracy and MCC are computed for each run and then averaged over the ten runs. The jackknife test (JKT), also called the leaveone-out test, is actually a nCV, where n is the total number of sequences in the dataset. We also performed the jackknife test but executed the only one run since each run would give the same result.

Feature Selection
The designed feature set is composed of 1486 descriptors. We performed feature selection since some of these features could be irrelevant to the prediction/characterization of DNA-binding proteins. Two stages were utilized in the wrapper based feature selection: (1) feature rank performed using random forest; (2) feature selection by forward best-first search combined with GNB classifier. In the first stage, top 300 features according to the Gini importance of random forest are selected. While in the second stage, feature selection is performed limited to this subset that is composed of 300 important features. The feature sets that lead to a higher average MCC are selected by performing the forward bestfirst search scheme. The computation of the MCC involves out-ofsample tests on the training set PDB594. More specifically, we execute ten random seeds of five-fold cross validation (5 CV) and use the average MCC to rank features. We start one feature that gives the largest MCC and then add the second feature (among the remaining 299 features) which results in the best average MCC. This is performed incrementally until adding an additional feature without obvious average MCC improvement. Figure 1 shows the flowchart of the proposed method.
In addition, other machine learning methods including LogR, DT, KNN and SVM are also applied to the above feature selection for a comparison. However, KNN needs set the number of neighbors, and the SVM classifiers require parametrization of the complexity constant C and the kernel function. The number of neighbors for KNN was limited to the set {5, 7, 9, 11, 13}. For each step of the above feature selection in which one feature was added into the previous selected feature set, KNN was performed over the all allowable numbers of neighbors and the one with the highest prediction performance was kept. For SVM, we consider two kernel types, radial basis function (RBF) K(x i ,x j ) = exp(2c||-(x i ,x j ) = exp(2c||x i 2x j || 2 ) where c is the width of the RBF function, and polynomial K(x i ,x j ) = (x i? x j ) d where d is the degree. When d = 1, the polynomial K(x i ,x j ) = (x i? x j ) is actually the linear kernel. The SVM classifiers with these two types of kernels are denoted as SVM-RBF and SVM-Poly, respectively. We performed the grid search to optimize the parameters of SVM classifiers.

Performance of the Proposed Method
The proposed method, called DBPPred, was implemented by ranking features using random forest algorithm and selecting features using forward best-first search strategy based on Gaussian naïve Bayes. Total of 300 features according to the feature rank were input to the subsequent feature selection, and each step in the forward best-first search by adding one remaining feature was performed based on 5 CV with 10 runs. Therefore, the results including sensitivity, specificity, accuracy and MCC were averaged over the ten runs, and their standard deviations were also reported. Figure 2 shows the improvement of MCC values along with the increasing number of selected features in the procedure of the forward, best-first search that was executed using 5 CV with 10 runs. The results from jackknife tests using the ranked features derived by the feature selection based on 5 CV with 10 runs were also shown in the figure. It can be observed that when the number of features is 56, the corresponding average MCC value based on 5 CV with 10 runs achieves the highest. Meanwhile, the MCC value derived by Jackknife test is also the highest.
Thus, the final feature set determined by the proposed method is composed of 56 features. The corresponding average sensitivity, specificity, accuracy and MCC values are 0.815, 0.767, 0.791 and 0.583, respectively, for 5 CV with 10 runs, and are 0.828, 0.781, 0.805 and 0.610, respectively, for jackknife test. Before the overall MCC peak achieved with 56 features, the procedure of feature selection provided in general improvement of MCC with the increasing number of selected features, however, the MCC value decreases a little bit when adding certain feature, such as the 16 th feature and the 21 st feature. We emphasize that the combination of all selected features contribute to the final improvement on MCC value.

Comparison with Several Machine Learning Methods
Apart from GNB, several classifiers including DT, LogR, KNN, SVM-Poly and SVM-RBF were also applied to the feature selection procedure of the proposed method for a comparison. Table 2 lists the prediction performance of considered methods according to their MCC peaks achieved that are similar to the case of GNB in Figure 2. The results for 5 CV with 10 runs and Jackknife test are both reported. As shown in Table 2

Comparison of Independent Tests with Existing Methods
The independent dataset PDB186 was used to validate the quality of predictions for sequences that share low identity (,25%) with the training set. We performed blind test on PDB186 using the GNB model that was trained on the entire PDB594 dataset. We also compared the predictions of the proposed DBPPred on PDB186 with those of several relevant existing methods that have web server or standalone version concerning the sequence based predictions of DNA binding proteins. These methods include iDNA-Prot [8], DNA-Prot [22], DNAbinder [23], DNABIND [24], and DBD-Threader [19], to our best knowledge. Table 3 shows the performance comparison of the proposed DBPPred with the five existing methods based on the PDB186 dataset. As shown in the table, the proposed DBPPred has the highest sensitivity of 0.796, accuracy of 0.769, and MCC of 0.538, and the secondly highest specificity of 0.742. The independent predictions of DBPPred are improved by accuracy of 9.2% and MCC of 0.183 when compared with the remaining best method, i.e. DBDBIND. The next method is iDNA-Prot, whose performance is very close to DBDBIND. DNA-Prot and DNAbinder are two close methods that have lower prediction quality than DBDBIND and iDNA-Prot. DBD-Threader was performed with the lowest accuracy of all considered methods. More specially, DBD-Threader achieved the lowest sensitivity of 0.237 and the highest specificity of 0.957, which implies that this method remarkably tends to predict a query protein as non DNA-binding whatever it is actually DNA-binding or non DNA-binding. As a result, DBD-Threader yields generates the lowest accuracy of 0.597. The reason may be due to the fact that DBD-Threader is   actually a threading based method that requires a template library of DNA-binding proteins [19]. However, the size of the template library may be not large enough. Moreover, two methods, DNAbinder and DNABIND, provide real-value outputs, which can be used to plot Receiver Operating Characteristic (ROC) [63] curve. We performed ROC analysis to further compare the prediction performance of the proposed method DBDPred, DNAbinder and DNABIND. The ROC curve shows the relation between true positive rate (sensitivity) and false positive rate (1-specificity) for each threshold of the real-value outputs. Figure 3 shows the ROC curves of the proposed DBPPred base on GNB, the DNAbinder based on SVM, and the DNABIND based on LogR. The areas under the ROC curves (AUCs), which quantify the overall performance independently of the threshold values, equal 0.791 for DBPPred, 0.607 for DNAbinder, and 0.694 for DNABIND. This indicates that the proposed DBPPred outperforms DNAbinder and DNABIND. The prediction results of all methods in Table 3 as well as the realvalue outputs of the proposed DBPPred, DNAbinder and DNABIND are listed in Information S1. Table 4 lists the false positive rates of the proposed DBPPred, iDNA-Prot, DNA-Prot, DNAbinder and DNABIND performed on several non-DNA binding protein datasets, NDBP4025, RB174, RB256 and RB430. We did not include the result of DBD-Threader into the table, since its prediction output probably tends to be negative (i.e. non-DNA binding protein) and the server is not friendly for large number of sequences. As shown in the table, DBPPred yields the smallest false positive rate of 0.254 (i.e. the specificity is 0.746) when compared with other methods including iDNA-Prot, DNA-Prot, DNAbinder and DNABIND, which achieve the false positive rates of 0.310, 0.354, 0.325, and 0.299, respectively, based on the NDBP4025 dataset. The results of all methods in Table 4 based on the dataset NDBP4025 are close to the specificity values derived from the independent tests on PDB186. In summary, DBPPred provides improved predictions of DNA-binding proteins with a balance of sensitivity and specificity. The prediction results of all methods in Table 4 performed on NDBP4025 dataset are listed in Information S2.
In case of RNA-binding proteins, as shown in Table 4, all methods show the limited ability to distinguish between DNAbinding and RNA-binding proteins. For the results performed on the three datasets RB174, RB256 and RB430, the smallest false positive rate achieved by iDNA-Prot based on RB174 dataset is 0.483, which is far from the largest false positive rate of 0.354 achieved by DNA-Prot based on NDBP4025 dataset. However, the false positive rate of the proposed DBPPred (0.530) is comparable with iDNA-Prot (0.528) and is smaller than those of DNA-Prot (0.707), DNAbinder (0.660) and DNABIND (0.733) based on the RB430 dataset. Specifically, DBPPred has larger false positive rate on RB174 and smaller false positive rate on RB256 when compared with iDNA-Prot, resulting in the comparable results between DBPPred and iDNA-Prot based on the union of RB174 and RB256, i.e. RB430. The prediction results of all methods in Table 4 performed on the two datasets RB174 and RB256 are listed in Information S3.
We conclude that the proposed DBPPred provides favorable results, which should allow for building a well-performing DNAbinding protein predictor. Additionally, a standalone software of the proposed model that predicts the DNA-binding protein is provided as Software S1. Furthermore, we investigate statistical significance of the differences of these feature values between the DNA-binding and non DNA-binding proteins on the PDB594 dataset. Table 5 gives the P values of two-sided t tests. It can be observed that if the statistically significant difference (SSD) between DNA-binding and non DNA-binding proteins is at 0.05 level, 43 out of 56 (43/ 56 = 76.8%) features have P values less than 0.05, and thus their differences of the feature values of DNA-binding and non DNAbinding proteins are statistically significant. As expected, the results confirm that the majority of the selected features by the proposed method have statistically significant differences between the DNA-binding and non DNA-binding proteins.

Analysis of Selected Features
It can be observed that the secondary structure based features are not selected in the final model. However, we strengthen that the high quality of the proposed method is attributed to the combination of the selected features. In addition, an alternative reason may be due to that the secondary structures were predicted from evolutionary information in SPINE-X program. When a number of features associated with PSSM scores were already selected, the predicted SS based features contributed no more improvement to the prediction of DNA-binding proteins, and then they were not selected.

Conclusion
In this work, we proposed a new method, called DBPPred, for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy and Gaussian naïve Bayes as the underlying classifier. The features comprise information from the primary sequence, the predicted secondary structure, the predicted relative solvent accessibility, and the position specific scoring matrix. The proposed method using GNB as the underlying classifier was compared with other five classifiers having the same cross validation procedures, including decision tree, logistic regression, k-nearest neighbor, SVM with polynomial kernel, and SVM with RBF kernel. As a result, the proposed DBPPred performs the best according to the five-fold cross validation with ten runs on PDB594 dataset. Moreover, independent tests of the proposed DBPPred, which was trained on the entire dataset PDB594, and other five existing methods including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader were performed on the PDB186 dataset, resulting in that DBPPred yielded the highest prediction quality. All of the experimental results, including additional tests on purely the non-DNA binding protein dataset NDBP2045 and the RNA-binding protein dataset RB430, indicate that the proposed DBPPred may be an alternative perspective predictor for large-scale determination of DNA-binding proteins.

Supporting Information
Dataset S1 The PDB594 dataset. This dataset was used for training. The file includes the PDB IDs, the deposition dates, the target values denoting that the proteins are DNA-binding or not, and the corresponding primary sequences. (TXT) Dataset S2 The PDB186 dataset. This dataset was used for independent blind test. The file includes the PDB IDs, the deposition dates, the target values denoting that the proteins are DNA-binding or not, and the corresponding primary sequences. (TXT) Information S1 The prediction results of the proposed DBPPred, iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader on PDB186 dataset. The file lists the actual target value and the predicted values of the existing methods for each sequence in PDB186 dataset. The real-value outputs of three methods, DBPPred, DNAbinder and DNABIND, are also provided. Software S1 The program of the proposed DBPPred model that predicts the DNA-binding proteins. This ZIP file contains python scripts together with instruction how to run it. (ZIP)