Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

doi:10.1371/journal.pone.0086703

Table 1.

Summary of the considered features, where x, x′ = {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V} denotes the 20 AA types, y = {C, H, E} denotes the three secondary structure states, h = {0.1, 0.2, 0.3, 0.4, 0.5} denotes the cutoff used to categorize the buried/exposed residues based on their relative solvent accessibility, t = {0, 25, 50, 75, 100} denotes the ratio for computing the percentile values, and m = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} denotes the lag for calculating the auto-correlation coefficients.

More »

Expand

Table 2.

Comparison of the prediction performance of the Gaussian naïve Bayes (GNB)-based wrapper, logistic regression (LogR)-based wrapper, decision tree (DT)-based wrapper, k-nearest neighbor (KNN)-based wrapper, and two support vector machine (SVM)-based wrappers with the RBF and polynomial kernels (denoted as SVM-RBF and SVM-Poly respectively).

More »

Expand

Figure 1.

The flowchart of the proposed method.

More »

Expand

Figure 2.

The improvement of MCC values (y axis) along with the increasing number of selected features (x axis) for the performed wrapper based feature selection.

A forward, best-first search was executed using both 10 5 CV runs and jackknife tests on the PDB594 dataset. The standard deviations of MCC values for the case of 5 CV with 10 runs are shown using error bar.

More »

Expand

Table 3.

Comparison of DBPPred with the existing methods based on independent blind tests on the same dataset PDB186.

More »

Expand

Figure 3.

ROC curves for the predictions of DNA-binding proteins on the PDB186 dataset.

We compare the predictions of DBPPred with DNABIND and DNAbinder that provide real-value outputs.

More »

Expand

Table 4.

List of false positive rates of the proposed DBPPred and the existing iDNA-Prot, DNA-Prot, DNAbinder and DNABIND on datasets NDBP4025, RB174, RB256 and RB430.

More »

Expand

Table 5.

The mean values of the selected 56 features and the P values that quantify significance of the differences between DNA-binding and non DNA-binding proteins for PDB594 dataset.

More »

Expand