DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.
Citation: Ma X, Guo J, Sun X (2016) DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues. PLoS ONE 11(12): e0167345. https://doi.org/10.1371/journal.pone.0167345
Editor: Bin Liu, Harbin Institute of Technology Shenzhen Graduate School, CHINA
Received: June 28, 2016; Accepted: November 12, 2016; Published: December 1, 2016
Copyright: © 2016 Ma et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: This work was supported by the National Natural Science Foundation of China (No. 61305072) http://www.nsfc.gov.cn/ and had the data collection and analysis, decision to publish. This work was also supported by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 14KJB520020) and had study design and also by the Qing Lan Project and had preparation of the manuscript. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
DNA-protein interactions play significant roles in various biological processes, such as gene regulation, DNA replication and repair, transcription and other biological activities associated with DNA [1–3]. Identification of DNA-binding proteins is fundamentally important to understand how proteins interact with DNA. DNA-binding proteins can be identified by many experimental techniques such as chromatin immunoprecipitation on microarrays, X-ray crystallography and nuclear magnetic resonance (NMR). However, the experimental techniques to recognize DNA-binding proteins are labor-intensive and time-consuming. Considering the weakness of determination of DNA-binding proteins using wet experiments, computational methods to identify putative DNA-binding proteins have become increasingly important in recent years. In recent years, rapid advances in genomic and proteomic techniques have generated numerous DNA-binding protein sequences. In 2014, the number of DNA-binding proteins in the UniProt database was more than 10 times greater than that in 2000. These large amounts of data provide the foundation for research on the identification of DNA-binding proteins using computational approaches.
Currently, there are two major tasks for the computational prediction of DNA-binding proteins. One is to identify DNA-binding proteins using structure information and the other is to predict them using sequence information. Obtaining structural information is difficult; therefore, it is necessary to develop prediction methods for DNA-binding proteins based on amino acid sequences.
During the past few decades, a series of studies on the identification of DNA-binding proteins using sequence information have been published [4–14]. Machine learning algorithms were employed to construct models to predict DNA-binding proteins and produced effective performances [4–9,11–19]. Interestingly, the support vector machine (SVM) algorithm has been used frequently to predict DNA-binding proteins [4–6,8,12–16]. Cai and Lin first applied the SVM algorithm for DNA-binding protein prediction using a protein’s amino acid composition and a limited range of correlations of hydrophobicity and solvent-accessible surface areas as input features . More recently, Zou et al. developed an entirely sequence-based protocol that transforms and integrates informative features from different scales used by SVM to predict DNA-binding proteins . Zhang et al. proposed newDNA-Prot, a DNA-binding protein predictor that employs an SVM classifier and a comprehensive feature that categorized features into six groups: primary sequence-based, evolutionary profile-based, predicted secondary structure-based, predicted relative solvent accessibility-based, physicochemical property-based and biological function-based features . DNA-Pro based on SVM algorithm to distinguish DNA-binding proteins from non-binding proteins . They incorporated features of overall amino acid composition, pseudo amino acid composition (PseAAC) proposed by Chou and physicochemical distance transformation. Liu et al. proposed a predictor called iDNAPro-PseAAC  which used PseAAC feature Combined with SVM algorithm. The most recent prediction method for DNA-binding proteins was aslo proposed Liu et al.which called iDNA-KACC. The iDNA-KACC was developed by combing SVM classifier as well as by incorporating the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. Random forest (RF) alorgithm, which is a useful machine learning classifier, was aslo used to prdict DNA-binding proteins. Lou et al. applied the RF algorithm to predict DNA-binding proteins using predicted secondary structure, predicted relative solvent accessibility and position-specific scoring matrix as the primary sequence features.
In this study, a systematic attempt was made to develop models to predict DNA-binding proteins with high accuracy using only sequence information. DNA-binding proteins have DNA-binding residues and non-binding proteins should not have DNA-binding residues. Therefore, the presence of DNA-binding residues could be used to predict DNA-binding proteins. We established an effective model, DNABR , to predict DNA-binding residues. The information of DNA-binding residues and non-binding residues predicted by DNABR was constructed as a feature vector to classify DNA-binding proteins and non-binding proteins. In addition, we proposed a novel feature, PSSM-PP, based on a position-specific scoring matrix (PSSM). The PSSM-PP feature not only represents the evolutionary information obtained by PSSM, but also contains information about physicochemical properties. Thus the novel method DNABP uses a random forest (RF) algorithm  in conjunction with a hybrid feature. The hybrid feature comprises 64 features selected from the PSSM-PP, DNA-binding propensity measures obtained from the information of DNA-binding residues, non-binding propensity measures obtained from the information of non-binding residues and physicochemical property features using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS). Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful models, simulated methods, or predictors as pointed out in [22–24] and emphasized in [25,26], we have established a web-server presented in this paper.
Materials and Methods
All DNA-binding protein sequences and non-binding protein sequences were collected from the UniProt database (http://www.uniprot.org/)  and only manually annotated and reviewed proteins were selected for this study.
To obtain the DNA-binding proteins as the positive dataset, “DNA binding” was used as keyword to search the UniProt database. More than 30000 DNA-binding proteins were obtained. As in previous works [4,6,9,12,28], we removed proteins with lengths less than 50 amino acids because they might be fragments and proteins of more than 6000 amino acids because they might be protein complexes. Protein sequences including irregular amino acid characters such as “x” and “z” were also removed. To avoid any effects on our experimental data from the similarity of the dataset, we removed any redundant data using the BLAST package  available from NCBI, with a threshold of 40%. Finally, our positive dataset had 7131 DNA-binding protein sequences.
To obtain the non-binding proteins as the negative dataset, we first selected all of the proteins from the UniProt database that did not have an implied RNA/DNA-binding functionality using a similar procedure to that proposed by Cai and Lin . In total, 528,086 non-binding proteins were processed according to the similarity criteria as the negative dataset. Consequently, we selected 67029 non-binding protein sequences as the negative dataset. An equal number of positive data and negative data is important to develop the prediction system for DNA-binding proteins. However, the number of DNA-binding proteins in the positive dataset was much less than the number of non-binding proteins in the negative dataset. The imbalance between the positive and negative data would affect the prediction performance; therefore, we randomly selected 7131 non-binding proteins from the negative dataset to balance with the positive dataset. The main dataset (Mainset) then comprised the 7131 DNA-binding proteins in the positive dataset and the selected 7131 non-binding proteins in the negative dataset (See Additional file S1 Table).
We further divided the 14262 proteins in the main dataset into two datasets: 1) the training dataset (Trainset), which comprised 6928 DNA-binding proteins and 6928 non-binding proteins (total 13856); 2) an independent test dataset (Testset), which consisted of 203 DNA-binding proteins and 203 non-binding proteins (total 406). The independent test dataset was used to evaluate the performance of our method against previous works [7,9,11]. Therefore, the proteins in Testset did not include any proteins that were used in previous works [7,9,11].
Position-specific scoring matrix combined with physicochemical properties (PSSM-PP).
The PSSM, which represents evolutionary information of amino acid sequences, has been used widely in research on the prediction of DNA-binding residues [30–35] and DNA-binding proteins [6,14,28] based on sequence information. Compared with other features, PSSM contributes most to improving the prediction performance of DNA-binding residues and DNA-binding proteins.
The PSSM scores used in this work were generated by PSI-BLAST . PSI-BLAST searches for each amino acid sequences were carried out against the non-redundant dataset of proteins in NCBI with an E value of 0.001. The 20 values of PSSM, obtained for each sequence position, were then scaled to the range of 0–1 using the following formula: (1) where x is the element value of the PSSM profile.
The PSSM feature for different proteins has a different vector dimension. Taking a query protein with N amino acids as an example, the vector dimension of the PSSM feature is 20*N. Considering the fact that the machine learning model construction requires a fixed vector dimension, the variable vector dimension of PSSM feature should be converted into a fixed dimension.
Furthermore, to improve the PSSM feature, we considered a physicochemical property feature combined with the PSSM feature. In our previous work, we combined the PSSM with the physicochemical property feature to predict DNA-binding residues  and achieved excellent prediction performance. Therefore, the novel PSSM-PP feature considered six physicochemical properties for each amino acid: the pKa values of the amino group, the pKa values of the carboxyl group , the electron-ion interaction potential (EIIP) , the number of lone electron pairs, the Wiener index  and the molecular mass . Those six physicochemical properties are relevant to DNA-protein interactions and contributed most to improving the prediction performance of DNA-binding residues in proteins compared with other physicochemical properties in the AAindex database  when combined with the PSSM feature . Those six physicochemical properties were normalized to the range of 0–1 using the following formula (2): (2) where NPa(i) represents the normalized quantitative property values that range from 0 to 1, i indicates the i-th amino acid and a is the index of the physicochemical property. Then NPa(i) is the value of the physicochemical property a of the i-th amino acid.
The PSSM-PP feature was constructed by combining PSSM with six physicochemical properties and took into account the fact that different proteins should have the same vector dimension. The PSSM-PP feature was constructed using the following procedure. 1) Similar to several previous studies [6,14,28,41], all rows in the PSSM were selected that belong to the same amino acid and form a new matrix. Then, 20 new matrices were obtained with the size Ak*20, where Ak is the number of amino acids of type k. 2) All values in each column were added into each new matrix. Each new matrix was converted to a vector. Therefore, we produced a 20-dimensional vector for each new matrix; a 20×20 = 400 dimension vector was obtained by the PSSM feature. 3) PSSM-PP was generated by merging the 20 amino acid columns of the PSSM into a single column containing the information of a certain physicochemical property. The value in row a and column k in PSSM-PP matrix, named Sak, was calculated with Eq (3): (3) where a is the index of a certain physicochemical property, k is the index of the type of amino acids in the query protein sequence, i is the index of the type of naïve amino acids, fk(i) is the scaled value of the i-th type of naïve amino acid for the k-th type of amino acid in the protein sequence of the PSSM calculated by formula (1), and NPa(i) is the normalized physicochemical property values of a for the i-th type amino acids calculated by formula (2). Sak represents the index of the type of amino acids k in the query protein sequence for a certain physicochemical property a and it not only contains the evolutionary information captured by PSSM, but also the conservation information about the amino acid k at the level of its physicochemical property a. Finally, the dimension size of the PSSM-PP feature was 6×20 (120).
Binding propensity measures (BP) and non-binding propensity measures (NBP).
DNA-binding proteins contain DNA-binding residues and the binding residues tend to gather together on the surface of the protein. Therefore, DNA-binding residues could play an important role in identifying DNA-binding proteins. Previously, we constructed a useful classifier named DNABR  (http://www.cbi.seu.edu.cn/DNABR/) to predict DNA-binding residues based on sequence information. DNABR outperformed other prediction methods for identifying DNA-binding residues. Therefore, DNABR was used to predict DNA-binding residues to construct binding and non-binding propensity measures in this study. Considering the characters of DNA-binding residues, we constructed two binding propensities measures named BP(1) and BP(2).
The DNA-binding residues, which we used in the binding propensities, were also obtained by DNABR. Therefore the reliability of the prediction needs to be considered. The two binding propensity measures (BP(1),BP(2)) were defined as follows: (4) where N and n are the number of amino acids and the number of DNA-binding residues in this protein, respectively; RI(i), a positive integer in the range 0 to 10, is the predicting reliability index of DNA-binding residue i generated from DNABR. More reliable predictions will have higher RI(i) values.(5)
Where N, n, and n(i) are the number of amino acids, the number of DNA-binding residues and the number of two DNA-binding residues with the distance i in the query protein, respectively.
RI(k) is the predicting reliability index of DNA-binding residue k generated from DNABR.
For a query protein, BP(1) describes the information of the appearance of DNA-binding residues in the amino acid sequence and BP(2) describes the correlation of DNA-binding residues in the amino acid sequence and represents the relevance of two DNA-binding residues with different gaps from 1 to N-1 amino acids. Furthermore, when equals zero in Eq (5), the problem 0log2 0 appeared in the Eq (5). To solve the problem, Eq (5) was transformed to Eq (6) using a Taylor series.(6)
Physicochemical property feature (PHY).
The physicochemical property feature was constructed based on the formula used in research on prediction of DNA-binding proteins , prediction of RNA-binding proteins  and functional classification in proteins . Eight physicochemical properties, including hydrophobicity, polarity, polarizability, charge, surface tension, secondary structure, solvent accessibility and normalized Van der Waals volume, were used. Each physicochemical property divided the 20 types of amino acids into three groups. Then, the three descriptors, composition index (C), transition index (T) and distribution index (D), were introduced by the work of Dubchak et al. to represent each physicochemical property. The composition index was calculated by the number of a certain property divided by the length of the query protein. The transition index was obtained by dividing the number of amino acids with a certain property followed by amino acids of a different property by the length of the query protein minus one. The distribution index measures the percent of the length of a query protein within which the first 25%, 50%, 75% and 100% of the amino acid of a particular property are located, respectively. Each physicochemical property generated a feature vector with a dimension of 21, thus the physicochemical property feature has a vector with dimension 168.
Cross-validation is a reliable method to test the performance of a new prediction model. We used five-fold cross-validation to evaluate our model. In five-fold cross-validation, the dataset was randomly divided into five parts. The evaluations were conducted five times using four parts as the training dataset to construct a classifier and the remaining part as the test dataset to evaluate the performance. The performance of each model was computed as the average of the five runs.
In this work, four performance measures, namely accuracy (ACC), sensitivity (SE), specificity (SP), and Matthew correlation coefficient (MCC) , were calculated to evaluate the prediction performance.
The accuracy is defined as , which evaluates the overall percentage of DNA-binding proteins and non-binding proteins that were correctly predicted. The sensitivity is defined as , which evaluates the percentage of DNA-binding proteins that were correctly predicted as DNA-binding ones.
The specificity is defined as , which evaluates the percentage of non-binding proteins that were correctly predicted as non-binding ones.The MCC is a statistical parameter that assesses the quality of the binary classification and is defined as . where TP, TN, FP, and FN represent the number of true positive, true negative, false positive and false negative results, respectively. An MCC equal to 1 indicates that the model has a perfect prediction performance and MCC close to 0 indicates that the model has a random prediction performance.
Random forest classifier
A random forest (RF) is an ensemble of a large number of classification trees. Each tree in the ensemble is trained on a subset of training instances that are randomly selected from the given training set. At each node, the best split is chosen from a set of variables selected at random from the set of input features. The prediction results of the RF classifier are based on the ensemble of those decision trees and each tree gives a classification result. Finally, the RF classifier selects the prediction result that has the largest number of votes from the classification results. The RF R package  was used to implement the RF algorithm.
The main purpose of feature selection is to remove the least used features from the original feature to improve the prediction performance. In this work, we used the mRMR method combined with IFS to select the prominent features that identify the positive instances from negative ones. The mRMR-IFS method has been used successfully to select important features in several classification studies [47–54].
The mRMR algorithm is a sequential forward selection algorithm first proposed by Peng et al to process microarray data . Each feature selected by the mRMR algorithm has the maximal relevance with target class and the minimal redundancy with other features. A detailed description of the mRMR algorithm can be found in the literature , and the mRMR program can be obtained from the website http://penglab.janelia.org/proj/mRMR/.
After the mRMR procedure, the mRMR feature set contained all features. The more prominent features obtained by mRMR algorithm have smaller orders. The IFS step was then used to determine the optimal set of features. Each feature in the mRMR feature set was added one by one from the first to the last. Therefore, N feature subsets were obtained if the mRMR feature set had N features. For each feature subset, an RF was constructed and evaluated by five-fold cross-validation. The IFS scatter plot was drawn with the number of feature subsets as its x-axis and corresponding MCC values as the y-axis. We chose the optimal feature subset when the IFS scatter plot reached a peak.
The steps of the DNABP method
The following steps were performed and are described as follows:
- The protein sequence data were collected form the UniProt database.
- The collected protein sequence data were preprocessed and assigned class labels.
- The protein sequences were converted to feature vectors.
- The optimal feature subset was obtained using mRMR-IFS.
- The RF prediction model was constructed based on the optimal features.
- The RF prediction model was evaluated.
A detailed flowchart of our work is shown in Fig 1.
Results and Discussion
The performance of DNA-binding protein prediction
Based on the Mainset, the different DNA-binding protein prediction models were constructed by RF and various features. The prediction performance of each model was evaluated using five-fold cross-validation (see Table 1).
The classifier using RF with the PHY feature just received a 77.65% accuracy and a 0.555 MCC. When the RF classifier was combined with the PSSM-PP feature only, it obtained a 81.69% accuracy and a 0.635 MCC, which outperformed the prediction performance obtained from the PHY feature. The classifier appending either PHY or BP and NBP features achieved total accuracies of 82.67% and 83.68%. When we constructed classifier using RF with all of the combination of all features, we achieved the best performance, with a 84.64% accuracy and a 0.706 MCC. The results represented that the combination of all features captured more information to discriminate DNA-binding proteins from non-binding ones compared with a single feature. Therefore, we implemented the mRMR-IFS algorithm to select an optimal feature subset from all features, including PSSM-PP, PHY, BP and NBP.
In Table 1, it is worth noting the comparison results between prediction performances obtained by the PSSM-PP feature with that of PSSM. Although the PSSM-PP used a significantly lower size of 120 dimensions in the input vectors than the 400 for PSSM, the PSSM-PP feature improved the prediction performance. This result indicated that PSSM-PP, which provides evolutionary information of the protein at the level of physicochemical properties, could effectively distinguish DNA-binding proteins from non-binding ones. Therefore, PSSM-PP was used as a significant feature rather than PSSM in this work.
The feature selection results obtained by the mRMR-IFS method
To identify the most prominent features and improve the prediction performance, the mRMR-IFS method was used in this research. First, we used the mRMR method to rank a list of 292 features for the Mainset. A small index value for a feature in this mRMR list represents a more effective power to distinguish DNA-binding proteins from non-binding ones. Second, we used IFS to select the optimal feature subset based on the mRMR list. The 292 different predictors were constructed by increasing the number recursively from rank one to rank 292, and the performance of each predictor was evaluated on the Mainset. The IFS scatter plot was constructed by feature indices and MCC values obtained from the corresponding predictor (Fig 2). A maximum MCC value of 0.727 was obtained using the top 64 features. As seen from Table 1, it is clear that the performance of the prediction model using those 64 features is better than that of the prediction model using all 292 features. The 64 optimal features are shown in Table 2. Finally, the DNABP model for predicting DNA-binding proteins was constructed by the RF algorithm using the 64 optimal features.
The maximum MCC value was 0.727 when the top 64 features were selected.
Comparison with other research on DNA-binding proteins
There are several studies on the prediction of DNA-binding proteins using sequence information [4–14]. To the best of our knowledge three methods, namely enDNA-Port , iDNA-Prot|dis  and nDNA-Prot , were proposed recently and provide web servers to predict DNA-binding proteins. These three methods all showed better performances when compared with previous methods such as DNA-Prot , DNAbinder or iDNA-Port . The predictor enDNA-Prot (http://bioinformatics.hitsz.edu.cn/Ensemble-DNA-Prot/) identifies DNA-binding proteins using physicochemical properties as input features and employing the ensemble learning technique. Liu et al. constructed a predictor, named iDNA-Prot|dis (http://bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/), by incorporating the amino acid distance-pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition (PseAAC) vector. Song et al. described the predictor nDNA-Prot (http://ndnaprot.aliapp.com/Prediction.jsp), which is an ensemble classifier named for classifying DNA-binding and non-binding proteins using the frequencies of the appearance of every kind of amino acid and physicochemical properties as input features. We used the Testset to evaluate our DNABP in comparison with the other three methods mentioned above. enDNA-Port, iDNA-Prot|dis and nDNA-Prot could predict DNA-binding proteins on the web server; therefore, the Testset was submitted to those three web servers for prediction. As shown in Table 3, the enDNA-Port achieved an MCC of 0.183 with 59.11% ACC, 54.19% SE and 64.04% SP. The iDNA-Prot|dis method achieved an MCC of 0.324 with 66.01% ACC, 73.4% SE and 58.62% SP. The nDNA-Prot predicted all of the proteins as non-binding proteins, therefore the nDNA-Prot achieved an MCC of 0 with 50% ACC, 0% SE and 100% SP. To obtain the performance of our DNABP, the process of constructing the prediction model was repeated based on the Trainset, and then predicted the DNA-binding proteins in the Testset. The ACC, SE and SP of DNABP prediction were 0.7315, 0.6847 and 0.7241, respectively, which resulted in an MCC value of 0.409. The results indicated clearly that our DNABP model achieved the best performance and demonstrates the superiority of our DNABP method, both in feature extraction and selection, compared with the other three methods.
In this research, we constructed DNABP model based on the Mainset dataset which is different from the benchmark dataset Xu et al. used to establish enDNA-Prot model . Then the question is that whether a DNABP model constructed based on the benchmark dataset would achieve better performance than the enDNA-Prot model. Therefore, a new DNABP model was trained based on benchmark dataset using 64 optimal features with RF algorithm and test on two independent datasets used in the research of Xu et al. When test on independent dataset1, DNABP model reached accuracy, sensitivity, specificity and Matthew correlation coefficient equal to 89.56%, 89.02%, 90%, and 0.789, respectively. While the enDNA-Prot model achieved accuracy, sensitivity, specificity and Matthew correlation coefficient equal to 84.62%, 73.18%, 94% and 0.7, respectively . When test on independent dataset2, the prediction performance of our DNABP model is also outperforms that of enDNA-Prot model (See Table 4). Those results show that our DNABP method superior to the enDNA-Prot method.
The feature selection results
Based on the mRMR-IFS method, we selected 64 features as the optimal feature subset from 292 original features. The 64 features outperformed all 292 features for distinguishing DNA-binding proteins from non-binding ones. The 292 features are divided into three types: PSSM-PP, BP/NBP and PHY and the number of each type of feature in the optimal feature subset is shown in In recent years, rapid advances in genomic and proteomic techniques have generated numerous DNA-binding protein sequences. In 2014, the number of DNA-binding proteins in the UniProt database was more than 10 times greater than that in 2000. These large amounts of data provide the foundation for research on the identification of DNA-binding proteins using computational approaches. Fig 3A. There are 38 PSSM-PP features, three BP/NBP features and 23 PHY features in the optimal features subset. Therefore, in the optimal subset, the number of PSSM-PP features is the highest and the number of BP/NBP features is the lowest. Considering that the number of each type of feature is different, we calculated the proportion of each type of selection feature for the corresponding type of feature. As shown in Fig 3B, we found that although the number of BP/NBP features in the optimal feature set was the lowest (3), the selection proportion of BP/NBP features was the highest (75%). This result indicated that BP/NBP features play an important role in the prediction of DNA-binding proteins. The number of PSSM-PP features is lower than the number of PHY features in the original feature set, while the number of PSSM-PP features is higher than the number of PHY features in the optimal feature subset. Thus, PSSM-PP features have the second largest selection proportion and PHY features have the smallest selection proportion. This result indicated that PSSM-PP features are more effective than PHY features in distinguishing DNA-binding proteins from non-binding ones. Taken together, these results proved that the results obtained in Table 1 are reliable. We also investigated the statistical significance of the differences for these features between DNA-binding proteins and non-binding proteins on the Mainset. The p-values of a two-sample t-test were calculated and are shown in Table 2. A small p-value indicated greater separation and large p-values indicated less separation. As seen from Table 2, 53 out of 64 (53/64 = 0.828) features have a p-value less than 0.001. This result, that 64 optimal features selected by the mRMR-IFS method have statistically significant differences between DNA-binding proteins and non-binding proteins, indicated that those features are useful for separating the DNA-binding proteins from non-binding proteins and could greatly improve the prediction performance for DNA-binding proteins.
Analysis of 64 features obtained by the mRMR-IFS method
Analysis of PSSM-PP features in 64 optimal features.
Thirty-eight PSSM-PP features were selected in the optimal features subset. Among the 38 selected PSSM-PP features, there are 13 features constructed by the pKa values of amino groups, nine features constructed by the pKa values of carboxyl groups, five features constructed by the electron-ion interaction potential (EIIP), one feature constructed by the number of lone electron pairs, three features constructed by the Wiener index and seven features constructed by the molecular mass. The contributions of each type of physicochemical property that constitutes the PSSM-PP features are shown in Fig 4. This result showed that among the six physicochemical properties, pKa values of the amino group and pKa values of the carboxyl group were selected the most and the number of lone electron pairs was selected the least. This shows that the pKa values of the amino group and pKa values of the carboxyl group play important roles in DNA-binding protein prediction and that the number of lone electron pairs contributes least to the prediction of DNA-binding proteins, which is consistent with the result obtained in Table 2.
(a) Physicochemical property distribution of the 38 PSSM-PP features that were selected in the optimal feature set. (b) The type of amino acid distribution used to construct the 38 PSSM-PP features that were selected in the optimal feature set.
Analysis of BP and NBP features in the optimal features.
The mRMR-IFS method selected two BP features and one NBP feature among the 64 optimal features, which means that only one NBP feature was not selected in the optimal feature subset. The high selection proportion suggested that BP and NBP features contribute most to distinguish DNA-binding proteins from non-binding ones. As shown in Table 2, the p-values of BP and NBP features between the binding proteins and the non-binding ones were much less than 0.001. This result also indicated that BP and NBP play a vital role in discriminating between DNA-binding proteins and non-binding proteins.
The BP/NBP features selected in the optimal feature subset were BP(1), BP(2) and NBP(2). The BP(1) feature represented the information of the appearance of DNA-binding residues in the query protein. The selection of the BP(1) feature reveals the reliability of the definition of the BP(1) feature that DNA-binding residues should appear in the DNA-binding proteins. BP(2) and NBP(2) represent the correlation of DNA-binding residues with DNA-binding residues and non-binding residues with non-binding residues in the amino acid sequence, respectively. The selection of BP(2) and NBP(2) indicated that the BP(2) and NBP(2) formulas, which represented the spatial information in DNA-binding proteins and non-binding proteins, respectively, were reliable. NBP(1) was not selected as an optimal feature, possibly because the number of non-binding residues is greater than the number of DNA-binding residues in the majority proteins, which would result in no statistically significant difference between DNA-binding proteins and non-binding proteins.
Analysis of PHY features in the optimal features.
Twenty-three PHY features are in the optimal feature subset, and their distribution is shown in Fig 5. The 23 PHY features were divided into eight types by physicochemical properties, including hydrophobicity, polarity, polarizability, charge, surface tension, secondary structure, solvent accessibility and normalized Van der Waals volume. As seen from Fig 5A, there are seven PHY features obtained from hydrophobicity property, which was the most among the eight physicochemical properties. The charge property and the solvent accessibility property both have four PHY features, which were the second most among the eight physicochemical properties. These results indicated that the three physicochemical properties were more useful for revealing the mechanisms of DNA and protein interactions than the other five physicochemical properties. A possible explanation could be: 1) DNA-binding residues in binding proteins should cluster on the surface of the proteins to bind to DNA; therefore, binding residues would tend to be hydrophobic residues, and the solvent accessibility property of DNA-binding residues should be stronger than that of non-binding residues; 2) DNA-binding residues tend to be positively charged so that they can easily interact with DNA, which is negatively charged. The polarizability property only has one PHY feature and the normalized Van der Waals volume did not have any PHY feature in the optimal feature subset. Thus the polarizability and the normalized Van der Waals volume contributed least to distinguishing DNA-binding proteins from non-binding ones.
(a) Physicochemical property distribution used to construct the 23 PHY features that were selected in the optimal feature set. (b) Distribution of the three descriptors used to construct the 23 PHY features that were selected in the optimal feature set.
The 23 PHY features were divided into three groups by the descriptors, which are composition index (C), transition index (T) and distribution index (D). As shown in Fig 5B, the C index has four PHY features, the T index has seven PHY features and the D index has 12 features among the 23 PHY features in the optimal feature subset. Each physicochemical property generated 21 PHY features and the C index generates three PHY features, the T index generates three PHY features and the D index generates 15 PHY features. Although the D index has the most features in the optimal feature subset, the selection proportion of the D index is the least (10% (12/(15*8)). The selection proportion of the T index is the most among the three descriptors (29.2% (7/(3*8)), which suggested that the T index contributed most to predicting DNA-binding proteins.
The reliablility of negative samples in the Mainset
As mentioned in “Dataset” section, the mainset was comprised by 7131 non-binding proteins randomly selected from the negative dataset and all of the the 7131 DNA-binding proteins in the positive dataset. The question arises then, whether the random selection of different dataset of 7131 non-binding proteins would change the prediction performance. Therefore other four randomly selected datasets of non-binding proteins was used to construct the DNABP model. Four dataset of 7131 non-binding proteins randomly selected from the negative dataset were respectively combined with 7131 DNA-binding proteins in the positive dataset and form four main dataset named Mainset_1, Mainset_2, Mainset_3 and Mainset_4. The predicton performances of DNABP models which built respectively from four main datasets using the RF algorithm with all of the 292 features were list in Table 5. The performance of four DNABP models which built from four different main datasets were very similar to the performance which obtained from Mainset. The result shows that the 7131 negative samples in Mainset is reliability to constructed DNABP model.
Based on the 64 optimal features selected by the mRMR-IFS method, a web server DNABP was developed to identify DNA-binding proteins from amino acid sequences. DNABP is freely available at http://www.cbi.seu.edu.cn/DNABP/. On the DNABP web page, users can submit an amino acid sequence in FASTA format. The DNABP model was established using the RF algorithm on the Mainset. The RF algorithm is implemented using the R package . After submitting the query sequence, the DNABP web server returns a quick prediction result that is sent to the user by e-mail. The DNABP server also returns the binding information of each residue, which is predicted by DNABR when the query protein is predicted as the DNA-binding protein.
To predict the DNA-binding proteins using sequence information, we proposed a new and useful method, DNABR, which combines an RF algorithm and an mRMR-IFS feature selection method. The method has novel features, including evolutionary information that combines conservation information with the physicochemical properties of amino acids (PSSM-PP), binding propensity measures (BP) and non-binding propensity measures (NBP). The results proved that these features markedly improved the predictions. The mRMR-IFS feature selection method was implemented to obtain the optimal feature subset. The RF model with the novel optimal feature subset selected from the hybrid feature set, including PSSM-PP, PHY, BP and NBP, achieved excellent performance with 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and an MCC of 0.727. A comparison between DNABP and other prediction methods indicated that our DNABP method is currently the most effective method to predict DNA-binding proteins using only sequence information. A web server named DNABP (http://www.cbi.seu.edu.cn/DNABP/) has been developed to aid the use of the DNABP model to predict DNA-binding proteins.
The authors wish to thank the editor for taking time to edit this paper. The authors would also like to thank the two anonymous reviewers for their constructive comments, which were very helpful for strengthening the presentation of this study.
- Conceptualization: XM XS.
- Data curation: XM.
- Formal analysis: XM.
- Funding acquisition: XM.
- Investigation: XM.
- Methodology: XM.
- Project administration: XM JG.
- Resources: XM.
- Software: JG.
- Supervision: XS.
- Validation: XM.
- Visualization: XM.
- Writing – original draft: XM.
- Writing – review & editing: XM.
- 1. Imamova LR, Chernov BK, Itkes AV. The role of phosphorylation of DNA-binding proteins in regulation of transcription of the human c-myc gene. Biochemistry (Mosc). 1997; 62 (10): 1152–1157.
- 2. Krajewska WM. Regulation of transcription in eukaryotes by DNA-binding proteins. Int J Biochem. 1992; 24 (12): 1885–1898. pmid:1473601
- 3. Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of protein-DNA complexes. Genome Biol. 2000; 1 (1): REVIEWS001. pmid:11104519
- 4. Cai YD, Lin SL. Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta. 2003; 1648 (1–2): 127–133. pmid:12758155
- 5. Fang Y, Guo Y, Feng Y, Li M. Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features. Amino Acids. 2008; 34 (1): 103–109. pmid:17624492
- 6. Kumar M, Gromiha MM, Raghava GP. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics. 2007; 8: 463. pmid:18042272
- 7. Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, et al. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One. 2014; 9 (9): e106691. pmid:25184541
- 8. Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naive Bayes. PLoS One. 2014; 9 (1): e86703. pmid:24475169
- 9. Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics. 2014; 15: 298. pmid:25196432
- 10. Szaboova A, Kuzelka O, Zelezny F, Tolar J. Prediction of DNA-binding proteins from relational features. Proteome Sci. 2012; 10 (1): 66. pmid:23146001
- 11. Xu R, Zhou J, Liu B, Yao L, He Y, Zou Q, et al. enDNA-Prot: identification of DNA-binding proteins by applying ensemble learning. Biomed Res Int. 2014; 2014: 294279. pmid:24977146
- 12. Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol. 2006; 240 (2): 175–184. pmid:16274699
- 13. Zhang Y, Xu J, Zheng W, Zhang C, Qiu X, Chen K, et al. newDNA-Prot: Prediction of DNA-binding proteins by employing support vector machine and a comprehensive sequence representation. Comput Biol Chem. 2014; 52: 51–59. pmid:25240115
- 14. Zou C, Gong J, Li H. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics. 2013; 14: 90. pmid:23497329
- 15. Nimrod G, Szilagyi A, Leslie C, Ben-Tal N. Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. J Mol Biol. 2009; 387 (4): 1040–1053. pmid:19233205
- 16. Ma X, Wu J, Xue X. Identification of DNA-binding proteins using support vector machine with sequence information. Comput Math Methods Med. 2013; 2013: 524502. pmid:24151525
- 17. Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou's PseAAC and Physicochemical Distance Transformation. Mol Inform. 2015; 34 (1): 8–17. pmid:27490858
- 18. Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep. 2015; 5: 15479. pmid:26482832
- 19. Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans Nanobioscience. 2016; 15 (4): 328–334.
- 20. Ma X, Guo J, Liu HD, Xie JM, Sun X. Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM Trans Comput Biol Bioinform. 2012; 9 (6): 1766–1775. pmid:22868682
- 21. Breiman L. Random Forests. Machine Learning. 2001; 45: 5–32.
- 22. Liu B, Fang L, Liu F, Wang X, Chen J, Chou KC. Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS One. 2015; 10 (3): e0121501. pmid:25821974
- 23. Chen J, Wang X, Liu B. iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions. Sci Rep. 2016; 6: 19062. pmid:26753561
- 24. Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015; 43 (W1): W65–71. pmid:25958395
- 25. Liu B, Chen J, Wang X. Application of learning to rank to protein remote homology detection. Bioinformatics. 2015; 31 (21): 3492–3498. pmid:26163693
- 26. Wang R, Xu Y, Liu B. Recombination spot identification Based on gapped k-mers. Sci Rep. 2016; 6: 23934. pmid:27030570
- 27. Consortium TU. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012; 40 (Database issue): D71–D75. pmid:22102590
- 28. Kumar KK, Pugalenthi G, Suganthan PN. DNA-Prot: identification of DNA binding proteins from protein sequence information using random forest. J Biomol Struct Dyn. 2009; 26 (6): 679–686. pmid:19385697
- 29. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25 (17): 3389–3402. pmid:9254694
- 30. Ahmad S, Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005; 6: 33. pmid:15720719
- 31. Wang L, Yang MQ, Yang JY. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 2009; 10 Suppl 1: S1.
- 32. Hwang S, Gou Z, Kuznetsov IB. DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics. 2007; 23 (5): 634–636. pmid:17237068
- 33. Ho SY, Yu FC, Chang CY, Huang HL. Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method. Biosystems. 2007; 90 (1): 234–241. pmid:17275170
- 34. Wang L, Huang C, Yang MQ, Yang JY. BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst Biol. 2010; 4 Suppl 1: S3.
- 35. Wu J, Liu H, Duan X, Ding Y, Wu H, Bai Y, et al. Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics. 2009; 25 (1): 30–35. pmid:19008251
- 36. Wang J. Biochemistry Higher Education (in chinese). 2002.
- 37. Veljkovic V, Veljkovic N, Este JA, Huther A, Dietrich U. Application of the EIIP/ISM bioinformatics concept in development of new drugs. Curr Med Chem. 2007; 14 (4): 441–453. pmid:17305545
- 38. Bonchev D. The overall Wiener index—a new tool for characterization of molecular topology. J Chem Inf Comput Sci. 2001; 41 (3): 582–592. pmid:11410033
- 39. Vapnik VN. Statisical learning theory. Wiley, New York. 1998.
- 40. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000; 28 (1): 374. pmid:10592278
- 41. Kumar M, Gromiha MM, Raghava GP. SVM based prediction of RNA-binding proteins using binding residues and evolutionary information. J Mol Recognit. 2011; 24 (2): 303–313. pmid:20677174
- 42. Han LY, Cai CZ, Lo SL, Chung MC, Chen YZ. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA. 2004; 10 (3): 355–368. pmid:14970381
- 43. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003; 31 (13): 3692–3697. pmid:12824396
- 44. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A. 1995; 92 (19): 8700–8704. pmid:7568000
- 45. Zhao H, Yang Y, Zhou Y. Structure-based prediction of DNA-binding proteins by structural alignment and a volume-fraction corrected DFIRE-based energy function. Bioinformatics. 2010; 26 (15): 1857–1863. pmid:20525822
- 46. Liaw AW M. Classification and regression by random forest. R News. 2002: 18–22.
- 47. Gao YF, Li BQ, Cai YD, Feng KY, Li ZD, Jiang Y. Prediction of active sites of enzymes by maximum relevance minimum redundancy (mRMR) feature selection. Mol Biosyst. 2013; 9 (1): 61–69. pmid:23117653
- 48. Gui T, Dong X, Li R, Li Y, Wang Z. Identification of hepatocellular carcinoma-related genes with a machine learning and network analysis. J Comput Biol. 2015; 22 (1): 63–71. pmid:25247452
- 49. Li BQ, Cai YD, Feng KY, Zhao GJ. Prediction of protein cleavage site with feature selection by random forest. PLoS One. 2012; 7 (9): e45854. pmid:23029276
- 50. Li BQ, Feng KY, Chen L, Huang T, Cai YD. Prediction of protein-protein interaction sites by random forest algorithm with mRMR and IFS. PLoS One. 2012; 7 (8): e43927. pmid:22937126
- 51. Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC. Prediction of protein domain with mRMR feature selection and analysis. PLoS One. 2012; 7 (6): e39308. pmid:22720092
- 52. Ma X, Sun X. Sequence-based predictor of ATP-binding residues using random forest and mRMR-IFS feature selection. J Theor Biol. 2014; 360: 59–66. pmid:25014477
- 53. Wang J, Zhang D, Li J. PREAL: prediction of allergenic protein by maximum Relevance Minimum Redundancy (mRMR) feature selection. BMC Syst Biol. 2013; 7 Suppl 5: S9.
- 54. Zhang N, Zhou Y, Huang T, Zhang YC, Li BQ, Chen L, et al. Discriminating between lysine sumoylation and lysine acetylation using mRMR feature selection and analysis. PLoS One. 2014; 9 (9): e107464. pmid:25222670
- 55. Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005; 27 (8): 1226–1238. pmid:16119262
- 56. Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One. 2011; 6 (9): e24756. pmid:21935457