Predicting DNA-Binding Proteins and Binding Residues by Complex Structure Prediction and Application to Human Proteome

As more and more protein sequences are uncovered from increasingly inexpensive sequencing techniques, an urgent task is to find their functions. This work presents a highly reliable computational technique for predicting DNA-binding function at the level of protein-DNA complex structures, rather than low-resolution two-state prediction of DNA-binding as most existing techniques do. The method first predicts protein-DNA complex structure by utilizing the template-based structure prediction technique HHblits, followed by binding affinity prediction based on a knowledge-based energy function (Distance-scaled finite ideal-gas reference state for protein-DNA interactions). A leave-one-out cross validation of the method based on 179 DNA-binding and 3797 non-binding protein domains achieves a Matthews correlation coefficient (MCC) of 0.77 with high precision (94%) and high sensitivity (65%). We further found 51% sensitivity for 82 newly determined structures of DNA-binding proteins and 56% sensitivity for the human proteome. In addition, the method provides a reasonably accurate prediction of DNA-binding residues in proteins based on predicted DNA-binding complex structures. Its application to human proteome leads to more than 300 novel DNA-binding proteins; some of these predicted structures were validated by known structures of homologous proteins in APO forms. The method [SPOT-Seq (DNA)] is available as an on-line server at http://sparks-lab.org.


Introduction
The completion of thousands of proteome projects has led to an explosive increase in number of proteins with unknown functions. The comprehensive Uniprot database [1] contains 10 7 protein sequences and, yet, less than 5% of these sequences have annotated functions from Gene Ontology Annotation database [2]. This gap between the number of sequences and the number of sequences with annotations is widening rapidly as inexpensive and more efficient next generation sequencing techniques become available. Experimentally identifying function of millions of proteins is obviously impractical. Thus, it is necessary to develop effective bioinformatics tools for initial functional annotations.
One important function of proteins is DNA-binding that plays an essential role in transcription regulation, replication, packaging, repair and rearrangement. Function prediction of DNA-binding can be classified into three levels of resolution (low, medium and high). A low-resolution function prediction is a simple two-state prediction whether or not a protein binds to DNA. A mediumresolution function prediction is to predict the region in a protein that binds with DNA (DNA-binding residues or DNA-binding interface regions). A high-resolution function prediction is to predict the complex structure between DNA and a target protein of unknown function.
An alternative approach to above machine-learning techniques is to take advantage of known protein-DNA complex structures. This can be accomplished by structural comparison between a DNA-binding template and a target protein structure [5,11,29,30]. For example, we demonstrated that a size-independent, structural alignment method SPalign makes a significant improvement over several other commonly used tools for locating functionally similar structures [11]. If the structure of a target protein is unknown, homology modeling [40,46] has been employed. Gao and Skolnick further illustrated the importance of combining structure prediction (through structural alignment [47] or threading [48]) with binding prediction for detecting DNA-binding proteins. One important aspect of this approach is its ability to predict the complex structure between a target protein and template DNA. This high-resolution function prediction at atomic details allows an improved understanding of binding mechanism and an integration with prediction of DNA-binding proteins and DNA-binding residues.
This work focuses on improving the high-resolution function prediction. The DBD-Threader method developed by Gao and Skolnick [48] first employed the threading technique called PROSPECTOR [49] to predict structures based on known DNA-binding domains. Confidently predicted complex structures are then confirmed for DNA-binding by utilizing a pairwise knowledge-based, contact energy function [47]. The method has achieved the Matthews correlation coefficient (MCC) of 0.68 for the two-state prediction of DNA-binding proteins by using a database of 179 DNA-binding domains (DB179) and 3797 non-DNA-binding domains (DB3797).
In this work, we approach this function prediction problem with different methods for protein-structure prediction and binding affinity prediction. Instead of a contact-based energy function employed in DBD-Threader [48], we employed a statistical energy function based on a distance-scaled ideal-gas reference state (DFIRE) [50] extended for protein-DNA interactions [51][52][53]. This DDNA energy function was found useful in developing a highly accurate structure-based technique called SPOT-Struc (DNA) that achieved the MCC value of 0.76 for the same database of DB179 and NB3797, employed by DBD-Threader. In addition to energy functions, we examined two fold-recognition techniques to enable a sequence-based prediction as DBD-Threader. One is a method based on hidden Markov model (HHM) called HHblits [54]. The other is our in-house built technique called SPARKS-X [55]. Both methods are among the top performers in critical assessment of protein structure prediction techniques (CASP 9) [55,56]. This development of SPOT-Seq for DNA-binding proteins was inspired by the success of prediction of RNA-binding proteins [57] by integrating SPARKS for structure prediction and DFIRE for protein-RNA binding prediction [58] and its successful application to human proteome [59].
SPOT-Seq for DBPs was applied to DB179 and NB3797 and achieved a MCC value of 0.77 for DBP prediction by combining HHblits with the DDNA3 energy function (leave-one-out). The method was further tested on newly determined DBPs (positive set), RNA-binding proteins (negative set), and the human proteome as well as SCOP folds that host both DNA and non-DNA binding proteins. All results confirmed that the method is highly sensitive (.50%) and its performance is consistent in various tests. More than 300 novel DBPs were found in human

Gao-Skolnick domain datasets (DB179 and NB3797)
Gao and Skolnick complied two datasets that contain 179 DNAbinding protein domains and 3797 non-DNA binding protein domains [47]. They were obtained by collecting the proteins with a resolution of 3 Å or better, a minimum length of 40 amino acid residues per protein and at least 6 base pairs of DNA and five residues interacting with DNA. The redundant data between two sets were excluded by using 35% sequence identity cutoff. DB179 is used as a template library in this work.

Test set of RNA-binding proteins (RB174)
RB174 is a dataset made of 174 high-resolution RNA-binding proteins (whole chains), collected by us in developing SPOT-Seq (RNA) based on a 25% cutoff. We employed RB174 to examine if the proposed method can separate DNA-binding proteins from RNA-binding proteins.

Independent test dataset (DB82)
An independent test set was built by obtaining the DNA-binding proteins released after December 2009. The protein chains were divided into SCOP domains, and the redundant data was removed by using sequence identity cutoff of 30%. We further excluded the proteins that have sequence identity higher than 30% with any proteins in DB179. Finally, we generated an independent test dataset with 82 protein domains (chains if SCOP domains were not available).

Function prediction protocol
The prediction protocol proposed here is the same as SPOT-seq (RNA) developed by us [57], except that 1) the template library is made of known protein-DNA complex structures and 2) HHblits  [54], in addition to SPARKS-X [55], is used in structure prediction. Briefly, HHblits (or SPARKS-X) is employed to match a target sequence to template structures in the template library. If a significant match is found based on a matching probability (HHblits) [or Z-score (SPARKS)], the top matched template(s) are then utilized to model protein-DNA complex structure(s) by copying the query sequence to the template complex structure(s) according to the alignment result while keeping the template DNA intact. The complex-structure models are then employed to estimate the binding affinity between the target protein (mainchain only) and the template DNA by utilizing DDNA3 [53]. The target protein is classified as DNA-binding if the binding affinity is higher than a threshold. Thus, there are only two parameters to be optimized: sequence-structure matching score (or Z-score for SPARKS) and the binding energy value.

Performance evaluation
The Here, TP, TN, FP, and FN refer to true positives, true negatives, false positives and false negatives, respectively. A MCC value provides an overall assessment of the method performance with 1 for perfect agreement and 0 for random prediction. One should note that sensitivity can also be called as coverage of true positive prediction while precision is fraction of corrected predictions in all positive predictions.

HHblits
HHblits [54] is a fold-recognition technique that extracts homologous sequences of targets from template library by Hidden-Markov models (HMM). The HHM matrices of targets and templates are built by searching against the Uniprot database. The probability of a match is calculated by comparing the HMM matrix of a target to the HMM matrix of a template. We define a target sequence as a DBP if the probability of a match is higher than a threshold. The threshold is optimized by maximizing the MCC value. HHblits was downloaded from http://toolkit. tuebingen.mpg.de/HHblits. Default parameters were utilized in structure prediction.

Results
Low-resolution function prediction (binding or not binding) 1. Leave-one-out cross validation (Gao-Skolnick datasets). A leave-one-out cross validation is conducted by removing all templates with .30% sequence identity to the target. The results were obtained by taking one chain sequence from DB179 or NB3797 and predicting whether it binds or does not bind to DNA. Figure 1 and Table 1 compared the methods based on known protein structures (DBD-Hunter [48], DDNA3 [53], and DDNA3O [53]), purely sequence-based (PSI-BLAST (NCBI) [48], PSI-BLAST(uniprot) [48]), sequence and template-structurebased (PROSPECTOR [48], HHblits, SPARKS-X), and incorporation of an energy function (DBD-Threader [48], SPARKS-X+Energy, HHblits+Energy). For sequence-based fold/homologyrecognition techniques, SPARKS-X yields the highest MCC value (0.647), followed by HHblits (0.639), PROSPECTOR (0.609), and  In particular, the best performing HHblits + Energy leads to a sensitivity of 65% and precision of 94%. Such performance is even better than the best structure-based technique (DDNA3O) with a MCC value of 0.76 (0.73 without TM-Score dependent optimization). Because combining HHblits with our energy function leads to a significantly improved method than combining SPARKS and the energy function, we mainly focus on the former here and below, unless indicated otherwise.
2. Separating DNA-binding from non-DNA-binding in the same SCOP fold. One crucial test of a method for predicting DNA-binding function is to examine whether or not it can classify DBPs from non-DBPs within the same structural fold. We analyzed 18 SCOP folds shared by DNA-binding and non-DNA-binding proteins. As shown in Table 2, after incorporating the DDNA energy function for DBP prediction, the number of true positives increases from 57 to 61 and false positives decreases from 24 to 4. Thus, removal of false positives is the key factor for large improvement when an energy function is employed.  Medium-resolution function prediction (DNA-binding residues) The complex structures predicted from our method allow us to infer amino-acid residues involved in DNA-binding. We define an amino-acid residue as a DNA-binding residue if any heavy atoms of the residue are less than 4.5Å away from any heavy atoms of a DNA base as in [47]. The accuracy of binding-residue prediction is examined on 116 true positive predictions from DB179. The final values of MCC, sensitivity, and precision of the prediction averaged over 116 proteins are 0.55, 57%, 66%, respectively. A similar, average MCC value (0.54) was obtained if SPARKS-X was used to perform structure prediction.
The quality of predicted binding residues is directly related to the quality of predicted structures as expected. Figure 2 shows the MCC for binding residue prediction as a function of predicted structural accuracy according to structural similarity between predicted and actual structures by SPscore. There is a trend that the higher accuracy for predicted structures, the higher the MCC value is. The correlation coefficient is 0.38. We noticed that there are a few cases of highly accurate structures but with poorly predicted binding regions (low MCC values). In those cases, accurate structures were limited to non-binding regions.

High-resolution function prediction (complex structure prediction)
The quality of predicted DNA-binding complex structures was examined by the structural alignment SPalign [11] that compares native structures and predicted structures based on a sizeindependent structural similarity score called SPscore. Two structures are considered as in the same fold if SPscore.0.5 [11]. For 116 correctly predicted targets, the average SPscore is 0.65. The structure similarity can also be evaluated by the fraction of aligned residues with a root mean-squared distance (RMSD) between two compared structures less than 4Å . We found that the medium value is 67%.
As an example, Figure 3 compared predicted binding sites with native binding sites, and the predicted structure with the native structure for the target (bacteriophage T4 DNA-adenine methyltransferase, T4-dam, PDB# 1yfjd,). The sequence identity between the target and the template (2g1pa, dam) is 24%. Predicted (light grey) and actual (orange) DNAs overlap with each other very well when protein structures are aligned. Predicted binding sites (cyan) are also consistent with the native binding region (yellow) with an MCC value of 0.60.

Independent tests
1. Negative set -Separating RNA-binding proteins from DNA-binding proteins. As RNA-protein interactions share some similar characteristics with DNA-binding proteins (both are positively charged, for example), it is important to examine if the proposed method can separate DBPs from RBPs. We tested the HHblits+energy method with the thresholds optimized by DB179+NB3797 datasets on the RBP dataset (RB174). It predicts 5 proteins as DBPs. Two of the five (1zbib and 1hysa) are highly homologous (sequence identity .70%) to the templates (1zblb and 1r0aa, respectively). Proteins in 1zbib (Bacillus halodurans RNase H catalytic domain) and 1r0aa (HIV-1 reverse transcriptase) are related to dual RNA-and DNA-binding functions. 1zbib is a complex structure between Bacillus halodurans RNase H catalytic domain and 12mer RNA/DNA hybrid. HIV-1 reverse transcriptase in 1r0aa is a RNA-dependent DNA polymerase. Two of the three remaining proteins (2qk9a and 1ooaa) are also annotated as DNA-binding. 2qk9a is Human RNase H catalytic domain binding with both RNA and DNA [60] and 1ooaa contains Rel homology domain (RHD) and DNA binding site [61]. The only remaining protein (PDB ID 2jlua) is dengue virus 4Ns3 helicase in complex with ssrna [62]. This helicase was found to function on both RNA and DNA templates [63]. Thus, there is zero false positive in DNA-binding prediction.

Positive set -Newly determined complex structures
(DB82). We tested the performance of SPOT-Seq (DNA) by utilizing 82 newly determined protein-DNA complex structures. SPOT-Seq correctly predicted 42 (51%) as DBPs based on the same thresholds obtained from the leave-one-out (matching probability of 84% and energy threshold of 28.6). The average MCC value for predicted binding residues of these 42 proteins is 0.64. The average structural similarity between predicted and actual structures is 0.73 based on SPalign [11]. As shown in Table 3, the sensitivity of two-state RBP prediction decreases from 65% for DB179 to 51% for this smaller DB82 test set while the average MCC value of binding residue prediction increases from 0.55 to 0.64 and the average structural similarity between predicted and actual structures increases from 0.65 to 0.73 according to SPscore.

Application to Human Proteome
Our approach was utilized to detect DBPs from human proteome. The human proteome with 20270 proteins was downloaded in 2010 from Uniprot [1]. We obtained annotations of human proteins from Gene Ontology (GO) [64].  Table 4.
Our sequence-based technique (HHblits+Energy) predicted 1975 out of 20270 proteins as DBPs. Among 1975 proteins, the majority (1612, 82%) predicted DBPs were annotated as DBPs according our definition above. The recovery (or sensitivity) of our method is 56% (1612/2883) annotated DBPs. Remaining 363 predicted DBPs were not annotated as DBPs according to Gene Ontology, in which 259 proteins were annotated with other functions and 104 proteins with no annotation in Gene Ontology. The recovery rate (sensitivity) of our prediction for each keyword is shown in Table 4. They are 61% for transcription factors and 62% for DNA binding but low for other keywords.
We examined 363 newly discovered DBPs in more details. We found that 23 of these predicted new DBPs (Table 5) have homologs (.60% sequence identity) with known experimental structures in predicted structural regions. The majority of these structures (21/23) are either in a monomeric form or in complex with itself or other proteins. Interestingly, two of 23 structures contain DNA, a direct confirmation of their DNA binding capability. They are Uracil-DNA glycosylase that involves with DNA repair [65] and steroid hormone receptor ERR2 that has ligand-activated sequence-specific DNA binding RNA polymerase II transcription factor activity [66]. These two proteins were not annotated in GO as DNA binding. Predicted structures for these 23 proteins are highly similar to the structures of their corresponding homologs in the majority of cases (16/23 or 70% with SPscore $0.5, an indication of same structural fold). For those predicted structures with ,0.5 SPscore with corresponding structures of their homologs, the majority (5/7) has a matching region of ,60 residues. Such small matching region is more likely to have binding induced conformational change. Seven in 23 proteins are Cyclin proteins that are involved in regulation of cell cycles.

Discussion
In this paper, we have developed a sequence-based method that predicts DNA-binding proteins and their complex structures with DNA based on existing protein-DNA complex structures. The method achieved a MCC value of 0.77 that is higher than the best structure-based technique (DDNA3O). The method achieved . 50% sensitivity in the independent test set of newly solved protein-DNA complex structures and 56% for human proteome. The method also has a 94% precision in leave-one-out cross validation. Such high precision is confirmed by the fact that 82% predicted DBPs in human proteome were annotated as DBPs. Because template-based methods depend on appropriate templates in the database, a limited number of templates made the methods with high precision but relatively low sensitivity (coverage). An improved fold recognition method is critical for further increasing the sensitivity.
One interesting observation is that combining HHblits with DDNA energy function, rather than combining SPARKS-X with DDNA energy function, yields the best performance, despite the fact that SPARKS-X alone produces a higher MCC value (0.647) than HHblits alone (0.639). This suggests a cancellation of errors where over-prediction made by HHblits is corrected by the energy function.
This work also reveals that DBPs are easier to identify than RBPs. The sensitivity for DBP prediction is 56% in human proteome, compared to only 43% for RBP prediction [59].
Moreover, ,400 new DBPs are discovered, compared to .2000 new RBPs in human proteome. This is mainly because RNA structures are much more complex and diverse than that of DNA. Moreover, RNA-binding proteins are not as well studied as DNAbinding proteins.
Here, the analysis of DNA-binding function is mostly done with GO annotations. We found that the GO annotation is not complete for some proteins. The known DNA-binding protein such as Uracil-DNA glycosylase [65] in Table 5 was not included in GO annotations but were annotated in Uniprot. We further found that 49 (13%) out of 363 predicted novel DBPs are annotated as DBPs in the DAVID database [67]. This further reduces the number of novel predicted DBPs to 314. Some of these novel DNA-binding proteins in Table 5 are nucleases and helicases that could operate on both DNA and RNA (e.g. CCR4 and DDX19B in Table 5). Others are less obvious for their putative DNA-interacting capability and warrants further investigations.
Finally, it is worthy to mention that the template-based approach presented here for DBP prediction is reasonably fast. It takes about a month on a single processor PC (or 2 days with a 16-core server) to scan all proteins in human proteome. The method [SPOT-Seq (DNA)] is available on line as a server at http://sparks-lab.org.

Author Contributions
Conceived and designed the experiments: YZ YY. Performed the experiments: HZ JW. Analyzed the data: HZ JW. Contributed reagents/ materials/analysis tools: HZ JW YY. Wrote the paper: HZ JW YZ YY.