Current address: Max Planck Institute for Molecular Genetics, Berlin, Germany
Conceived and designed the experiments: RS MSV SV. Performed the experiments: RS MSV. Analyzed the data: RS MSV. Contributed reagents/materials/analysis tools: RS MSV SV. Wrote the paper: RS MSV SV.
The authors have declared that no competing interests exist.
Protein–DNA interactions are crucial for many cellular processes. Now with the increased availability of structures of protein–DNA complexes, gaining deeper insights into the nature of protein–DNA interactions has become possible. Earlier, investigations have characterized the interface properties by considering pairwise interactions. However, the information communicated along the interfaces is rarely a pairwise phenomenon, and we feel that a global picture can be obtained by considering a protein–DNA complex as a network of noncovalently interacting systems. Furthermore, most of the earlier investigations have been carried out from the protein point of view (protein-centric), and the present network approach aims to combine both the protein-centric and the DNA-centric points of view. Part of the study involves the development of methodology to investigate protein–DNA graphs/networks with the development of key parameters. A network representation provides a holistic view of the interacting surface and has been reported here for the first time. The second part of the study involves the analyses of these graphs in terms of clusters of interacting residues and the identification of highly connected residues (hubs) along the protein–DNA interface. A predominance of deoxyribose–amino acid clusters in β-sheet proteins, distinction of the interface clusters in helix–turn–helix, and the zipper-type proteins would not have been possible by conventional pairwise interaction analysis. Additionally, we propose a potential classification scheme for a set of protein–DNA complexes on the basis of the protein–DNA interface clusters. This provides a general idea of how the proteins interact with the different components of DNA in different complexes. Thus, we believe that the present graph-based method provides a deeper insight into the analysis of the protein–DNA recognition mechanisms by throwing more light on the nature and the specificity of these interactions.
The interaction of proteins with DNA is crucial for several cellular processes. Some insights into the mode of interaction can be obtained from the analysis of the complexed structures. Conventional analyses are based on the identification of pairwise interactions. However, a collective representation of the network of interactions and the analyses of such networks provide valuable information, which is not easy to obtain from pairwise analyses. Although the protein structure networks have been described in the literature, this is the first time that a network representation of protein–DNA is described. Construction and analysis of such networks have given valuable information on protein–DNA interactions in terms of network parameters, such as clusters of interacting residues and hubs, which are highly connected residues. Furthermore, the results also represent both the protein- and the DNA-centric viewpoints, because the analysis is carried out on combined networks. The methodology developed here can lead to predictions, such as important residues responsible for stabilizing protein–DNA interactions, and will be of interest to experimentalists.
A network of interactions among the macromolecules drives the cell. The protein–DNA interactions orchestrate the high fidelity processes like DNA recombination, DNA replication, and transcription. With the increasing number of high-resolution structures of macromolecular complexes, it is now possible to obtain insights into the atomic details of interactions governing their structural and functional integrity. In the present study, we focus on protein–DNA interactions, which can either be specific or non-specific depending on the functional requirement. Insights into the mechanism of protein–DNA binding and recognition have come from extensive analysis of protein–DNA interfaces
As a conceptual turning point, it has been pointed out that most of the information on protein–DNA complexes have been obtained from a “protein-centric” view and new insights are likely to emerge if protein–DNA complexes are investigated from a “protein–DNA-centric” viewpoint
The protein–DNA graph is of special interest, since we are dealing with two different types of biopolymers with unique structural and chemical properties. In the case of proteins, the two amino acids are linked by a rigid peptide bond and each amino acid could be unambiguously considered as a node in the protein graph. However, in the case of DNA, the linkage between two nucleotides is through a flexible phosphodiester bond and the nodes can be defined at various levels. For example, a nucleotide as a whole or its individual chemical components such as the phosphate group, deoxyribose and the bases (A, T, G, and C) can be represented as nodes. Such a different representation of nodes has distinct advantages of their own, in interpreting the nature of interaction between the protein and DNA
The present graph based analysis of protein–DNA complexes focuses on the following points. Primarily, the interface interactions of protein–DNA complexes have been investigated at a network level. This is achieved by constructing protein–DNA graphs (PDGs) on the basis of the strength of interaction between the nodes and also by performing extensive calibrations to choose the optimal strength of interactions to gain structural insights. Secondly, the clusters and hubs of such interacting amino acids and the nucleotides at the interfaces have been analyzed in a set of protein–DNA complexes.
Significant results that are inaccessible by conventional pairwise analysis of the structure or by sequence analysis have been obtained from the present work. These include the identification of spatial networks of interacting residues that are sequentially far apart, the evaluation of a scale of interaction strength along which we can compare and analyze the interaction networks of protein–DNA complexes and the identification of groups of optimally interacting residues which stabilize the structural architecture. Furthermore, we have been able to revisit the classification scheme of DNA binding proteins. Our classification schema, which is based on the concepts of graphs/network, interaction strength, and the type of interaction, is distinct from the classification schemes proposed earlier with a protein-centric point of view. We have compared our results with the other classification schemes
A protein–DNA graph (PDG) is a bipartite graph constructed to represent the interaction between the amino acids of the protein and the nucleotides of the DNA in a protein–DNA complex. A bipartite graph deals with two different node sets and edges are defined across the two node sets. A contact in the bipartite PDG is defined when a side chain of an amino acid interacts with the nucleotide. The interactions of the amino acid with the nucleotide can be considered at different levels: with the phosphate (p), deoxyribose sugar (S) or base (B) components individually, or with the nucleotide as a complete entity. The edges are defined upon quantification of the interaction between the amino acids and the nucleotides with the “Interaction Strength,”
In the bipartite graph representation of the protein–DNA complex, the amino acids and the nucleotides of the DNA form the two node sets. Shown in yellow are amino acid nodes, and blue, the nucleotides. The edges (between the nodes from one set to the other) are shown in red. These edges are defined based on a specific MEC. The MEC quantifies the minimum number of atomic contacts expected between an amino acid and a nucleotide to define an edge (Equation 1). The contacts are specifically evaluated between side chains of the amino acids and the phosphate, or the deoxyribose sugar or the base of nucleotides to form the P-p, P-S, and P-B clusters, respectively.
The nodes in a PDG are connected if the
Graph | Weak (WMEC) | Optimal (OMEC) | Strong (SMEC) |
Protein-Phosphate (P-p) | Up to 3% | 3% to 5% | Above 5% |
Protein-Deoxyribose (P-S) | Up to 4% | 4% to 8% | Above 8% |
Protein-Base (P-B) | Up to 3% | 3% to 5% | Above 5% |
The rationale for selecting these ranges is mentioned in Supporting Information.
The propensities of amino acids to form P-p, P-S, and P-B graphs were calculated from DS1 (see
Amino acid propensities in the different component graphs are calculated in the OMEC range. The propensity for a particular amino acid “
The protein–DNA complexes have been classified into different groups based on the structural similarity of the proteins bound to the DNA. Luscombe et al have provided a comprehensive classification of the protein–DNA complexes based on the secondary structural motifs of proteins interacting with the DNA
β-Sheet group and the β-hairpin group are the two well defined groups that use β-strands as the major secondary structure to interact with the DNA. Although the presence of the β-strand is a common feature in these groups, the modes of interaction are distinctly different from one another, as will be discussed below.
(a) Different P-S clusters in the TATA binding protein (1tgh) (at MEC 7%) are shown in different colors with their amino acid composition. (b) A P-S cluster is shown in detail to reveal the tight interactions between the amino acids (blue) and the deoxyribose (orange) of the nucleotides. (c) Only the Cα of the amino acids of the P-S clusters is highlighted in the sheets named from A to J.
PDBS | Cluster number and composition | ||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
TGN | KVFP | RVI | |||||||
TGKN | KVFP-SQ | RVIL | ILTTGKN | KSF | FL | ||||
TGKNA | KVFP | RVILNGQS | VFP | ||||||
TGN | KVFP | RVL | RLTTGN | KVSF | |||||
TGNILV | KVFP | ILTTGKN | KVF | FL | FR | ||||
TGN | KVFP | RVILF | TTGKN | KF | RLI | ||||
TGN | KVFP | RVI | ILTTGKN | KF |
Note that the nucleotides involved in the clusters are not reported. Complete information including the nucleotides in these clusters, residue number, and chain is given in
Hubs have been defined as amino acid residues which are connected to four or more nucleotides or vice versa. The members of this group contain a few hubs (
An amino acid interacts with four different bases to form a P-B hub. In the TATA box binding protein (1tgh) the position of a P-B hub Phe193 (at MEC 3%) is shown. This hub (blue) interacts mainly with bases from four different nucleotides (red) T115, A116, T109, and A110.
PDBS | P-p | P-S | P-B |
Integration host factor | C-42C C-41A C-31G C-29A C20T D24A E36C E46C | B46ARG | |
Human TATA box binding protein | B7T B8A C107T C108T | A284PHE | |
GCN4 | A6A A7T B28A | C235ASN D235ASN |
The nodes in the clusters are given as chain id-residue id residue name. (MEC(P-p) = 3%, MEC(P-S) = 4%, MEC(P-B) = 3%). Only a representative from each group is given. Other members of the group are given in
The analysis of the patterns of clusters from PDG of the β-sheet group has thus revealed striking features like the presence of consistent P-S clusters across all the members of this group. Further, the presence of a Phe hub in the graph is also linked to the deformation of the DNA. These consistent patterns that emerge through our method of bipartite graph representation are not only qualitatively similar in terms of the residue composition, but also possess a similar connectivity between the nodes of the graph as evaluated with the MEC.
In the Arc protein (1bdt), the P-p (yellow), P-S (green), and P-B (blue) clusters are shown. The β-hairpin that is interacting with the DNA is highlighted in red. A cluster of charged residues (Arg, Asn, and Gln) from the β-hairpin makes contact with successive bases of the DNA. These P-B clusters are flanked by the P-S and P-p clusters arising from other secondary structures (helices and loops) around the β-ribbon. A P-p hub (orange) in which the phosphate group of Ade4 interacts with Ser32, Val33 and Phe10 is also highlighted.
Discrimination of the β-sheet group and the β-hairpin group is also brought out by hub analysis. We observe that P-p hubs are dominant in the β-hairpin group, in contrast to the P-S hubs observed in the β-sheet group (
Groups like helix turn helix (HTH), zipper-type (ZT), zinc coordinating group (ZC), and other α-helix group (OAH) use helices as the major secondary structure to interact with the DNA. HTH and ZT groups have been classified based on their structural motifs, while ZC group is classified according to the presence of coordinated Zinc in the complex. The Other α-Helix group constitutes all the remaining proteins that employ α-helices to interact with DNA. We center our discussion mainly on the well-characterized HTH and ZT groups.
The P-p (yellow), P-S (green), and P-B (blue) clusters are shown in the bacteriophage 434 repressor protein (1rpe) which belongs to the HTH group. The P-B clusters are present in the recognition helix that interacts with the major groove of the DNA and the P-p and the P-S clusters are present in the other interacting region that interacts with the minor groove.
The Max transcription factor, which belongs to the Helix-Loop-Helix Zipper family, has a dimerization domain and a DNA binding domain (center). The DNA binding domain shows the presence of symmetric P-p (yellow) and asymmetric P-B clusters (blue) (upper and lower). The cluster compositions are given (single letter code for nodes: upper case for amino acids and lower case for nucleotides). The cluster compositions for other proteins are given in
There have been studies of asymmetric base pair recognition by the zipper type proteins and the dimers binding DNA in a specific orientation
A detailed investigation of the nucleosome core particle (1aoi) is presented below.
We have generated P-p, P-S, and P-B graphs for these two histone-DNA complexes (1aoi and 1s32). There is a clear domination of P-p and P-S clusters and a complete absence of P-B clusters. The lack of specific P-B clusters and the predominance of backbone mediated non-specific clusters agree with the fact that the histones interact non-specifically with the DNA through the electrostatic
The cluster analysis of these two structures has shown that there are four additional P-p clusters in the clamped structure out of which one is around the linker region. There are five additional P-S clusters and two are present near the linker region (
The superposition of the two structures is done using Align. (Only 1aoi is represented in the above figure for clarity.) The clusters that remain unperturbed in both the structures are given in red. The new clusters that are formed due to the conformational changes mediated by the presence of the linker are given in blue. Significant P-B clusters were not seen in both the structures. The composition of the clusters in the linker region is given in
The importance of classifying the DNA binding proteins from the protein–DNA point of view, rather than the protein-centric view is being recognized for developing better protein–DNA recognition code
The interface P-p, P-S, and P-B clusters are examined in the protein–DNA complexes. The analysis of these clusters shows that the complexes can either have exclusive P-p, P-S, or P-B clusters or they can contain a mixture of these types of clusters. In cases where more than one type of cluster is observed, we define overlapping (amino acids sharing the same nucleotide) or non-overlapping (amino acids making contact with different nucleotides) clusters. Based on the types of clusters observed, the complexes are classified into seven groups. The complexes containing exclusive P-p, P-S, and P-B clusters are denoted as class 1, 2, and 3, respectively. Mixtures of (P-p+P-S), (P-S+P-B), and (P-p+P-B) clusters are denoted as class 4, 5, and 6, respectively. The complexes containing all the three component clusters (P-p, P-S, and P-B) are considered as class 7. A sub-classification based on the presence of overlapping or non-overlapping (P-p+P-S) or (P-S+P-B) or (P-p+P-B) components is made for cases 4 to 7. The details of the classification of the protein–DNA complexes are presented in
Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Class 6 | Class 7 | |||||||
P-p clusters only | P-S clusters only | P-B clusters only | P-p and P-S clusters (no P-B clusters) | P-S and P-B clusters (no P-p clusters) | P-p and P-B clusters (no P-S clusters) | P-p, P-S, and P-B clusters are present | |||||||
Overlapping clusters | Non-overlapping clusters | Overlapping clusters | Non-overlapping clusters | Overlapping clusters | Non-overlapping clusters | Overlapping P-p, P-B, and P-S clusters | Non-overlapping P-p, P-B, and P-S clusters | ||||||
P-p and P-S clusters overlap but not P-B clusters | P-S and P-B clusters overlap but not P-p clusters | P-p and P-B clusters overlap but not P-S clusters | P-P, P-B and P-S clusters occur separately | ||||||||||
– | |||||||||||||
1cma- |
1azp- | 1zaa- | 1a31- | 1ecr- | 1bnz- | 1ckt- | 1ramA | 1apl- | 1bdt- | 1d3u- | 1bss- | 1ihf- |
|
1bf4- | 1a35- | 1xbr- |
1vkx- | 1lli- | 1tgh- | 1ipp- | |||||||
7ice- | 1bhm- |
1c9bB | 1cyq- | 1vol- | |||||||||
2dnj- | 1dnk- | 1bnk- | 1cdw- | 1an4- | 1a3qA | 1dctA | 2bdp- | 1tc3- | |||||
3orc- | 2rve- | 1t7pA | 1bpx- | 1hlo- |
1bf5-* | 1rv5- | 3ktq- | 1a74- |
|||||
3bam- | 1qss- | 10mh- | 1nfkA | 4skn- | 1ssp- | ||||||||
1skn- | 1qsy- | 1clq- | 5mht- | 1fjl- |
1vas- | ||||||||
6pax- | 2bpf- | 1pvi- |
1a1g- | 3pvi- | |||||||||
1ysa- |
2ktq- | 1tau- | 1aay-* | 1gdt- |
1cit- | ||||||||
1b3t- |
2ssp- | 2pvi- | 1d66-* | 1ignA |
1fok- | ||||||||
4ktq- | 1ubd-* | 1rpe- | 1hcr- |
||||||||||
1lat- | 1qrv- | 1zme- | 6cro- | 1mnm- |
|||||||||
1akh- | 1yrn- |
||||||||||||
1hddC |
1an2- | 2gli- |
3cro- |
||||||||||
1pdn- | |||||||||||||
3hddA | 1a02- | 1a6y- | |||||||||||
1a0a- | |||||||||||||
1aoi- | |||||||||||||
1glu- | |||||||||||||
1tsr- |
|||||||||||||
2nll- |
These protein–DNA complexes are also present in DS3 (see
The classification was done in the SMEC region (for DS2), which implies that we have considered only highly significant interactions of the amino acid side chains with the chemical components of the DNA.
It should be noted that our classification scheme based on the interaction patterns of amino acids with nucleotide components in PDG does not directly deal with the type of interaction involved (like electrostatics, van der Waals, H-bonding, etc). However, indirectly, the P-p cluster is dominated by electrostatic interaction and the P-S clusters are composed of van der Waals interactions along with stacking of aromatic residues with the deoxyribose ring. The P-B graphs are dominated by stacking of amino acids (mostly the planar side chain of Arg) with the bases, H-bonding and also charge mediated interactions.
A comparison of the present classification with the structural motif based classification by Luscombe et al
Here we have shown that the DBPs classification based on the protein–DNA interface interaction at the molecular level differs significantly from the protein motif based classification, although a little consensus is observed. DBPs have also been classified based on other criteria. For instance, Prabakaran et al
The fact that there is only a marginal overlap between different classification schemas underscores the versatilities in protein–DNA recognition mechanism. It may be valuable to use different approaches to obtain complementary information to understand the protein–DNA recognition mechanisms in detail.
The present study aims to represent a protein–DNA interface as an undirected bipartite graph based on non-covalent interactions. A quantitative method has been developed to represent the interactions between both DNA and protein as a single, combined graph. Such a representation has facilitated the study of the spatial relationships between the amino acids and the nucleotides at the protein–DNA interface in a holistic way. Thus, protein–DNA interfaces across the spectrum of complexes could be compared at a uniform level, irrespective of the structural and functional differences.
In general, we have provided a method of quantifying the interactions of proteins with the components of nucleotides (phosphate, deoxyribose and base). It is now clear that the combined representation of protein and DNA as PDGs could highlight the intricacies involved in protein–DNA recognition of some families of proteins. For instance, the predominance of protein-deoxyribose (P-S) clusters and hubs has brought out the specificity of the interaction in β-sheet proteins. Such analysis and the group specific features of protein–DNA recognition could be used as a starting point in predicting the DNA binding sites on these proteins. We have also proposed a scheme for classifying the structures based on the nature of the network connectivity present at the protein–DNA interface. Based on comparative analysis, we conclude that different classification schemes could provide complementary information on the nature of protein–DNA interactions.
Thus, the analyses performed on a dataset of protein–DNA complexes have highlighted the nature of the clusters and hubs present at the recognition site. These clusters and hubs may not only prove to be valuable in understanding the residues contributing to the stability of the protein–DNA interfaces, but also could be identified as features characteristic for a given group of proteins. The knowledge gained from the study could also provide a platform for further docking and prediction experiments.
The protein–DNA complexes with resolution better than 2.5 Å and with protein identity less than 25% were taken from PDB (Version 3.1)
The interaction between the amino acids and the nucleotides at the protein–DNA interface is represented as undirected bipartite Protein–DNA Graphs (PDGs). Here, the amino acids comprise one node set and the nucleotides constitute the other node set of the bipartite-PDGs as shown in
Here the evaluation of the strength of interaction is restricted only to atom-atom contact and does not explicitly take into account the details such as hydrogen bond, salt bridge interactions etc. Indirectly, this amounts to a measure of packing density at the selected region.
The normalization values are the estimates of the maximum non-covalent contacts an amino acid or a nucleotide can have in protein–DNA complexes. The method of obtaining normalization values for amino acids in proteins was previously given
Here it should be noted that, while calculating the amino acid - amino acid contacts, the contacts made by an amino acid side chain with its sequence neighbors (
Protein-nucleic acid recognition mechanisms are often mediated by amino acids through a specific or nonspecific recognition of a nucleotide backbone or base at the protein–DNA interface. Quite often the electrostatic interactions of proteins with the phosphate groups are considered as non-specific and the stacking interactions and hydrogen bonding with bases are considered as specific. Furthermore, there is substantial conformational flexibility in the phosphodiester bond and in the conformation of the deoxyribose ring. Therefore, the nucleotides have been dissected into their chemical components such as the phosphate backbone (p), deoxyribose sugar (S), and the base (B). The interactions of an amino acid with all these individual components are characterized by constructing separate interaction graphs of amino acids with all the p, S, and B components of the DNA.
The normalization values (
These individual phosphate, sugar backbone, and base-specific normalization values are useful to obtain finer details regarding the molecular connectivity existing at the protein–DNA interface. The normalization values of the individual chemical components of the nucleotide thus obtained are given in
Nucleotides/nucleotide components | Normalization values |
Phosphate (PO4) | 25 |
Deoxyribose sugar | 28 |
Adenine | 144 |
Guanine | 156 |
Cytosine | 114 |
Thymine | 120 |
The normalization values were calculated on a dataset provided by Jones et al
The PDGs are constructed as specified above and represented as binary adjacency matrices at a given MEC. Clusters of interacting nodes are identified from the adjacency matrix using Depth First Search (DFS) algorithm
Hubs are highly interacting nodes in a graph. In protein structure graphs, a node was declared as a hub if it was connected to a minimum of four nodes
Average of the Largest Clusters (as a function of MEC) of all protein-DNA complexes in the dataset. (P-p graph in blue, P-S graph in green and P-B graph in brown). From the above plot we can see that the sizes of the largest clusters are large at lower MEC (1%–3%) and this region is classified as WMEC. There is a transition in the sizes between MEC 4%–5% (corresponding to OMEC). Beyond this transition zone, the cluster sizes decrease consistently with MEC (SMEC region). Hence we have chosen these values of MEC as cut-offs for the weak, optimal and strong MEC according to the behavior as described above, to analyze different P-p and P-B graphs. Further fine tuning was carried out based on the analysis of specific cases. In this process, we slightly modified the criteria for P-S clusters in which the OMEC was shifted to 4% to 8%. Therefore this plot gives an idea on the basis of binning the MEC (
(0.15 MB DOC)
A bipartite cluster based classification of DNA-binding proteins. P-p clusters are highlighted with yellow color for phosphate and brown for interacting amino acids. P-S clusters are highlighted with green color for deoxyribose sugar and purple color for the interacting amino acids and P-B clusters are highlighted with red color for bases and blue color for the interacting amino acid residues.
(4.11 MB EPS)
A comprehensive list of the protein-DNA complexes studied and the general clusters and hubs information.
(0.10 MB DOC)
Component clusters in different DNA-binding proteins.
(0.18 MB DOC)
Component hubs in DNA-binding proteins.
(0.85 MB DOC)
Equivalent clusters in 1aoi (without linker) and 1s32 (with linker).
(0.04 MB DOC)
Normalization values evaluated from the updated dataset (DS1).
(0.03 MB DOC)