Identification of Position-Specific Correlations between DNA-Binding Domains and Their Binding Sites. Application to the MerR Family of Transcription Factors

The large and increasing volume of genomic data analyzed by comparative methods provides information about transcription factors and their binding sites that, in turn, enables statistical analysis of correlations between factors and sites, uncovering mechanisms and evolution of specific protein-DNA recognition. Here we present an online tool, Prot-DNA-Korr, designed to identify and analyze crucial protein-DNA pairs of positions in a family of transcription factors. Correlations are identified by analysis of mutual information between columns of protein and DNA alignments. The algorithm reduces the effects of common phylogenetic history and of abundance of closely related proteins and binding sites. We apply it to five closely related subfamilies of the MerR family of bacterial transcription factors that regulate heavy metal resistance systems. We validate the approach using known 3D structures of MerR-family proteins in complexes with their cognate DNA binding sites and demonstrate that a significant fraction of correlated positions indeed form specific side-chain-to-base contacts. The joint distribution of amino acids and nucleotides hence may be used to predict changes of specificity for point mutations in transcription factors.


Introduction
Specific binding of transcription factors to DNA is a major mechanism of regulation of gene expression, hence boosting interest to the problem of the protein-DNA recognition code. Initial hopes stemmed from the observations that single amino acid substitutions can drastically change the protein affinity to its DNA sites. On the other hand, the structure of the DNA double helix is relatively rigid. An early (mid-70s) paper suggested that specific recognition depends on hydrogen bonds between side chains of amino acid residues and nucleotides bases, factors [30]. The correlation between the level of conservation of specific residues in DNAbinding proteins and that of DNA sites has been demonstrated for 21 protein families [31]. The residues contacting the sugar-phosphate backbone are conserved, whereas the residues contacting nucleotide bases are conserved if binding motifs are similar for all proteins from a family, and variable otherwise. Within a genome, there is a correlation between the degree of conservation of a consensus nucleotide and the number of contacts it forms with DNA [32]. In the TAL-effector family of Xanthomonas TFs, injected into plant cells during infection, there exists a recognition code linking pairs of amino acid residues, so-called repeat-variable diresidues, and base pairs in the recognized site [33,34], and this code may be used to predict TALeffector targets [35,36]. A similar code was suggested for the CRO family of phage TFs [37].
These and similar observations formed a base for the identification of specificity-determining positions in aligned, homologous protein sequences divided into groups by specificity towards ligands, cofactors or DNA motifs [38]. For each alignment column, the mutual information is calculated as a measure of correlation between the positional amino acid distribution and the division into specificity groups. This method was applied to identification of specificity-determining positions in prokaryotic [38,39] and eukaryotic [40] transcription factors, and the predictions were in good agreement with the structural and mutagenesis data. The main drawback of the method, the need to define specificity groups in advance, may be partially offset by automated clustering of protein sequences [40,41].
Similar methods based on measuring the mutual information are widely used for the identification of protein-protein interactions (e.g. [42,43]) or even prediction of the protein threedimensional structure [44]. They do not require structural or phylogenetic information. Such methods were applied to identify a fraction of functionally important contacts in several families of eukaryotic TFs [45,46] and the LACI family of bacterial TFs [28]. A caveat is that this method requires large training samples and an estimate of expected mutual information. It also, by construction, underestimates the importance of conserved positions. One more problem is that it is sensitive to shared evolutionary history of the analyzed factors (phylogenetic trace), and special techniques need to be developed to get rid of the latter [38,43]. A related approach, applied to the EGR subfamily of eukaryotic zinc finger TFs [47] and to bacterial LACI and TETR families [48], is assigning interaction energies to contacting pairs of residues and bases, and it may suffer from similar drawbacks. Direct analysis of available structures supplemented with calculation of a physical energy function was used to redefine binding motifs for 67 yeast TFs [49,50]. Binding specificity predictions derived from 3D structures are systemized in the 3D-footprint database [51].
Predicted specific interactions were used to construct mutant TFs with new specificities for a variety of families, both eukaryotic, e.g. zinc fingers [52,53] and BHLH [54], and prokaryotic, such as TAL effectors [55], LACI [28], and CRP/FNR [56]. On the other hand, extensive experimental screens sometimes produced discouraging results: randomization of DNA-interacting residues of a zinc-finger protein Zif268 [57] and LACI-family TFs [58] did not yield consistent, family-specific protein-DNA interaction codes. Most residues, including non-contacting ones, were shown to influence binding of LACI-family TFs [59,60]. Contacting residues are not sufficient to explain binding specificity of eukaryotic FOX (forkhead box) TFs [61].
Previously we adapted a number of techniques used to identify specificity determining positions [39] to the identification of correlated protein and nucleotide positions, likely important for protein-DNA recognition. In addition to simple computation of mutual information, our algorithm assesses statistical significance correcting for (possible) overrepresentation of closely related TFs and common ancestry (phylogenetic trace) of some subgroups in a dataset. An objective threshold is set based on probabilistic calculations (the so-called Bernoulli threshold). The algorithm was implemented as a web server Prot-DNA-Korr (http://bioinf.fbb.msu.ru/Prot-DNA-Korr) and applied to study co-evolution of TFs and binding motifs in the NRTR [62] and REX [63] families of TFs. Here we describe it in detail and apply to the MERR family of bacterial TFs.
Experimentally studied proteins, MerR, HmrR, CueR, ZntR, CadR, PbrR, GolS use monoand divalent metal ions as ligands [64,77,78]. In addition, several heavy-metal resistance regulons (sets of operons regulated by particular TFs) were subject for a comparative-genomics study [79]. The binding sites of these TFs are located between the promoter −35 and −10 boxes of the regulated operons, an arrangement being typical for MERR-family transcriptional activators. Moreover, the distance between the promoter boxes in such promoters equals 19-20 bp instead of usual 16-17 bp [64, 69, 70, 79, 80]. The mechanism of transcriptional activation is known from structural and mutational studies [81]. DNA untwisting and base pair distortion decrease the distance between the promoter boxes and set them in a conformation capable of binding by the RNA polymerase. This distance change approximately equals 2 bp. Deletion of 2 bp from the promoter spacer has the same effect on transcription.
The crystal structures in complexes with DNA are known for six MERR-family proteins: BmrR [81][82][83][84], MtaN [82] and GlnR [85] from Bacillus subtilis, TnrA from Bacillus megaterium [85], SoxR from Escherichia coli [86,87], and TipAL from Streptomyces lividans (PDB ID 2VZ4). None of them are involved in heavy metal resistance. DNA-free structures are available for BmrR [88] and Mta [89] from B. subtilis, CueR and ZntR from E. coli [90], NmlR from Bacillus thuringiensis (PDB ID 3GPV), BC_0953 from Bacillus cereus (PDB ID 3HH0), LMOf2365_2715 (PDB ID 3GP4) and lmo0526 (PDB ID 3QAO) from Listeria monocytogenes, and SCO5550 from Streptomyces coelicolor [91]. These structures show that TFs from the MERR family have very similar spatial conformations, GlnR, TnrA and SCO5550 being exceptions. The DNA-binding winged helix-turn-helix (WHTH) domain is located in the N-terminus followed by the antiparallel coiled coil providing dimerization. The ligand-binding domains located in the C-terminus may differ in length, sequence and structure. SCO5550 has a different dimerization domain resulting in a different overall structure. GlnR and TnrA have a dimerization domain located in the N-terminus that results in a different mode of interaction between monomers also yielding a different overall dimer architecture. Similar crystal structures and promoter organization suggest that the mechanism of transcriptional activation is the same for all MERR-family activators sharing this structural organization.

Methods
Here we describe an outline of the algorithm for the identification of correlated pairs of positions. The details for each step are presented in the Results section. The program takes TFs and TFBSs alignments as an input. For each pair of alignment positions we calculate the frequencies of 'nucleotide-amino acid' (NT-AA) pairs. From the observed and expected (under hypothesis of independence) frequences we derive a measure of correlation between pair of columns, mutual information. Applying the above steps for randomly generated pairs of columns, we obtain the expected mutual information values, which are then corrected by linear transformation to take into account shared ancestry of sequences as described in [38]. From the observed and expected mutual information values, a measure of statistical significance, Z-score, is then derived. Pairs with top Z-scores are designated as statistically significantly correlated. The actual number of pairs is determined by the Bernoulli cutoff procedure [39].

Study of MERR-family regulators of heavy-metal resistance
Genomic and protein sequences were taken from GenBank RefSeq database (release 55) [92]. Three-dimensional structures of proteins were taken from the PDB database [93]. The Gen-Bank CDD database [94] was used for classification of transcription factor (TF) into subfamilies. Protein-DNA molecular contacts were taken from the NPIDB database [95]. Van der Waals contacts were taken from the articles in which the structures were published. In the NRTR, REX, MERR cross-family study, Van der Waals contacts were obtained using the HBPLUS utility [96]. Structure-based multiple protein sequence alignments were built using the PRO-MALS3D program [97]. Phylogenetic trees were constructed using the MEGA5 package [98]. The GenomeExplorer package [99] was used to build positional weighted matrices (PWMs) and to search genomic sequences for transcription factor binding sites (TFBSs) and promoters. TFBS and operon data were submitted to the RegPrecise database [27]. Ancestral protein and DNA sequences were reconstructed using the PAML package [100]. Sequence logos were generated using the WebLogo program [101].

Algorithm for the identification of correlated pairs of positions
The correlation between the residues A in an amino acid alignment column i and the bases N in a nucleotide alignment column j is measured using the mutual information: where f i,j (a, n) is the observed weighted frequency of a pair (amino acid a in the TF alignment column i, nucleotide n in the site alignment column j) and f exp i;j ða; nÞ ¼ f i ðaÞ Â f j ðnÞ is the expected weighted frequency of this pair computed as a product of f i (a), the weighted frequency of the amino acid a at the column i, and f j (n), the weighted frequency of the nucleotide n at the column j.
To estimate the statistical significance of the observed mutual information values, one needs the distribution of mutual information for a random pair of columns I $ i;j . In order to obtain it, TF-site pairs are randomly reconnected 10,000 times. Further, a linear transformation is applied to take into account shared ancestries (the phylogenetic trace), as described in [38]. Finally the Z-score, a measure of statistical significance, is calculated as: where EðI $ i;j Þ and sðI $ i;j Þ are the mean and the standard deviation, respectively. The pairs are ranked by calculated Z-scores, and the top k pairs are selected, where k is determined by the Bernoulli cutoff procedure [39]. In a nutshell it minimizes the probability (reported as p-value) to observe k given Z-scores from the Gaussian distribution.
Weighting. To avoid overrepresentation of similar, and closely related sequences, we introduce weights of pairs TF-site as products of weights of individual TF and site sequences: Here, the number of pairs residue a-nucleotide n (further denoted by [a − n]) in column [i, j] is calculated as the sum of weights of TF-site pairs: where RS a;n i;j is the set of TF-site pairs with the pair [a − n] in the columns [i, j]. Similarly, residues a in the column i are counted as: where RS a i is the set of TF-site pairs with the residue a at position i in TF. Weights of TFs are determined using the Gerstein-Sonnhammer-Chothia algorithm [102]. To do that, the phylogenetic tree of TFs was constructed using the neighbor-joining method implemented in Clustal [103] and rooted in the middle of the longest path between leaves.
Pseudocounts. To account for non-observed data and to avoid null frequencies, we introduced pseudocounts supplementing the set of N observed sequences by k ffiffiffiffi N p random sequences with amino acid and nucleotide frequencies drawn from the respective alignment columns. At that, the amino acid pseudocounts reflected the amino acid substitution matrix, as in the SDPPred algorithm [39], and the normalized frequency of the amino acid a in the alignment column i was defined as: where N i (a) is the weighted count of amino acid a in column i, N is the total number of residues in the alignment column, P(b ! a) is the probability of substitution b ! a computed by the BLOSUM [104] matrix at identity level 30-40%, κ = 0.5 is a parameter regulating the contribution of pseudocounts. The nucleotide pseudocounts are introduced in the same way with substitution probabilities P(m ! n) = 1/4 for each pair m, n.
Finally, the frequency of a pair [a − n] in columns [i, j] is computed as: By our null hypothesis nucleotide and residue substitutions are independent, thus P(b, m ! a, n) = P(b ! a) × 1/4 and: Implementation. The algorithm is implemented in the Java language and thus can be executed on any computer provided a Java virtual machine is installed. The program and the source code may be accessed from the web at http://bioinf.fbb.msu.ru/Prot-DNA-Korr.
Calculated Z-scores are graphically represented via an interactive heatmap plot (TFs vs TFBSs). A detailed NT-AA contingency table for a requested pair of positions can be drawn for in-depth analysis. Under-and overrepresented NT-AA pairs in the table are emphasized by coloring based on an arbitrary χ 2 -score summand cutoff (50 by default). The contingency tables, along with the tables of χ 2 and mutual information summands, as well as list of Z-scores may be exported as a plain text.

Analysis of heavy-metal resistance regulators from the MERR family
Identification of transcription factors and construction of motifs. Proteins containing HTH_CueR, HTH_MerR1, HTH_CadR-PbrR, HTH_CadR-PbrR-like and HTH_HMRTR conserved domains (Specific Protein option in GenBank CDD) were downloaded from the GenBank RefSeq database. Further in this study they are referred to as TFs from the CUER, MERR, CADR-PBRR, CADR-PBRR-like and HMRTR subfamilies, respectively. Only proteins encoded in completely assembled genomes were retained for the analysis. In total they contained 1516 TFs (see Table 1 for details). TFs with sequences longer than 190 bp and shorter than 110 bp were excluded from the study, given that typical proteins of these subfamilies have the length of 130-140 bp [79,90]. Structure-based multiple sequence alignments were constructed using structural information from CueR (PDB ID 1Q05, 1Q06, 1Q07) and ZntR (PDB ID 1Q08, 1Q09, 1Q0A) from Escherichia coli [90]. Phylogenetic trees for each subfamily were built by the neighbor-joining method with pairwise gap deletion option that keeps the information from gap-containing columns. BmrR from Bacillus subtilis (GI 50812267) was used as an outgroup. Only one of each group of nearly identical proteins (distance between the leaves on the tree less than 0.02) encoded in genomes of different strains of the same species was retained for further study. Following the application of these procedures, 906 TFs remained in the studied set ( Table 1). Most of them (783 TFs) are encoded in genomes of Proteobacteria. Other phyla represented in this set include Actinobacteria (52 TFs), Cyanobacteria (22 TFs), and Firmicutes (18 TFs).
We built selective PWMs for searching the genomes for putative TFBSs using sites from [79] as a starting point. One PWM per subfamily was built with the exception for MERR and HMRTR where a single PWM did not provide desired sensitivity. Hence, two PWMs were constructed for the MERR subfamily and three for the HMRTR subfamily, each corresponding to a separate smaller branch on the phylogenetic tree of studied TFs. The PWMs are presented in S1 File. The length of CUER, CADR-PBRR and CADR-PBRR-like motifs was 21 bp, whereas the length of MERR and HMRTR motifs was 22 bp. The selected genomes were searched for TFBSs in regions from −400 to +50 bp relative to the gene translation start sites annotated in Gen-Bank. The threshold for TFBS search was set to 3.5. Exclusion of false positive TFBS. Numerous experimental and computational studies of promoters regulated by transcriptional activators from the MERR family show that these TFs bind specific sites located between the promoter boxes of the regulated operons [64,69,70,79,80]. Moreover, the distance between the promoter boxes in such promoters equals 19-20 bp instead of usual 16-17 bp. Previous studies [105] demonstrated that the distance between the center of the TFBS and the 3'-end of the −35 promoter box is fixed within a subfamily. Putative promoters were found using the E. coli σ70 promoter consensus TTGACA-()-TATAAT. The distance between the TFBS centers and the −35 promoter boxes equals 7 bp for CUER, CADR-PBRR and CADR-PBRR-like sites (21-bp long) and 8 bp for MERR and HMRTR sites (22-bp long). TFBSs scoring above the threshold were considered false positives if they did not overlap with candidate promoters having 19-20 bp spacers or the distance between the center of the site and the 3'-end of the −35 promoter box is other than 7 bp for 21-bp sites and 8 bp for 22-bp sites. Further, only sites co-localized with the TF gene and/or located upstream of genes with relevant function (heavy metal resistance) were retained.
Using this procedure, 884 TFBSs were identified for 763 TFs (Table 1, S2 File). We tested how the usage of site and promoter overlap affects the number of found sites. We did this for weak sites (with scores from 3.5 to 5.0) and strong sites (with scores above 5.0). For strong sites, the number of candidates grows only slightly when the promoter information is omitted. In contrast, for weak sites this number grows tremendously. On average, 97% of candidate sites in a genome are weak sites without promoter support (S3 File). Therefore we used the information about putative promoters.
Sequence Logos for the sites of each studied subfamily are presented in Fig 1. Each Logo includes the binding motif as well as the −35 and −10 promoter boxes and three flanking positions. Then we built the phylogenetic tree for TFs with identified sites (Fig 1). TFs from different subfamilies form distinct branches on the tree with only several exceptions, in agreement with manually curated conserved-domain classification of HMR TFs from the MERR family provided in GenBank CDD (GenBank CDD accession number cl02600). The identified TFBSs were then aligned: one central position was deleted from the 21-bp long sites, and two, from the 22-bp long sites. For the computation of correlations, the alignment block containing 74 columns was taken from the alignment of TFs of all studied subfamilies. This block completely covers the N-terminal DNA-binding winged helix-turn-helix (WHTH) domain of these proteins. A set of corresponding pairs of protein and DNA sequences was formed by this block and the alignment of TFBSs. After deleting duplicate pairs, we obtained a set containing 776 unique pairs of corresponding protein and DNA sequences. This heatmap shows imperfect symmetry due to imperfect symmetry of TF binding sites and respective binding motifs. We searched the literature and the NPIDB database [95] for the contacts of TFs from the MERR family with DNA (Fig 2). All protein-DNA contacts (side chains to bases, side chains to DNA backbone and protein backbone to DNA backbone) are presented in S2 Fig overlaid with the same heatmap. A pair of positions was marked as interacting if the interaction was reported at least once. Since CueR and ZntR structures were resolved in the DNA-free form [90], the experimental contacts come from crystal structures of proteins from subfamilies not included in the present study [81][82][83][84][85]87]. However, these experimental data are relevant, as the structures of WHTH DNA-binding domains of the TFs from the MERR family are conserved [82,85,87,91]. These crystal structures of dimeric TFs (except MtaN and GlnR from B. subtilis) consist of one monomer and one DNA strand. The GlnR structure includes one monomer and one double-stranded half-site and MtaN structure includes both monomers and complete double-stranded site. Therefore we performed a mirror reflection of the contacts to cover both half-sites. This results in a strictly symmetrical map of contacts (Fig 2, S2 Fig).
Overall, 36 experimentally identified interacting pairs (side chain to base) were found. Nine pairs appear as both correlated and forming side-chain-to-base contacts (Fisher's exact test pvalue of 1.96 × 10 −8 ). This proves the relevance of the applied procedure and cutoff selection. Of 32 correlated pairs, 23 are located in the recognition α-helix of the HTH domain of MERR- family proteins (positions 13-21 in the protein alignment in Fig 2). Other correlated pairs correspond to the α1-helix of the WHTH domain and the β-hairpin between the β1 and β2 strands that constitutes the first wing of the WHTH domain. The most significantly correlated pairs of positions are symmetrical (6,14) and (13,14). Other correlations are much less significant.
Hereinafter pairs of positions are referred to as (j, i), where the TFBS position comes before the comma and the TF position, after. Symmetrical TFBS positions give 19 when summed. The 'nucleotide-amino acid' pairs for the respective pairs of positions are denoted as NT-AA.
Over-and underrepresented NT-AA pairs along with subfamilies where they preferably occur are listed in Table 2.
We mapped correlated pairs on the phylogenetic tree of the studied TFs (Fig 1), using only pairs where several overrepresented pairs of residues had large (over 50) counts: (3, 13)-S3 Fig, (5,14)-S4 Fig, (6,14)- S5 Fig and (6,21)-S6 Fig. These data show that the same overrepresented pairs NT-AA appeared several times independently in course of evolution. We tested whether mutations in the TF DNA-binding domains lead to subsequent changes in binding motifs. At that, we reconstructed ancestral sequences of studied TFs and their binding sites in internal nodes of the phylogenetic tree of the TFs (data not shown). We used the Jones-Taylor-Thornton (JTT) substitution model for amino acids and general time-reversible (REV/GTR) model for nucleotides. However, we could not observe a prevalence of either protein-DNA or DNA-protein order of mutations leading to the formation of overrepresented pairs.

Algorithm performance analysis
Input data bootstraping. We studied to what extent our method tolerates inadequate data in the input. For that, we progressively shuffled residues in 10%, 20%, etc. of aligned protein sequences, simulating misalignment and wrong input data. Each progressive step was    Table 3 shows that half of 32 significantly correlated pairs remain in the list even if 50% of the data is scrambled. Moreover, top two correlated pairs remain in the list with only 30% of the valid input data. On the other hand, the weaker 1/3 of the list fall below the threshold with only 10% of scrambled data. While the ranks of the said pairs usually drop only slightly below the 32 rank threshold (S5 File), this happens in a consistent manner. For instance, the (18,35) pair originally having rank 31 never gets to the top 32 pairs with 10% of the data scrambled.
The bootstrap table suggests that the bottom 1/3 of the correlated list are sensitive to the input data quality and, together with some pairs falling just below the significance threshold may be considered as the "grey area".
We also performed negative control of our method by providing shuffled regulator-site pairs from the initial input data. The shuffling was performed similarly to shuffling for expected mutual information required for Z-scores calculation. Conservation and correlation. Protein positional information content used in the logo generation as a measure of conservation was compared with Z-scores for corresponding pairs of columns. The correlated protein positions appear to be moderately conserved. Fig 3 suggests that overall highly conserved residues tend to have lower z-scores.
Conserved correlated residues that form contacts are very rare. In three PDB (3ikt, 3gz6, 1r8e) structures from the REX, NRTR and MERR families, respectively, we found 43 contacts with either of the partners being conserved (S6 File). Only two such contacts in the NrtR structure appeared to be correlated.

Discussion
We developed and implemented an algorithm for the identification of pairs of positions that are important for the protein-DNA recognition. Our method requires multiple alignments of DNA-binding proteins and of their respective sites. The method does not rely on known 3D structures of protein-DNA complexes, here we rather use them to validate our results. It should be noted that of necessity the contacts and correlations were identified on different sets of TFs belonging to different subfamilies.
The comparison with structural data shows good agreement both in quantitative and qualitative terms. The sets of correlated and contacting pairs strongly overlap (Fisher's test p-value 1.96 × 10 −8 ). The recognition helix of the HTH domain contains a large cluster of correlated pairs. According to classic Suzuki studies of spatial structures [2], residues 1, 2, 6 of the recognition helix that face the DNA major groove are most important for the protein-DNA recognition as they form hydrogen bonds with DNA bases hence allowing the protein to read the DNA sequence. Here these residues participated in correlated pairs, with residue 2 being the most correlated. The MERR and previously studied REX [63] and NRTR [62] families provide correlation data on three families HTH binding domains. Residue 6 participates in correlations in all three families and residues 1, 2 in two familes each.
Among hydrogen bonds, Van der Waals interactions, and hydrophobic contacts in the three families we do not see any preference for either type to correspond to correlations (S5 File). However, the data are not sufficient to form a solid conclusion.  (6,14)  100  100  100  100  100  100  99  90  46  12  3   (13,14)  100  100  100  100  100  100  98  85  40  5  2   (3,13)  100  100  100  100  100  99  87  67  35  14  Although significantly correlated pairs are likely to be contacting ones, our algorithm is not merely a substitute for a 3D structures analysis. Conserved interactions will not demonstrate correlations due to the lack of sequence variation [45]. On the other hand, some residues may affect specificity indirectly, and it would be difficult to identify them in 3D structures. The correlation analysis identifies all coevolving pairs of positions and along with overrepresented NT-AA pairs thus providing hints for future experimental investigations [56].
In most correlated pairs of positions, overrepresented NT-AA pairs appear independently multiple times in course of evolution of the studied TFs. It has been shown that binding sites for existing TFs can emerge rather rapidly from sequences that resemble weak sites [106,107]. This model implies that changes in a TF sequence, decreasing its affinity to pre-existing sites, yield changes in the sites, hence restoring the effective binding. The binding motif (in the simplest form, the consensus of the sites) changes accordingly. We reconstructed ancestral sequences of TFs and the respective DNA motifs, but failed to confirm the hypothesis about the leading role of substitutions in TFs yielding subsequent substitutions in recognized sites and hence motifs. We used several crystal structures of related TFs in the DNA-bound form to demonstrate high level of coincidence between correlated pairs and contacting positions. At that, it is plausible that the conserved positions provide for the initial DNA binding, whereas correlated positions fine-tune interactions with specific sites. A proof of concept was provided by an experimental study of CRP, that demonstrated lack of specific binding after individual mutations in either the TF or the site, but partially reconstituted binding after dual TF-site mutations substituting one preferred NT-AA pair to another pair preferred at the given positions [56]. While existing computational methods may not predict DNA motif given only TF sequence and 3D structure, some progress has been already made. For example, it is possible to match each TF from a given family, present in a genome, to the respective motif from a given set of motifs recognized by these TFs in the same genome [108]. The latter situation arises in comparative-genomic prediction of transcriptional networks.
TFBS prediction and regulon reconstruction in multiple related genomes using comparative genomic approaches has become a major source of information about regulatory networks. Combined with identification of correlations between the sequences of TFs and their binding sites, they may become powerful tools for studying the evolution of TF families and coevolution of interacting protein and DNA sequences using sequence data alone.