Figure 1.
The Canonical Cys2His2 Zinc Finger DNA Binding Model
Residues at positions 6, 3, 2, and −1 (relative to the beginning of the α-helix) at each finger interact with adjacent nucleotides in the DNA molecule (interactions shown with arrows). (Figure adapted from a figure by Prof. Aaron Klug, with permission.)
Figure 2.
Estimating DNA-Recognition Preferences
The DNA-recognition preferences are estimated from unaligned pairs of transcription factors and their DNA targets [2] (above). The EM algorithm [13] is used to simultaneously assess the exact binding positions of each protein–DNA pair (bottom right), and to estimate four sets of position-specific DNA-recognition preferences (bottom left).
Figure 3.
Predicting the DNA Binding Site Motifs of Novel Transcription Factors
The protein's DNA-binding domains are identified using the Cys2His2 conserved pattern (top left). The residues at the key positions (6, 3, 2 and −1) of each finger (marked in red in the bottom left panel) are then assigned onto the canonical binding model (bottom right), and the sets of position-specific DNA-recognition preferences (top right panel) are used to construct a probabilistic model of the DNA binding site. For example, the lysine at the sixth position of the third finger faces the first position of the binding site (dotted blue arrow). We predict the nucleotide probabilities at this position using the appropriate recognition preferences (dotted black arrow).
Figure 4.
Four Sets of Position-Specific DNA-Recognition Preferences in Zinc Fingers
The estimated sets of DNA-recognition preferences for the DNA-binding residues at positions 6, 3, 2, and −1 of the Zinc Finger domain are displayed as sequence logos. At each position, the associated distribution of nucleotides is displayed for each amino acid. The total height of letters represents the information content (in bits) of the position, and the relative height of each letter represents its probability. Color intensity indicates the level of confidence for a given amino acid at a certain position (where paler colors indicate lower confidence due to low occurrences of the amino acid at the specific position in the training data) (see Tables S1 and S2 for full data). Some of the DNA binding preferences are general, regardless of the residue's position within the zinc finger (e.g., lysine's tendency to bind guanine), while others are position-dependent (e.g., the tendency of phenylalanine to bind cytosine only when in position 2).
Figure 5.
Validation of DNA-Recognition Preferences
(A) The predicted binding site model of human Sp1 protein is compared to its known site (matrix V$SP1_Q6 from TRANSFAC [2], based on 108 aligned binding sites). To prevent bias by known Sp1 sites in our training data, the set of DNA-recognition preferences was estimated from the TRANSFAC data after removing all Sp1 sites.
(B) Scanning the 300-bp-long promoter of human dihydrofolate reductase (DHFR) by the predicted Sp1 binding model. The p-value of each potential binding site is shown (y-axis). Four positions achieved a significant p-value (higher than the horizontal red line), out of which three are known Sp1 binding sites [41] (arrows).
(C) A summary of in silico binding experiments for 21 pairs of Zinc Finger transcription factors and their target promoters. Shown is the tradeoff between false positive rate (x-axis) and true positive rate (y-axis) as the significance threshold for putative binding sites is changed. For every threshold point, our set of recognition preferences (EM) achieves better accuracy than the preferences of Mandel-Gutfreund et al. [5] (M-G) and Benos et al. [15] (SAMIE). Interestingly, when the DNA-recognition preferences were estimated from training data expanded to include TRANSFAC's artificial sequences, inferior results were obtained (dotted red line).
(D) Cumulative distribution of Sp1 scores among the sequences of targets/non-targets of unbiased chromatin immunoprecipitation scans of human Chromosomes 21 and 22 [16]. The predicted Sp1 motif appears in 45% of the experimentally bound sequences but in only 5% of the control sequences.
Figure 6.
Inferring the Function and Activity of Zinc Finger Transcription Factors in D. melanogaster
(A) Similar gene annotation enrichment among the putative target sets of 29 transcription factors in D. melanogaster. Blue cells correspond to significant overabundance of a GO term (row) among the predicted targets of a protein (column), using a hyper-geometric test. The binding sites of most factors show enrichment in at least one GO term. For some of the regulators, the enriched GO terms match prior biological knowledge. For example, the putative targets of Glass (gl) were found to be enriched with terms related to photoreceptor cell development (red circle 1). Similarly, the putative targets of Buttonhead (btd) and Sp1 were enriched with developmental terms such as neurogenesis, development, and organogenesis (red circle 2). Closely related GO annotations are not shown; see Figure S4 for full results.
(B) Deducing the activity of the 29 transcription factors using gene expression patterns. Expression data from early (0–12 h) embryogenesis [20] and data from the entire Drosophila life cycle [21] are used to test whether the putative direct targets of a regulator are expressed differently than the rest of the genes in a given experiment. Red cells correspond to significant enrichment of overexpressed targets using a Kolmogorov-Smirnov test, while green cells correspond to enrichment of underexpressed targets. For most of the regulators the analysis resulted in at least one significant embryogenesis experiment, suggesting an active role in early developmental stages (above). Similar results were obtained using the full life cycle gene expression data (below).
Table 1.
Analysis of Differential Expression in D. melanogaster Imaginal Wing Disc