Retrocopied Genes May Enhance Male Fitness

To determine the sequence specificity of dimeric Ss-LrpB, a high resolution contact map was constructed and a saturation mutagenesis conducted on one half of the palindromic consensus box. Premodification binding interference indicates that Ss-LrpB establishes most of its tightest contacts with a single strand of two major groove segments and interacts with the minor groove at the center of the box. The requirement for bending is reflected in the preference for an A+T rich center and confirmed with C·G and C·I substitutions. The saturation mutagenesis indicates that major groove contacts with C·G at position 5 and its symmetrical counterpart are most critical for the specificity and strength of the interaction. Conservation at the remaining positions improved the binding. Hydrogen bonding to the O and N acceptor atoms of the G50 residue play a major role in complex formation. Unlike many other DNA-binding proteins Ss-LrpB does not establish hydrophobic interactions with the methyls of thymine residues. The binding energies determined from the saturation mutagenesis were used to construct a sequence logo, which pin-points the overwhelming importance of C·G at position 5. The knowledge of the DNA-binding specificity will constitute a precious tool for the search of new physiologically relevant binding sites for Ss-LrpB


INTRODUCTION
Understanding the sequence-specific binding of a transcription regulator is of central importance in order to unravel the functioning of this protein in the establishment of the regulatory process. In vivo any sequence-specific DNAbinding protein will also bind to pseudo-sites with a reduced affinity. These pseudo-sites nevertheless play an important role both thermodynamically, by setting the concentration of freely available regulatory protein and kinetically, by regulating the rate of location of the specific targets. Furthermore, low affinity sites may have a considerable impact on the outcome of the regulatory response if they occur in proper juxtaposition with a high affinity site and are bound in a cooperative fashion. A detailed knowledge of the sequence specificity is therefore required to search for new binding sites and to distinguish between physiologically relevant regulatory sites and pseudo-sites.
At the present time, archaeal regulation is still poorly documented. The lack of efficient tools in genetics and molecular biology, especially for hyperthermophilic archaea, constitutes a severe limitation for the analysis of regulator function and the identification of regulons, modulons and stimulons. Nevertheless, the (potential) binding sites of a handful archaeal regulators have been identified in upstream promotor/operator regions of their respective target genes through in silico or experimental approaches (in vitro or more rarely in vivo) [for a review on archaeal transcription regulation, see Ref. (1)]. Most of these sites are semi-palindromic.
Several characterized archaeal regulators belong to the archaeal/bacterial Lrp/AsnC family (2). Lrp-like regulators are widely distributed among archaea, but with the exception of LysM from Sulfolobus solfataricus, the physiological role of these potential regulators in archaea remains mainly elusive (3). Lrp-like proteins show a variable degree of amino acid sequence identity, but share the same fundamental architecture. Crystal structure determination has been done for the archaeal members LrpA (4) and FL11 (5), both from Pyrococcus species and for the bacterial members LrpC from Bacillus subtilis and AsnC from Escherichia coli (6). The N-terminal helix-turn-helix DNA-binding domain is connected with a flexible linker to the C-terminal domain that shows a typical ab sandwich fold, also called RAM domain (regulation of amino acid metabolism) (7). The latter is involved in effector binding and oligomerization. In solution, Lrp proteins exist as dimers or oligomers of dimers. They usually bind cooperatively to operators carrying an array of degenerate semi-palindromic targets (2).
Ss-LrpB is an Lrp-like protein from S.solfataricus. This regulator binds its own operator region at three similar, regularly spaced 15-bp binding sites (8). The deduced palindromic consensus sequence, 5 0 -TTGCAAAATTTGCAA-3 0 , has four highly conserved base pairs (in bold) in each half-site and a 5-bp long central region exclusively composed of weak base pair. A recent AFM (atomic force microscopy) study of Ss-LrpB:operator complexes indicates that each binding site is contacted by an Ss-LrpB dimer (9). Furthermore, occupation of the three binding sites results in the formation of a globular complex in which $100 bp of the operator DNA are wrapped around the interacting regulator molecules.
The DNA-binding sequence specificity of a regulator can be determined by using the SELEX strategy (systematic evolution of ligands by exponential enrichment) (10). This technique will select a set of high affinity DNA sites that define a consensus-binding sequence for the protein. It has been applied for the Lrp-like regulators E.coli Lrp and Ptr1 and Ptr2 from Methanocaldococcus jannaschii (11,12). In the present study we have followed another strategy. We determined the DNA-binding sequence specificity of Ss-LrpB by measuring the in vitro binding to a set of mutated variants of the idealized symmetrical consensus box. This was done systematically for all possible single base pair substitutions of one half of the binding site (saturation mutagenesis) and also for targets containing an abasic position or a non-canonical base: inosine, uracil, 5-methyl cytosine or 2-aminopurine. The importance of the spacing between the two consensus-half-sites was assessed by including single and double base pair deletions and insertions. Binding profiles were constructed based on electrophoretic mobility shift assay (EMSA) experiments, allowing an estimation of the apparent binding equilibrium dissociation constants (K D ). The quantitative binding data from the saturation mutants were represented in an energy normalized sequence logo (13,14). Combined with the results of a high resolution contact probing analysis of the Ss-LrpB:consensus box interaction, this provides a detailed view of how each base pair energetically contributes to the specific binding. Such an extended and detailed experimental analysis had not yet been performed for an archaeal regulator or a bacterial/ archaeal Lrp-like regulator.

Protein purification
Recombinant Ss-LrpB protein was produced in E.coli and purified by a combination of heat treatment and ion exchange chromatography as described previously (9). The protein concentration was determined by a MicroBCA assay (Pierce). The purified protein was divided into small aliquots, which were frozen and thawed individually before each EMSA experiment.

Footprinting and binding interference analysis
Footprinting and binding interference analysis were performed with a 150-bp DNA fragment with the consensus box near the center. This fragment was generated by PCR using the plasmid pBendCon as template and the oligonucleotides EP9 and EP10 as primers (8). PCRs were performed by using ReadyMix Taq PCR Mix (Sigma-Aldrich). One of the oligonucleotides was 5 0 end labeled with [g-32 P]ATP (GE Healthcare Bio-Sciences) and T4 polynucleotide kinase (Roche). The resulting DNA fragments (either top strand or bottom strand labeled) were purified by PAGE prior to analysis.

EMSA
EMSA experiments were performed either with the labeled 150-bp fragment, as described above, or with 45-bp oligonucleotide duplexes (annealing of complementary oligonucleotides) of which one oligonucleotide was 5 0 end labeled with [g-32 P]-ATP and T4 polynucleotide kinase (Roche). The latter applies for all mutated variants of the consensus box. These 45-bp oligonucleotides contain the wild-type (TAAAAAGGCATTATCTTGCAAAATTTGCAATAATC-CTTTTATGTT) or mutated consensus sequence starting at position 16, preceded and followed respectively by 15-bp of the sequences upstream and downstream of the naturally occuring strong Box1 in the Ss-lrpB promotor/operator region. All oligonucleotides have been purchased from Sigma-Aldrich, except the abasic variants (Eurogentec) and the 2-aminopurine variant (VBC Biotech Services GmbH).
EMSAs were performed as described previously (15). All binding reactions proceeded at 37 C in LrpB binding buffer (see above) and in the presence of a large excess of non-specific competitor DNA (25 mg/ml sonicated herring sperm DNA).

Data analysis
All EMSA autoradiographs were scanned and the integrated band intensities were quantified using the Intelligent Quantifier software (Bio Image). For each lane, the background intensity was also measured and subtracted from the band intensities. Only the integrated intensities (I.I.) from the unbound DNA bands were considered for further analysis because of the 'smearing' effect. All I.I. values where divided by the average of two I.I. measurements of free DNA (without addition of Ss-LrpB). This corresponds to the fraction of unbound DNA. The fraction of bound DNA in each lane was then calculated to be: fraction bound DNA ¼ 1 À fraction unbound DNA. These quantitated data were plotted versus Ss-LrpB concentration (binding profile) and fitted using the Hill equation (Origin, non-linear least squares method): In this equation, n corresponds to the Hill coefficient, which is a measure of binding cooperativity. In several cases, v max was <1 and therefore k did not equal the apparent K D . The apparent K D was instead determined to be the protein concentration at which the fraction of bound DNA equals 0.5. EMSAs were repeated several times (at least twice, mostly four times) for each oligonucleotide duplex variant. Visible outliers, due to experimental errors such as pipetting errors, were omitted prior to plotting. Typically, K D values determined from EMSA experiments can vary by as much as 2-fold (18). The largest source of errors in determining apparent K D values from EMSAs is the size of increments in the used protein concentrations. We have been particularly careful to keep these increments low, generally 1.5-fold, with a maximum of 2-fold. This led to variations comprised between 20 and 50% for good binding sites and to higher values for very low affinity binding sites.
Sequence logos were constructed with the web-based tool enoLOGOS, using relative entropy [(14); available at http:// biodev.hgen.pitt.edu/enologos]. This tool was used to create a logo from the aligned forward and reverse sequences from the naturally occuring Ss-LrpB binding sites. A logo was also created based on the binding affinity data. A position weight matrix was constructed with binding energy data from the saturation mutagenesis analysis. Binding energies (expressed in kT units) were calculated to be:

High resolution contact probing
Previously it has been shown that Ss-LrpB binds a 150-bp fragment bearing the consensus-binding site with an apparent K D of $20 nM (8). Here, the study of this interaction is extended by enzymatic and chemical footprinting and by premodification binding interference assays ( Figure 1). Ss-LrpB protected on both strands a 22 nt long stretch against DNase I cleavage. This stretch covers the 15-bp consensus box and extends 5 nt towards the 3 0 end and 2 nt towards the 5 0 end (Figure 1a and f). Protection on both strands is therefore offset by 3 nt towards the 3 0 end; this reflects how DNase I is positioned in the minor groove and cleaves phosphodiester bonds on opposing strands across the minor groove. The presence of hyperreactive sites at the 3 0 -boundaries of the consensus box are indicative of Ss-LrpB-induced DNA deformations resulting in local minor groove widening. The limits of protection areas and the sites of reactivity and hyperreactivity are fully symmetrical (Figure 1f). This indicates that specific binding results in the alignment of the 2-fold symmetry of the Ss-LrpB dimer to the dyad axis of symmetry of the palindromic binding site.
In order to obtain a higher resolution footprint, the nuclease activity of the smaller 1, 10-phenanthroline-copper ion [(OP) 2 -Cu + ] (19) was used for in-gel footprinting of Ss-LrpB: consensus box DNA complexes separated from bare DNA by gel electrophoresis (Figure 1b). This resulted in a protection area that is on both strands limited to the 15-bp consensus sequence (Figure 1b   little cleavage activity in the consensus box whereas the flanking sequences were cleaved more efficiently. This heterogeneity in the sensitivity to both the enzymatic and the chemical nuclease reflects local conformational variations of the DNA and suggests a narrowing of the minor groove in the A+T rich (73.3%) consensus-binding site (20,21).
Critical base-specific contacts were identified by premodification binding interference experiments. These techniques are based on the creation of a pool of DNA molecules with on average one base-specific modification per molecule. Low and high affinity molecules are subsequently separated in an EMSA designed to result in $50% binding. Free and bound DNA are recovered from the gel, cleaved at the site of modification and analyzed by gel electrophoresis in denaturing conditions to distinguish positions that are crucial for binding from sites that are irrelevant (see Materials and Methods). Experimental results are shown for the top strand only (Figure 1c-e); a summary of all the effects is given in Figure 1f and in a helical presentation in Figure 1g. Methylation of the N 7 atom (major groove) of any of four guanine residues (at the adjacent positions 4, 5 0 and the symmetrical counterparts À4 0 , À5) strongly inhibited complex formation (Figure 1c and f) indicating that Ss-LrpB makes strong major groove contacts with these parts of the binding site ( Figure 1g). The methylation of the N 3 atom (minor groove) of 12 adenine residues (positions 0, 1 0 , 2 0 , 3 0 , 6, 7 and the symmetrical counterparts) resulted in a weaker but significant interference (Figure 1c and f). A similar effect was observed at two symmetrically related adenine residues juxtaposing the 15-bp target. These results suggest local minor groove contacts, mainly in the central part of the binding site and near its extremities (Figure 1g), though it can not be excluded that some effects might be indirect and result from subtle local alterations of the helix induced by the methyl group.
Base removal binding interference effects (missing contact) are usually considered to represent direct effects, which imply that they reflect interactions established between these bases and the interacting protein molecule (22). Alternatively, indirect effects might occur owing to structural alterations or changes in the DNA conformability generated by the absence of a base (8,23). Based on the observed effects (Figure 1d-g) it can be concluded that the vast majority of the purine and pyrimidine residues of the consensus box contribute to Ss-LrpB binding. On the top strand the strongest effects were observed upon removal of G À5 and any adenine of the stretch A À3 -A 0 and of T À7 , T À6 , T 1 , T 2 and T 3 . Removal of the adenine 5 0 to the consensus box also interfered with Ss-LrpB binding. Weaker effects were observed upon removal of G 4 , C 5 and A 6 . In contrast, the removal of C À4 and A 7 hardly affected complex formation. Therefore the inhibitory effect of premethylation of residue A 7 is likely indirect since it occurs in the minor groove, on the backside of the DNA molecule. The effects observed upon removal of residues of the lower strand provided essentially the symmetrical image of the results on the top strand (Figure 1f and g). It appears that Ss-LrpB establishes the vast majority of its tightest major groove contacts with one single strand segment in each half-site (Figure 1f and g). Generally, the results of the footprinting and missing contact probing assays performed with the consensus box are in good agreement with the results obtained previously with the array of three degenerate binding sites in the Ss-lrpB control region (8).
Finally, random deoxyuridine substitution binding interference experiments (24) were performed. Substitution of thymine by uracil, which corresponds to the removal of the C 5 -methyl group in the major groove, did not impair complex formation (data not shown). This indicates that the methyl groups of thymine residues potentially contacted through the major groove (positions 3, 6 0 , 7 0 and the symmetrical counterparts at À3 0 , À6 and À7) do not significantly contribute to Ss-LrpB binding. This is unlike what has been observed in many other protein-DNA complexes and is even more surprising in view of the elevated A+T content of the consensus box.
Binding affinity of Ss-LrpB to the consensus box depends on the length of the flanking sequences The sequence preference of Ss-LrpB at all positions of one half-site of the binding site was studied with derivatives of a 45-bp duplex DNA (see Materials and Methods). The annealing of complementary oligonucleotides allowed to study in identical conditions all the substitution mutants, the abasic molecules and molecules carrying a non-canonical base. The apparent equilibrium dissociation constant (K D ) of the interaction was determined by applying the EMSA technique ( Figure 2). This was done by quantitating the free DNA population by densitometry and plotting the fraction of bound DNA versus the concentration of Ss-LrpB. As demonstrated previously, the EMSA can provide estimates of the macroscopic binding constants of protein-DNA interactions (25). The construction of a binding profile by fitting the Hill equation allowed us to determine the apparent K D as the protein concentration required to shift a DNA fraction of 0.5. This also allowed us to determine the Hill coefficient, which is a measure of binding cooperativity. The affinity of binding to the 45-bp consensus duplex (K D of 91 nM) was $3-fold lower than binding to a 150-bp DNA fragment bearing the same consensus-binding site (K D of 33 nM) ( Figure 2 and Table 1). Similar differences in binding affinity depending on the length of the flanking sequence have been observed in other studies (26). Footprinting experiments (DNase I and in-gel Cu-OP) clearly indicate that the 150-bp fragment does not assemble >1 Ss-LrpB dimer. Therefore, the most plausible explanation of the size effect is that the flanking sequences provide 105 non-specific sites that contribute the equivalent of two specific sites. Furthermore, accelerated targeting by 1D diffusion along the DNA molecule and the possibility of having different dissociation rates from internal sites and the ends of the fragments may add to the observed differences in binding affinity to the 105-and 45-bp fragments (27).
The Hill coefficient was similar for binding to the 150-bp fragment and the 45-bp duplex, on average 2.4. This indicates positive cooperativity. Since Ss-LrpB binds to a single palindromic site, this cooperativity might indicate the existence of an equilibrium between monomeric and dimeric forms of Ss-LrpB at low protein concentration. Indeed, although gel filtration and crosslinking experiments indicate that the predominant oligomeric form of Ss-LrpB in solution is a dimer (9), this does not exclude that dissociation might occur in the nM range.

Analysis of the binding specificity of Ss-LrpB
Complex formation was studied with a complete set of 24 single base pair substitution mutants of one half of the symmetrical consensus box (saturation mutagenesis of positions 0-7) as described above. This allowed a detailed analysis of the sequence-specific contribution of each base pair to the interaction. An example of such analysis is presented for position 5 (Figures 3a-e). Average apparent K D values for all mutants are summarized in Table 2. In the presence of excess non-specific competitor DNA, binding to these annealed oligonucleotides rarely resulted in a complete depletion of the free DNA population, even at the highest protein concentrations used and showed a more intense smearing, indicative of the formation of less stable complexes. As a consequence, the maximal fraction of bound DNA (corresponding to v max ) in the Hill fittings is slightly <1 in most cases and significantly <1 in a few instances (very low affinity binders).
None of the 24 single base pair substitution mutations resulted in a better binding of Ss-LrpB. Therefore, the consensus box appears to be optimized. In contrast, all possible changes except the A·T!T·A substitution at position 0 resulted in a nearly 2-to >100-fold reduction of the binding affinity (Table 2). A survey of the relative binding affinities (Figure 4) indicates that all base pairs of the consensus contribute in a sequence-specific manner and to a variable degree to complex formation, but that position 5 is crucial.

Major groove contacts: hydrogen bonding at C 5 ·G 5 0 is crucial for Ss-LrpB binding
The saturation mutagenesis study (Table 2, Figure 4) clearly indicates that the C·G at position 5 is the most discriminating base pair of the Ss-LrpB binding site. Representative autoradiograms of binding to mutants of position 5 are shown in Figure 3a-d. Remarkably, substituting A·T for C·G almost completely destroyed binding. Even at the highest Ss-LrpB concentration used (16 220 nM), only some unstable binding was observed. The C·G!G·C mutant showed detectable specific binding, although with a strongly reduced affinity (86-fold increase in K D ). The smallest effect was observed with the T·A substitution (8.9-fold increase in K D ).
Hydrogen bonds are by far the most important sequencespecific contributors to complex formation. Given the more extensive reorganization of potential hydrogen bonding groups when substituting T·A as compared with A·T for C·G (Figure 3a-d) one might have expected the latter to be the less detrimental change. The opposite was observed: this appears to be due to a position-dependent inhibitory effect of the hydrophobic methyl group of thymine, as demonstrated with the use of uracil and 5-methyl cytosine substitutions (Figure 5a and b; Table 2). Uracil is equivalent to thymine but lacks the C 5 methyl group on the major groove side of the base. Binding to the A·U mutant [7.1-fold increase in K D as compared with wild type (WT)] was much better than to the A·T mutant (>170-fold increase in K D ) and comparable with the T·A mutant (8.9-fold increase). The substitution of C 5 by 5-methyl cytosine (C Me ), keeping the complementary G 5 0 residue intact, resulted only in a 2-fold increase in the K D . Combined, these results clearly indicate the strong negative position-dependent effect of a methyl group at position 5 0 of the bottom strand.
The importance of the C·G base pair is also reflected in the results of the high resolution contact probing analysis. Both depyrimidation of C 5 and depurination of G 5 0 strongly interfered with complex formation (see above). The respective contribution of each base to complex formation was further quantified with target molecules abasic for either one of these complementary bases or both (Figure 5e-g; Table 2). The removal of G 5 0 resulted in a 3-to 4-fold higher K D than the removal of its partner C 5 . Binding to the double abasic consensus site was hardly detectable. This result emphasizes the overwhelming importance of G 5 0 for the strength and specificity of complex formation.
The strong inhibitory effects of depurination and of premethylation of the N 7 atom of G 5 0 suggest direct hydrogen bonding of Ss-LrpB to the N 7 and/or O 6 hydrogen bond acceptors on the major groove side of the guanine ring. To evaluate the importance of hydrogen bonding with the C 6 carbonyl of G 5 0 we compared complex formation with targets carrying either 2-aminopurine (2-AP) or inosine instead of guanine (Figure 5c and d; Table 2). 2-AP lacks the carbonyl group on the major groove side of the base, whereas inosine lacks the exocyclic amino group on the minor groove side. 2-AP was highly detrimental for complex formation (>120-fold increase in K D as compared with the WT) whereas, as expected, inosine had nearly no effect (1.3-fold increase in K D ). Combined, our results pin-point the capital importance of hydrogen bonding with the C 6 carbonyl of guanine at position 5 (and its symmetrical counterpart G À5 0 ) in the site selectivity of Ss-LrpB.
Arginine:guanine is the most common specific contact found in crystal structures of protein-DNA complexes, followed by arginine:cytosine (28). Furthermore, in 87% of the arginine:guanine pairs, hydrogen bonds are formed with the amide group of arginine and the N 7 and O 6 acceptor atoms of the guanine. Therefore, the C·G base pair at position 5 and its symmetrical counterpart might very well be contacted by the side chain of one or more arginine residues of the recognition helix of Ss-LrpB. Ss-LrpB bears three arginine residues in its recognition helix, at positions 42, 44 and 47. Of these, R 44 is highly conserved among Lrp-like regulators. Alanine substitution of these residues might provide further molecular details on the interaction of Ss-LrpB with position 5 of the binding site. It is worth noting that HTH motives generally do not interact in a one-to-one mode with their DNA target but rather establish complex patterns of interactions in which one base is contacted by different amino acids and one amino acid contacts several bases.

Major groove contacts: positions 3, 4, 6 and 7
At position 3, contacted through the major groove, the T·A!G·C mutant showed the smallest reduction in binding affinity (5.4-fold increase in K D ), followed by the C·G (7.3-fold) and A·T (8.8-fold increase) mutants. This particular order in the base pair preference might in part be explained by the fact that the hydrogen bond acceptor groups of T and G are only slightly shifted in a G·C base pair as compared with a T·A pair, whereas they are completely rearranged in an A·T pair (see Figure 3a-d).
Here, unlike what we observed at position 5, the hierarchy of relative binding affinities correlates with the degree of reorganization of major groove constituents in the different substitution mutants. Otherwise, steric hindrance on neighboring contacts and differences in the local groove geometry might also contribute to the observed differences in binding affinity.
Replacing the G·C base pair at position 4 of the consensus box by any of the three other possible combinations resulted in a similar modest effect (4-to 5-fold increases in K D ; Table 2). This observation is compatible with the higher variability observed at position 4 in the three Ss-LrpB binding sites in the control region of its own gene.
Together with position 4, position 6 appears to contribute the least to the specific binding of all major groove-contacted positions. The A·T!T·A mutant showed the highest binding affinity (3.1-fold increase in K D ). Both the G·C and C·G substitutions resulted in a similar, $5-fold increase in K D . The A·T base pair at position 7 is also contacted through the major groove, as indicated by the similar relative binding affinities of G·C and I·C substitution mutants (K D of 172 and 155 nM, respectively; Table 2). Nevertheless, the nature and the specificity of these interactions appear to be completely different from position 6 as indicated by the hierarchy in the binding specificities. At position 7, the G·C transition mutant had the smallest effect (1.9-fold increase in K D ), followed by the T·A mutant ($4-fold increase) and the C·G mutant ($7-fold increase). The surprisingly small  Table 1). The value of the A 5 mutant corresponds to a minimal value. effect of the G·C substitution suggests that alternative contacts may be established upon the profound reorganization of the bottom of the major groove accompanying this substitution.
The mild effect (2.4-fold) of a double substitution mutant with uracil instead of T 6 0 and T 7 0 indicates that the methyl group of these residues does not contribute much to the binding energy and sequence specificity of the interaction (Table 3). This result is in full agreement with the random deoxyuridine substitution binding interference experiments (see above). The comparable mild (2-to 3-fold) reduction in relative binding affinity to the single and double (4.6-fold) abasic mutants (Table 2) indicates that the A and T residues of position 7 contribute significantly less to the binding energy and specificity than the G residue at position 5 0 .
Minor groove contacts: positions 0, 1 and 2 The positions that are supposed to be contacted through the minor groove generally had a higher tolerance to base pair substitution than the half-site positions contacted through the major groove (Table 2, Figure 4). At these positions, the highest tolerance we observed is to mutation in the opposite weak base pair (T·A to A·T or the inverse). This reflects the fact that a protein can hardly distinguish a T·A base pair from an A·T base pair on the minor groove side since their hydrogen bond acceptor groups (C 2 carbonyl from T and N 3 of A) will roughly switch positions. The effect of the A·T to T·A transversion was smallest at the central position 0 (1.2-fold increase in K D ). In fact, this base pair substitution leaves the 15-bp consensus box unchanged; it simply corresponds to reading the opposite strand. Therefore, the observed 1.2-fold difference in binding affinities reflects the experimental error and is within the error range observed in the EMSAs (see Materials and Methods). Both bases of this complementary pair contribute significantly to complex formation as indicated by the missing contact probing assays (see above). This was confirmed and better quantified for A 0 ; binding to the abasic 45-bp duplex occured with an $5.2-fold reduced affinity ( Table 2).
None of the three Ss-LrpB binding sites in the own control region has a strong base pair in the central part of the box. The apparent prohibition of strong base pair might be related to the exocyclic C 2 -amino group of guanine, which plays a dual role in DNA structure and recognition. The introduction of an NH 2 group in the minor groove results in groove widening and consequently in a decrease of the electronegative potential as compared with weak base pairs. It also constitutes a steric hindrance for the compression of the minor groove upon bending of the operator DNA towards the interacting Ss-LrpB molecule (8,9). The need for bending of an operator site is generally reflected in the fact that there is a preference for A+T rich sequences at the midpoint of the target, where the minor groove has to be narrow. To evaluate the influence of the C 2 -amino group of guanine upon substitution of strong base pairs for A·T and T·A in the minor groove we measured complex formation with a pair of double base pair substitutions carrying G·C and C·G (positions À1 and 1, respectively) and the equivalent construct carrying inosine instead of guanine (Table 3). Inosine lacks the C 2 -amino group of guanine on the minor groove side of the base and is also equivalent to C 6 -deaminated adenine. Therefore an I·C pair resembles an A·T pair in the minor groove and a G·C pair in the major groove. The double guanine-bearing mutant exhibited an $7-fold reduction in the relative binding affinity compared with the WT; a similar effect was observed in a double G·C for A·T substitution at positions 0 and 1 ( Table 3). In contrast, the inosine bearing double mutant showed a 1.6-fold increase in K D , only. Therefore, inosine interferes significantly less with complex formation than guanine. This result emphasizes the importance of minor groove geometry in the central part of the Ss-LrpB binding site.

Binding to insertion and deletion mutants
To examine the importance of the alignment of the two half-sites of the consensus box contacted through the major groove we studied the binding to insertion and deletion mutants (Table 4). These mutants had an insertion or deletion of either 1 or 2 bp in the A+T rich central segment contacted through the minor groove. The insertion of one extra T·A base pair (ins1: in the stretch T 1 -T 3 ) still allowed the formation of the specific complex, though with a 3.3-fold increase of the K D as compared with the WT. In contrast, deleting a single A·T base pair (del1: in the stretch of A À3 -A 0 ) completely abolished Ss-LrpB binding (K D > 11 000 nM; Table 4). A similar result was observed with a double base pair deletion mutant (A·T and T·A). A double base pair insertion mutant (consecutive A·T and T·A base pairs in the center of the box) resulted in a specific binding with a 11.8-fold increased K D . These results indicate a limited conformational flexibility of the Ss-LrpB dimer. Increasing the separation between the two half-sites contacted through the major groove, thereby disturbing the helical alignment, is tolerated to a certain extent. In contrast, reducing their separation (thereby inducing a similar rotation of $34 per base pair but in the opposite direction) is highly detrimental. These results suggest that steric hindrance between the two subunits of the Ss-LrpB dimer might exclude the simultaneous binding of their HTH motives to the unproperly aligned major groove segments of the deletion mutants.
Energy normalized sequence logo-modeling DNA-binding sequence motifs can be graphically represented by sequence logos, based on the information theory (13). The height of the stack of letters corresponds to the sequence conservation, expressed in bits of information. The relative heights of the bases correspond to their relative frequencies. Therefore, a sequence logo provides more information than a consensus sequence. Here, we generated two kinds of logos with a web interface called enoLOGOS: (i) based on sequence comparisons, (ii) based on binding energies (Figure 6a and b; 14). A sequence logo was constructed based on the three binding sites for Ss-LrpB in the control region of its own gene (Figure 6a; 8). An alignment of both forward and reverse sequences was used since Ss-LrpB binds as a dimer. Two corrections were applied in the creation of this logo: (i) A small-sample correction, since the information content tends to be overestimated in the case of a small  dataset. (ii) A correction for background base frequencies in the case of genomes with a biased GC content such as S.solfataricus (GC content of 37%). The Ss-LrpB sequence logo confirms our previously deduced consensus sequence. Conservation is higher in the major groove contacted halfsites than in the A+T rich center. However, since only three binding sites have been used to create the logo, its significance is rather low. In contrast, the saturation mutagenesis dataset used to create an energy normalized sequence logo is much larger and therefore more significant (Figure 6b). Full symmetry was assumed to construct the logo for the complete 15-bp binding site. The correction for the biased GC content was not applied since it is irrelevant in this case. Compared with the logo based on the natural targets, this 'enologo' has a higher resolution. The relative importance of a C at position 5 (and G at position À5) is highly emphasized. A cosine wave represents the twist of the DNA helix. Positions recognized in the minor groove have a maximal information content of 1 bit as opposed to 2 bits in the major groove (29). This is explained by the fact that in the minor groove a protein cannot discriminate weak base pairs from each other or strong base pairs from each other. Binding of the C 5 ·G 5 0 base pair in the major groove is confirmed since the information content exceeds 1. On the contrary, based on our binding affinity data, the sequence in the minor groove is relatively less important. This is certainly the case for positions À1, 0 and 1, where little sequence preference is indicated in the logo. The total information content of this enologo is 7.2 bits as compared with 10.4 for the sequence logo based on the natural binding sites.
The logo constructed with the binding energy data from the saturation mutagenesis analysis can be used to search for new high affinity binding sites in the S.solfataricus P2 genome. Assuming additivity, which is usually a good approximation in order to find new sites (30), the binding affinity of each sequence can be predicted. Nevertheless, setting the threshold such that relevant high affinity sites are retrieved and the amount of false positives is minimalized, is a delicate and not so straightforward process. The algorithm will be designed to allow an extra 1 or 2 bp in the center of the box. The joint occurence of high and low affinity sites should also be considered. When correctly aligned the latter will be bound in a cooperative manner and consequently acquire a physiological role, as already demonstrated by the concentration-dependent formation of structurally very different complexes of Ss-LrpB with the control region of its own gene (9).
Although our binding assays indicate that the consensus box bears the optimal base at each position, this does not necessarily imply that the consensus box is the best possible binding site. Some positions might be functionally interdependent and therefore give rise to context-dependent effects. The tightest binding site could have been found by SELEX, but on the other hand this technique does not provide the energy landscape that was determined here. It is also worth noting that the consensus box does not occur in the S.solfataricus genome. This is frequently observed with regulatory sites. Tight binding and long half-lives of such complexes might be incompatible with the flexibility required to adapt to rapidly changing environmental conditions and microbial generation times.