Comparative Genomics of CytR, an Unusual Member of the LacI Family of Transcription Factors

CytR is a transcription regulator from the LacI family, present in some gamma-proteobacteria including Escherichia coli and known not only for its cellular role, control of transport and utilization of nucleosides, but for a number of unusual structural properties. The present study addressed three related problems: structure of CytR-binding sites and motifs, their evolutionary conservation, and identification of new members of the CytR regulon. While the majority of CytR-binding sites are imperfect inverted repeats situated between binding sites for another transcription factor, CRP, other architectures were observed, in particular, direct repeats. While the similarity between sites for different genes in one genome is rather low, and hence the consensus motif is weak, there is high conservation of orthologous sites in different genomes (mainly in the Enterobacteriales) arguing for the presence of specific CytR-DNA contacts. On larger evolutionary distances candidate CytR sites may migrate but the approximate distance between flanking CRP sites tends to be conserved, which demonstrates that the overall structure of the CRP-CytR-DNA complex is gene-specific. The analysis yielded candidate CytR-binding sites for orthologs of known regulon members in less studied genomes of the Enterobacteriales and Vibrionales and identified a new candidate member of the CytR regulon, encoding a transporter named NupT (YcdZ).


Introduction
CytR, the regulator of transport and utilization of nucleosides, was first mentioned in 1975 [1] and identified in 1985 [2]. The corresponding gene, cytR, was sequenced in 1986 [3].
The structural and functional features of CytR were reviewed in [17][18][19][20]. CytR is an atypical representative of the LacI-family [21]. Its affinity to its operators is rather weak [22] and because of that, in contrast to most prokaryotic repressors, CytR alone is not capable to repress transcription. CytR functions in a complex with a multifunctional transcription factor (TF), CRP [23,24].
The CRP protein is a dimer [25]. The subunit dimerization depends on the N-terminal domain, while the DNA recognition is performed by the C-terminal domain [26]. A possible regulatory mechanism was suggested, based on the crystal structure of CRP in complex with a DNA-fragment [27].
CytR protein is also dimeric [28]. The number of CRP-binding sites (O CRP ) per CytR-binding site (O CYTR ) varies from one to three [17]: one, as in the cytR promoter [6]; two, as in the majority of cases; or three, as in the cdd promoter [8,17,29]. This might indicate different structures of the CRP-CytR complex or repositioning of the CRP dimers upon interaction. In most promoters, CRP has a stronger affinity to the distal operator O CRP D [30], with an exception being cddP, where CRP binds stronger to its proximal operator O CRP P [29]. An important requirement is that at least one CRP-operator has to be situated at a distance not exceeding 5 nucleotides to the corresponding CytRoperator [20], with the position of the O CYTR operator being not symmetric relative to the flanking O CRP operators [17]. Fig. 1 shows a typical organization of the O CRP D-O CYTR -O CRP P complex for five experimentally studied E. coli genes.
The mechanism of the CytR action is anti-activation rather than direct repression [17,31,32]. In particular, at the promoter deoP2, RNA polymerase and CytR compete for CRP that in this case acts as an activator [31]. CRP alone activates transcription, whereas the CRP-CytR cooperatively bound to O CRP and O CYTR , respectively, represses transcription. Cytidine binding to CytR releases the latter from the complex, hence the activation by CRP resumes and the gene is derepressed [33,34]. At that, the intrinsic CytR binding to DNA is not affected by cytidine binding [35]. The repression and activation of some other CytR-regulon genes were considered, in particular, in [17]. In the cytR, deoP and udp regulatory regions only one CRP-binding site participates in the activation; in the nupG, tsx and cdd promoters two CRP-binding sites are involved in the activation, and in all regulated genes, except cytR, two CRP-binding sites participate in the repression. Hence upstream regions of all CytR-regulated genes contain at least one CRP-binding site, either distal or proximal, that participates both in activation and repression, see [17] (Fig. 3, p.463).
The CytR-binding motif consists of two half-sites, denoted here as O CYTR D (distal) and O CYTR P (proximal). Unlike the situation with many TFs, including repressors from the LacI-family, the length of spacers between parts of the O CYTR motif may vary in a wide interval from about zero to three DNA helical turns, with large spacers tending to comprise an integer number of turns, at most three [18].
In most studies, O CYTR D and O CYTR P were assumed to form degenerated inverted repeats [11,13,36], and the major role in specific binding was assigned to protein-protein (CytR-CRP) Figure 1. Organization of upstream regions of five experimentally proven E. coli members of the CytR regulon. [13,19,37]. CytR-binding sites (O CYTR ) are highlighted in magenta, cores of CRP-binding sites (O CRP ) -in green. Numbers denote spacer lengths. Dots denote gaps in the alignment. doi:10.1371/journal.pone.0044194.g001   rather than protein-DNA (CytR-O CYTR ) interactions. Still, at the physiological concentration of CytR, the CytR-DNA interactions are absolutely necessary for the repressor complex to be formed [37]. The exact O CYTR position was mapped precisely in few cases only, e.g. in the deoP2 promoter by point mutagenesis [36] or by exchange of udpP and deoP2 O CYTR operators [19]. In the majority of other cases, binding sites were located approximately by using the protein shift assay, protein footprinting with DNAase I or hydroxyl radical footprints, DMS-treatment, or cloning into a plasmid and measuring the level of CytR-repression [11,22]. The exact position of O CYTR was then predicted by the comparison with the consensus [11]. The latter was described as an inverted pentameric repeat TGCAA-N 2-3 -TTGCA [36], (where N denoted the number of nucleotides), a palindrome TTGCAA [38], or a pair of inverted octameric repeats (either 59-AATGYCAAC-GC-GTTGCATT-39 The consensus CytR and CRP motifs are shown at the bottom in magenta and green, respectively. Blue in the left column marks the genomes whose CytR-binding sites were used to construct the PWM. Shadows of grey denote the level of conservation, as set by GeneDoc: black -100% conservation; dark gray -the consensus nucleotide frequency between 75% and 100%, light grey -the consensus nucleotide frequency between 50% and 75%; white -no conservation. doi:10.1371/journal.pone.0044194.g004 Comparative Genomics of CytR PLOS ONE | www.plosone.org  or 59-AYGTGCAAC-N x -GTTRCATT-39, where Y = T or C, R = A or G, and x = 10, 11, 12 or 13) which are the optimal CytRbinding sites in the absence of CRP, or direct repeats of octamers in either orientation with a 1 bp spacer [37]. The most recent description implies only octameric repeats with a spacer allowing both of them to be situated on the same side of the DNA-helix, with the spacer being less than 4-5 nucleotides or roughly a helical turn, that is 10-11 nucleotides [18,20]. The current experimental data agree with this description [13,29,39]. The distances of about  two or three helical turns were experimentally proven to be possible [18] but so far have not been observed in nature.
Here, we study the evolution of CytR-binding sites, characterize their common features, and identify new candidate members of the CytR regulon in the Enterobacteriales and Vibrionales.

Recognition rules
We compiled a list of gamma-proteobacterial genomes encoding orthologs of CytR. Orthologs were initially defined by the bidirectional best hit criterion and confirmed by construction of phylogenetic trees (Fig. 2). All these genomes belong to the Enterobacteriales and Vibrionales, the list is given in Table 1. We also identified orthologs of genes known to be regulated by CytR in E. coli (Table S1) (see Data and Methods).
We used 69 published CRP-binding sites [40] to construct the O CRP positional weight matrix (PWM) using SignalX (see Data and Methods).). Sequence LOGOs of the constructed motifs are shown in Fig. 3A. To construct the O CYTR PWMs, we considered five E. coli genes with clearly distinguishable O CYTR (Fig. 1). We performed multiple alignment of the upstream regions of these genes and their orthologs. At that, we gradually increased the number of aligned sequences, starting with closest E. coli relatives and then including more distant ones, in the order given by the phylogenetic tree of the CytR proteins ( Fig. 2), while the O CRP sites could be reliably aligned and the distance between them remained approximately constant. Then we selected only the sequences that were conserved in the regions corresponding to the E. coli sites: both O CYTR D and O CYTR P were taken for deoC from ECO, SFL, ENT, CKO, SEH, STY, KPE; for udp from SEH, STY, ECO, SFL, CKO; for ppiA from SFL, ECO, CKO; for rpoH from STY, KPE, CKO, ECO, SFL; and for nupG from ECO, SFL, CKO, STY; see Table 1 for genome abbreviations; the selected genes are highlighted in blue in Fig. 4. Sites in other species were accepted for the matrix construction if they satisfied the following conservation conditions: (1) the same distance between O CRP sites for each E. coli gene listed above and its orthologs; (2) the same distance between O CYTR half-operators; (3) at most two mismatches in each O CRP site, and at most three total mismatches in the O CRP sites, compared to the E. coli O CRP sites; and (4) at most four mismatches in the O CYTR operator, and at most three mismatches in each O CYTR half-operator compared to the E. coli O CYTR boxes. The selected boxes were used to construct PWMs for the upstream (distal) and downstream (proximal) halfoperators (O CYTR D, Fig. 3B1, and O CYTR P, Fig. 3B2, respectively).
To identify new candidate CytR regulon members, we used three recognition rules to select regions for construction and manual analysis of alignments, requiring either (1) two candidate O CRP sites at a distance 10-40 bp, or (2)   hand, if O CYTR D and/or O CYTR P shifted in a fraction of genomes, each new position would be represented by a new smaller peak. We accepted a peak if the average score within a window exceeded 3, or a single prominent peak with the score slightly below the average (e.g. about 2.7 for O CYTR of cdd or cytR, see below). The positional conservation was also assessed using the plots of the information content (see Data and Methods). A SWAS peak was assumed to be more reliable if it was observed in a region of a more or less constant positional conservation.

Evolution of CytR-binding sites
To characterize the conservation of CytR-binding sites, we constructed three groups of alignment of gene upstream regions for closest relatives of E. coli, for other Enterobacteriales (in some cases, for all available Enterobacteriales including E .coli), and for the Vibrionales, and analyzed these alignments using the SWAS plots. It should be noted that the representation of gene orthologs in genomes varied and, further, in some genomes the intergenic regions diverged beyond recognition. The criteria for the inclusion of upstream regions to alignments were based on the scores of the O CRP sites and the conservation of the distance between them.
The operator cassettes may be classified into four main types by the pattern of conservation observed in the SWAS plots. The first type has two clear peaks that correspond to O CYTR D and O CYTR P, yielding conservation of both sites and the distance between them. The second rare type has one clear peak and a diffuse group of scattered minor peaks, reflecting conservation of one O CYTR site and absence or shift of the other one. The third type is characterized by the absence of clear peaks. Finally, the fourth type is two peaks of the same type, reflecting direct rather than inverted repeats. There were few such cassettes, but they also could be conserved to some extent. Note that the above definitions may depend on the number and similarity of sequences in an alignment: the closer they are, the more likely the respective gene would belong to type 1 rather than to type 3. The udp gene encodes uridine phosphorylase in many Enterobacteriales and Vibrionales. The detailed structure of the udp cassette in E. coli was studied in [29]. The distance between the O CRP sites in the udp promoter is conserved (30 or 31 bp) in almost all Enterobacteriales and Vibrionales; the only exceptions with non-conserved intersite distances and an overall low score of the cassette are Photobacterium profundum, Photorhabdus luminescens and Vibrio fischeri. The distances between the candidate O CYTR sites are not constant, and the alignment may be divided into three subalignments. In the SWAS plot of close relatives of E. coli, two pronounced peaks corresponding to O CYTR D and O CYTR P are visible (Fig. 5). In more distant Enterobacteriales and in the Vibrionales, no clear peaks are seen, and there are many genomespecific non-conserved candidate O CYTR sites, some overlapping with O CRP , that cannot be confidently predicted based on the sequence analysis alone (Fig. 6 and Fig. 7, respectively). Hence, the udp cassette is of type 1 at close distances and of type 3 at more distant ones.
The deoC gene encodes NAD(P)-linked 2-deoxyribose-5-phosphate aldolase. Prominent SWAS-plot peaks are observed in close relatives of E. coli where, unusually, there is no spacer between the O CYTR sites: O CYTR D is immediately adjacent to O CYTR P ( Fig.   S1 and [29]). In more distant Enterobacteriales (Edwardsiella ictaluri, Dickeya dadantii, Erwinia tasmaniensis and Klebsiella pneumoniae) no clear peaks are seen (Fig. S2). In all Vibrionales including P. profundum, two Yersinia species (Y. pestis and Y. pseudotuberculosis) and Pectobacterium atrosepticum, the 4-box cassettes had very low total weights (about 10 or even less) and variable distances between the O CRP sites, and hence the respective regions were not included into the alignments. Thus again we have type 1 behavior at close, and type 3 at distant Enterobacteriales.
Very low scores of 4-site cassettes were observed for most ppiA (peptidyl-prolyl cis-trans isomerase A) gene orthologs. Nevertheless, for six closest E. coli relatives, two peaks in the SWAS plot are clearly seen (Fig. S3), therefore the ppiA cassette is of type 1 at close Enterobacteriales, similar to previously characterized ppiA cassette in E. coli, see [9] (Fig. 1.B, p. 990).
The rpoH gene encodes the heat-shock sigma-factor (sigma-32 or s H ). The E. coli cassette was described in [13]. The 4-site cassette scores for this gene are rather low, mainly because of the low scores of the O CRP sites. Moreover, the scores in most Enterobacteriales are lower than those in E. coli. However, the SWAS plot features two clear peaks (Fig. S4), thus the rpoH cassette belongs to type 1 in close E. coli relatives. The nupG gene encodes one of two high affinity nucleoside transporters in E. coli. It is present in seven genomes, the fewest among all considered genes. Further, the gene annotated as nupG in Salmonella enterica Heidelberg is in fact xapB, encoding xanthosine MFS transporter [41], as demonstrated by the analysis of phylogenetic trees (not shown) and co-localisation with xapA, the latter encoding a subunit of xanthosine phosphorylase. In K. pneumoniae, the total score of the best O CRP pair is too low (about 7.1), and although the distance between them (30-31 nucleotides) is not sharply different from that in other genomes (27-28 bp), it is  likely that the regulation of nupG in K. pneumoniae has been lost. The SWAS plot has two pronounced peaks corresponding to the O CYTR D and O CYTR P sites with the conserved distance of 9 bp between them and an overlap between the latter and the proximal O CRP site ( Fig. 8 and [39]). Hence the nupG cassette belongs to type 1.
The tsx gene, encoding the nucleoside channel, is present in many bacteria up to the Vibrionales. While the distance between the O CRP sites flanking O CytR is mostly conserved for the tsx orthologs in the Enterobacteriales, about 14 bp, the score of the 4box cassette even in close relatives of E. coli is rather low, due to low scores of the O CRP sites (about 3). However, the SWAS plot features two pronounced peaks, and hence the cassette is of type 1, although O CYTR P overlaps O CRP P (Fig. 9). The predicted sites in E.coli differ slightly from those suggested earlier (Fig. 9) here and [38] (Fig. 10, p.33253). Out of four other Enterobacteriales with tsx orthologs (E. ictaluri, Enterobacter 638, K. pneumoniae, Serratia proteamaculans) only three yield a relatively satisfactory alignment with the corresponding SWAS plot of type 2 (Fig. S5). Only four of the Vibrionales have tsx orthologs (Vibrio harvey, Vibrio parahaemolyticus, Vibrio splendidus, Vibrio vulnificus) (Fig. S6). Since a highscoring O CYTR P peak is situated relatively close to O CRP D and all other peaks have very low scores (about 2), this case is assigned to type 3, as neither O CYTR D nor O CRP P can be reliably identified.  The cytR gene itself has only the distal O CRP site. On the other hand, the bound complex has been observed in E. coli K-12 [6] and S. typhimurium [14]. The O CYTR site is often assumed to be an imperfect inverted repeat [36], but the alignment of operator cassettes from sixteen Enterobacteriales and, separately, six Vibrionales shows that the O CYTR site is an imperfect direct repeat (Fig. 10). At that, the Enterobacteriales and Vibrionales seem to have conserved organization of the O CRP D-O CYTR D-O CYTR P recognition site, but slightly different sequences of O CYTR D and O CYTR P. The unusual properties of this cassette may explain the fact that the scores of the O CYTR sites are low, less than 3. However, the conservation of these sites in the alignment provides the evidence for their functional relevance (Fig. 11 and Fig. 12).
Another atypical cassette is that of cdd, cytidine/deoxycytidine deaminase. It contains O CRP sites flanking direct repeats of O CYTR [8,29]. Two O CRP D sites denoted in the literature and here, in this particular case, O CRP 2 and O CRP 3 overlap by 20 bp, that is, they are shifted relative to each other by 2 bp. The arrangement of the sites is conserved in 14 genomes (Fig. S7 and Fig. S8).
To analyze direct repeats in the two latter cases, belonging to type 4, we applied the standard matrices for O CYTR D and O CYTR P ( Fig. 3B1 and Fig. 3B2, respectively) and selected the matrix providing two highest SWAS-plot peaks. Both site cassettes have 1 bp spacers.
In the two cases of direct repeats, for the cytR cassette, pronounced SWAS-plot peaks were observed for the O CYTR D PWM both for the Enterobacteriales and Vibrionales (Fig. 11 and Fig. 12, respectively), whether for the cdd cassette visible peaks are produced by O CYTR P PWM for the Enterobacteriales, whereas none of the two matrices provides anything definite for the Vibrionales (Fig. S7 and Fig. S8, respectively).
NupC is a nucleoside transporter. It is unrelated to NupG and shows somewhat different specificity: unlike universal NupG, it does not transport guanosine and deoxyguanosine [42]. The nupC gene was proposed to be regulated by CytR based on its function in the nucleoside transport, similar to some other genes from the CytR regulon [43], and the location of candidate pentameric binding sites [10]. While the alignment of the nupC upstream regions of five closest relatives of E. coli contains conserved regions, they have very low O CRP scores. Candidate O CYTR sites are seen in the alignment as inverted repeats at a zero distance (Fig. 13). The corresponding peaks at the SWAS plot are weak (,2.6) but clearly visible. The score of O CRP D is about 4, which is consistent with a usual model of regulation of promoters with two CRPbinding sites. An alternative is O CYTR being a direct repeat with a 3 bp spacer, close to the one observed in a SELEX experiment for direct repeats [37]. In this case the score of one of the peaks is larger than 3 and the score of the second peak is about 2.5, both assessed by the O CYTR P PWM (Fig. S9). Finally, there is a possibility that weaker O CRP sites, in particular the one overlapping the transcription start site, also participate in formation of the regulatory complex. An experiment is needed to validate the predicted site and to select between the alternative descriptions of the cassette structure.

New candidate member of the CytR regulon
As the initial criterion for identification of new possible operator cassettes, we relied on conservation of the distance between candidate O CRP sites and the presence of peaks in the SWAS plots, demonstrating conservation of the O CYTR positions. We started with identification of E. coli genes preceded by high-scoring O CRP D-O CYTR D-O CYTR P-O CRP P cassettes. We required that the score of each cassette exceeded the minimal observed score for the known genes (cut-off 12.6) and that the distance between O CRP sites was in the interval . As expected, the initial four genes used to construct the PWMs (deoC, nupG, ppiA, udp) had high total scores and were among the leaders in the list ordered by decrease of the total O CRP -O cytR D-O cytR P-O CRP score. We selected 37 E.coli genes satisfying these criteria, listed in Table S2.
Then we identified orthologs of these genes and checked the presence of a pair of O CRP sites at approximately the same distance in at least five genomes. After that we aligned the promoter regions, anchored at O CRP D and O CRP P, and applied the O CYTR D and O CYTR P PWMs, constructing SWAS plots for the spacer between O CRP D and O CRP P.
One strong candidate emerged from this analysis. The ycdZ gene of E .coli is preceded by a cassette formed by two O CRP sites at a conserved distance (29-31 bp) and O CYTR sites in the correct arrangement, and this cassette is conserved in 17 related genomes, that is, in almost all Enterobacteriales and Vibrionales. The exceptions were D. dadantii, E. tasmaniensis, P. atrosepticum, Y, pestis and Yersinia pseudotuberculosis, where this gene is simply absent, and V. vulnificus that has an atypical O CRP -O CRP distance.
The alignment of the ycdZ upstream regions may be divided into three subalignments. In close relatives of E. coli, two pronounced peaks in the SWAS plots, corresponding to O CYTR D and O CYTR P, are visible (Fig. 14). In more distant Enterobacteriales, one clear peak is seen (Fig. 15). In the Vibrionales, one peak is visible, but its average score is less than 3 (Fig. 16). Thus, the ycdZ cassette belongs to type 1 at close distances and to type 2 in more distant Enterobacteriales and in the Vibrionales. Hence we predict that ycdZ is a member of the CytR regulon.
The encoded protein YcdZ is an inner-membrane protein from the DUF1097 family. According to TMHMM (see Data and Methods) it has five transmembrane domains (Fig. 17). Hence YcdZ is likely to be a transporter. We suggest naming it NupT.

Discussion
We have observed that the distances between candidate O CRP sites are conserved in upstream regions of orthologous genes regulated by the CRP-CytR complex. On the other hand, positions of the O CYTR sites seem to be conserved only at close evolutionary distances, as the highest-scoring candidate sites may occupy different positions in distant Enterobacteriales and in the Vibrionales (e.g. Fig. 18). One possible explanation for that, discussed in the literature, is that the binding of CytR to DNA has very low specificity, and the regulation is based on the formation of multimetric CRP-CytR complexes stabilized by the CytR-DNA interaction [44]. However, the existence of the CytR-binding motif, albeit weak, as well as the intergenome conservation (higher than background) of O CYTR sites argues against this explanation. This is represented by peaks in the SWAS plots.
On the other hand, the intergenome conservation of the distances between O CRP sites in promoters of specific genes together with intragenome differences between genes and relatively low conser- The problem of identification of the CytR-binding motif is not trivial either. Indeed, the experimental data do not define the binding sites up to nucleotide: the most commonly used method, DNA footprinting, leaves some uncertainty about the site extent and location [45]. When the motif is strong, it is simple to align the footprinted regions and identify the common core. However, for weak motifs this is far from being straightforward, and we believe that evolutionary considerations yielding the phylogenetic footprinting techniques also deserve attention.
Finally, the overall structure of the CytR-binding site may vary. In most cases, it is an inverted repeat with a variable spacer. However, as shown in a SELEX experiment with the deo operator, both inverted repeats of O CYTR boxes with a large spacer (10 to 13 bp) and direct repeats in either direction (O CYTR D or O CYTR P) with a short spacer (1 bp) may be bound by CytR [37]. Direct repeats in the O CYTR D orientation were observed to be conserved in the operators of cytR and cdd, again, with short spacers.
The comparative analysis also enables to identify new regulon members even for regulators with weak motifs. Of course, the predicted CytR regulation of the nupT (ycdZ) gene requires experimental verification.

Data and Methods
Complete genome sequences of the Enterobacteriales and Vibrionales in the gbk format were downloaded from GenBank (ftp://ftp.ncbi.nih.gov/genomes/Bacteria) [46].
The Positional Weight Matrix (PWM) was defined via: where W(b,k) is the positional weight of nucleotide b at position k of the PWM, N(b,k) is the count of nucleotide b at position k in the training sample. The sum of the positional weights for a site yields the site score: Sliding window average score (SWAS) plots were constructed as follows. The upstream regions of each gene from the CytR regulon were aligned in the bacterial groups with approximately constant CRP-CRP distance. Within a sliding window of size 8 nt, the average score of sites within the window was calculated using respective PWMs. Scores of sequences containing gaps in a given window position were set to 0, but these sequences were counted for averaging; hence, positions with gaps were penalized.
The positional information content was calculated as where e(a, i) is the frequency of nucleotide a in the alignment position i.  P is the proximal CRP-operator with respect to the transcription start. 5 -spr is spacer length. 6 -Sscore is total score of the cassette. 7 -site score is in parentheses. $ -start pos is the start position of the cassette in the respective upsteam region. @ -no direct repeats for O CyTR D-O CRP P, only inverted ones; * -known regulon member with experimentally determined cassette; # -predicted regulon member, predicted cassette; #* -known regulon member, predicted cassette.