Conserved and Variable Functions of the σE Stress Response in Related Genomes

Bacteria often cope with environmental stress by inducing alternative sigma (σ) factors, which direct RNA polymerase to specific promoters, thereby inducing a set of genes called a regulon to combat the stress. To understand the conserved and organism-specific functions of each σ, it is necessary to be able to predict their promoters, so that their regulons can be followed across species. However, the variability of promoter sequences and motif spacing makes their prediction difficult. We developed and validated an accurate promoter prediction model for Escherichia coli σE, which enabled us to predict a total of 89 unique σE-controlled transcription units in E. coli K-12 and eight related genomes. σE controls the envelope stress response in E. coli K-12. The portion of the regulon conserved across genomes is functionally coherent, ensuring the synthesis, assembly, and homeostasis of lipopolysaccharide and outer membrane porins, the key constituents of the outer membrane of Gram-negative bacteria. The larger variable portion is predicted to perform pathogenesis-associated functions, suggesting that σE provides organism-specific functions necessary for optimal host interaction. The success of our promoter prediction model for σE suggests that it will be applicable for the prediction of promoter elements for many alternative σ factors.


Introduction
Induction of alternative sigma (r) factors is an important strategy for coping with environmental stress in bacteria.Indeed, there is a rough correlation between the apparent complexity of the environment and the number of alternative r factors, e.g., Mycoplasma sp., which are obligate intracellular pathogens, contain only the housekeeping r and no alternative r's; Escherichia coli, which inhabits the relatively constant environment of its host organisms but can also survive in vitro, has six alternative r's; and Streptomyces coelicolor, which inhabits a hostile and changing soil environment, has 62 alternative r's.Therefore, the ability to predict promoters recognized by alternative r's would significantly improve our capacity for understanding how bacteria adapt to stress.
It is challenging to predict bacterial promoters, which are composed of two conserved sequences centered at about À10 and À35 from the start point of transcription.Some promoters also have an ''upstream element'' (UP) upstream of the À35 sequence and/or an ''extended À10'' element immediately upstream of the À10.The fact that these promoters are composed of multiple, weakly conserved elements separated by less conserved, variable length spacer sequences makes their prediction a difficult bioinformatics problem.Such attempts have a long history, mostly directed at predicting promoters recognized by r 70 (b3067), the housekeeping r in E. coli, using hidden Markov models, neural networks [1][2][3][4], and position weight matrices (PWMs) [5][6][7][8].While these methods detect promoters with a moderate degree of success, they suffer from high false-positive rates (FPRs) in genomic sequences.In addition, promoter consensus and mismatch searches have also been employed to identify promoters for the Group IV factor, r W (Bsu0173), in Bacillus subtilis [9].However, these approaches are not as effective as using PWMs that better describe the natural variability of target sites.Here, we consider only PWMs because their success is comparable to more complex models [3].Staden [5] used three matrices (describing the À35, À10, and þ1 promoter motifs) and one spacer penalty (for the À35 to À10) to predict r 70 promoters; variations of this approach were later explored by Hertz and Stormo [7].Huerta and Collado-Vides describe the most accurate prediction method to date for r 70 promoters using multiple matrices for the À35 and À10 motifs, with one spacer penalty for the intervening spacer [6].Although this method successfully identifies known promoters with high sensitivity (86%; true positives/total promoters), it suffers from many false predictions resulting in low precision (20%; true positives/total predictions), reducing its utility as a prediction tool to identify new promoters.
Alternative r factors usually turn on a group of genes synchronously in response to a particular stress, and hence use very few activators.As a consequence, promoters recognized by alternative r factors are somewhat less variable and might have higher information content than those recognized by the housekeeping r factor, making them more amenable to bioinformatic analysis.We chose to test this proposition by determining the feasibility of predicting promoters of E. coli r E (b2573), both in E. coli K-12 and in related bacteria.r E , a Group IV (extracytoplasmic, ECF) r factor [10,11], mediates the envelope stress response [12,13], is essential in E. coli K-12 [14], and is important for virulence in related bacteria [15][16][17][18][19][20][21][22].We first identified r E regulon members and their promoters using genome-wide expression analysis and transcript start site mapping in the E. coli K-12 genome.We derived a model for these r E promoters by building upon approaches pioneered for r 70 promoters, and used this model to make predictions in related genomes.By comparing promoter predictions from the actual genome with those from ''randomized'' genomes, we were able to identify those promoters that are unlikely to occur by chance alone.In addition, we adapted cross-genome approaches utilized for transcription factors [23][24][25] as an additional way of predicting promoters in E. coli and related pathogenic genomes.We tested all predictions in E. coli K-12 and Salmonella typhimurium and unique predictions in E. coli CFT073.These tests demonstrated that the model works with high precision.
Our studies reveal that the extended regulon of 89 predicted transcription units (TUs) is predicted to consist of a core set of genes conserved in most organisms and another group of more poorly conserved genes.Remarkably, each of these gene sets has a coherent function.The core genes coordinate the assembly and maintenance of lipopolysaccharide (LPS) and outer membrane porins (OMPs), the two key structures of the outer membrane of Gram-negative bacteria, in response to environmental change.A majority of the variable r E regulon members perform functions known to be important for a pathogenic lifestyle.We suggest that induction of such determinants at the first sign of stress facilitates bacterial adaptation to the host environment.

Results
Identifying r E -Dependent Genes by Transcription Profiling r E -dependent genes were initially identified using genomewide transcription profiling, comparing a wild-type E. coli K-12 strain that has a low level of r E , with a strain overexpressing r E (following induction of its gene, rpoE, from an inducible promoter by IPTG).This strategy is preferable to comparison with an rpoE À strain because: (1) many r Etranscribed genes have multiple promoters, so that the change in transcriptional signal upon loss of r E is often small; and (2) rpoE À strains (which require an uncharacterized suppressor for viability [14]) grow slowly, invalidating the direct comparison between rpoE þ/À strains.We monitored changes in gene expression in four separate time-courses after induction and used statistical analysis of microarrays (SAM) [26] to identify 75 significantly induced and eight significantly repressed genes (Figure 1; see Materials and Methods).Some of these genes are part of operons in which other gene members were clearly induced but were not marked as significant in our strict selection criteria.Therefore, to fully describe the r E regulon we expanded this set by using the statistics from SAM to analyze the reproducibility and significance of the expression ratios of all the genes adjacent to and in the same orientation as the highly significant genes.This gave 96 genes organized in 50 r Edependent TUs, of which 42 were induced and eight were repressed (Figure 1).

Identification of r E Promoter Motifs Upstream of Induced TUs
To determine which of our induced genes might have r E promoters, we used rapid amplification of cDNA ends (59 RACE; see Materials and Methods) to identify start points of each TU, comparing mRNAs from rpoE overexpressed versus rpoE À cells.This analysis indicated that 28 of the 42 induced TUs contained r E -dependent transcription start sites (unpublished data).The remaining promoterless TUs identified in transcriptional profiling may be indirectly regulated by r E , especially since most were only weakly induced.
Bacterial promoters are located immediately upstream of their start sites.We therefore searched small blocks of sequences directly upstream of the 59 RACE determined transcription starts for conserved r E motifs using the algorithm WCONSENSUS (see Materials and Methods).By testing several different search-window positions and widths, we found that a 16-nt search window (À1 to À16) was optimal for identifying the conserved À10 motif (T/CGGTCAAAA), and that a 16-nt search window starting 9 nt upstream of the À10 element was optimal for locating the À35 motif (GGAACTTTT).Although there were no other highly significant motifs, we found a 30-nt window of generally A/ T-rich sequences directly upstream of the À35 motif with two conserved A/T-rich elements at positions À48/À49 and À57/ À58.These correspond closely to the two information peaks in the SELEX-derived consensus sequences for the UP element of the rrnB P1 promoter [27].In addition, the initiation nucleotide of the 28 promoters exhibited a strong preference for a purine (A/G) and weak conservation of sequences directly upstream.
The sequence logos of the conserved sequence motifs upstream of the 28 r E -dependent transcription start sites, together with their information content, are displayed in Figure 2A.The fact that all of the sequences contained good À35 and À10 promoter motifs indicated that we had successfully mapped r E -dependent transcription initiation sites.Note that most of the total information content of the promoter motifs (22.8 bits) was contributed by the well-conserved À10 and À35 motifs.Figure 2B-2D displays histograms of the distance distributions of the promoter elements from each other: most promoters preferred a 5/6-nt discriminator region between theÀ10 andþ1 (Figure 2D), while the spacing between the À10 and À35 varied from 15-19 nt, with 16 nt strongly preferred (Figure 2C).Interestingly, individual promoters displayed an inverse correlation between the length of these two spacers: promoters with a long À10/À35 spacer tended to have a short discriminator, and vice versa.Consequently, the range of distances between the À35 and þ1 for all the promoters is quite small: 25-28 nt, with most promoters preferring a 26/27-nt spacer (Figure 2B).The identified promoter sequences are listed in section A of Table 1.The sequence alignments for the UP, À35, À10, and þ1 sequences were used to build four PWMs (see Materials and Methods); each PWM spans the complete sequence illustrated in each logo in Figure 2A.Each promoter was then scored by summing the individual PWM scores and incorporating penalties for suboptimal spacing between the motifs to generate a distribution of known promoter scores with mean (l k ) and standard deviation (r k ).High-scoring promoters were composed of more highly conserved promoter elements at optimal spacings, and low-scoring promoters contained less well-conserved elements at suboptimal spacings.
We searched the E. coli K-12 MG1655 sequence for r E promoters in which each individual PWM scored !lÀ2r, and where the distance between motifs was within the range observed for the 28 RACE-identified promoters.These constraints allow potential promoters to have a combination of weak and strong motifs and the variable spacings characteristic of known E. coli K-12 r E promoters.Genomewide predictions with PWM -35 identified 98,113 sites (Table 2).Sequences flanking these sites were then searched for UP, À10, and þ1 motifs within the spacing range of our validated promoters to create a library of candidate promoters (note that the order of the searches does not affect the final library).The total promoter score of each candidate was calculated using the same procedure described above for the known promoters and then converted to a z-score (the number of standard deviations [r k ] of the candidate score from the mean score of the known promoters [l k ]).In cases where promoters overlapped such that the þ1 motifs were within 4 nt of each other, only the highest scoring promoter was selected.This generated a library of 553 candidate promoters that includes 27 of the 28 RACE-identified promoters (Table 2), missing only the ybfG promoter that fails due to a poor start motif (,lÀ2r) despite having a relatively high total promoter z-score (À0.03).

Identifying Significant r E Promoters From the Promoter Prediction Library
The vast majority of the 553 predicted promoters were low scoring and randomly distributed, in contrast to the 59 RACE validated promoters, which were high scoring (.À1) and located near target genes (Figure 3A).To identify significant (i.e., functional) promoters from our library, we compared predictions from the actual genomic sequence (Figure 3B) with those from 100 randomized genomes generated in silico (Figure 3C).The randomized genomes maintain the location of all open reading frames (ORFs), average codon, and nucleotide content, but now contain only nonspecific sequences.Hence, predictions from these genomes indicate the number of predictions occurring by chance alone.This allows us to determine both a FPR and a probability score that the prediction arose by chance (pvalue) for every prediction in the actual K-12 genome.Using a cutoff of FPR ,0.5 and p , 0.05 for each bin (a bin describes a group of promoters with similar scores and positions relative to the gene) and an additional distance and z-score constraint to remove spurious predictions (see Materials and Methods), we generated 39 highly significant predictions.Their combined FPR is 0.22, which means that 8.6 of 39 predictions would be expected by chance alone.Of the 39 significant predictions, 24 were of previously validated promoters located upstream of genes that were induced in transcriptional profiling.The remaining 15 predicted promoters were not upstream of genes that were induced in transcriptional profiling.Interestingly, one promoter is upstream of ompX (b0814), which is repressed in the transcription profiling, but is oriented away from the gene.Thirteen of 15 promoters (including ompX) were confirmed either by in vitro transcription or in vivo promoter assays (sections A and C in Table 1), giving a total of 37 of 39 verified significant predictions.) with an identified r E promoter upstream, (B) genes significantly regulated upon overexpression of rpoE (as determined by transcription profiling) but with no identifiable r E promoter, and (C) genes not significantly regulated after overexpression of rpoE (as determined by transcription profiling) but with confirmed r E promoters derived either from promoter predictions or from the literature.Transcription Unit: TUs are listed in chromosomal order; genes within a unit are listed in order of transcription; genes in parenthesis are induced but are not predicted to be translated since the r E promoter is internal.In these instances no upstream promoter was detected either by 59 RACE or by promoter predictions.Ratio: averaged expression ratio of (rpoE induced)/rpoE wt (time points 10 min to 60 min) of first gene in TU. r E promoter, validated r E promoter sequences.Genes marked ''no data'' were not present on our microarrays.Distance: number of nucleotides of þ1 position upstream of translation start point of the first gene in the TU; positive values denote sites internal to the first gene.Score: total promoter z-score (see Materials and Methods).Sequence: r E promoter sequence; conserved À35 and À10 elements are in bold and the start of transcription is in bold lower case.Evidence: evidence for r E promoter; K, previously known; P, predicted from promoter model; R, confirmed by 59 RACE PCR; T, confirmed by in vitro transcriptions; V, confirmed by in vivo promoter assays.In several instances the identified r E promoter is far upstream and internal to the adjacent gene, often resulting in this gene appearing as induced in our transcription profiling experiments.In the cases of xerD, yhbN, narY, rnt and wza, the internal r E promoters are very close to the 5' end of the annotated coding sequence, suggesting that these genes may have an alternative translation start point downstream of these promoters to result in a functionally transcribed gene product.The imp gene was not present on our microarrays but is reported to be a member of the r E regulon [28].The expression ratio for the imp operon is for surA.To determine the performance of our model in identifying significant promoters, we need to know the total number of validated r E promoters in E. coli K-12.We used several approaches to identify the 49 promoters that comprise the r E regulon in this organism (all promoters are listed in Table 1).( 1) We identified 28 promoters by transcriptional profiling coupled with 59 RACE and 13 additional promoters from our significant promoter model to give 41 promoters.(2) We searched our library of 553 promoters for any new predictions upstream of genes that were induced in our transcriptional profiling experiments.We found two lowscoring promoters located upstream of genes (malQ [b3416] and lpp [b1677]); these were validated in vitro to give 43 promoters.Note that, similar to ompX, lpp is repressed in the transcription profiling and the r E promoter is upstream but oriented away from the gene.(3) We noticed that several validated predictions are located far upstream of the nearest gene (dsbC [b2893], yhbG [b3201], lhr [b1653], and wzb [b2061]; Table 1) and are in fact internal and very close to the 59 end of the adjacent ORF, suggesting that these ORFs may be misannotated.Searching our promoter library, we found a high-scoring promoter located upstream of narW (b1466) just beyond our distance cut-off that was very close to the beginning of narY (b1467).We confirmed this promoter in vitro to give 44 promoters.(4) Two genetic screens [28,29] identified additional putative r E -dependent promoters; we validated the five additional promoters identified by Rezuchova et al. to give 49 validated promoters, but were unable to validate any of the eight new promoters proposed by Dartigalongue et al.We note that most of the Dartigalongue et al.-proposed promoters contain poorly conserved sequence elements separated by a wide range of spacer lengths, suggesting they might not be functional.Table 3 shows all validated E. coli K-12 r E regulon members divided into functional categories.
Of the 39 highly significant predictions, 37 were validated, giving our promoter model a precision of 95% (validated predictions/number of predictions; Table 2 and Figure 4).This promoter model also successfully identified 37 of 49 known r E promoters, giving a sensitivity of 76% (validated predictions/known promoters).Averaging the sensitivity and precision scores gives an estimate of the total performance, or accuracy, of the r E prediction model (85%; Table 2).True promoters that remained undetected by the highly significant prediction model did so for a variety of reasons: five promoters failed because either their UP, À35, À10, or þ1 motifs scored less than l À 2r; five promoters failed because of low total promoter scores, making them difficult to distinguish from the many other low-scoring nonfunctional promoters; and two failed because they were located far upstream of the nearest gene.Given the variety of reasons that they failed, this suggests that they were outliers rather than a fault with a particular predictive step of the model.

Predictions of r E Promoters in Closely Related Genomes
Given the success of our promoter model in E. coli K-12, we extended it to eight genomes of closely related organisms in which the DNA binding determinants of the r E orthologs are identical or very similar to those in E. coli K-12 r E (Figure S1).This determination is based on the demonstration that the structure of Domain 2 (which recognizes the À10 conserved promoter sequence) and of Domain 4 (which recognizes the À35 conserved promoter sequence) of E. coli r E can be overlaid with that of r 70 , the housekeeping r, indicating that the structure of these two domains is conserved across r's [30].The À10 and À35 promoter recognition determinants in r 70 have been thoroughly mapped [31].We assumed that comparable residues in r E carried out À10 and À35 recognition and identified eight organisms in which these residues were highly conserved.
We applied the promoter prediction model developed in E. coli K-12 to these eight genomes to generate a library of promoter predictions for each organism.We then identified all putative regulon members in TUs by assuming that the downstream genes formed an operon if they were in the same orientation and the intervening intergenic region (IG) was less than 50 nt [32].Significant promoters were identified as described above for E. coli K-12 by comparison to predictions from random genomes (constructed specifically for each real genome to account for their structure, average codon, and nucleotide contents).To prevent spurious results in some genomes, significant promoters (FPR , 0.5; p , 0.05) were also filtered for z-score .À2 and distance , 1,100 nt upstream of genes.As a second method, a significant prediction in any one genome was used to search the relevant promoter library for promoters upstream of conserved orthologs in the other species (see Materials and Methods).The matching promoter did not have to satisfy a minimum p-value or FPR, enabling the detection of less well-conserved orthologous promoters.However, to prevent spurious results, predicted r E promoters were required to have a z-score .À2 and to be within 1,100 nt upstream of the orthologous gene or TU.For each significant prediction upstream of a conserved ortholog, the probability of identifying a matching promoter in each genome by random chance from the promoter libraries is approximately 0.03, suggesting that the matches we identified were highly significant.In addition, we found that the vast majority of matching promoters were at similar distances upstream of the orthologs as the original search promoter, further increasing the significance of the matches.The results of these procedures are summarized in Table 4 and are presented in a database of conserved predicted r E promoters and regulon members across all nine genomes (Table S1).
These two computational approaches, together with experimentally identified promoters in E. coli K-12, generated an ''extended r E regulon'' across nine genomes, which consisted of 89 unique TUs (Table 4).Interestingly, there are no TUs predicted to be regulated by r E in all nine genomes; however, a core of 19 TUs is present in at least six genomes.The conserved members of the regulon predominantly carry out related functions (Table 5) involving the outer membrane and the regulatory strategy to maintain the r E response.The majority of the remaining r E -controlled TUs are not highly conserved, but most control cell envelope functions (Table 5; see Table S2 for a list of all the extended regulon members in each functional category).
Among the nine organisms, E. coli O157:H7 has the most predictions (49) and Yersinia pestis the least (nine) (Table 4).Genomes may have fewer significant r E predictions because they have a reduced r E regulon.Alternatively, the promoter model may not perform well in that organism.We believe that Yersinia is an example of an organism with a reduced r E regulon, making it difficult to detect its promoters with the random genome approach that relies on identifying overrepresented sequences.In support of this idea, the r E DNAbinding determinants in both organisms are essentially conserved (see Figure S1), and eight of nine Yersinia promoters with reasonable promoter scores were identified using the conserved ortholog approach (see Table 4).This may also be true for Erwinia and Photorhabdus, which also have only a few significant promoter predictions (one and eight, respectively).However, they also contain four and six amino acid changes, respectively, near the DNA-binding determinants of regions 2.4 and 4 (see Figure S1), so there is a possibility that there is a slight deviation of the optimal promoter sequence that is not captured by the E. coli promoter prediction model.We note, though, that these genomes still share many highly conserved r E regulon members, indicating that many of our predictions in these genomes should be functional.In more divergent genomes, where r E orthologs had amino acid changes at critical DNA-binding positions (Shewanella oneidensis, Vibrio cholerae, and Pseudomonas aeruginosa; unpublished data), our model was unsuccessful.Interestingly, loss of P. aeruginosa r E is complemented by E. coli r E [21], and likewise, both r E consensus sequences are similar ( [33] and references therein).However, few promoters match consensus, and the r E orthologs may tolerate different variations in their target promoter sequences.

Validation of the r E Promoter Model in S. typhimurium and E. coli CFT073
To determine the validity of our predictions, we experimentally tested all predictions made in S. typhimurium.In addition, we tested all unique predictions made in E. coli CFT073 (conserved predictions were not tested because their promoters were virtually identical to those found in E. coli K-12).Promoter function was tested both by in vivo promoter Proteins with no identified r E promoter are in parentheses.
a Proteins in which their function is predicted from amino acid sequence BLAST analysis for related proteins of known/predicted function.Proteins that have no significant sequence homology to any protein of known/predicted function are labeled unknown function.Note that some proteins are in more than one functional group.b Imp was not present on our microarrays but is reported to be a member of the r E regulon [28].
OMP, outer membrane protein; OM, outer membrane.DOI: 10.1371/journal.pbio.0040002.t003S1 and S2).Although both of these assays used E. coli K-12 RNA polymerase and r E , we do not think there are any functional differences from the E. coli CFT073 and S. typhimurium r E holoenzymes since their subunits are virtually identical and differ only in a few nonessential positions, with at least 99.72% and 98.58% sequence identity, respectively, with the E. coli K-12 subunits.These assays revealed a high success rate.For S. typhimurium, we made a total of 29 predictions, composed of 22 significant predictions based on the random genome model and seven predictions based on the conserved ortholog approach.Sixteen of 22 (73%) of the significant predictions and four of seven (59%) of the conserved orthologs were validated, for an overall success rate of 69%.For CFT073, of the 40 predictions, we have validated 29 of 38 (76%) significant predictions and two of two conserved ortholog predictions, for an overall success rate of 78%.We note that unconfirmed predictions may still be functional in vivo, as they might require a coregulator not present in our assay conditions or in E. coli K-12.These results suggest that our promoter prediction strategies provide a reasonably accurate picture of the r E regulon in organisms closely related to E. coli K-12.

Discussion
The goal of this work was to follow the responses mediated by alternative r's across organisms to determine whether these responses have changed.This required us to develop methods that accurately predict promoters recognized by alternative r's.We have developed a successful strategy to predict the r E regulon in E. coli K-12 and related organisms and have validated predictions in three organisms.We report the first comprehensive analysis of the conservation and variation of a r factor regulon across genomes, identifying an ''extended'' r E regulon in nine genomes comprised of 89 unique TUs.Of these, only 19 are highly conserved.The highly conserved TUs maintain appropriate cellular levels of LPS and OMPs, two unique constituents of the outer membrane of Gram-negative bacteria, thereby identifying the core function of the regulon.The less-conserved regulon members perform multiple pathogenesis-associated func-tions, suggesting that the r E regulon has been co-opted to provide organism-specific functions necessary for optimal interaction with the host.

Promoter Predictions
We chose to employ de novo promoter prediction as our primary method for cross-genome analysis because it can identify promoters unique to a particular genome.This is an important attribute, given the variability of bacterial ge- For a particular genome, the total number of predictions is derived either from the significant predictions model or from the conserved approach.A conserved ortholog prediction meets the following conditions: (1) the downstream gene has an ortholog in a related genome; (2) the ortholog has a predicted upstream promoter within 1,100 nt upstream of the gene and a total z-score .À2; (3) the promoter has a significance score of FPR , 0. nomes.For example, the three sequenced E. coli genomes share only 40% of their coding sequence.As a secondary approach, we searched for weakly conserved predictions upstream of orthologous genes, thereby identifying additional promoters too weak to pass the first filter (e.g., the latter method identified seven new S. typhimurium promoters, four of which were validated in vitro, and eight new Yersinia promoters).Our r E promoter model performed considerably better (precision ¼ 95%; accuracy ¼ 85%; see Table 2) than the housekeeping r 70 promoter model (precision ¼ 20%) upon which it is based [5][6][7], primarily because the combined information content for r E is much higher than that for r 70 (I seq ¼ 22.8 bits versus 12.56 bits).In addition, performance was improved by comparison to a random genome to reduce false positives and our secondary approach of searching for conserved orthologs.Interestingly, r 70 promoters, but not r E promoters, were often embedded in predicted clusters of overlapping sites [6].This distinction may result from the differences in specificity of the two models or reflect a fundamental distinction in promoter recognition mechanisms of housekeeping and alternative r's.We note that a simple prediction model having a single-weight matrix and a fixed-length spacer suffices to predict promoters of another family of r factors (r 54 ; RpoN) unrelated to the r 70 family [34][35][36].In contrast, our promoter prediction model should be applicable for the prediction of promoters elements for the many alternative r 70 family members that bind to promoter elements separated by variable spacers, and especially Group IV r's that tend to bind to more highly conserved promoter sequences [11].
Many r E promoter predictions were limited to particular subgroups.In some cases, the orthologs themselves had limited distribution.This particularly interesting case suggests that the ortholog has an organism or species-specific role.For example, the highly related E. coli and Shigella genomes contained three predictions upstream of orthologs exclusive to at least three of four of these genomes, and the two Salmonella species contained two predictions upstream of orthologs unique to Salmonella (see Table S1).In other cases, the orthologs themselves were widely distributed, but r E promoters were identified for only some orthologs.For example, ten predicted r E promoters are found only upstream of genes in E. coli and Shigella, and five r E promoters are found only upstream of genes in Salmonella (see Table S1).These cases may identify examples of regulon evolution, where r E promoters are created or lost in response to the requirements of the organism.Alternatively, we may have failed to detect r E promoters because one or more of their motifs failed our cutoff criteria.Finally, when r E promoters regulate long polycistronic TUs, some downstream TUs may no longer be classified as r E regulated in related genomes, either because of gene shuffling or because their intergenic distance was .50nt (our cut-off for genes in an operon).In this latter case, r E might still regulate the downstream genes.

The Core r E Regulon
The core r E regulon consists of 19 TUs and 23 proteins, of which 20 have known functions (Table 5; Figure 5).Amazingly, at least 60% of the core regulon members (;75% of proteins with known functions) ensure the synthesis and assembly of LPS and OMPs, or encode the transcriptional circuitry to maintain the homeostasis of these two key constituents of the outer membrane of Gram-negative bacteria.The proper ratio of OMPs and lipid A contributes to the impermeability of the outer membrane [37].
Five members of the core regulon are involved in the synthesis or assembly of LPS.Four members (Lpx A, B, D, and PlsB) promote the synthesis of lipid A, the hydrophobic anchor of the LPS, and a fifth (BacA) contributes to LPS assembly [38,39].Lipid A comprises the outer leaflet of the outer membrane.The high resistance of Gram-negative bacteria to hydrophobic compounds is in large part due to the high density of saturated fatty-acid chains and potential for many lateral interactions in lipid A, which together dramatically slow diffusion of hydrophobic compounds through the outer membrane [40].
OMPs are trimeric b-barrel proteins that form channels in the outer membrane to permit access of small solutes.These abundant proteins comprise about 25% of the surface area of the bacteria [37] and have a complex assembly pathway.Six members of the core regulon promote the OMP assembly: two lipoproteins (YfiO and YraP) [41,42], three chaperones (Skp, FkpA, and DegP) [41,43], and YaeT (Omp85), which is generally implicated in insertion of b-barrel proteins into the outer membrane of many species [44][45][46] and may also do so in E. coli [45,46].YaeT functions in a complex with three lipoproteins (YfiO, YfgL, and NlpB) [42], of which only YfiO is in the core regulon.However, the other two lipoproteins may also turn out to be part of the conserved regulon as YfgL is predicted to be driven by a r E promoter in five organisms and, at least in K-12, NlpB (b2477) is induced by overexpression of r E through an unknown mechanism.The complex assembly pathways of LPS and porins are not completely known, but it is clear that the two are mutually dependent [47][48][49][50][51][52].Thus, some conserved regulon members may actually function in both assembly pathways.
Intriguingly, FtsZ, a member of the core regulon, is involved in initiating cell division (reviewed in [53]).This raises the possibility that the r E regulon may be needed to synthesize the excess outer-membrane components required at the time of septation.Thus, its primordial function may have been to facilitate passage through the cell cycle.However, as these core components are essential for the integrity of the outer membrane, this response could easily be used as a primary defense mechanism to protect the barrier function of the cell in the face of environmental stress.
The core regulon also encodes the transcriptional circuitry that allows the cell to detect and respond to imbalances in LPS and OMPs to maintain envelope homeostasis.Unassembled OMPs activate the proteolytic cascade that degrades RseA (b2572) [54], the membrane-spanning antisigma factor that inhibits r E function (reviewed in [12]).As LPS intermediates participate in OMP assembly [47][48][49][50][51][52], the unassembled OMP signal reports on the status of both LPS and OMP maturation [55][56][57][58][59][60].Two notable features of the transcriptional circuit encoded by the core regulon ensure a rapid and sensitive response to imbalances in OMP assembly.First, the rpoErseABC operon has two highly conserved r E promoters, one upstream of the entire operon and the second upstream of rseA (see Table S1).As a consequence of this arrangement, r E positively autoregulates itself, thereby ensuring a rapid increase in proteins required for OMP/LPS homeostasis, and up-regulates RseA to set up a negative feedback loop (Table 5; Figure 5).The fact that RseA synthesis is driven from two promoters is likely to dampen the response, reduce oscillation, and provide a sufficient excess of RseA to ensure rapid down-regulation following a decrease in unassembled OMPs.A second important feature of the response is a homeostatic loop that prevents further buildup of unassembled OMPs (Figure 5).At least in E. coli K-12, OmpA (b0957), OmpC (b2215), OmpF (b0929), and OmpX are down-regulated upon induction of r E , thereby decreasing the flow of OMPs to the envelope.Down-regulation may be accomplished by production of r E -regulated antisense small RNAs transcribed divergently from their negatively regulated OMPs (V.Rhodius, unpublished data).Intriguingly, the r E promoter divergent from ompX is a member of the core regulon (see Table S1 and Table 5), raising the possibility that OMP down-regulation is a conserved feature of the response.

The Extended r E Regulon
More than 60 of the unique r E -controlled TUs we have predicted are present in fewer than six of the nine genomes we have scanned; many are present in only a small subset of these genomes (see Table S1 and Table 4).However, the majority of those with known functions carry out a coherent theme: adaptation of the organism to the conditions encountered when the bacterium interacts with its eukaryotic host (Table S2; Table 6).This idea is presaged by two functions in the core regulon: an iron acquisition system (YecI) to facilitate growth in the iron-deficient host environ- Note that functions conserved across different genomes may be encoded by different genes.See ment and a component of alkyl reductase (AhpF) to detoxify lipid hydroperoxides that may be generated during exposure to macrophages.The predicted extended regulon encodes multiple functions related to pathogenesis.Among these, several have already been validated in at least one organism.These include synthesis of capsule, a viscous polysaccharide layer that facilitates adhesion and protects against macrophage ingestion; recombination functions to resolve DNA lesions that could be generated by the respiratory burst (RecJ/O/R); and metabolic components for nitrate/nitrite respiration (NarW/ V) that facilitate adaptation to the anaerobic/microaerophilic host environment.In addition, the regulon is predicted to encode components that produce colanic acid and chorismate and that modify the core and O-antigen portion of LPS, although no predictions in these classes have yet been validated.That the extended r E regulon encodes many pathogenesis-related functions explains why cells lacking r E are defective in pathogenesis [15][16][17][18][19][20][21][22], and suggests that the extended r E regulon may serve as an early adaptation system to facilitate survival in vivo.In addition, although the bacteria discussed here occupy diverse hosts, many pathogenic determinants apply broadly, even across the plant-animal divide [61][62][63].
Why is a response devoted to monitoring the status of OMPs and LPS also used for pathogenesis-related functions?Possibly, interaction with host cells alters the status of these r E regulators, thereby triggering the r E response.Using the core regulon as a base, organisms might then add additional members to the r E regulon that improve their viability in their hosts.This would explain why many of the pathogenesis functions are unrelated either to the core function of the regulon or even to the envelope itself.The variability of the r E regulon suggests that it may be easier to adapt the function of an existing regulator by changing the location of its binding sites than to evolve new regulators.Because environmental change is likely to generate envelope stress, it may be generally true that regulators sensing the envelope will contain organism-specific regulon members that facilitate the response for the particular ecological niche of the bacterium.Interestingly, r E is a member of the Group IV r family, many of which also respond to stress in the envelope.It will be interesting to determine whether organism-specific variation in regulon function is characteristic of other Group IV r's.
Bacterial strains and plasmids used in this study are listed in Table 7. Strain CAG25195 was constructed by using a lambda lysate from CAG16037 (MC1061 [UkrpoH P3::lacZ] DlacX74) to lysogenize MG1655 as described by [65].P1 vir-mediated transductions were carried out as described by [66].
Plasmid pLC245 was used to overexpress rpoE from the strong IPTG-inducible trc promoter and was constructed as follows: the rpoE gene was amplified by PCR from genomic MG1655 DNA using the primers RPOE1 (59-CATATGAGCGAGCAGTTAACGGAC-39) and RPOE2 (59-GCAA GGATCCTCAACGCCTGATAAGCGGTT-39), which encodes a BamHI site (underlined).The PCR product was digested with BamHI to create one overlapping end, and then ligated into vector DNA prepared from pTrc99A by digesting with EcoRI, treating with Klenow enzyme to produce a blunt end, and then digesting the vector with BamHI.The final construct was confirmed by sequencing.
Strain growth and probe preparation for microarray analysis.To identify genes that alter their expression upon overexpressing r E , time-course microarray experiments were performed with the strain CAG25196 (MG1655 DlacX74 [UkrpoH P3::lacZ]) carrying the control vector, pTrc99A, versus CAG25197, which carries the IPTG-inducible rpoE overexpression vector, pLC245 (Table 7).Samples containing the control vector were labeled with Cy3 (green), and rpoE overexpression samples were labeled with Cy5 (red).Cells were grown in M9 complete minimal media with appropriate antibiotics in order to maximize the number of genes expressed, rather than in a rich media such as LB (luria broth) [67].500-ml conical flasks containing 100 ml of media were inoculated from fresh overnight cultures to a final OD 450 ¼ 0.03 or 0.035 for strains carrying the plasmid pTrc99A due to the fractionally slower growth rate.Cultures were grown aerobically at 30 8C in a gyratory water bath (model G76 from New Brunswick Scientific, Edison, New Jersey, United States) shaking at 240 rpm until OD 450 ¼ 0.3.Cultures were then induced with a final concentration of 1 mM IPTG and incubation resumed as before.Immediately prior to induction, and at 2.5, 5, 10, 15, 20, 30, and 60 min after induction, 1ml and 8-ml samples were removed for microarray analysis.
Culture samples for microarray analysis were added to ice-cold 5% water-saturated phenol in ethanol solution, centrifuged at 6,600 g, and the cell pellets flash-frozen in liquid N 2 before storing at À80 8C until required.Labeled probe for microarray analysis was prepared as described in [68].Briefly, total RNA was isolated from the stored cell pellets using the hot phenol method, and labeled Cy3 and Cy5 cDNA was prepared from 16 lg of total RNA with 10 lg of random hexamer Vector, SC101 ori, Kan r .GFP reporter plasmid carrying GFPmut2 used measure the activity of r E promoter fragments cloned in the upstream XhoI-BamHI sites.
[82] DOI: 10.1371/journal.pbio.0040002.t007 (Integrated DNA Technologies, Coralville, Iowa, United States) using the indirect labeling method.DNA microarray procedures.Relative mRNA levels were determined by parallel two-color hybridization to glass slide cDNA microarrays [69].PCR products of 4,110 ORFs representing 95.8% of E. coli ORFs were prepared according to [70] using primers from SigmaGenosys (The Woodlands, Texas, United States).The products were spotted onto glass slides to make DNA arrays as described in protocols on http://derisilab.ucsf.edu/core/resources/index.html.Samples were hybridized to the arrays and scanned as described in [68].The resulting TIFF images were analyzed using GenePix 3.0 software (Axon Instruments, Union City, California, United States) and the data stored on an AMAD database (software available from http:// derisilab.ucsf.edu/core/resources/index.html).
Expression data analysis.Expression data were normalized using the assumption that the quantity of initial mRNA was the same for both samples [71].To correct for intensity (dye)-dependent biases, we used intensity-dependent normalization [72,73].For each gene spot on an array, the green (Cy3) fluorescent intensity was defined as G ¼ (F532 Median -B532) and the red (Cy5) fluorescent intensity was defined as R ¼ (F635 Median -B635), where the local background intensity (B532, B635) is subtracted from the median foreground intensity (F532 Median , F635 Median ).The data were filtered to exclude all R and G values less than 3 3 local background.For each microarray experiment, an ''MA-plot'' was used to represent the (R,G) data, where M ¼ log 2 R/G and A ¼ log 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ðR 3 GÞ p .A local Adependent normalization was performed by fitting a normalization curve using the robust scatter plot smoother ''lowess'' implemented in the statistical software package R, such that: where c(A) is the lowess fit to the MA-plot.The fraction of data used for smoothing each point was 50%.Statistically significant differentially expressed genes were identified from replicate microarray experiments using the SAM software ( [26]; http://www-stat.stanford.edu/;tibs/SAM/index.html).SAM employs gene-specific t tests and by analyzing permutations of the t scores from the dataset derives a false discovery rate (percentage of genes identified by chance) for a user-selected cutoff threshold (the lowest false discovery rate at the median percentile).The rpoE timecourse expression data revealed that genes that altered their expression in response to rpoE did so within 10 min after induction.Therefore, in each of the four time-courses time points from 10 min onwards were considered replicates and averaged to create four independent datasets.These data were then filtered for presence in at least 75% of datasets and significant genes identified using a stringent cutoff of the lowest false discovery rate (0.95%) at the median percentile.
59 RACE PCR.The 59 ends of r E -dependent transcripts were mapped using new 59 RACE adapted from [74].We chose this method because (1) it is highly sensitive, facilitating the detection of weakly expressed transcripts; and (2) sequencing the RACE products enables the precise identification of mRNA 59 ends.Total RNA was extracted as described for microarray analysis from strains CAG25197 (rpoE þ ; Table 7) 1 h after induction with 1 mM IPTG and CAG22216 (rpoE À ; Table 7).Both strains were grown under identical conditions as for the microarray experiments in M9 complete minimal media with appropriate antibiotics to OD 450 ¼ 0.3; samples from CAG22216 were harvested, while CAG25197 was induced with 1 mM IPTG for 1 h before harvesting.Fourteen micrograms of total RNA was treated with 5 U tobacco acid pyrophosphatase (TAP; Epicentre Technologies, Madison, Wisconsin, United States) to remove the 59 c and b phosphates from the RNA, and the samples cleaned by organic extraction and ethanol precipitation.One hundred picomoles RNA oligo (59-GAGGACUCGAGCUCAAGC-39; MWG Biotech, Ebersberg, Germany) was then ligated onto the 59 ends of the TAP-treated RNA using 5 U T4 RNA Ligase (Epicentre Technologies), and the samples again cleaned by organic extraction and ethanol precipitation.The oligo-ligated RNA was then used as template for reverse transcription reactions using 200 U SuperScript II RT (Invitrogen, Carlsbad, California, United States).In each series of experiments, 20 ng each of up to 40 gene-specific primers (GSP1; sequences available on request) were used in the same reaction to generate a library of cDNAs corresponding to the mRNAs of up to 40 putative r Eregulated genes.The production of full-length cDNAs was increased by reducing RNA 28 structure from incubating the reaction at increasingly higher temperatures: 37 8C for 1 h, 42 8C for 30 min, and 50 8C for 10 min.A dilution of the reverse-transcription reaction was then used as template for PCR amplification in the presence of a DNA primer containing a sequence complementary to the ligated RNA oligo sequence, and a second gene-specific primer (GSP2) for each gene that is closer to the promoter.A separate PCR reaction was performed with each GSP2 primer and the products visualized by 7.5% PAGE.Most of the tested genes contained multiple PCR products, suggesting multiple promoters.Thus, to identify r Edependent transcripts for each gene, PCR products were compared from cDNA generated from CAG25197 (rpoE þ ) and CAG22216 (rpoE À ) cells; products present from only the rpoE þ reactions were considered r E -dependent transcripts.These products were gel-purified from 7.5% PAGE gels, electroeluted, and sequenced using the appropriate GSP2 primer.The transcription start site was defined as the nucleotide immediately preceding the sequence corresponding to the ligated RNA oligo sequence.In some cases, two adjacent start sites could be discerned by the appearance of a second RNA oligo sequence 1 nt out of frame from the first after reading the genome sequence.
Identifying r E promoter elements upstream of transcription starts mapped by 59 RACE.WCONSENSUS [75] was used to identify the different conserved r E promoter elements using a method similar to [6].We note that BioOptimizer is also a suitable alternative since it can identify two-block motifs separated by a variable spacer [76].WCONSENSUS generates optimal matrices of aligned sequence motifs based on maximizing information content and minimizing the expected frequency of finding the matrix by chance given the known sequences.Matrices were selected using the second cycle in which every sequence contributes to the final alignment.A range of sequence windows of different widths were searched to identify optimal matrices describing À10 and À35, start site, and upstream elements.Optimal matrices for the À10 motif were identified by searching sequence windows À1 to À16, and for the À35 by searching a 16-nt window 9 nt upstream of the identified À10 motif.
r E promoter predictions using PWMs.The information content (I seq ) of aligned r E promoter motifs was calculated using: where i is the position within the site, b refers to each of the possible bases, f b,i is the observed frequency of each base at that position, and p b is the frequency of base b in the entire genome (in E. coli taken to be 0.25 for A/G/C/T).The aligned r E promoter sequences were visualized using sequence logo ( [78]; http://weblogo.berkeley.edu/).PWMs (W b,i ) for each of the r E promoter elements (PWM UP , PWM À35 , PWM À10 , and PWM þ1 ) were built using the method of [79]: where n b,i is the number of bases b at position i in the aligned sequences and N is the total number of aligned sequences.A pseudo count of 0.1 was added for each base b for the Bayesian estimate.The relative binding affinity of r E to a DNA sequence of length L (equal to the length of the PWM) is given by the score: (where b corresponds to the nucleotide at position i within the sequence fragment of length L), such that a high score corresponds to a high-affinity site with a close match to the consensus sequence, while a low score corresponds to a low-affinity site with a poor match to the consensus.The PWM was calibrated by scoring all the sequences used to build the matrix (E w ), and the distribution of the scores is described by their mean (u w ) and standard deviation (r w ).Potential r E target sites in the E. coli genome were identified by calculating the score E g of every possible sequence window of length L in both strands of the genomic sequence and computing the mean (u g ) and standard deviation (r g ) of the distribution.Predicted sites were made by selecting all genomic scores E g greater than a cutoff, S 0 , of two standard deviations below the mean of the PWM scores (u w -2r w ).
A penalty score adapted from the methods of [5] and [7] was applied to predicted promoters for suboptimal spacing between the þ1, À10, and À35 motifs based on the observed spacing frequency for the known r E promoters.The spacer penalty was determined by taking the natural logarithm of an approximated spacer frequency normalized by the approximated frequency of the most frequently occurring spacer class.For each promoter, this was calculated for three spacers and summed to give a total spacer penalty: þ1 to À10 (discriminator); À10 to À35 (spacer); and þ1 to À35 (total).
A total score was calculated for each predicted promoter (S p ): The predicted promoter scores, S p , were calibrated by scoring the known promoter sequences used to build the matrices (S k ) to derive a distribution with mean (l k ) and standard deviation (r k ).The S p scores were then converted to a promoter z-score: Z p ¼ (S p -l k )/r k .
In vitro transcription assays.Single-round in vitro transcription assays were employed to test predicted r E promoters.DNA templates were prepared by PCR from genomic DNA (primer sequences available on request) to create fragments with the promoter of interest contained within flanking sequences 100 nt downstream and 200 nt upstream of the predicted transcription start point.RNA polymerase core enzyme was purified as described in [80], and His 6tagged r E was purified using a Qiagen Ni 2þ affinity column per manufacturer's instructions (Valencia, California, United States).The transcription assays were performed as described in [81] with the following modifications: Binding reactions (12 ll) contained 50 nM template DNA, 250 nM core RNA polymerase, 500 nM r E , 5% glycerol, 20 mM Tris (pH 8.0), 300 mM KAc, 5 mM MgAc, 0.1 mM EDTA, 1 mM DTT, 50 lg/ml BSA, and 0.05% Tween.Single-round transcriptions were initiated with 4 ll of ''NTP þ heparin mix'' (to give a final concentration of 200 lM each NTP and 100 lg/ml heparin in 13 binding buffer), incubated for 5 min at 37 8C, and then terminated with 8 ll of 25 mM EDTA.The reactions were extracted with phenol and chloroform, precipitated with ethanol, and resuspended in 8 ll of H 2 O.The RNA transcripts were then used as templates in labeled reverse-transcription reactions using a primer ;100 nt downstream of the predicted transcription start point (same as the downstream PCR primer used to create the template DNA).In vivo promoter assays.Promoters to be validated were cloned on XhoI-BamHI fragments into the green fluorescent protein (GFP) reporter plasmid, pUA66 (Table 7; [82]) upstream of the gene GFPmut2 [83].The promoter fragments were generated by PCR from genomic DNA in which the upstream and downstream primers contained an XhoI and BamHI site, respectively, and amplified genomic promoter sequence from À65 to þ20 with respect to the predicted transcription start point.Cloned promoter constructs were confirmed by sequencing.Reporter strains were generated by transforming the plasmids constructs into strains CAG25196 and CAG25197 carrying the pTrc99a vector and the rpoE expression plasmid, pLC245, respectively (Table 7).Promoter assays were performed by direct inoculation of Luria broth supplemented with appropriate antibiotics from frozen glycerol stocks.One hundred fifty-microliter cultures were grown in covered 96-well U-bottom tissue culture plates overnight at 30 8C with shaking at 400 rpm.The cultures were then diluted 1:50 into fresh 96-well plates containing Luria broth supplemented with appropriate antibiotics and 1 mM IPTG.Cultures were grown as before for up to 23 h and fluorescence measured in a Spectra Max Gemini XS 96-well fluorometer and OD 600 measured in a Spectra Max 340 96-well spectrophotometer (Molecular Devices, Sunnyvale, California, United States).r E -dependent promoter activity was determined by first subtracting the background fluorescence/OD 600 readings of CAG25196 and CAG25197 cells bearing a promoterless GFP vector from the readings of CAG25196 and CAG25197 cells carrying the same promoter construct, and then subtracting the CAG25196 from the CAG25197 readings for each promoter.Four independent assays were performed for each promoter construct.A promoter was judged to be r E dependent if the standard deviation of the four assays did not overlap with those of the promoterless GFP vector; this translated to a r E -dependent signal at least three times greater than background.This approach was validated by confirming r E -dependent activity of 42 of 49 verified E. coli K-12 r E promoters.
r E promoter predictions in related genomes.Promoter predictions were made in genomes as described for E. coli K-12 using genome sequence files (*.fna) and annotation files (*.ptt) downloaded from the NCBI FTP database (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) on 6 August 2004.For each genome promoter predictions were plotted as a function of promoter z-score versus distance upstream of the nearest ORF in the same direction (see Figure 3A).A topographic plot of promoter z-score versus distance upstream was then constructed in which the x and y axes were divided into 200-nt and 1 unit bins, respectively, and the number of predictions falling within each bin (P A ) determined (see Figure 3B).Significant predictions were identified by comparing against predictions made in genomes containing randomized sequences.Randomized genomes were constructed to mimic the structures of real genomes but in which the nucleotide sequence of each structure was randomized.For each genome, the percentage nucleotide content was determined for all divergent IGs, convergent IGs, IGs less than 50 nt in the same direction as adjacent ORFs (short IGs), and IGs greater than 50 nt in the same direction as adjacent ORFs (long IGs).Finally, for each genome the average codon usage was determined for all ORFs.Randomized genomes of identical sizes were then constructed in which the size, orientation, and location of all the genomic structures were maintained but in which the nucleotide sequences were randomized while maintaining the average codon usage for all ORFs and the average nucleotide content for all dIGs, cIGs, long IGs, and short IGs.For each genome, promoter predictions were made from 100 randomized genomes, and, using the same bins as for the actual genomes, an averaged topographic plot was constructed that recorded the average number of predictions within each bin (P ¯R; see Figure 3C).For each bin of the actual genome topographic plot, a FPR was calculated that compared the average number of predictions in the 100 randomized genomes (P ¯R) with the number of predictions in the actual genome (P A ): In addition, for each bin, the significance of obtaining the observed number of predictions from the actual genome (P A ) given the average number of prediction from the randomized genomes (P R ) was calculated based on Poisson distribution to derive a p-value.All promoter predictions in actual genomes were assigned a FPR and pvalue based on the bin where they were located.Promoter predictions for an actual genome were determined significant if, in general, FPR , 0.5 and p , 0.05, with the FPR cutoff being the stricter filter.Additional filters of promoter z-score .À2, distance upstream ,1,100 nt were also applied to prevent spurious results in some genomes.
Conserved r E promoter predictions.A database of protein orthologs across the genomes was constructed using the program BLAST and the NCBI protein sequence files (*.faa) for each genome.Orthologs were defined as the highest scoring hit in a target genome, which, when the matching sequence was used to search the original genome, identified the same search sequence as the highest scoring match.All coding sequences in the genomes were organized into putative TUs defined as all adjacent ORFs in the same orientation separated by less than 50 nt [84].Using the protein ortholog database, conserved TUs across genomes were identified by containing at least one protein ortholog.In some instances, a TU in one genome may match more than one TU in other genomes due to the location of constituent ORFs becoming separated.Conserved promoter predictions were defined as predictions from the promoter prediction libraries less than 1,100 nt upstream of all orthologous TUs and scored in general promoter z-score .À2, distances upstream , À1,100 nt, FPR ,0.5, and p , 0.05 in at least one genome.Given that each promoter library contains approximately 150 predictions with zscore .À2 at distances ,1,100 nt upstream, and each genome contains on average 4,500 genes, a matching promoter occurring by random chance for a particular search promoter ¼ 150 of 4,500, or 0.033.Table S1.Highly Significant and Conserved r E Promoter Predictions across Nine Closely Related Genomes Orthologous TUs are displayed on the same row; note that only one gene in each TU needs to be an ortholog.Genes within a TU are separated by ''¼'' in the following fields: Unique ID (unique identification number from NCBI ptt file); Gene (Gene name); Function (Gene description from NCBI ptt file).Promoter predictions are given in the fields Distance (number of nucleotides of þ1 position upstream of translation start point of the first gene in the TU) and Score (total promoter z-score; see Materials and Methods).If there is no promoter prediction for that TU, these two fields just contain ''-.''Promoter predictions for E. coli K-12, E. coli CFT073, and S. typhimurium highlighted in gray in the distance and score fields have been validated by in vitro transcriptions and/or in vivo promoter assays.Promoter predictions in E. coli CFT073 that are conserved with E. coli K-12 are presumed functional based on their high level of conservation and were not tested.See Figure S1 for abbreviations.Found at DOI: 10.1371/journal.pbio.0040002.st001(100 KB XLS).

Supporting Information
Table S2.r E Regulon Members in Nine Closely Related Genomes Organized into the Functional Categories Displayed in Table 5 Orthologous proteins are displayed on the same row.Proteins in parenthesis are part of TUs observed to be regulated in E. coli K-12 and based on TU conservation are assumed to be part of the regulon in the related genomes.Validated predictions for E. coli K-12, E. coli CFT073, and S. typhimurium are highlighted in gray.Predictions in E. coli CFT073 that are conserved with E. coli K-12 are presumed functional based on their high level of conservation and were not tested.See Figure S1 for abbreviations.

Figure 1 .
Figure 1.Expression Profiles of r E Regulon Members Significantly regulated genes identified from genome-wide transcription profiling following comparison of rpoE overexpressed (CAG25197) versus wild-type (CAG25196) E. coli K-12 MG1655 cells.The color chart illustrates the expression level for each gene from an average of four time-course experiments (see Materials and Methods).Red denotes induced, and green denotes repressed genes in CAG25197 following rpoE induction.Fold change of mRNA levels (rpoE overexpressed/wild-type) is indicated by the scale at the bottom of the figure; time in minutes after induction of rpoE in the time-course experiments is indicated at the top of the figure.Genes are identified by their unique ID and name (Gene ID) and are listed in chromosomal order to illustrate the TUs; the direction of transcription is indicated.DOI: 10.1371/journal.pbio.0040002.g001

Figure 2 .
Figure 2. Sequence Logos and Spacer Histograms of r E Promoter Motifs Motifs were identified upstream of the 28 mapped transcription starts in E. coli K-12.(A) Sequence logos (http://weblogo.berkeley.edu/;[78]) of the À35, À10, and þ1 start site motifs and the A/T rich UP sequences.The information content (I seq ) of each motif is indicated (see Materials and Methods).(B-D) Histograms of the number of promoters versus distances between the motifs identified in (A): (B) þ1 start and À35 motifs; (C) À10 and À35 motifs; and (D) þ1 start and À10 motifs.Distances between the À35, À10, and þ1 start motifs are from the conserved GGAACTT, TCAAA, and A/G sequences, respectively, as marked in (A).Note that the weakly conserved spacer sequence appeared to associate with the À10 motif and was therefore incorporated into PWM À10 .DOI: 10.1371/journal.pbio.0040002.g002

a
Distance from translation start point of yaeT.b Distance from translation start point of b2512.c Distance from translation start point of dsbC.d Distance from translation start point of yraP.e Distance from translation start point of yhbG. f

g
Distance from translation start point of narW.h Distance from translation start point of lhr.i Distance from translation start point of wzb.DOI: 10.1371/journal.pbio.0040002.t001How Well Does Our r E Promoter Model Perform in E. coli K-12?

Figure 3 .
Figure 3. r E Promoter z-Scores versus Distance Upstream of the Nearest Gene in Actual and Randomized E. coli K-12 Genomes Only promoters less than 2,000 nt upstream of target genes are shown.(A) Scatter plot of predicted (diamonds) and known (circles) r E promoters in E. coli K-12 MG1655.(B) Topographic plot of predicted r E promoters in E. coli K-12 MG1655.The x and y axes are divided up into 200-nt and 1 unit bins, respectively, and the number of predictions falling within each bin are indicated colorimetrically as shown in the scale.Note that the data in this plot are the same as the predictions in (A).Bins containing significant predictions are indicated by yellow ovals.(C) Topographic plot indicating average number of predicted r E promoters made from 100 randomized E. coli K-12 MG1655 genomes in silico (see Materials and Methods).Each bin illustrates the average number of predictions made from 100 separate randomized genomes that fall within the parameters of that bin.DOI: 10.1371/journal.pbio.0040002.g003

Figure 4 .
Figure 4. Venn Diagram of Predicted and Known r E Promoters in E. coli K-1239 predictions from the promoter library were identified as highly significant, of which 37 were confirmed.A total of 49 known r E promoters were confirmed from the literature and additional experiments, of which 37 were successfully identified by the promoter prediction model (see text; Table2).DOI: 10.1371/journal.pbio.0040002.g004

Figure 5 .
Figure 5. Functions of the Highly Conserved r E Core Regulon Members Stresses such as heat lead to the accumulation of unassembled OMPs; this activates the sequential proteolysis of the membrane-spanning antisigma RseA [12,54].The inner membrane proteases DegS [b3235] and RseP [b0176] release the cytoplasmic portion of RseA, which is then degraded by the cytoplasmic proteases ClpX [b0438] and Lon [b0439] ([85]; R. Chaba unpublished data) to release free r E , which then binds to RNA polymerase core to regulate the expression of target regulon members.r E up-regulates functions required for synthesis, assembly, and/or insertion of both OMPs and LPS, the most abundant components of the outer membrane, as well as envelope-folding catalysts and chaperones.r E also up-regulates expression of itself and its negative regulator RseA and enhances expression of GreA [b3181] and r 32 [b3461].Importantly, r E down-regulates OMP expression, thereby reducing the accumulation of unassembled OMPs, which presumably limits the duration of the response.DOI: 10.1371/journal.pbio.0040002.g005 Primers were annealed by incubating with the template for 10 min at 70 8C before chilling on ice.The reverse transcription reactions (15 ll) contained 8 ll of template RNA, 10 lM primer, 13 StrataScript RT Buffer, 50 U StrataScript RNase H-RT (Stratagene, La Jolla, California, United States), 200 lM dCTP/dGTP/dTTP, 10 lM dATP, 6 lCi [a-32 P] dATP (3,000 Ci/mmol; 110 TBq/mmol), and 8 U RNase Inhibitor (Boehringer Mannheim, Mannheim, Germany).Reactions were incubated at room temperature for 10 min and then at 42 8C for 1 h 50 min, before terminating with 9 ll of stop solution (95% deionized formamide, 25 mM EDTA, 0.05% [w/v] bromophenol blue, and 0.05% [w/v] xylene cyanol FF).The cDNA transcripts were resolved by electrophoresis after heating at 90 8C for 2 min and loading 8 ll on a 6% denaturing polyacrylamide sequencing gel together with DNA sequencing reactions that functioned as size markers.Transcripts were visualized using a Molecular Dynamics Storm 560 Phosphorimager scanning system (Sunnyvale, California, United States).

Figure S1 .
Figure S1.Amino Acid Sequence Alignments of Conserved DNA-Binding Regions of r E across Eight Genomes The RpoE (r E ) sequences are aligned against RpoD (r 70 ) based on the structural alignment in [30].Residues inferred to be involved in DNA interactions are based from r 70 [31] and are highlighted in yellow.(A) Alignments of conserved regions 2.2-3.0 involved in À10 promoter recognition.

Table 1 .
r E Regulon Members in E. coli K-12

Table 2 .
[29]me-Wide r E Promoter Predictions in E. coli K-12 Distance between þ1 and À35 of 25-28 nt;(6)Overlapping promoters ( 4-nt overlap); (7) Significant predictions (FPR , 0.5; p , 0.05; z-score !l À 2r; distance upstream , 1,100 nt).Number of predictions (all predictions using the PWMs with a cutoff of !l À 2r), 59 RACE-identified sites, Rezuchova sites (promoters identified by[29]), and Total sites (total number of known promoters) indicate the number of promoters remaining or detected by the model after each filter was applied.The starting number of promoters is indicated in parenthesis with each title.Sensitivity describes the ability of the model to detect known promoters; Sensitivity ¼ (Validated Predictions/Total sites(49)), where Validated Predictions is the number of Total sites predicted at that filter step.Precision gives the proportion of successful predictions of the model; Precision ¼ (Validated Predictions/Number of Predictions), where Number of Predictions is the number remaining at that filter step.Accuracy describes the overall performance of the model; Accuracy ¼ (Sensitivity þ Precision)/2.DOI: 10.1371/journal.pbio.0040002.t002

Table 4 .
Genome-Wide r E Promoter Predictions in Nine Related Genomes

Table 5 .
5and p , 0.05 in at least one genome.Number of conserved predictions relates to promoters not already identified by the significant prediction model.Nonconserved predictions are promoters present only in that genome.Predictions with no orthologs are promoters upstream of genes that have no orthologs in the other genomes.Total unique predictions is the total number of nonorthologous promoters.DOI: 10.1371/journal.pbio.0040002.t004Predicted Core r E Regulon Members Orthologous genes predicted to be regulated by r E in six or more genomes.a Orthologous genes predicted in eight genomes.b Orthologous genes predicted in seven genomes.DOI: 10.1371/journal.pbio.0040002.t005

Table 6 .
Predicted Properties of r E Regulon Members across Nine Genomes

Table 7 .
Bacterial Strains and Plasmids Used in This Study