Structural Properties of Prokaryotic Promoter Regions Correlate with Functional Features

The structural properties of the DNA molecule are known to play a critical role in transcription. In this paper, the structural profiles of promoter regions were studied within the context of their diversity and their function for eleven prokaryotic species; Escherichia coli, Klebsiella pneumoniae, Salmonella Typhimurium, Pseudomonas auroginosa, Geobacter sulfurreducens Helicobacter pylori, Chlamydophila pneumoniae, Synechocystis sp., Synechoccocus elongates, Bacillus anthracis, and the archaea Sulfolobus solfataricus. The main anchor point for these promoter regions were transcription start sites identified through high-throughput experiments or collected within large curated databases. Prokaryotic promoter regions were found to be less stable and less flexible than the genomic mean across all studied species. However, direct comparison between species revealed differences in their structural profiles that can not solely be explained by the difference in genomic GC content. In addition, comparison with functional data revealed that there are patterns in the promoter structural profiles that can be linked to specific functional loci, such as sigma factor regulation or transcription factor binding. Interestingly, a novel structural element clearly visible near the transcription start site was found in genes associated with essential cellular functions and growth in several species. Our analyses reveals the great diversity in promoter structural profiles both between and within prokaryotic species. We observed relationships between structural diversity and functional features that are interesting prospects for further research to yet uncharacterized functional loci defined by DNA structural properties.


Introduction
The DNA molecule is not a uniform linear macromolecule as it is often represented but displays local structural variations that depend on its base composition and sequence [1,2].This intrinsic variability in the DNA structure plays a functional role in a variety of biological processes [3].The structure of a DNA molecule is primarily determined by its nucleotide sequence, thus similar DNA sequences generally have similar DNA structures.The reverse is however not necessarily true: DNA molecules with similar structural properties can arise from different sequences.This redundancy is the reason that the DNA structure is often considered as a separate information level from the DNA sequence, despite the inherent relationship between the two.Various properties of the molecular structure can be modeled using structural scales derived from theoretical simulations and/or experimental measurements [4][5][6].The DNA molecule is highly variable at many different levels and thus many possible DNA structural properties can be characterized, from the local stability of the helical duplex to the global conformation of the molecule.These structural characteristics of genomic regions can be represented as structural profiles, where each position in the region is a assigned a value that denotes a specific structural property of the DNA at this location [3].
Several previous studies have analyzed structural profiles of prokaryotic promoter regions [7,8].On average, most prokaryotic promoters appeared to be less stable, more rigid, and more extremely curved than other genomic regions [7,[9][10][11][12][13].However these studies were based on the limited number of transcription start sites (TSS) that were available at the time.Recent years have seen a growing interest in the DNA structure, as its influence on a variety of genomic elements has been described; e.g.transcription factor binding sites (TFBS), nucleosome positioning and transposon insertion sites [3,[14][15][16].Also in the case of promoter regions, structural properties are widely used as the main, if not the only, feature in promoter classification [17][18][19][20][21][22].For example, one of the most common approaches in this regard is the discovery of prokaryotic promoter regions based on the difference in the DNA duplex stability upstream and downstream from the TSS [9,23,24].
Recent technical advances have inspired us to re-evaluate the structural properties of the prokaryotic promoter region.Next generation sequencing allows the characterization of TSSs on a genome-wide scale with single-nucleotide precision with much greater ease than before.Additionally, new techniques now allow isolation of primary transcripts from the RNA pool to detect bona fide TSSs.Primary transcripts have a 59 tri-phosphate group, but they usually get quickly processed.This results in a new 59 that might not represent the actual TSS containing a mono-phosphate instead of a triphosphate.Enrichment for 59 tri-phosphate transcripts, e.g. by selective digestion of RNA molecules capped with a 59 mono-phosphate group or by ligating a biotinilated oligonuclotide, helps to enrich for primary mRNA and sRNA transcripts, whereas the processed transcripts or most of the cellular rRNA (which due to its polycistronic nature is mostly mature 59 monophosphate) are removed [25,26].The use of these technologies has facilitated the study of TSSs such that there is now a wealth of detailed TSS data available for a variety of prokaryotic organisms.Furthermore, large-scale homogenized expression compendia were recently made available for several prokaryotic species.These compendia represent a rich resource to explore the coordination of genome-wide gene expression responses across a great variety of conditions [27].For instance, studies have revealed insightful new patterns in prokaryotic expression regulation, such as the existence of large expression classes in prokaryotic genes which underlie massive life style switches of single cell organisms [28].
In this paper the structural properties of prokaryotic promoter regions were characterized with three goals in mind.The first goal was to describe the structural profiles of TSSs in model prokaryotic organisms, and to compare them to past observations made with more limited data sets in order to verify whether the common assumptions regarding their characteristics remain valid.The second goal was to gain novel biological insights regarding the role that the structural properties of the promoter DNA play in transcription.For example, we investigated whether certain patterns in the structural profiles can be linked to gene expression characteristics associated with these promoters.The third goal was to expand the analyses of promoter region to an evolutionary context by comparing the structural profiles of different related and more distant species.

The Promoter Regions Reported by Different Experimental Methods Display Similar Structural Profiles
As mentioned in the introduction, there are many experimental methods to determine the location of TSSs in a genome.An exact determination of the bona fide TSSs is crucial to all subsequent analyses, since TSSs limit the downstream region of the promoter.The wealth of experimental data, particularly for E. coli, generated with different methods also allowed us to investigate whether the experimental approaches used to determine the TSS have any influence on the structural properties of the promotor region.
We can divide the E. coli TSS data into three broad categories based on their experimental origin: low-throughput experiments (LT) from classic single-object experiments, collected in curated databases; high-throughput experiments (HT), e.g.RNAseq without preprocessing; and high-throughput experiments enriched for primary transcripts (HTE), e.g.RNAseq preceded by terminator exonuclease treatment or adapter ligation.The overlap between the TSSs of these data sources, allowing only for a deviation of three nucleotides (plus minus three), is presented in table 1.The low overlap between the three different experimental categories can be partially explained by the fact that no data set has TSS for every transcription unit present in the E. coli genome.The regions from 200 nt upstream to 50 nt downstream from the TSSs were considered as the 'promoter region'.The downstream 50 nt are cautiously included, as some regulatory elements, such as repressor binding sites, are known to occur after the TSS [29].The selection of structural properties for this study was based on their association with prokaryotic promoter activity reported in the literature.The used structural scales are base stacking energy, denaturation temperature, curvature, B-DNA twist, Z-DNAphilicity and major groove bendability [30][31][32][33][34][35].
Comparison of the structural properties of the promoter regions between the three methodological categories reveals that most structural profiles are unaffected by the experimental method.The majority of the variation between the different average profiles of the promoter region was not significant and fell well within the intrinsic variation present between different promoter sequences from the same method.The only striking exception is the region about 10 nucleotides downstream the TSS, where the average base stacking energy from HT experiments is higher than both the LT and HTE profiles, as can be seen in figure 1.This difference in base stacking energy within a region of 10 nt, was significant both between the HT and LT (KS-test p-value 7.7?10 29 ) and between HT and HTE (KS-test p-value 8.6?10 228 ).Given the relation between DNA sequence and structure, it seems likely that there is also a difference in the nucleotide sequence at this position.Indeed comparison of the nucleotide sequences reveals that the HTEderived TSSs have a much higher frequency of either guanine or cytosine at the TSS position or downstream, compared to either the HT-or LT-generated TSS (see Figure S1).As the remainder of the structural profiles derived form the other scales (and the consensus sequence) do display high similarity, the experimental method will likely have little impact on the overall findings.These small differences between the methods are thus not the major topic of the subsequent sections, and we combine the TSS data from different experimental methods in the following analyses.

The Structural Profiles of the E. coli Promoter Regions Differ Greatly from the Genomic Mean
The E. coli TSSs constitute the best characterized and comprehensive collection of data available.Furthermore, since E. coli is the most intensively studied prokaryote, there is also a rich amount of knowledge on functional elements around these promoter regions as well as on the downstream genes.For these reasons, E. coli constitutes a good starting point in this analysis.Therefore, TSSs from the seven data sets introduced previously, produced by many different types of experimental procedures, were combined into a single set of high quality TSSs.The E. coli promoter sequences could then be compared to their genomic background, based on randomly selected genomic regions of equal length.It is already apparent from the sequences themselves that the GC content of the promoter regions (45.31%) is lower than that of the genomic mean (50.79%) in E. coli.To discover if the structural properties for promoter regions differ from randomly selected genomic sequences, the distribution of the mean value for the different structural features was compared between them, as shown in figure 2. For every tested structural property, the TSS sequences were found to be significantly different.In general, the TSS sequences were found to be less stable, as evidenced by the lower denaturation temperature and stacking energy (which has a negative scale).This lower DNA helical stability has already been noted in previous studies and is postulated to be linked to the DNA denaturation step during RNA polymerase open complex formation [24].Further, we also found the TSS sequences to be more rigid as per the lower bendability and the higher B-DNA twisting (which has been used as a measure for the rigidity of the DNA molecule in the past).Again, this corresponds well with previous reports [7].Additionally, the TSS regions displayed a larger fraction of regions with extreme curvature and a lower tendency to be in the Z-DNA conformation than the remainder of the genome.It has been reported in prior studies that there are local differences in the structural property profiles across the promoter region [8,36].The average structural variation of the DNA molecule at each position of the selected promoter regions of E. coli is shown on the left side of figure 3.For all structural properties except curvature, the most extreme values occur around the 210 position.This is very relevant, as this is the main recognition site for the RNA polymerase holoenzyme and features a relatively strong TATAAT consensus sequence that can be found in many of the promoter regions (see Figure S1).The 210 region is less stable than the surrounding region, since this is the limit of the melting of the DNA in the process of open complex formation, and it is more flexible, as seen in the B-DNA twist profile and the bendability profile.These characteristics do correspond to the expected properties of the TATAAT consensus sequence.The TSS itself is also unstable but has a high rigidity.The area upstream from the 210 position where the 235 promoter element is located, is more stable but is a region of disagreement between the two rigidity scales.From the bendability results, this region seems to be rigid, but the B-DNA twisting suggests a more flexible DNA structure.This can be the result of the directionality constraint on the bendability scale, as it only describes bendability towards the major groove.The position may therefore be flexible but not in the direction of the major groove.The curvature seems to be most pronounced between 50 nt to 100 nt upstream from the promoter region.
However, the average profiles are not representative for all promoters, since only 50% do not show any curvature (see Figure S2).This diversity in structural profiles causes a large standard deviation on all average structural values presented and underlines the need for stringent statistics in all of our analyses.Furthermore, many of the structural profiles seem to display extremes at similar positions, such as the 210 position, which suggests a dependency between the different profiles.The correlation between the average structural profiles found for the promoter regions is also presented in figure 3.While the actual values within the scales are known to be independent, by applying them to biologic sequences and smoothing them, certain patterns of correlation emerge.This increase in correlation between structural scales when applied to genomic sequences has previously been noted by Baldi et al. [37].The two scales used to measure the stability of the DNA helix, namely the denaturation temperature and the base stacking are strongly anti-correlated.In addition, there is a strong correlation between the denaturation temperature and the GC content of the sequence.This relation could be expected, as the GC content is known to be a primary determinant of the stability of the DNA helix due to the difference in hydrogen bonds between AT and GC pairs.These different stability scales will therefore likely give very similar results due to the high correlation (and anticorrelation) between their structural profiles (e.g.denaturation, base staking energy).Thus we only considered the most commonly used of these scales as a measure for stability, namely base stacking energy, and will proceed with only the curvature, bendability and base stacking profiles in the remainder of the paper.

Patterns in the Structural Profiles of the Promoter Region are Related to Functional Loci
It has been established that not all promoter regions follow the same structural profile but that they can be divided or grouped into several categories.In eukaryotic species, different promoter categories have been linked to distinct RNA polymerases [8].Prokaryotic species only have a single RNA polymerase to transcribe their genes, but have multiple sigma factors (e.g.E. coli has 7), which are recruited as part of the RNA polymerase complex and are responsible for directing the binding of the complex to the promoter.Different sigma factors have a preference to bind distinct sets of promoters.The constitution of the cellular sigma factor pool thus regulates the affinity of the RNA polymerase to different promoters on a genome-wide scale [38].
Following the analyses of the structural profiles, we can evaluate whether certain sigma factors have a tendency to co-occur with specific structural patterns in the promoter.Such patterns may thus be defining characteristics involved in the specific recognition by different sigma factors.For the analysis of the effect of the sigma factors on the promoter region, we opted to narrow down our search window to the region where sigma factors will most likely interact: from 50 bp upstream to 10 bp downstream of the TSS.Indeed, as can be seen in Figure S3, there are many local variations in the bendability and base stacking within this region for the different sigma factors.To confirm if these differences are due to actual variation between sigma factors or simply the result of the large variation of the promoter structural profiles, we performed a two-sided Kolmogorov-Smirnov (KS) test for each sigma factor profile at each position.Based on this analysis, only one of these variations was found to be significant according to a KS-test, namely the stable region in the base stacking profile of sigma 28 at position 214 with a p-value of 1.69 ? 10 25 .
The curvature profiles of the promoters assigned to different sigma factors showed similar results.Again we found differences between the profiles of the various sigma factors but they were not significant.However, as can be seen in Figure S3, the resolution of the curvature profiles is much lower and does not allow a clear evaluation within the scope of functional loci at 60 bp as done for the bendability and base stacking profiles.Figure 4 shows the curvature across the entire promoter region and reveals that the major region of high curvature is not located at the TSS but further upstream.This suggests that there might be limited direct interaction between the functional elements of the core promoter and the DNA curvature.The region starting from about 50 bp upstream from the TSS to about 250 bp upstream, sometimes termed as the proximal promoter, seems to feature more DNA curvature.This is interesting as this region also features many of the binding sites (BS) of transcription factors (TFs).Indeed, comparison of the average curvature profile and the TFBS density in figure 4 reveals highly similar patterns.Thus TFBS seem to occur more in the curved regions of the promoter.When comparing the curvature values of the promoter regions annotated with TFBSs against the promoter regions that have no TFBS annotated, we observed that on average the TFBS have a tendency to occur in regions with a higher curvature (KS-test pvalue: 1.87?10 2164 ).The curvature of the DNA molecule thus cooccurs with the presence of TF binding sites, implying that DNA curvature may play a role in TFBS function.Given this relationship and the diversity in TFs, we evaluated whether there might be a certain subset of TFs that are functionally related to the curvature.If we compare the curvature at the position of all the binding sites of a TF to those of the other TFs, we do indeed identify a set of TF binding sites that occur at significantly high or low curvature, as shown in table 2. We found four TFs whose binding sites have a tendency to occur in curved regions significantly more than other TFs, namely CytR, GlpR, NanR and NhaR, and one which has a lower tendency to occur in curved regions, namely GntR.See the discussion below.

The Expression Behavior of Genes can be Linked to Structural Characteristics in Their Promoter Regions
Given the role that the promoter plays in transcription regulation, there might be a direct or indirect link between the groupings of promoters based on their structural profiles and the expression behavior of their downstream genes.To investigate if such a relationship exists, the promoter categories can be compared against the gene expression domains, i.e. the conditions under which a gene is up-or downregulated.In a recent study we described three large expression classes in the Escherichia coli transcriptome, where each class is a set of genes that share global expression behavior across a large amount of experimental conditions [28].Detailed analysis of these expression classes showed that each had a clear functional association (see Data S1).One large class (the 'growth' class) contained most of the genes that can be associated with housekeeping, essential cellular functions and cell division.A second class of genes (the 'stress' class), which displays anti-correlated expression to the growth class, features several genes involved in periods of specific stress or associated alternative metabolism.The last class (the 'general' class) displays overlapping expression domains with both the growth and stress classes and includes many of the genes involved in general metabolism and in the mobility of the organism.
Each promoter in our data set can be annotated as being part of one of the three classes based on the functional expression class of the first downstream gene.In this analysis, the complete TSS data set was again analyzed across the larger span of 200 nt downstream to 50 nt upstream.These average profiles are largely similar over the three expression classes with most differences falling well within the expected variation resulting from the original noisy promoter profiles.One exception is a sharp difference between the growth class and the other two classes within the base stacking energy profiles at the region around the TSS, as can be seen in figure 5.The growth class promoters are on average more stable at this position than those from the other classes (KS-test p-value: 6.4?10 211 ).Given this large difference in stability and the known correlation between stability and GC content, it is not surprising that the G/C frequency of the promoters of genes in the growth class is higher than that of either the stress or general expression class.Detailed analysis of the base sequence shows that the stability difference at the TSS is caused by a higher preference for guanine and cytosine from minus 5 bp to minus 1 bp relative to the TSS, without any clear pattern for a specific sequence in the growth class promoters (see Figure S4).The GC content or the growth class promoters at position 25 to 21 is 51.2% while that of the general metabolism class and stress class is 41.4% and 36.2%,respectively.This region of higher stability in the growth class promoters is heavily contrasted by the low stability and low GC content at the 210 position and the region downstream from the TSS.

The Structural Profiles of Proteobacteria are Mostly Dependent on GC Content
The conservation of the promoter structural profile among related species can be studied by expanding the analysis to other prokaryotes.The Proteobacteria clade, of which E. coli is a member, has several species for which large TSS data sets are available: Klebsiella pneumoniae, Helicobacter pylori, Salmonella enterica serovar Typhimurium, Pseudomonas aeruginosa and Geobacter sulfurreducens.The Proteobacteria are the largest and most diverse group of gram-negative bacteria.Given the size and number of data sets available for this clade, we will first compare the structural profiles of the Proteobacteria species.Within this phylum, E. coli, S. enterica and K. pneumoniae are Enterobacteria of the Gammaproteobacteria class, while P. aeruginosa is a Pseudomonadales of the same class.H. pylori belongs to the Epsilonproteobacteria class and G. sulfurreducens to the Deltaproteobacteria.As in E. coli, the promoters regions were found to be less stable, more rigid and more curved than the genomic background in all species, with the exception of curvature in H. pylori.The absolute values of the base stacking energy profiles of the promoter regions, as shown in figure 6, show great variation between the six Proteobacteria.These profiles suggest that the H. pylori promoter regions are the least stable, while those of P. aeruginosa have the highest stability.This finding coupled with our earlier observation on the correlation between GC-content and base stacking energy suggests that this difference might be caused by the difference in genomic GC-content between this species.Indeed the genomic GC-content of H. pylori is the lowest at 39%, while that of P. aeruginosa is the highest with 66%.The GC content of the other species is between 50% and 61%.Despite the difference in overall stability, it is interesting to note that all of the Proteobacteria show similar patterns in their base stacking profiles: a very clearly defined unstable 210 region followed by a more stable region around the TSS.The bendability profile showed similar results with the most species following almost the same profile, the only exception being H. pylori which features high rigidity.It is however clear from the bendability and stability profiles that the DNA structure surrounding the 210 position has remained mostly conserved in the promoter nucleotide sequence across all the studied species.This conservation is less clear in the promoter consensus sequences (see Figure S7).While the three Enterobacteria and H. pylori all share a conserved TA(T/A)AAT consensus sequence at the 210 position, it seems mostly absent in P. aeruginosa and G. sulfurreducens.Finally, most species also displayed significantly higher curvature values in their promoter regions than in the remainder of their genome.The only exception in this regard is H. pylori, which also features extreme curvature regions in the remainder of its genome and thus the high curvature of the promoter region was not found to be significant.
The lack of large expression compendia or sigma factor annotation for these species prevents us from performing equivalent analyses as those performed on E. coli.The clear relation that was found between the stability of the TSS region and the functional class of the downstream gene in E. coli suggests that gene ontology information could be used to test if such a functional relation also exists in these species.For this test, the genes were extracted with a low base stacking energy value in the base stacking profile at the 24 position of their promoter (the site of maximum difference between the classes in E. coli) and their gene ontology enrichment was calculated.This analysis revealed that the promoters of the two Enterobacteria that featured high stability at this position could be linked to several ontology terms.The K. pneumoniae genes with a stable region upstream from the TSS were found to be enriched only in processes that were also enriched in the E. coli growth functional expression class (as can be found in Data S1), namely carbohydrate derivative metabolic process (p-value of 6.58?10 25 ), tRNA processing (p-value of   ).Again each of these four terms was also enriched in the E. coli growth functional expression class, as shown above.

The Structural Profiles of E. coli are not Representative for all Prokaryotes
The previous analyses were expanded to other non-Protebacteria prokaryotic species for which large TSS data sets are available: Bacillus anthracis of the Firmicutes, Chlamydophila pneumoniae of the Chlamydiae (also referred to as Chlamydia pneumoniae), the cyanobacteria Synechocystis sp. and Synechococcus elongates, and the archeae Sulfolobus solfataricus.All except S. solfataricus are bacteria and of these bacteria all except B. anthracis are Gram-negative.
Again in all cases the promoter regions were found to be significantly less stable and more rigid than the genomic mean.However the promoter regions of the non-Proteobacteria were not found to display higher curvature, with the exception of C. pneumoniae.The absolute base stacking values are again correlated to the genomic GC-content of the species, with the highest being 55% for S. elongates and the lowest 35% for B. anthracis.The base stacking profiles all show extreme values at the 210 position, with the exception of B. anthracis, as shown in figure 7. The remainder of the profile for the cyanobacteria is very similar to that previously described for E. coli, differing only due to the difference in GCcontent.However this conservation around the 210 position is much less clear in the sequence (see Figure S8).The base stacking profile of C. pneumoniae and S. solfataricus deviate more and feature several additional unstable regions, such as around the 230 position, that were not present in the Proteobacteria.The difference is greater in the bendability profiles, where S. solfataricus features a very flexible TSS and a very rigid 230 position.In addition, the bendability profile of C. pneumoniae contains a region of high rigidity from 250 to 215 from the TSS.

Discussion
The structural profiles of promoter regions of six prokaryotic species were studied within the context of their diversity and their function.The main anchor point for these promoter regions were TSS identified through high-throughput experiments or large curated databases.Across all the studied species, the promoter region was found to be less stable and more rigid than the remainder of the genome.The actual structural profiles of the promoter regions were found to differ greatly across the different species, especially at large evolutionary distances.The most consistent pattern was found to be the decreased DNA stability centered around the 210 position across all the Gram-negative bacteria, which likely underlies the success of methods using this structural property for the classification of promoter regions [36].However the sequence at this position showed more variation in more distant species despite similarities in the structure, suggesting that the some of these structural properties must remain conserved for the function of the promoter even if the sequence is not.This low stability has been suggested to facilitate helix denaturation prior to the transcription event [24].This pattern was however less clear in the only Gram-positive bacteria tested, namely B. anthracis.While this could signify the absence of this pattern in this class of bacteria, it must be noted that this single data set might not be representative for all Gram-positive promoter regions as a whole and/or that the accuracy of TSS determination may not be very high given the absence of regulatory patterns in both the sequence of the structural profiles.
It is interesting to postulate that each sigma factor, which binds near the 210 position and drives the recognition of the promoter region by the RNA polymerase, might have a unique recognition pattern that is present both in the stability or bendability profiles.While some patterns could be found in these structural profiles that are specific to several sigma factors, no clear picture emerged.The correct characterization of these recognition patterns and their statistical significance is likely hampered by interference of several sigma factors binding to the same promoter and limited high-quality data on sigma factor binding.The described patterns also relate to the average structural profiles where the individual promoter profiles display large diversity.Thus any pattern had to be interpreted within large amounts of background noise.Other patterns in the structural profiles extended well beyond the reach of sigma factors, indicating that these might also result from other functional regulatory elements present in the promoter region.
One recurring pattern in the structural profile of promoter region, namely a stable or unstable region at the TSS, could not be linked to any specific sigma factor.This type of pattern could however be associated with either the expression class of the downstream gene or the experimental method used to determine the TSS.Correcting for the patterns distinct for either the expression class or experimental method by plotting the average profiles for the different experimental methods within a single expression class or vice versa, as can be found in Figure S5 and S6, reveals that these are two separate structural patterns.Indeed, as mentioned before the low-high stability region related to the function expression classes is located just ahead of the TSS (from about 25 to 21 upstream), while that from the experimental bias is more concentrated downstream from the TSS (from the TSS itself to about +5 downstream).The explanation behind the presence of these stability regions in either case is not immediately clear from the current study.It should be of note that the pattern described for the growth expression class of a high stability region followed by a sudden dip in stability at the TSS is very similar to the average stability pattern present at the TSS of eukaryotic species, where this stability difference is postulated to aid in correct positioning of the RNA polymerase at the transcription start site [19].Additionally, the fact that we also found this low and high stability pattern just upstream from the TSS in K. pneumonia and S. enterica with similar functional annotation enrichment to that found in E. coli, not only support the hypothesis that this is a functional element but indicates that it might be conserved throughout the Enterobacteriaceae.However, the lack of homogenized expression data and extensive annotation for these specific species currently limits this analysis.Experiments modifying the stability of the TSS to characterize resulting changes in expression level could provide further insight.Dedicated experiments are indeed the only way to prove that the described structural patterns are in fact functional black line).Reported profiles span 200 nt upstream to 50 nt downstream from the TSS.Top: bendability profile as log-frequency where higher values signify more flexible DNA.Middle: Base stacking energy profile in kcal/mol where higher values correspond to less stable DNA.Bottom: Curvature profile in angle degrees where higher values correspond to more curved DNA.doi:10.1371/journal.pone.0088717.g006and not an indirect result from other functional elements, such as those potentially contained in the DNA sequence itself.
Patterns in the DNA curvature profiles diverged strongly from the other structural scales as it is influenced by long range interactions.The findings for the curvature property were not consistent across the studied species; only the promoter regions of some of the Proteobacteria and C. pneumoniae were found to have significantly higher curvature values than the remainder of the genome.Additionally, the species that have a higher tendency towards curved promoter regions, only featured high curvature values in typically less than half of their promoters, as was shown for E. coli.This implies that high curvature values are not a general characteristic of prokaryotic promoter regions.These regions of high curvature may thus potentially support other types of genomic elements that simply occur in the promoter regions of some species.In the past, the function of the DNA curvature has been postulated to act as a thermosensor or as a functional support for TFBS [39][40][41].Indeed, we found that regulatory binding sites in E. coli were on average more likely to occur in curved regions upstream of the promoter.However detailed analysis showed that while several transcription factor binding sites had a significant tendency to occur in highly curved regions of the promoter, others displayed no preference or seemed even to avoid curved regions.Thus even in species with significantly curved promoters, such as E. coli, curved DNA regions are not an absolute requirement for TFBS but likely only specific to a subset of the TFs.Interestingly, the most significant of the high curvature TFs (CytR) and the low curvature TF (GntR) both belong to the LacI-GalR protein family where induction of DNA curvature is known to play a critical role in their regulatory mechanism [42,43].CytR is an exception of the LacI-GalR family in this regard, as it cannot induce the required curvature in its target sites and its regulatory effects are dependent on curvature induced by other DNA-binding proteins [44,45].The tendency to bind at target sites with high curvature may suggest that CytR could also make use of the intrinsic curvature present in the DNA molecule.

Conclusions
The structural patterns in prokaryotic promoter regions were re-examined in this study.Despite a lack of data in the past, the estimated structural patterns present at prokaryotic promoter regions obtained by TSS determination seems to have been mostly correct for all tested Gram-negative bacteria; promoter regions are less stable, more rigid and often more curved than genomic DNA.However, large interspecies differences in the structural profiles themselves could be observed.Additionally specific patterns found within the structural profiles of promoters could be linked to the expression behavior of the downstream genes and regulation by sigma factors.The most significant of these findings was a stable/ unstable region downstream from the TSS that was associated to the expression class of the downstream gene.These findings have possible implications on promoter prediction tools that use structural properties, as they may be biased towards a certain type of promoter based on sigma factor recognition or a single gene expression class.Finally this work offers new prospects for future research into yet uncharacterized functional elements that are defined by DNA structural properties.

Data Sets
An overview of the TSS data utilized in this study is given in table 3. The TSS determined for E. coli were grouped together into a single data set.To compile a complete set of all TSS from these different data sources without biases by including the same region more than once, we choose to limit our analysis to one TSS per gene.The TSS chosen was always the furthest upstream TSS that had been found from the gene, the rationale being that these TSSs are the least likely to be a false positive, e.g.59 end of a processed transcript.
For the comparison of the data sources, we compiled a set of those found by low-throughput methods (RegulonDB and PromEC curated lists), HT methods without preprocessing steps (RACE by Cho et Salgado et al. (2013) was not included in the experimental compilation as it represents a mixture between a high-throughput method with and without preprocessing [47], however it is included in the complete E. coli TSS data set.
Sigma factor and transcription factor binding sites were obtained from the annotations in RegulonDB (version 7.5) supported by strong evidence.Enrichment calculations were performed based on a hypergeometric distribution.The likelihood that two samples are derived from the same distribution was evaluated based on a two-sided Kolmogorov-Smirnov (KS) test.Note we apply the KS-test for two types of analyses in this study.The first is a statistical comparison between distributions of mean values for entire sequences, e.g. in the comparative analysis of average structural values between promoter and non-promoter sequences.The second is a screening of an entire profile to identify positions where the profile differs significantly from a background.In this case the two distributions that are compared are that of the structural values at a certain position within a specific subset of promoter regions, e.g.those associated with a sigma factor or a gene expression class, and the structural values at the same position in all remaining promoter regions of the same species.Both the hypergeometric and KS testing were performed using the tools available within Matlab 2013a.The resulting p-values were evaluated for significance at a threshold of 0.05 divided by the number of tests performed within the experiment as per the Bonferroni correction.
solfataricus (red solid line) and E. coli (solid black line).Reported profiles span 200 nt upstream to 50 nt downstream from the TSS.Top: bendability profile as log-frequency where higher values signify more flexible DNA.Middle: Base stacking energy profile in kcal/mol where higher values correspond to less stable DNA.Bottom: Curvature profile in angle degrees where higher values correspond to more curved DNA.doi:10.1371/journal.pone.0088717.g007

Structural Profiles
The structural properties used in this manuscript were derived from experiments or theoretical approaches and stored in structural scales.The structural scale model is based on the neighbor model of DNA structure, which postulates that the primary structural characteristics of the DNA molecule are the result of neighboring base interactions.As such the structure of a DNA molecule can be derived with reasonable accuracy from the sequence if the contributions of each di-or trinucleotide to the overall structure are known.The structural scales detail this contribution for each di2/trinucleotide.The structural scales can be applied to any DNA sequence to calculate the structural profile, a vector listing the contribution of each di2/tri-nucleotide in the sequence at the respective position.This profile can readily be used for study of the DNA structure, except for the curvature calculation requiring additional steps (see below).In all cases discussed here however, a loess-smoothing with range 10 nt was applied.When dealing with average profiles, the smoothing was done prior to the calculation of the average.
Base stacking energy is a dinucleotide scale that was derived from the theoretical calculations of Ornstein et al. [30].This scale contains the minimal free energy, as expressed by kilocalories/ mol, for each dinucleotide.The more negative this energy is, the more stable the base stacking and thus the DNA helix is expected to be.
Denaturation temperature is a dinucleotide scale derived from the experimental observations of Delcourt and Blake [48].This scale gives for each dinucleotide the average melting temperature in uC.Thus the higher this temperature is, the more stable the DNA molecule.
B-DNA twist is derived from the observations made by Olson et al. [1] and provides the average angle found between successive base pairs for each dinucleotide.The reported values are the twist angle in degrees between successive base pairs.It has been observed that this scale is a good measure for the rigidity of the DNA molecule, as more rigid DNA molecules tend to have a higher twist [33].
Z-DNA-philicity is a dinucleotide scale derived from the calculations by Ho et al. [34], who determined the free energy of the dinucleotide if present in the Z-DNA conformation expressed in kilocalories/mol.Here sequences with lower values will be more inclined to be in the Z-DNA conformation.
Major groove bendability is a trinucleotide scale derived from the DNase-I cutting frequencies as defined by Brukner et al. [35].The reported values are the log of these cutting frequencies, i.e. the less a sequence is cut by DNAase-I, the more negative the assigned value will be.As DNase-I will cut a DNA molecule when it is bent towards the major groove, sequences with high scores in this scale will be more likely to have an intrinsic bend towards the major groove or to be flexible in this direction.
Curvature is calculated in the manner of the BEND algorithm as detailed by Goodsell and Dickerson [32].Here curvature is defined as the sum of the local additive intrinsic deformation of the different base pairs along the helical axis.Its calculation is based on three dinucleotide scales, derived from the original BEND algorithm, for the twist, tilt and roll of the DNA molecule.This information is used to calculate the average trajectory of the DNA axis over 10 base pairs, which is then used to derive the average curvature of each position based on the trajectory 50 base pairs up-and downstream.The reported values are thus the relative angle of the DNA curvature in degrees across these 100 bp.Due to the long range nature of curvature defined in this manner, the curvature was first calculated for the entire genome and then assigned to each separate sequence based on its location.
Smoothing of the structural properties is performed by applying a loess regression to the numerical vector derived from the structural scales after application on the DNA sequence.Unless indicated otherwise, all structural profiles were smoothed with a span of 10 nt.Non-structural Profiles The GC content profiles are derived from a scale where each dinucleotide is assigned a value identical to the number of strong bases it contains (e.g.AT is assigned a 0, AG is assigned a 1, GC is assigned a 2, etc.).While it is not a true structural scale, it is well known that the GC content influences many aspects of the DNA molecular structure.However the reported GC content frequencies, e.g. in the cross-species and functional gene expression class analyses, are not derived directly from this scale but are simply the percentage of either G or C in the corresponding genomic sequences.
The transcription factor binding sites density was calculated by assigning a value of 'one' to each genomic position where a transcription factor is reported to bind according to the data available in RegulonDB and 'zero' if not [47].The average profile over several sequences can therefore be interpreted as the frequency of binding sites across all sequences at a specific position.

Functional Gene Expression Classes
The expression classes are derived from the large E. coli expression compendia found in COLOMBOS (release 2, June 2012) [27,49].Correlation matrices were calculated for all genes in the compendia, where each element ij is the correlation of the gene i versus gene j over all conditions.A hierarchical clustering method was then applied which identified three large categories of genes present in the genome.These three expression classes could then be related to three general biological states: survival in stressful conditions, growth and general metabolism.More information about the nature and construction of these clusters can be found in Meysman et al. [28].Gene ontology enrichments of these three classes can be found in Data S1.

Gene Ontology Enrichment of the Stability Profiles
To analyze the functional relationship between the stability of the TSS region and the functional annotation of the downstream genes, the TSSs with low stability at the 24 position were extracted from the collection of structural profiles for each species.The cut-off for 'low stability' was determined based on the distribution of the base stacking values at this position and differed based on the GC content.The cut-off for the Enterobacteriaceae was set at 28.5, while those of the other more AT-rich species was set at 27.5.The extracted TSSs were then mapped unto the first gene in their associated operon.The functional annotation of the resulting gene list was then evaluated based on the gene ontology annotation available in UniProtKB-GOA (accessed on February 15, 2013).Enrichment for all gene ontology terms mapped to these genes and all of the ancestor ontology terms were estimated using a hypergeometric distribution.The background set for comparison consisted of all the genes downstream from a described TSS with gene ontology annotation.A p-value cut-off of 0.1, corrected for multiple testing with a Bonferonni approach by dividing by the number of tested ontology terms, was used for significance.

Figure 1 .
Figure 1.Average promoter stability profiles grouped by experimental source.Average base stacking energy profiles of TSSs from different experimental sources: low-throughput (LT), high-throughput without preprocessing (HT) and high-throughput with 3P enrichment (HTE).Reported structural profile spans from 200 nt upstream to 50 nt downstream of the TSS.Higher base stacking energies correspond to lower DNA stability.Thus the structural profile derived from the HT experiments reflects lower stability at the TSS when compared to those from the LT and HTE experiments.doi:10.1371/journal.pone.0088717.g001

Figure 2 .
Figure 2. Comparison of promoter and genomic structural values.Distribution of the mean of the structural DNA properties of the E. coli promoter regions (blue line) compared to randomly selected E. coli genomic sequences of similar length (red line) plotted as kernel-smoothed density plots.The y-axis is thus the density, i.e. a normalized representation for the amount of promoters or random sequences with a given mean structural value as given by the x-axis.From left to down: base stacking energy in kcal/mol (KS-test p-value: 8?10 2189 ), denaturation temperature in uC (KS-test p-value: 7?10 2167 ), curvature angle in degrees (KS-test p-value: 3.5?10 213 ), B-DNA twisting angle in degrees (KS-test p-value: 4?10 27 ), Z-DNA-philicity in kcal/mol (KS-test p-value: 4?10 2187 ) and bendability as log-frequency (KS-test p-value: 7?10 229 ).doi:10.1371/journal.pone.0088717.g002

Figure 3 .
Figure 3. Correlation between average E. coli promoter structural profiles.Average structural profiles of the E. coli promoter sequences.A loess-smoothing with range 10 nt was applied prior to the calculation of the average profile.From top to bottom: DNA curvature, denaturation temperature, GC content of the sequence, bendability, B-DNA twisting, Z-DNA-philicity, and base stacking energy.Promoter sequences illustrated are 200 nt upstream from the TSS to 50 nt downstream with the TSS location marked as a grey dashed line.To the right is the heatmap showing the correlation between the average profiles at the left side of this this figure.Colors in the heatmap correspond to the correlation values with 1 (green) signifying strongly correlated average profiles; 21 (blue) signifying strong anti-correlation and 0 (white) signifying no apparent correlation.The distance based on the correlation values between the average profiles is represented by a tree to the right.doi:10.1371/journal.pone.0088717.g003

Figure 4 .
Figure 4. Average E. coli promoter curvature and TFBS density profiles.Average structural profiles for E. coli promoters from 2200 to +50 from the TSS.Top: Average curvature profile of the promoter regions in degrees, where higher values correspond to larger curvature.Bottom: Average transcription factor binding site density profile for the promoter regions, where the values correspond to the frequency of promoters that have a binding site at a given position.doi:10.1371/journal.pone.0088717.g004

Figure 5 .Figure 6 .
Figure 5. Average promoter stability profiles grouped by expression class.Average base stacking energy structural profiles in kcal/mol of the promoter regions associated with one of the three E. coli expression classes, namely the stress class (orange), the general metabolism class (purple) and the growth class (cyan).Higher base stacking energies correspond to lower DNA stability.The structural profiles span 200 nt upstream to 50 nt downstream from the TSS.doi:10.1371/journal.pone.0088717.g005

Figure 7 .
Figure 7. Average structural profiles for six distant prokaryotic species.Average structural profiles of the promoter regions from B. anthracis (solid green line), C. pneumoniae (pink dotted line), Synechocystis sp.(dark blue dotted line), S. elongatus (light blue dotted line), S.

Table 1 .
TSS Overlap between experimental methods.

Table 2 .
Transcription factor binding sites with significant average curvature.
al., two experiments by Mendoza-Vargas et al.) and HT experiments with enrichment for tri-phosphate-capped RNA (RACE by Kim at al., RNAseq by Gama-Castro et al.).The RNAseq results detailed in