Influence of Rotational Nucleosome Positioning on Transcription Start Site Selection in Animal Promoters

The recruitment of RNA-Pol-II to the transcription start site (TSS) is an important step in gene regulation in all organisms. Core promoter elements (CPE) are conserved sequence motifs that guide Pol-II to the TSS by interacting with specific transcription factors (TFs). However, only a minority of animal promoters contains CPEs. It is still unknown how Pol-II selects the TSS in their absence. Here we present a comparative analysis of promoters’ sequence composition and chromatin architecture in five eukaryotic model organisms, which shows the presence of common and unique DNA-encoded features used to organize chromatin. Analysis of Pol-II initiation patterns uncovers that, in the absence of certain CPEs, there is a strong correlation between the spread of initiation and the intensity of the 10 bp periodic signal in the nearest downstream nucleosome. Moreover, promoters’ primary and secondary initiation sites show a characteristic 10 bp periodicity in the absence of CPEs. We also show that DNA natural variants in the region immediately downstream the TSS are able to affect both the nucleosome-DNA affinity and Pol-II initiation pattern. These findings support the notion that, in addition to CPEs mediated selection, sequence–induced nucleosome positioning could be a common and conserved mechanism of TSS selection in animals.


Introduction
An essential step in gene regulation is the recruitment of RNA-Pol-II (Pol-II) to the transcription start sites (TSS) at gene promoters [1][2][3]. This is often facilitated by the presence of conserved sequence motifs known as core promoter elements (CPEs), which are found at a fixed or nearly fixed distance from the TSS [4,5]. Among them, the TATA-box, located 25-30 basepairs (bp) upstream of the TSS, and the Initiator (Inr), located at the TSS, are the best known and most widely conserved CPEs among species [6,7]. The TATA-box is bound by general transcription factors (TFs) that guide and anchor Pol-II to the TSS [8]. As a consequence, promoters with a TATA-box are generally characterized by a focused, almost to the single base, start site [9,10].
In spite of the CPE's demonstrated capability to select a TSS with high precision, only a minority of promoters have a CPE (in human 10% a TATA-box, 30% an Inr motif) [11]. A central question in gene expression is how Pol-II selects the TSS in their absence [12,13]. It has been shown that nucleosomes in promoter regions can regulate gene expression via TF binding site occlusion [14] but their role in TSS selection by Pol-II remains unclear. Promoters have a remarkably conserved chromatin architecture consisting of a nucleosome free region that spans 100-150 bp upstream the TSS followed by a well-positioned nucleosome (+1 nucleosome) [15,16]. This general conformation can be altered by diverse factors. Contrary to intuition, so called broad promoters with dispersed initiation sites have the most focused and regular nucleosome architecture whereas narrow promoters (also referred as peak promoters) have less organized nucleosomes [17] and an atypical chromatin architecture [18]. In zebra fish, the chromatin architecture of the same promoter has been shown to change from one developmental stage to another [19] but there again, the conformation with the more structured nucleosome architecture shows a broader initiation site pattern. In mammals, promoters have traditionally been classified according to the presence or absence of CpG islands (CGI), 500-1000 bp long regions enriched in C+G [20][21][22]. CGI-promoters are often TATAbox depleted [23], have broad TSS [9], exhibit characteristic histone marks [24] and have a precisely positioned +1 nucleosome which is present even when the promoter is not transcribed [25]. In essence, CGI-promoters resemble the broad promoters described in other species and thus may not be considered a separate class.
An open question in gene regulation is whether the chromatin at promoters is organized by sequence-intrinsic features or indirectly by the transcription machinery occupying the nucleosome-free region and thereby forcing the nucleosome to bind to the nearest free space downstream the TSS. On a genome level, two types of sequence features have been reported to participate in nucleosome positioning: dinucleotide periodicity and base composition [26]. A theoretical model suggests that the same dinucleotide repeated at 10 bp intervals leads to intrinsic curvature that favors the wrapping of the DNA around the histone octamer [26,27]. This model theorizes that the periodic dinucleotide always occurs with the same orientation relative to the histone-octamer surface, for instance having the major groove facing outwards, and implies a rotational positioning of the nucleosome. Some authors have identified WW (W for A or T) and SS (S for C or G) dinucleotides in counter-phase as major contributors of rotational positioning [28,29], others emphasized the importance of RR (R for A or G) and YY (Y for C or T) motifs [27]. DNA base composition can also affect nucleosome positioning. Highly AT-rich sequences, in particular poly(dA:dT) tracts, strongly disfavor nucleosome formation [30,31], whereas G+C rich sequence tend to have high nucleosome occupancy [32,33]. Unlike dinucleotide periodicity, sequence composition can position nucleosomes in a narrow DNA region without specific preference for rotational setting, a condition termed translational positioning.
As said before, the role of sequence-intrinsic features in chromatin organization around promoters remains a matter of debate [34]. Zhang and colleagues concluded that its positioning is the result of Pol-II binding to the DNA [35]. Recent studies done in yeast have shown that chromatin remodelers play an important role in organizing chromatin both at a genome [36] and promoter level [37] and that they act synergistically with DNA sequences [38]. Others have reported the presence of nucleosome-favoring and disfavoring sequences in yeast promoters [27,[39][40][41], with a high correlation between in-vitro and in-vivo nucleosome organization in these regions [42,43]. Recently, a 10 bp periodic signal has been observed in cumulative WW frequency plots of promoters sequences aligned with respect to the major TSS as defined by CAGE [44]. A similar WW periodicity can be seen in WW heat map plots published in [19]. The phasing of WW periodicity with the TSS is the first indication that the rotational setting of the DNA in the +1 nucleosome is guiding the TSS selection process.
In this paper, we investigate the molecular mechanisms of TSS selection by jointly analyzing experimentally determined chromatin architectures, DNA-encoded nucleosome signals, Pol-II initiation site patterns and natural genetic variation in promoters stratified by the presence or absence of specific CPEs and/or the breadth of the initiation patterns. The analysis on five model organisms (Homo sapiens, Mus musculus, Danio rerio, Drosophila melanogaster and Caenorhabditis elegans) confirms that different species have an overall similar chromatin organization with nevertheless some noteworthy species-specific differences. All five organisms have sequence-intrinsic nucleosome-positioning signals that are predictive of in-vivo nucleosome organization, but only in promoters that lack TATA-boxes. Additionally, we show that broad promoters associated with strong sequence-encoded nucleosome +1 have 10 bp periodic initiation patterns. By analyzing the effects of genetic variants on promoter initiation site patterns and dinucleotide periodicity, we provide genetic evidence that rotational nucleosome positioning is mechanistically involved in TSS selection.

Results
Promoters have DNA rotational properties that influence in-vivo nucleosome organisation and are affected by species-specific biases in DNA composition To verify that DNA sequences around animal promoters had rotational nucleosome-positioning properties and that the 10-11 bp was the prevailing frequency, 1 kb regions on each side of H. sapiens, M. musculus, D. rerio, D. melanogaster and C. elegans TSSs were scanned for the presence of periodic signals of any length for each individual WW, SS, YY, or RR dinucleotide (S1 Fig). Confirming our expectations, all organisms showed a peak in signal intensity for periods of 10-11 bp (S2 Fig) that are typical of nucleosomal DNA with a minimum in correspondence of the nucleosome free region and a maximum in the N+1 region (S3 Fig). To further study the rotational properties of single promoters sequences and their effect on chromatin conformation, the strength of 10.3 bp periodic signals for each dinucleotide was evaluated in each promoter and compared to their in-vivo nucleosome maps. As expected, the WW dinucleotide (or SS for D. melanogaster) showed the highest correlation with in-vivo nucleosome signals ( Fig 1A and S4 Fig). In H. sapiens, about one third of promoters (top promoters of Fig 1A) had low WW periodicity upstream the TSS and a peak in periodicity immediately downstream. This was reflected in the chromatin organisation with a clear nucleosome free region (NFR) and a focused N+1. As expected, this group of promoters was also depleted of TATA-box and enriched in CpG islands. Approximately 25% of promoters showed an opposite signal, with a peak upstream the TSS and a valley downstream (promoters at the bottom of Fig 1A). They were characterized with a less pronounced NFR, a broader N+1, an enrichment in TATAbox and depletion in CpG islands, in agreement with earlier studies. CpG-enriched promoters were previously reported to have an open chromatin conformation and to be enriched in active histone marks. On the other end, CpG-depleted promoters were reported to have a close chromatin conformation and low levels of histone modifications [45,46].

Dinucleotide periodicities have an additive effect on chromatin organisation
Fig 1A shows that a large fraction of human promoters had a WW signal that, although depleted in the NFR, did not show a clear enrichment in the N+1 region. These promoters might have had other dinucleotide signals that peaked in this region allowing for a correct nucleosome positioning. To test this hypothesis, we identified promoters with periodic signal intensity (for each dinucleotide) in the proximal promoter region that could favour the average in-vivo nucleosome distribution. To do so, we compared the average 10 bp periodic signal in the NFR with that of the N+1 region and identified promoters with a higher signal downstream of the TSS (named hereafter as concordant signal). The organisms had heterogeneous number of promoters with concordant signals (Fig 1B). H. sapiens and M. musculus promoters were characterised for having the YY and RR dinucleotides as the most common and, at the same time, the WW signal was less frequent. This could have been the consequence of the presence of CpG islands that, with their high GC content, could affect the dinucleotide frequencies and the possibility to generate a periodic signal. WW signal was more frequent in all other organisms but only in D. rerio it was the most frequent. In fact, D. melanogaster showed that more then 40% of promoters had a concordant SS signal, whereas C. elegans promoters were enriched in YY signal but strongly depleted of SS signal. Nonetheless, in all organisms 80% of promoters had at least 1 concordant signal ( Fig 1C) and 20% 3 or more. The presence of multiple concordant signals in the proximal promoter region was clearly reflected in chromatin organisation ( Fig 1D) with more focused nucleosomes even outside the proximal-promoter area used in this analysis.

Consensus sequences for promoters' nucleosomes are not always similar to genomic nucleosomes
Our analyses showed that more then one dinucleotide periodic signal was often present in the N+1 region of a promoter (Fig 1). However, it was not clear how the dinucleotides were positioned compared to each other within the same sequence. The mutually exclusive WW and SS are expected to be found in counter-phases [28] as YY and RR [27]. Trifonov [47] concluded that the general DNA consensus sequence for genomic nucleosomes could be summarized with the following 2 motifs, SSRRNWWNYY or SSYYNWWNRR (note the relative position of the YY and RR in the two motifs), but little is known about the relative position of the 4 dinucleotides in the N+1 region. We addressed this using aggregate plots as in [44] where patterns of WW frequency were revealed in the N+1 region of H. sapiens promoters that were remarkably similar to the dinucleotide periodicities seen in MNase-seq data [28]. Using this observation, we evaluated and compared the periodic frequencies of DNA consensus sequences of the N+1 and genomic nucleosomes. To do so, promoters of the 5 organisms under study were aligned to the TSS and, using aggregated plots, the strength of a 10 bp periodic signal was evaluated in the N+1 region of all possible motifs of length 10 bp generated permuting the 4 dinucleotides and two N bases (240 motifs). A similar analysis was performed on genomic nucleosomes defined by high-resolution MNase data and aligned to the inferred center position. In H. sapiens (Fig 2A) there was a very high correlation between the 10 bp frequency strengths measured in DNA sequences coming from genomic nucleosomes and signal from the DNA sequences of the N+1 region with a clear separation between motifs with high signal and all the rest. Confirming the expectations from [47], motifs with strong periodicity were all characterized for having the WW dinucleotide in counter phase to the SS as well as the YY and RR and to share the same dinucleotide order: the SS dinucleotide was always followed by YY, then by WW and RR. The average intensities of this motif class around H. sapiens promoters showed a pattern that closely resembled in-vivo nucleosome maps (S5 Fig) with signal depletion in correspondence of the NFR and a peak at the N+1. Moreover, the class of motifs belonging to the first motif in Trifonov model (SS-RR-WW-YY) [47], showed very week signal in both regions. These findings indicated that in H. sapiens, the DNA wrapped around the histones in the N+1 region had almost identical dinucleotide periodicity patterns of the DNA found in genomic nucleosomes. M. musculus, D. melanogaster and D. rerio showed a preference for motifs belonging to the same class as H. sapiens (

Dinucleotide periodicities in promoters correlate with Pol-II initiation patterns
The finding that promoters with a broad initiation pattern have phased dinucleotide periodicities in the N+1 region compared to focused promoters [44] that, on the other end, are enriched in TATA-box motifs [9,17] suggests that TATA-box and chromatin conformation could have different effects on transcription initiation [12,13]. The TATA-box can direct Pol-II to the TSS with high precision [1] whereas in its absence, chromatin organization could guide the Pol-II complex but less precisely. To analyze the quantitative effect of rotational properties of DNA on Pol-II positioning, the correlation between the strength of the dinucleotide signals in the N +1 region and the spread of Pol-II initiation were studied in grater detail. To do so, promoters were first grouped according to their TATA-box state (with and without the motif) and, for the TATA-less promoters, according to their average dispersion of Pol-II initiation around the TSS (from very focused to very broad promoters) evaluated using CAGE data and summarized with a Dispersion Index (DI, it could be considered as the standard deviation around the most likely initiation site). Then, for each group, the average strength of the four dinucleotide signals in the N+1 region was evaluated. In all organisms tested there was a strong inverse correlation between promoters DI and the average dinucleotide strength (for example for H. sapiens: R 2 = 0.76 and p-value = 0.0002) (Fig 3A and S9 Fig). Focused promoters without a TATA-box were characterized for the presence of a strong periodic signal, whereas broad promoters showed a weak periodicity. TATA-box promoters were outliers: they showed low DI values and weak periodic signals. In D. melanogaster another large group of promoters (5628 promoters, 1/3 of the total) was characterized for having focused initiation and weak periodicity. All these promoters had a DPE [48] and an Inr element, both of which are found at conserved distance from the TSS. Moreover in all organisms, only promoters without TATA-box (or Inr-DPE) had the signal in phase with the TSS suggesting that there was a fixed distance between the TSS

CPE-less promoters show 10 bp periodic initiation patterns
To further elucidate the relationship between periodic DNA signals and Pol-II, we studied the primary and secondary transcription initiation patterns in promoters. In fact, rotational nucleosome positioning due to a 10 bp periodic signal does not require the occurrence of the nucleosome center at exactly the same base: it tolerates shifting by multiples of 10 bp [26,27]. To validate our model that the rotational setting of the +1 nucleosome influences TSS selection by Pol-II, CAGE tags were used to analyze the distribution of transcription starts at promoters. In order to detect these secondary Pol-II initiation sites, a "micro-peak" method was applied to the data that consisted in extracting positions that corresponded to a local maximum in CAGE tag coverage within a window of 5 bp. This method emphasized the stronger initiation sites compared to a simple cut-off value and also reduced the background noise given by spurious signals (S12 Fig). Subsequently, the average distributions of secondary TSS around promoters grouped by their TATA and DI statuses were evaluated.
In H. sapiens, each promoter subclass showed a similar level of primary TSS activity with comparable frequencies of micro-peaks at the TSS ( Fig 3B). Away from the primary TSS, two opposite Pol-II behaviors were detected. The first had a strong 10 bp periodic pattern in secondary initiation sites distribution around promoters and corresponded to TATA-less promoters regardless of their DI values with both focused and broad promoters showing strong secondary initiation patterns. The second had no clear periodic signal near the central peak and corresponded to TATA-box promoters. This subclass had also poor affinity values ( Fig  3A) with the absence of a phase signal downstream the TSS (S10 Fig). The other organisms showed similar patterns of Pol-II initiation (S13 Fig) with TATA-box containing promoters the only group that did not show any periodicity in secondary initiation. In D. melanogaster, Inr-DPE promoters had a micro-peak distribution similar to TATA-box containing promoters.
The 10-bp periodic distribution of secondary initiation sites could be due to local curving of the DNA at the major initiation site or one-sided protection by components of the pre-initiation complex. To rule out this possibility and to establish a direct link between TSS phasing and the +1 nucleosome signal, we selected promoters with the strongest pattern in secondary initiation sites and studied their DNA properties in the N+1 region. Results showed that promoters with a strong periodic TSS initiation pattern (Fig 3C) also showed high phasing with the +1 nucleosome periodic signal (Fig 3D), further suggesting the presence of a direct relation between the two.

Natural variants that map in the N+1 region alter Pol-II initiation
The strong correlation observed between DNA-encoded nucleosome positioning signals near the TSS and transcription initiation patterns (Fig 3) was an indication that the DNA sequence of promoters had a crucial role in guiding Pol-II to the initiation site via a possible N+1 interaction. To gain further evidence that there was a causative link between DNA sequence and Pol-II initiation and to identify the region that had the greatest influence, we studied the effect of natural variation on promoters' DI. To do so, we used CAGE data from the ENCODE tier 1 cell line GM12878 (a lymphoblastoid cell line) for which the genome had been sequenced by the 1000Genome consortium [49]. Using data from this cell line, it was possible to study the effect of natural variation, such as SNPs and Indels (deletion or insertion of few bases), on Pol-II initiation expressed as variation in DI. To address this we compiled promoters' variants for which the GM12878 was homozygous for the minor allele. In total there were 15548 SNPs mapping near promoters (2kb window around TSS) and 1849 indels. The two distributions were similar (S14 Fig), both showed low frequencies near the TSS, but were not exactly the same. SNPs minimum was centered slightly upstream the TSS whereas indels minimum downstream, in a region that coincided with the N+1.
GM12878 CAGE tags were then used to evaluate DI values for all promoters. As a reference, we used CAGE data from blood-derived cells from a different origin that should not contain the same mutations [44] and assigned them to a reference genome containing always the major allele (most likely genome). To identify the promoter region that had the greatest impact on TSS dispersion, we first selected promoters that had natural variants in the GM12878 cell line Kb region around human promoters were scanned for the presence of natural variants with a sliding window of 150 bp (shift 10) and analyzed for their impact on WW 10 bp frequency intensity in that region. The variation in signal intensity was then correlated to variation in DI for the corresponding promoters using a linear model. The dots represent the slope (angular coefficient) of the linear model for the region centered in that position. Negative slope values correspond to negative correlation between the variation in WW dinucleotide frequency and variation in DI (C) Effects of mutations in the N+1 region on nucleosome affinity and DI. The plot shows the effect of mutation on nucleosome affinity in the N+1 region (measured as the difference in 10 bp frequency intensity for WW dinucleotide between GM12878 and ML sequences) and its correlation with variation in DI for the same promoter (measured as difference between GM12878 and other blood related cell lines). Each dot represents a different promoter. The p-value of 0.022 corresponds to the F statistic evaluated on the linear regression model. and grouped them according to the distance of the variants from the TSS (in windows of 150 bp and 10 bp shift). Then the average variation in DI between the two cell lines was evaluated for each group of promoters and plotted as a function of the distance of the window from the TSS (Fig 4A). It was possible to evaluate the impact on initiation patterns made by natural variants at any given distance from the TSS. Both SNPs and indels had a measurable effect on TSS dispersion if located in the proximal promoter region. Overall, SNPs had a weaker effect on TSS dispersion, with a maximum for SNPs mapping 120 bp downstream the TSS, in the central region of the N+1 (Fig 4A). Conversely, Indels had a much stronger impact in a region that extended from the TSS until the end of the N+1 and peaked within the first half of the N+1. Interestingly, SNPs and indels mapping in the NFR did not coincide with a strong variation in DI.

Variants disrupting dinucleotide periodicity in the N+1 region tend to increase TSS dispersion
We then investigated the relationship between alterations of the nucleosomes-DNA affinity (measured as variation in dinucleotide 10 bp frequency) produced by natural variants and their effects on Pol-II initiation. To assess this, we scanned the promoter region with a sliding window of 150 bp (10 bp shift) and investigated the linear relationship between the variation in 10 bp frequency for the WW dinucleotide (produced by GM12878 natural variants that mapped in that region) and the variation in the observed DI for the corresponding promoters. The N+1 region was the only one showing a negative correlation between the variation measured in the nucleosome-DNA affinity and the variation in promoters' DI, with a minimum centered at base +110 (p-value = 0.022, Pearson's r = -0.184) (Fig 4B). On a single promoter level, natural variants that mapped in this region with disruptive effect on the nucleosome binding corresponded to promoters with increased DI compared to WT (Fig 4C). On the other end, natural variants that increased the nucleosome affinity had an effect on lowering the DI.

Discussion
Two pathways for TSS selection by POL-II have been described in the literature. According to the conventional model the TSS position is defined by the presence of CPE [5]. However, the majority of eukaryotic promoters lack CPEs, including a TATA-box and an Inr [11]. Jiang and Pugh proposed that TSS selection in yeast might be linked to the position of the N+1 in the absence of CPEs [12]. Here, through a comparative analysis of DNA-encoded nucleosome signals in animal promoters and Pol-II initiation patterns, we report that the DNA signals underlying both mechanisms are conserved across species and, through the study of DNA natural variants, we show that the level of affinity between N+1 and DNA affects TSS selection in the absence of CPEs.
The function of sequence-intrinsic features in chromatin organization around promoters is still a matter of discussion [34]. Although studies done in yeast have shown an important role of chromatin remodeler in organizing chromatin at a genome [36] and promoter level [37], a growing body of evidence favors the functional role of sequence-intrinsic features at promoters [27,[39][40][41]. Moreover, in a recent study Drillon et colleagues have shown that around 1/3 of nucleosomes in the human genome are positioned based on DNA sequence determinants [50]. Here, through comparative analysis of promoters DNA sequence composition, we show that in 5 model organisms (H. sapiens, M. musculus, D. rerio, D. melanogaster and C. elegans) the position of nucleosomes at the majority of promoters is at least partly determined by DNA encoded signals, with some remarkably species-specific differences. Promoters of all organisms show a 10 bp periodic signal for the four dinucleotides tested (WW, SS, YY and RR). H. sapiens is the only organism showing also a strong signal for YY and RR dinucleotides for a periodicity of 8 bp, that is probably the consequence of the presence of specific CT rich microsatellite sequences in human promoters [51] (S1 Fig). As expected, the dinucleotide that shows the highest correlation with in-vivo nucleosome maps is WW (Fig 1A). Regardless of this, multiple periodic signals reinforce each other in organizing chromatin around promoters (Fig 1D), suggesting an additive effect of the affinity of the four dinucleotides to histones. When we study the spatial relationships between the four dinucleotides within a promoter sequence we find the same consensus as in genomic nucleosomes (SS-YY-WW-RR) for all organisms tested with the exception of C. elegans. Interestingly, on a genome level the DNA that is wrapped around C. elegans nucleosomes has the same consensus sequence as all other organisms but at promoter level we find that there are two distinct group of promoters characterized for having the SS-YY-WW-RR or SS-RR-WW-YY consensus. This finding is intriguing since the difference in the two sequences is not purely semantic, but has been predicted to alter the affinities to histones [47]. Although SS-RR-WW-YY has been predicted to have the higher affinity to nucleosomes allowing for perfect bendability of the DNA around the histone octamer [52], our analysis show that C. elegans promoters with this sequence in the N+1 region do not have any difference in chromatin conformation compared to promoters with the other consensus. The reason for this unexpected observation is unknown and need further investigation.
The identification of promoters by the transcription machinery is a process that is guided by the general transcription factor TFIID [53], a multi-subunit protein that is not only able to interact with the TATA-box or the DPE element [5] but also with chromatin [54][55][56] via the TAF3 subunit, suggesting the presence of a motif-independent TFIID recruitment at promoters that rely on the N+1 [57]. In agreement with this hypothesis, TATA-box mutation studies have shown a direct effect on Pol-II initiation both in term of TSS position and level of promoter activity [19,58]. On the other end, no study, to our knowledge, has investigated the effect that nucleosome-DNA affinity in the N+1 region has on TSS selection. Correlation analysis shows that in all organisms promoters without CPEs have the predicted level of nucleosome-DNA affinity anti-correlated with TSS initiation patterns (Fig 3A and S9 Fig). Broad promoters generally have lower DNA-encoded nucleosome affinity. Conversely, narrow promoters, often presented as a homogeneous class in the literature, vary greatly in this respect, with only the CPE-less subset (TATA-less and Inr-DPE-less in D. melanogaster) showing strong affinity in the N+1 region. Moreover, the 10 bp periodicity seen in Pol-II initiation in all promoters, focused and broad, that lack CPEs (Fig 3B and S13 Fig) is another indication of a direct interaction between Pol-II and the N+1 in the absence of other DNA signals. In fact, a model of Pol-II initiation that relies on the interaction with the N+1, which in turn is rotationally positioned and able to tolerate shifting by multiples of 10 bp [26,27], would allow Pol-II to start transcription at 10 bp intervals. Furthermore, the study of DNA natural variants in H. sapiens have shown that the region with grater influence on TSS selection is the N+1 (Fig 4A) and that there is a negative correlation between variation in nucleosome affinity and Pol-II initiation (Fig 4B  and 4C). That is, the presence of a variant in the N+1 region that decreases the nucleosome-DNA affinity results in an increase in TSS dispersion and vice-versa. These results strongly support the model of a motif-independent TFIID recruitment mediated by N+1-TAF2 interaction [57]. We can speculate that, in the absence of the TATA-box or Inr-DPE, the relative stability of the histones-DNA complex in the N+1 region could be transferred to the PIC via interaction with TFIID leading to a more or less focused transcription initiation by Pol-II. An alternative mechanism of PIC recruitment at promoters in the absence of CPE has been proposed by recent work by Afek and Lukatsky done in yeast in which they used a non-consensus based free-energy function to predict PIC affinity instead of nucleosome affinity [59]. Interestingly, they found that the free-energy distribution around promoters (Fig 1 and Fig 2 in [59]) is very similar to our nucleotide periodicity profile we see in human (S5A Fig) with a minimum located in the nucleosome-free region upstream of the TSS followed by spikes in freeenergy in correspondence of the nucleosome occupied regions. On the other end, in all organisms studied, CPEs containing promoters are outliers compared to non-CPE promoters: they are focused but have weak nucleosome affinity and do not show any TSS periodicity. In this class of promoters the initiation site appears to be specified solely by the presence of the CPE [8,10].

Methods
The study is based on experimental evidence present in public datasets. All arithmetic computations were done in R and the corresponding code is presented in the Data Reproduction Guide provided as supplementary material (S1 Text). This document follows high standards of reproducible research; it is a step-by-step guide to precisely reproduce all results presented in this paper and to generate all the figures.

Position weight matrices for CPEs and CpG island annotation
Promoter lists were stratified based on the presence or absence of core promoter elements using the TATA-box and Inr position weight matrices (PWMs) from [6]. Promoter sequences were scanned with these PWMs using the cut-off values suggested in the original paper. Promoters were classified as TATA+ if a TATA-box was present at position -29±3 relative to the TSS, while as Inr+ if this motif occurred exactly at the TSS. The D. melanogaster Inr-DPE matrix is posted at http://epd.vital-it.ch/promoter_elements/init-dpe.php, including the recommended cut-off values.
CGI coordinates for human and mouse were downloaded from the UCSC genome browser [67]. Promoters with a CGI that spans the TSS (starting before and ending after the TSS) were attributed to the CGI+ class.

Evaluation of periodicity score around promoters
Promoter sequences from position -1074 to position 1075 relative to the TSS were extracted from the corresponding genome assembly (H. sapiens: hg19; M. musculus: mm9; D. rerio: danRer7; D. melanogaster: dm3; C. elegans: ce6) and scanned for the presence of four dinucleotide types (identified by IUPAC codes): WW (W = A or T), SS (S = C or G), RR (R = A or G) and YY (Y = C or T). The resulting binary sequences were individually scanned in a sliding window of 150 bp, shifted by 10 bp at a time. A Fourier transform was applied to each window in order to extract the power spectrum. From the resulting spectrum, the value corresponding to a frequency of 0.097 (corresponding to a period of 10.3 bp) was extracted. This value was directly used as a periodicity score.

Identification of genomic nucleosomes
For paired-end samples, nucleosome positions were restricted to paired-reads that formed fragments of exactly 147 bp as previously reported in [28]. In a similar way, to reproduce analogous results on single-end samples, reads were selected if they had another read mapped on the opposite strand 147 bp downstream. For both single-and paired-end samples, multiple fragments that mapped to the same location were considered only once. For both paired-and single-end samples, the midpoints of the fragments were used as the inferred nucleosome position.

Evaluation of consensus motifs scores for nucleosome +1 and genomic nucleosomes
Consensus motifs were generated by permuting the 4 dinucleotide (WW, SS, YY, RR) and two Ns. Sequences starting with an N were discarded resulting in a total of 240 sequences. These consensus motifs were then mapped to promoters and MNase-seq enriched regions.
For the analysis of nucleosome +1, the region from position -99 to 300 relative to the TSS of the corresponding genome assembly was used for mapping each consensus motif allowing a maximum of 3 mis-matches. Then, the average occurrence frequency for each motif was evaluated from base +50 to +200 relative to the TSS and a Fourier transform was applied in order to identify the intensity of the frequency of 0.097 (corresponding to a period of 10.3 bp). This value was then stored as the motifs' score for the nucleosome +1 and the procedure was repeated for all consensus motifs. For the genomic nucleosomes a similar analysis was performed. In order to speed-up the analysis, 80.000 inferred positions were randomly selected from each sample. Subsequently, each consensus motif was mapped around the inferred nucleosome position and the average occurrence frequency was calculated from position -75 to +75 relative to it. A Fourier transform was then applied as before and the value for a period of 10.3 bp was used as the motif score in genomic nucleosomes.

Periodicity analysis of Pol-II initiation patterns
CAGE data from different samples belonging to the same species were first merged into one file. TSS profiles were then extracted for promoter regions extending from -103 to +104 relative to the dominant TSS using the ChIP-Extract tool from the ChIP-Seq web server [68]. The resulting integer arrays were then converted into binary "micro-peak" arrays. Briefly, a micropeak corresponds to a 5bp window with a minimal number of 100 tags. The position of the micro-peak is then assigned to the position with the highest number of tags within the corresponding window. Each micro-peak was then given a maximum value of 1 tag. The cumulative frequency of micro-peaks was then determined at single-base resolution within a 200bp region around the TSS.
To identify promoters with a strong 10 bp periodicity in micro-peaks signals, promoters were ranked according to the covariance between their micro-peaks distribution and a cosine function of period 10 bp. Promoters with weak micro-peak signal (with low covariance values) were selected for having a cumulative covariance equal to 0.

Nucleosome distribution around promoters
Nucleosome distributions for promoter subsets were computed from nucleosome mapping data using the ChIP-Cor program from the ChIP-Seq web server [68]. MNase-or ChIP-seq tags were centered by 70 bp to account for the estimated fragment size of about 140 bp (centering parameter of the ChIP-Seq server). Multiple tags mapping to the same genomic location were removed from the analysis (parameter "Count cut-off "set to 1) and tag frequencies were calculated in a 10 bp sliding window.

Evaluation of Dispersion Index (DI)
The spread of CAGE tags in a window of 100 bp around the TSS was expressed as a Dispersion Index (DI) using the following formula: Where N is the total number of tag starts in the window around promoter k, and x i is the mapped position of the 5' end of tag i. For each species, DI values were calculated for each promoter using CAGE data from individual samples. A DI was calculated only if more then 5 tags mapped in the selected region. The sample-specific DI were then averaged to obtain a final unique and robust DI value for each promoter.

Analysis of genomic variants in GM12878 cell line and generation of a Most Likely (ML) genome
VCF files of Indels (version 2010_07) and SNPs (version 2010_03) for the GM12878 cell line were downloaded from the 1000Genomes ftp web server. All homozygous variants were extracted from these files and used to generate a GM12878 genome. On the other end the frequencies of these variants were evaluated using the allele frequency calculated by the final version of the 1000Genome project (phase 3, 20130502). For each variant, the most frequent allele was stored and used to generate the Most Likely genome that was then used as reference. The final list of SNPs and Indels for GM12878 cell line was restricted to the variants that differ compared to the ML genome.