Genomic Organization of Human Transcription Initiation Complexes

A repertoire of transcription initiation factors engage the core promoter of mRNA genes to recruit RNA polymerase (Pol) II to initiate transcription, yet their precise spatial organization remains unclear. Using ChIP-exo, here we detail the interactions and genomic organization of initiation factors TBP, TFIIB, and Pol II at mRNA genes and within CpG islands. We find that when Pol II moves into a transcriptionally paused state, TBP/TFIIB remain at the promoter. We show that TBP and TFIIB bound to the core promoter at two separate, resolvable locations that coincided with sites of divergent transcription initiation. We also examine the precise binding of TBP at Pol III transcribed tRNA genes. We find that TBP crosslinked to tRNA genes in a similar manner as at Pol II transcribed genes. This comprehensive and high resolution genome-wide detection of the initiation machinery produces a consolidated view of transcription initiation events humans at Pol II coding and Pol III transcribed tRNA genes.


Introduction
The classic paradigm for assembling the minimal core transcription machinery at mRNA promoters starts with the recruitment of the TATA binding protein (TBP). Next is the docking of TFIIB, which straddles and locks onto TBP. Together with TFIIF, TFIIB then engages Pol II in its active site to help set the start site of transcription (TSS) [1,2]. The recruitment of the transcription machinery has long been thought to be an important rate-limiting step in gene expression [3]. Concepts in transcription initiation by all three RNA polymerases (I, II, and III) have been guided by this basic theme [4].
For RNA polymerase II, in multi-cellular eukaryotes, some of general transcription factors may be largely pre-assembled at promoters. There, Pol II is in a transcriptionally engaged but paused state, approximately 30-50 bp downstream from the TSS [5][6][7]. Further complicating the classic paradigm of transcription initiation of mRNA genes is the coupling of antisense transcription upstream of the core promoter [8]. These divergent TSSs are spaced roughly 250 bp apart with some variance, and driven by separate initiation complexes [9]. However, the precise genomic organization of human transcription complexes within this context remains unclear. Conventional genomic factor mapping approaches, such as ChIP-seq, are not of sufficiently high resolution to address this issue.
Most vertebrate promoters are found within stretches DNA with high density of CG dinucleotides, called CpG islands, that can be reversibly methylated [10]. Methylation of CpG islands is associated with transcriptional silencing, whereas unmethylated or hypomethylated CpG islands are thought to contribute to creating a permissive chromatin state for transcriptional activation by destabilizing nucleosomes [11]. The prevalence of CpG islands at promoters and their influence on transcription raises the question of how the transcription initiation machinery is organized within this genomic context.
In contrast, transcription initiation by RNA polymerase III at tRNA genes involves TFIIIC recognition of specific sequences downstream of the TSS, then recruits TFIIIB to a region immediately upstream of the TSS that lacks apparent sequence specificity [12,13]. Pol III then binds to form a pre-initiation complex. TFIIIB contains TBP (and BRF, a factor related to TFIIB) and thus it has been enigmatic as to how TBP in TFIIIB engages the upstream region.
In 2013, we published a manuscript detailing the organization of the TBP, TFIIB, and Pol II (PIC) components of the transcription machinery across the human genome in K562 cells and other transformed cell lines using the ChIP-exo genome-wide assay [14]. That paper was retracted in 2014 due to statistical errors concerning the specificity of DNA sequence elements associated with the identified PICs. Here we report those parts of the study that we deem to remain valid. This includes a characterization of the structural organization of TBP, TFIIB, and Pol II around coding genes and TBP at tRNA genes.

Pre-initiation complex occupancy (PIC) at mRNA genes
To obtain a detailed assessment of pre-and post-initiation complexes we conducted ChIP-exo on TBP, TFIIB, and Pol II in the human erythroleukemia cell line (K562). We focused on TBP and TFIIB to assess PIC formation because in yeast these proteins were the most detail-rich, whereas other initiation factors displayed essentially similar ChIP-exo patterns [15]. To assess post-initiation transcription complexes and the extent to which genes display promoter-proximal pausing, we ChIP'd the largest Pol II subunit (POLR2A). 8,364 TFIIB ChIP-exo peak-pairs (Table A in S1 File) were found within 500 bp of an mRNA TSS, which corresponds to~50% of all annotated protein-coding K562-expressed genes ( Fig 1A). Seemingly expressed genes that lacked a TBP/TFIIB location may have arisen from multiple sources including rare but stable mRNAs, detection noise, and antisense transcription arising from a more distal promoter. TBP/TFIIB/Pol II occupancy and mRNA levels displayed a similar trend (Fig 1B), but were weakly correlated (S1 Fig), possibly due to differences in RNA stability.

Divergent mRNA transcripts arise from distinct initiation complexes
To gain detailed insight into the structural organization of human promoter initiation complexes, we focused on the 8,364 K562 TFIIB locations near the TSS of 6,511 coding RNAs as defined by RefSeq [16]. Fig 2A provides one example of the raw tag distribution concentrated 25 bp upstream of the RPS12 ribosomal protein gene TSS. When individual genes were examined (Fig 2B), or averaged (meta analysis) across all 6,511 genes ( Fig 2C), two regions of high TFIIB/TBP/Pol II occupancy were observed. The major right-ward peaks corresponded to primary promoter transcription initiated complexes (Fig 2C, upper panel). Those in the left-ward direction matched divergent TSSs [8,[17][18][19], although the resulting RNA was less abundant than expected from TFIIB/TBP/Pol II occupancy levels ( Fig 2C, lower vs upper panel; Note that 2°TSS represents only 24% of the total TSS signal). This may result from RNA instability, as seen in yeast [20][21][22]. The clear spatial separation of complexes indicates that divergent transcripts arise from distinct initiation complexes.

TBP/TFIIB separation from paused Pol II in CpG islands
CpG islands overlapped with nearly 80% (5,095) of the 6,511 mRNA gene promoters where we detected TFIIB ChIP-exo crosslinking. We found that on average 1.6 TFIIB complexes (8,254 locations within 5,095 CpG islands) were detected per CpG island, regardless of island length, with the center of the island being enriched~100 bp downstream of the primary TSS ( Fig 3A  and 3B). Those complexes separated by >80 bp had uncorrelated occupancies (Fig 3C, black), which suggests that they are regulated independently. Those, <80 bp apart appeared to be regulated coordinately.
For the vast majority of transcription units, Pol II crosslinked 50 bp downstream of the primary TSS (Fig 2B and 2C), where it is expected to pause after initiating transcription [7]. Pol II was most depleted 20-60 bp upstream of the consensus TSS, indicating that on average it likely spends less time there than at the paused sites, in proliferating K562 cells. This suggests that in most (but not necessarily all) cases, once Pol II is recruited it, rapidly clears the promoter and assumes a paused-state~30-50 bp downstream of the TSS, consistent with the observation that Pol II pause release is a rate-limiting step in transcription [23,24].
The crosslinking pattern of human TFIIB was of particular interest since TFIIB in budding yeast crosslinks broadly across the relatively stable single-stranded DNA region within the Pol II active site at core promoters [15], in accord with crystallographic models of "open" complexes [25]. However, with the majority of human Pol II residing at pause sites just downstream of the core promoter, and RNA rather than TFIIB filling the Pol II active site region, the human TFIIB crosslinking patterns may not resemble yeast. Remarkably, the major crosslinking point for human TFIIB was~20 bp upstream of the TSS, rather than spread across the TSS as in yeast. This location is precisely where TFIIB contacts DNA immediately downstream of TBP bound to the TATA box. Additional lower intensity crosslinking was observed near the TSS (Fig 2A and 2C, upper panel), which would be consistent with low levels of TFIIB  We identified 150,753 putative low-threshold (>4 tag counts) TFIIB locations in K562 cells that were far (>500 bp) from the TSS of a protein-coding gene. Of these, 21,160 were also identified in the TBP dataset (within 20bp), 9,192 in the Pol II dataset (within 150 bp), and 2,353 in a no-antibody control (within 150 bp, data not shown). About 57% and 5% of these TBP/ TFIIB (11,192 of 19,952) and TBP/TFIIB/Pol II (47 of 1,033) co-occupied locations (and not in the no-antibody control), respectively, resided in ENCODE-annotated [26] heterochromatic regions of the genome. The cohort of 11,192 TBP/TFIIB promoter distal locations appear to be mostly (57%) isolated heterochromatic complexes whose physiological significance remains to be determined, although 36% of these locations did reside in ENCODE-annotated promoter/ enhancer/transcription regions of the genome. In contrast, the vast majority (90%) TBP/ TFIIB/Pol II co-occupied locations (n = 1,033) resided in ENCODE-annotated promoter/ enhancer/transcription regions of the genome.

TFIIB locations across cancer cell lines
To assess the extent to which TFIIB occupancy at coding genes differed across cancer cell lines, we conducted ChIP-exo mapping of TFIIB locations across four ENCODE cancer cell lines: HeLa S3, HepG2, and MCF7 in addition to K562 (cervical, liver, breast, and blood, respectively). We detected TFIIB at 9,069 mRNA genes in at least one cell line, and at 1,691 genes in all lines (group 1) (Fig 4 and Table B in S1 File). The remaining 7,378 genes were parsed by Kmeans clustering into three additional groups. For group 1, gene ontology analysis [27] revealed that these genes tend to be housekeeping genes, such as those involved in translation, chromatin assembly, and RNA splicing (P = 10 −55 , 10 −12 , and 10 −11 ; respectively). As expected for housekeeping genes, these genes displayed similar levels of TFIIB occupancy across the four cancer cell lines tested. Groups 2 displayed some notable differences in occupancy, suggesting tissue-specific TFIIB promoter occupancy for some of the genes. In particular for group 2 genes, relative TFIIB occupancy was increased in HepG2/MCF7 and decreased in K562/HeLa. Group 2 was enriched with genes encoding transcription activators (P = 10 −6 ), consistent with the frequent tissue-specific expression of TFs [28,29]. Groups 3 and 4 comprised much of the lowly occupied (blue) genes across the four cancer cell lines, suggesting that these genes may be expressed at a basal level. For groups 3 and 4, gene ontology analysis showed an enrichment of RNA processing, catabolic, and cell cycle genes (P = 10 −15 , 10 −15 , and 10 −16 ; respectively) that may not be as highly expressed as the translation (eg: ribosome protein subunits) and chromatin assembly (e.g., histones) found in group 1 that are among the most highly transcribed genes. MCF7and HepG2 TFIIB profiles were most distant from K562 on the dendrogram plot (Fig 4, above heatmap) suggesting that the patterns of TFIIB occupancy differed most between blood (K562) and breast/liver (HepG2/MCF7) tissues.

TBP binds~21 bp upstream of tRNA gene TSSs
Given that the initiation of Pol III genes requires a distinct set of complexes from Pol II genes, yet share TBP in common, we examined whether TBP crosslinked to tRNA genes in a similar manner as at Pol II transcribed genes. We aligned strand-separated exonuclease stop sites to the TSS of all 386 tRNA genes at which TBP was detected (Fig 5B). Remarkably, as evidenced non-overlapping 10 bp intervals relative to the TSS, and then the average peak-pair density value across all TFIIB-occupied (n = 6,511) genes was plotted as a percent of the total. The "spikes" of TBP and TFIIB are indiscernible (vertically offset in inset). Lower panel: Distribution of 2°polyadenylated RNA [39], with traces separated by sense (blue) and antisense (red, inverted trace) orientations relative to the corresponding mRNA TSS.
doi:10.1371/journal.pone.0149339.g002   (Fig 5A), which is in line with what was observed at Pol II promoters. Unlike at Pol II promoters, almost no crosslinking was observed downstream of the TSS, which may reflect a lack of pausing of Pol III (through which TBP might crosslink), which differs from Pol II. Similar to TBP crosslinking through TFIIB, TBP might predominantly crosslink through BRF. Indeed, the peaks of BRF and TBP crosslinking are coincident at Pol III genes in mice [30]. If true, then TBP in complex with a TFIIB family member might engage the core promoter similarly in Pol II and III systems.

Consolidated genomic view of initiation
Genome-wide mapping of the general transcription machinery at near single-base resolution offers a clearer view of certain transcription initiation events from yeast to humans, Pol II to Pol III, and mRNA to tRNA. In general, a TFIIB/BRF family member is recruited to all coding or noncoding core promoters via a TBP family member in a spatially-constrained manner. As established elsewhere [24,[31][32][33], Pol II then scans downstream, where it encounters a TSS that allows for productive transcription. In metazoans, Pol II subsequently pauses 30-50 bp further downstream. In yeast, a nucleosome border may help set the start site of productive transcription [34]. Although core promoter regions are seemingly long (~40 bp in human) for sequence-specific binding, they do not appear to be enriched with well-defined sequence elements and so may have evolved to have inherently low specificity, presumably to keep basal transcription low and to maintain high dependence on transcriptional activators. Divergent antisense transcription is a common feature of mammalian promoters [8,18]. Whether these divergent transcription events arise from distinct PICs was unclear and unresolvable with the lower resolution of conventional ChIP-seq. We show that TBP and TFIIB bound to core promoter regions at two separate resolvable locations that coincided with sites of divergent transcription initiation. A recent study confirmed our original findings and interestingly extended them to suggest a unified architecture of bidirectional transcription initiation at promoters and enhancers [9].
Although the transcription of tRNA genes requires an almost entirely distinct set of machinery, TBP subunit nevertheless crosslinks at the same genomic position relative to TSSs of Pol II and Pol III transcribed tRNA genes. Therefore, TBP in complex with TFIIIB may engage the core promoter in Pol II and III systems in a fundamentally similar manner.

ChIP-exo and Antibodies
With the following modifications, ChIP-exo was carried out as previously described [35] with chromatin extracted from 10 million cells, ProteinG MagSepharose resin (GE Healthcare), and 3 ug of either TFIIB (Santa Cruz Biotech, sc-225), TBP (Santa Cruz Biotech, sc-204), or Pol II (Santa Cruz Biotech, sc-899, directed against the N-terminus of the Pol II large subunit encoded by POL2RA).

Alignment to Genome, Peak Calling, and Data Access
Libraries were sequenced on an Illumina HiSeq sequencer. The entire length of the sequenced tags were aligned to the human hg18 reference genome using BWA [36] using default parameters. Raw sequencing data are available at NCBI Sequence Read Archive (SRA067908) and a sequencing statistics summary may be found in Table C in S1 File. The resulting sequence read distribution was used to identify peaks on the forward (W) and reverse (C) strand separately using the peak calling algorithm in GeneTrack (sigma = 20, exclusion zone = 40 bp) [37]. For strand-specific and strand-merged plots, sequencing tags were normalized to input. To obtain RPKM (Reads Per Kilobase per Million mapped reads) normalized counts, Input tags were binned relative to TSSs in the same manner as ChIP data, RPKM computed, and then for each corresponding bin the RPKM for each ChIP data set was divided through by the Input RPKM. Peaks were paired if they were 0-80 bp in the 3 0 direction from each other and on opposite strands. Any peak-pair locations that were present in the ENCODE designated blacklist were removed from the analysis [38]. Since patterns described here were evident among individual biological replicates, and replicates were well correlated, we merged all tags from biological replicate data sets to make final peak-pair calls. TFIIB peak pairs in K562 cells were initially considered for preliminary analysis if they had a low threshold tag count of >4 in the merged datasets. 159,117 locations met these criteria. Subsequent analysis revealed that 8,364 TFIIB locations were near (within 500bp) a RefSeq TSS, while the remaining 150,753 TFIIB locations were TSS-distal (>500bp). Input tag distribution relative to 8,364 TFIIB locations is shown in S2 Fig. NCBI-curated RefSeq TSSs (n = 26,987) [16] comprising 23,562 nonredundant gene TSSs genes were considered (Table D in S1 File). Assignment of TFIIB (8,364 peak-pairs) and TBP (7,642 peak-pairs) to the nearest RefSeq TSS required that they be within 500 bp of the TSS, yielding 6,511 nonredundant mRNA genes. Importantly, using a more stringent interval only marginally changed these numbers and did not alter our conclusions. If a gene had >1 TSS, then the TSS nearest to the bound location (peak-pair midpoint) was used as the primary TSS, and other nearby TSSs were considered secondary (Fig 1C, lower panel).  This file contains Tables A-D. Table A: TFIIB locations at protein coding genes,  Table B: TFIIB occupancy across cancer cell lines, Table C: Sequencing statistics, and Table D: hg18 RefSeq Annotation. (XLSX)