A novel core promoter element induces bidirectional transcription in CpG island

How TATA-less promoters such as those within CpG islands (CGI) control gene expression is still a subject of active research. Here, we have identified the “CGCG element”, a ten-base pair motif with a consensus sequence of TCTCGCGAGA present in a group of promoter-associated CGIs of ribosomal protein and housekeeping genes. This element is evolutionarily conserved in vertebrates, found in DNase-accessible regions and employs RNA polymerase 2 to activate gene expression. Through extensive analysis of several endogenous promoters, we demonstrate that this element activates bidirectional transcription through divergent start sites. Methylation of this element abrogates the associated promoter activity. When coincident with a TATA-box directional transcription remains CGCG-dependent. Because the CGCG element is sufficient to drive transcription, we propose that its unmethylated form functions as a core promoter of TATA-less CGI-associated promoters.


35
Gene expression is one of the most critical, yet enigmatic, biological processes that 36 defines cellular and organismal identity, and that mediates cellular response to internal 37 and external stimuli 1 . Importantly, dysregulation of this process is known to contribute 38 to various human diseases such as cancer 2 . With the discovery of RNA polymerases, 39 the mechanisms of how transcription occurs have been extensively studied in many 40 organisms 3 . In contrast to the relatively simple prokaryotic transcriptional system, 41 metazoan transcription is considerably more elaborate and involves complicated 42 promoter structures, multiple functional DNA elements and a repertoire of specific 43 general transcription factors. These factors and DNA elements are required to facilitate 44 accurate transcriptional initiation, elongation, and termination 4-6 . 45 The best-known DNA element that mediates the initiation of transcription of protein-46 coding genes is the TATA box with the consensus sequence TATAA 7 . This element is 47 usually located 25 to 34 base pairs upstream of transcription start sites (TSS). However, 48 most human promoters, including those regulating housekeeping genes lack this DNA 49 element 8 , suggesting that TATA-less promoters are controlled by different yet poorly 50 understood mechanisms. A few novel elements have been described that presumably 51 function as core promoter elements in TATA-less promoters 9-12 . Yet, most of these 52 promoter elements (e.g. GC-box or Inr motif) require additional transcriptional activator 53 binding sites in order to drive directional transcription. 54 Vertebrate genomes contain short G+C rich sequences that are typically less than 1 kb 55 long traditionally termed CpG islands (CGIs) 13,14 . These regions are considered to be 56 demonstrate that the CGCG element suffices as a core promoter element to drive 79 bidirectional transcription. Gene Ontology analysis indicates that this element is 80 enriched in the promoters of housekeeping genes, most notably those controlling RNA 81 metabolism and translation, and in promoters producing long non-coding RNAs. 82 Together, our results indicate that the CGCG element functions as a previously 83 unknown driver of CGI-associated TATA-less promoters. 84

Motif discovery in DNase-sensitive CpG islands 86
Roughly 50 percent of human promoters are associated with a CGI 27 . To identify novel 87 CGI-associated, independently-functioning promoter elements that potentially drive 88 transcription independent of other promoter elements and are enriched in human CGIs 89 (~30k), we extracted CGI sequences that overlapped with DNase-accessible regions 90 (~192k DNase-seq peaks) in the K562 cell line. We then performed an unbiased motif 91 discovery to identify the most enriched motifs in transcriptionally active CGI-associated 92 promoters (figure 1a). As expected, the SP1 binding site (GC box) had the highest 93 enrichment score consistent with its purported role in driving TATA-less promoters. 94 Binding sites for NRF and ETS were also identified, consistent with roles for these 95 transcription factors in the regulation of CGI-associated housekeeping genes 28 . We 96 also identified two novel sequence motifs (#7 and #10) that were highly conserved 97 within vertebrates. There were more than 400 incidences of motif #10 that coincided 98 with DNase-seq footprints in multiple cell lines (K562 is shown), suggesting that this 99 motif represents a shared regulatory element ( figure 1b, Supplementary figure 1a). 100 Although most CGI-associated promoters contain one copy of the motifs shown in figure  101 1a, motifs 7 and 10 occur in multiple copies in a given promoter (figure 1c). Genome 102 Ontology and Metagene profile analyses showed that motif 7 and 10 are enriched 103 significantly in annotated human CGI-containing promoters, with motif 10 being far more 104 enriched in promoters of annotated coding and non-coding genes despite being less 105 frequent (figure 1b; motif 7=1408 vs. motif 10=413 copies) (figure 1d). 106

CGCG elements recruit transcriptional machinery and activate gene expression 107
To determine whether motif 7 and 10 could confer transcriptional activity independently, 108 we cloned the sequence of the most common variant of each motif (ACTACAATTCCC 109 and TCTCGCGAGA, respectively) into the promoterless firefly luciferase reporter 110 construct, Empty pGL2-basic. The resulting constructs were then separately 111 cotransfected along with a control reporter for Renilla luciferase driven by the HSV-1 112 thymidine kinase promoter (pRL-TK) into human embryonic kidney (HEK293T) cells. 113 Motif 10, but not Motif 7, significantly activated firefly reporter gene expression (figure 114 2a). This result encouraged us to focus on motif 10, which we named the "CGCG 115 element" based on its central motif. A genome-wide analysis found that this element 116 maps within 50bp of annotated TSSs in human and mouse genomes (Supplementary 117 figure 1b) suggesting that this element could potentially function as a core promoter 118 element 29 . To address the function of a specific naturally-occurring CGCG element, we 119 analyzed the CGI-containing promoter of the human Density Regulated gene (DENR). 120 The DENR promoter contains three tandem CGCG elements separated by 21 and 11 121 nucleotides (figure 2b). To determine the role of each CGCG element in this promoter, 122 we inserted promoter fragments containing CGCG #1, CGCG #1,2 and CGCG #1,2,3 123 into pGL2-basic. Although a single copy of the CGCG element significantly increased 124 reporter activity, there was a 7-and 17-fold increase in reporter activity with the addition 125 of the second and third CGCG elements, respectively. Introducing G to T mutations in 126 all CGCG elements (CTCG #1,2,3) dramatically decreased promoter activity, 127 suggesting that the CGCG element is necessary and sufficient to drive reporter 128 expression and that there is a cooperativity between multiple CGCG elements (figure 129 2c). 130 To determine if CGCG element-driven gene expression is dependent on RNA 131 polymerase 2 (POL2), we transfected HEK293T cells with reporter constructs that 132 contain either the consensus motif (TCTCGCGAGA) or a CTCG mutation 133 (TCTCTCGAGA) and performed a chromatin immunoprecipitation (ChIP) for POL2 30  protein levels compared to WT controls (figure 2g). Together with the reporter analyses, 148 these findings suggest that CGCG elements actively recruit transcriptional machinery 149 and promote gene expression in the CGI-associated promoter of DENR gene. 150

CGCG element confers bidirectional transcription activity 151
Due to the palindromic nature of the TCTCGCGAGA motif, we wondered whether the 152 CGCG elements could also activate bidirectional transcription. To test this, we 153 developed a novel bidirectional reporter construct (LuBiDi) to measure promoter activity 154 using firefly and Renilla luciferase genes as reporters of directional transcription from a 155 central control motif (figure 3a). 156 We inserted one or two copies of the TCTCGCGAGA motif into the LuBiDi plasmid and 157 measured reporter activity. A single CGCG element was sufficient to induce both firefly 158 and Renilla reporters whereas two CGCG elements induced an additional 4-fold 159 increase (figure 3b). To study the motif sequence requirement for this activation, we 160 introduced mutations in the motif that disrupted the wild-type sequence in various 161 locations. First, to determine whether the palindromic structure was more important than 162 sequence content in conferring the bidirectional transcriptional activity, we exchanged 163 the flanking sequences to form AGACGCGTCT, which maintains both symmetry and 164 CpG content. This mutation abrogated the dual activation of reporters (figure 3b), 165 suggesting that the CGCG element has sequence polarity. A CGCG -> CTCG transition 166 mutation (TCTCTCGAGA, reduced CG content) and an "A" insertion into CGCG 167 (TCTCGACGAGA, unchanged CpG content) abrogated dual reporter activity (figure 3b). 168 The inclusion of two copies of the A insertion mutant failed to induce transcription. 169 Altogether, these results indicate that the WT element, CGCG core plus the flanking 170 palindromic sequences found in motif 10, are required for promotion of bidirectional 171 transcriptional activity. 172 To analyze the expression dynamics of CGCG elements in single cells, we developed 173 another promoter-less bidirectional reporter (pmCGFP) that codes for enhanced Green 174 To study the role of copy number variation on bidirectional transcription activity in more 203 detail, we generated LuBiDi reporters that contain one, two or four copies of 204 TATCGCGAGA, a common variant of the CGCG element with an imperfect palindrome. 205 Reporter activity increased proportionally with the number of motifs as measured by 206 luciferase activity or luciferase transcript levels (figure 3e, f). 207

Endogenous CGCG elements confer transcriptional activity in CGI-associated 208 promoters and methylation abrogates its promoter activity 209
To determine if CGCG elements are associated with bidirectional transcription from 210 endogenous promoters, we analyzed a previously published GRO-cap (global run-on 211 sequencing followed by enrichment for 5'-cap structure) analysis performed on K562 212 cells 33 . GRO-cap allows for the detection of nascent, often unstable strand-specific 213 RNA transcripts that are usually undetectable by common RNA-seq methods, likely 214 because of the greatly increased sequencing depth near to TSS associated with 215 directional transcription of coding RNAs. We found that the bidirectional transcription is 216 associated almost exclusively with CGCG elements that occur in CGI-enriched 217 promoters (figure 4a). Gene Ontology (GO) analysis showed that genes containing 218 CGCG promoter element produce protein-coding transcripts whose products form 219 discernable protein-protein interacting networks (Supplementary figure 3). Specifically, 220 these genes encode core components of RNA metabolism and the translational 221 apparatus (Table 1) Given that the CGCG element drives bidirectional transcription, we were interested to 252 determine the frequency of this element in annotated uni-vs. bidirectional promoters. 253 The vast majority of CGCG elements (93%) occur in annotated unidirectional promoters 254 that drive coding or lncRNAs, while 7% occur in an annotated bidirectional promoter 255 (Table 2). However, recent studies suggest that the majority of what were classically 256 defined as unidirectional promoters produce unstable "promoter upstream transcripts" 257 (PROMPTS) 37 . Based on this, we investigated the role of CGCG elements in three 258 different endogenous promoters that differ in their annotated directionality and whether 259 they combine CGCG element with TATA-boxes. In order to determine the role of 260 endogenous CGCG elements, we simultaneously disrupt CGCG element but 261 maintained CG content by exchanging the flanking sequences (i.e. TCTCGCGAGA to 262 AGACGCGTCT). We first focused on the POLR1C/YIPF3 bidirectional promoter region, 263 which has two TSS separated by 30 nucleotides that flank a single CGCG element. We 264 inserted a promoter fragment (~30bp) containing the wild-type CGCG element into the 265 LuBiDi construct, and as a comparison, constructs were generated in which the flanking 266 sequences (AGA and TCT) were exchanged. The WT fragment from POLR1C/YIPF3 267 promoter induced bidirectional expression irrespective of its orientation (figure 5a). In 268 contrast, the flank-exchanged mutants, regardless of insert orientation, did not show 269 any discernable reporter activity. An interesting yet poorly studied feature of vertebrate genomes is the presence of CpG 349 rich regions known as CGIs 14 . Although CGIs mark transcriptionally active regions of 350 the genome, the mechanism of RNA polymerase recruitment in these regions has been 351 elusive 13 . Through enrichment analysis, we found that CGCG elements are enriched in 352 CGI-containing promoters and that they can recruit transcriptional machinery to promote 353 bidirectional transcription, a feature that most transcriptionally active CpG islands was 354 shown to possess 19 . Additionally, we provide evidence that in some rare cases, the 355 CGCG element could interact functionally with an adjacent TATA-box within a CGI to 356 activate gene expression. Similar synergetic activities have been described previously 357 43,44 suggesting that the CGCG element also shares this attribute with other known core 358 promoter elements. 359 How housekeeping genes whose products are core components of cellular processes 360 are transcriptionally regulated is poorly understood. In this study, we found that genes 361 whose products play a central role in translation and transcription are enriched for 362 CGCG elements in their CGI-associated promoters. This analysis led us to identify a 363 group of ribosomal genes whose CpG rich promoters contain one or multiple copies of 364 CGCG elements (Supplementary Figure 5). These promoters do not contain the 365 previously described TCT motif that is thought to regulate the transcription of the other 366 group of ribosomal genes in humans 40 . These results suggest that TCT and CGCG 367 elements regulate the expression of different sets of ribosomal genes in human. In 368 addition to genes encoding ribosomal proteins, promoters of key translation initiation 369 factor genes encoding EIF5, EIF3H, and DENR, as well as the essential translation 370 termination factor ETF1, contain copies of the CGCG elements. This is consistent with 371 the current perspective that different classes of promoter elements regulate functionally 372 distinct protein coding genes 1 . 373 Additionally, we directly demonstrated that methylation of CpGs in the CGCG element 374 could suppress its promoter activity. Indeed, roughly 80 percent of CpG sites in the 375 genome, particularly CpGs that occur outside of CGIs, are methylated 45 . We speculate 376 a switch-like mechanism that could activate or repress gene expression based on the 377 methylation status of CGCG elements. Accordingly, we propose a model where CGCG 378 elements, when occurring in CGIs, are protected from methylation thereby maintaining 379 promoter activity in housekeeping genes. In contrast, CGCG elements in other regions 380 of the genome would be more subject to methylation, resulting in transcriptional 381 silencing. In theory, DNA methylation of CGCG elements could protect the genome from 382 spurious transcription, as reviewed elsewhere 46 . A similar switch-like mechanism for a 383 group of transcription factors that contain CpG motif has been described in the past in 384 which CpG methylation would affect the affinity of transcription factors such as Kaiso 47 . 385 Although the nature of the factor, or factors, that bind to non-methyl CGCG element has 386 yet to be clarified, our results suggest that ChIP-seq studies should be interpreted with 387 greater consideration to account for the differential binding of proteins to methyl or non-388 methyl CpG-containing motif sequences. 389 In a recent study, Dual Specificity Kinase 1 (DYRK1A) was identified as a novel POL2 390 C-terminal domain (CTD) kinase and activator of RNA polymerase 2 48 . Subsequent 391 ChIP-seq analysis of DYRK1A showed that this protein is specifically enriched in CGCG 392 containing promoters. It has been suggested that RNA polymerases are recruited 393 through various transcriptional preinitiation complexes (PIC) that specifically regulate 394 different promoter classes 1,49 . Therefore, we speculate that CGCG elements directly or 395 indirectly recruit DYRK1A as the component of a novel PIC that remains to be 396 completely elucidated. 397 In conclusion, this study provides strong evidence that the CGCG element is Oligonucleotide pull-down assay 520 To determine whether CGCG elements can bind to Kaiso, we separately synthesized 521 biotin-tagged DNA duplex that contained unmodified TCTCGCGAGA, TCTCTCGAGA 522 or completely methylated (TCT me CG me CGAGA). 10 µM from each duplex were bound 523 and washed to 100 µl Streptavidin Dynabeads as recommended by the manufacturer 524 (Invitrogen). HEK293T cells were lysed using NET-N buffer containing protease 525 inhibitors cocktail (Sigma) and incubated on ice for 30 min. Lysates were centrifuged at 526 12000 RPM for 10 min to pellet cellular debris, and supernatant representing 500 µg 527 protein was mixed with duplex-charged beads and incubated at 4C overnight. The 528 beads were washed five times with NET-N buffer, incubated with 50 µl Laemmli loading 529 buffer (1X: 0.02% w/v bromophenol blue, 4% SDS, 20% glycerol, 120 mM Tris-Cl, pH 530 6.8) and boiled for 5 min to elute bound proteins. The proteins were analyzed by 531 immunoblotting for Kaiso (Santa Cruz D-10) and control antibody. 532

Rapid Amplification of cDNA ends (5´ RACE) 533
To determine divergent TSSs, we transfected near confluent HEK293T cells in 10 cm 534 dishes with 5 µg LuBiDi construct along with 0.5 µg pEGFP-C1 to monitor transfection 535 in the following day. RNA was extracted as described before 72 h after transfection. The 536 quality and purity of RNA were evaluated using Agilent 2100 Bioanalyzer and samples 537 with RNA integrity number (RIN) values >= 8.0 were selected for further analysis. The 538 SMARTer 5´ RACE (Clontech) protocol was used to determine divergent TSSs from 10 539 µg of total RNA. Briefly, the RNA was first reverse transcribed at 42C for 90 min using 540 poly-dT primers and extended beyond TSS using RT-mediated template switching that 541 employs the SMARTer IIA Oligonucleotide as the template only when the 5´ cap is 542 encountered. The resulting cDNA products were amplified using specific internal 543 primers for either firefly or Renilla plus the Clontech Universal Primer Mix (UPM). A GFP 544 primer set was used as an internal control. Primer sequences used in RACE 545 experiments are provided in the Supplementary Information 1. The PCR products 546 containing TSS were directionally cloned into the linearized pRACE vector using the In-547 fusion HD system, and individual bacterial clones were obtained following 548 transformation of the ligated products into Stellar competent cells. Sanger sequencing 549 of the resulting plasmid clones (using M13 primer) was used to identify TSSs. 550

Motif Discovery 552
The CpG island annotation track in the human genome (hg38) was downloaded from 553 the UCSC genome browser (https://genome.ucsc.edu), and sequences that overlap with 554 K562 DNase-seq peaks track were extracted using Bedtools 53 . The resulting 555 sequences were used for motif discovery using the findMotifgenomewide script in the 556 Homer bioinformatics software suite using default command line arguments for the 557 human genome 54 . 558

Genomic annotation and Metagene analysis 559
The scanMotifgenomewide script from the Homer program version 4.8 was used to 560 locate all instances of motif 7 and 10 in human (hg38) and mouse (mm9) genomes. The 561 annotatePeaks script (Homer) was used to identify motif co-occurrence, genomic 562 annotations, metagene, and enrichment analysis. 563

ENCODE Conservation, DNase-seq, GRO-Cap, WGBS 564
Processed data points for hg38 were extracted and processed using Wigman software 565 for 50 bp upstream and downstream windows for each motif occurrence. For ENCODE 566 WGBS (accession number ENCFF867JRG). The PhyloP and PhastCons conservation 567 scores for hg38 assembly were downloaded from the UCSC genome browser 568 (http://hgdownload.cse.ucsc.edu/downloads.html). ENCODE accession number 569 ENCFF867JRG was used for K562 DNase-seq data. The GRO-Cap dataset for K562 570 and GM12878 cell lines with GEO accession number of GSM1480321 was used to 571 analyze nascent transcripts in promoters. POL2 ChIP-seq from K562 cell line with the 572 accession number of ENCFF000YWS was used to determine POL2 occupancy state on 573 CGCG elements. Heatmap plots were generated using the in-house written Wigman 574 software (https://github.com/AminMahpour/Wigman). 575 Gene Ontology and gene network analysis 576 Bedtools Closest feature was used to compile a list of genes that their annotated TSS 577 are less than 500bp from CGCG elements on both plus and minus strand from the latest 578 hg38 GTF annotation file (http://www.ensembl.org/info/data/ftp/index.html). A custom 579 script was written and used to determine the number of CGCG elements in annotated 580 coding, non-coding, uni-and bi-directional CGI promoters. 581 Gene Ontology (GO) analysis performed using GOrilla gene enrichment analysis 582 platform. A list of CpG islands-associated genes was used as the background genes for 583 enrichment analysis 55 . GO enrichment score is defined as ( ⁄ ) ( ⁄ ) ⁄ , where N is the 584 total number of background CpG island-associated genes that have a GO term, B is the 585 number of genes associated with a specified GO term, and n is the number of genes 586 whose promoter contain CGCG element and b is the number of genes in the 587 intersection. Gene set interaction networks were generated and analyzed using 588 REACTOME package v53 (http://www.reactome.org/). Network were visualized 589 graphically using Cytoscape software version 3.5 (http://www.cytoscape.org/) 590

Start-seq analysis 591
Start-seq from mouse bone-marrow derived macrophages was published previously 592 and is available for download from GEO website (GSE62151, 593 https://www.ncbi.nlm.nih.gov/geo/). Data were analyzed as described previously. Briefly, 594 reads were aligned uniquely to the mm9 genome allowing a maximum of two 595 mismatches with Bowtie version 0.12.8 (-m1 -v2). Sense and divergent TSS were 596 assigned as defined previously. Start-seq heat maps depict Start-RNA reads in 10 bp 597 bins at the indicated distances with respect to the TSS. Heatmap plots were generated 598 using Partek Genomics Suite version 6.12.1012.   Non-coding and coding pair 9 2