Discovery of gene regulatory elements through a new bioinformatics analysis of haploid genetic screens

The systematic identification of regulatory elements that control gene expression remains a challenge. Genetic screens that use untargeted mutagenesis have the potential to identify protein-coding genes, non-coding RNAs and regulatory elements, but their analysis has mainly focused on identifying the former two. To identify regulatory elements, we conducted a new bioinformatics analysis of insertional mutagenesis screens interrogating WNT signaling in haploid human cells. We searched for specific patterns of retroviral gene trap integrations (used as mutagens in haploid screens) in short genomic intervals overlapping with introns and regions upstream of genes. We uncovered atypical patterns of gene trap insertions that were not predicted to disrupt coding sequences, but caused changes in the expression of two key regulators of WNT signaling, suggesting the presence of cis-regulatory elements. Our methodology extends the scope of haploid genetic screens by enabling the identification of regulatory elements that control gene expression.


Introduction
Gene-based insertion enrichment analysis 136 To determine which genes were enriched for total GT insertions in the sorted versus the 137 unsorted cells, all insertions in bins annotated with a given gene and its associated promoter as 138 defined above were aggregated separately for the sorted and unsorted cell populations. Thus, the 139 sum of insertions for a specific gene included both sense and antisense insertions that overlapped 140 with the gene's features, including the promoter. For each gene, a p-value for the significance of 141 enrichment was calculated using a one-sided Fisher's exact test run using the "scipy" package 142

Upstream insertion enrichment analysis
This analysis included bins annotated exclusively as promoter and containing at least one 158 GT insertion regardless of orientation in the sorted cells. An FDR-corrected p-value for the 159 significance of insertion enrichment in each of these bins was determined using a one-sided 160 Fisher's exact test (from "scipy" package for Python) comparing the frequency of insertions in 161 the bin for the sorted versus the unsorted cells. Bins were then ranked in ascending order based 162 on FDR-corrected p-value (Figs 2D and 2E, S1 File). was determined using a one-sided Fisher's exact test (from "scipy" package for Python) 172 comparing the frequency of insertions in the bin for the sorted versus the unsorted cells. Bins 173 were ranked in ascending order based on FDR-corrected p-value (Figs 2F and 2G, S1 File). 174 175 BAIMS pipeline code 176 The BAIMS pipeline code used for the bioinformatics analysis is available through 177 All clonal cell lines containing specified GT insertions were isolated as described in the 181 "Isolation of APC KO-2 mutant cell line containing a GT insertion" section of Materials and 182 methods in [3]. 183 The TFAP4 GT  NaCl, 2% NP-40, 0.25% deoxycholate, 0.1% SDS, 1X SIGMAFAST protease inhibitors, 10% 218 glycerol), sonicated in a Bioruptor 300 (Diagenode) 2 x 30 sec in the high setting, centrifuged 10 219 min at 20,000 x g, and the supernatant was recovered. 220 The protein concentration in the supernatant was quantified using the Pierce 234 This analysis was performed as described in the in the previous section. 75 µg of total 235 protein were loaded in duplicate and electrophoresed in a NuPAGE 4-12% Bis-Tris gel. 236 Following the transfer step, the nitrocellulose membrane was cut between the 50 and 75 kDa 237 molecular weight standards and blocked for 1 hour with Odyssey Blocking Buffer. The top 238 membrane was incubated with rabbit anti-LRP6 primary antibody, and the bottom membrane 239 was incubated with mouse anti-ACTIN primary antibody. Membranes were washed with TBST 240 and incubated with IRDye 800CW donkey anti-rabbit IgG and IRDye 800CW donkey anti-241 mouse IgG secondary antibodies, respectively. Membranes were washed with TBST followed by 242 TBS, and imaged using the Li-Cor Odyssey imaging system. Acquisition parameters in the 243 manufacturer's Li-Cor Odyssey Image Studio software were set so as to avoid saturated pixels in 244 the bands of interest, and bands were quantified using manual background subtraction. The 245 integrated intensity for LRP6 was normalized to that for ACTIN in the same lane and the average 246 +/-SD from duplicate lanes was used to represent the data in Fig 4E.  In order to map GT insertions in a way that would enable us to identify regulatory 282 elements, we devised a bioinformatics pipeline that was completely agnostic to the boundaries of 283 annotated genes and simply tracked the number and orientation of GT insertions in short 284 genomic intervals of arbitrary size, defined as "bins" (Fig 1A). We refer to this approach as 285 "Bin-based Analysis of Insertional Mutagenesis Screens", or BAIMS. Sequencing reads adjacent 286 to the location of GT insertions found in sorted (phenotypically selected) and unsorted (control) 287 cells from haploid genetic screens were aligned to the human genome and assigned to the bin was defined according to the GRCh38 assembly of the human genome. Each bin was also 290 annotated with any relevant genetic features it overlapped with (5' untranslated region (5'UTR), 291 coding domain sequence (CDS), intron and 3' untranslated region (3'UTR)), using the RefSeq 292 annotations from the University of California, Santa Cruz Table Browser [7] for the GRCh38 293 assembly of the human genome. We also defined an additional genetic feature, designated 294 "promoter", as the 2000 base pairs (bp) upstream of the TSS of each gene. This region typically 295 includes the minimal promoter but may also contain other cis-regulatory elements. We annotated 296 bins overlapping with this feature accordingly. The relative orientation of any insertion with 297 respect to any feature can therefore be determined, allowing us to observe patterns of sense and 298 antisense GT insertions across features of interest ( Fig 1C). This information can be displayed in 299 a histogram depicting insertions over any genomic region of interest (Fig 1C), providing a high- found in all bins that overlap with the gene (Fig 1D; see Materials and Methods). We refer to this 332 analysis, which produces a significance score for GT enrichment comparable to that of previous 333 analyses [3], as "gene-based insertion enrichment analysis". insertions in bins exclusively annotated as intron (Fig 2A); we refer to this analysis as "antisense 376 intronic insertion enrichment analysis." To identify regulatory elements in regions immediately 377 upstream of genes, we looked for enrichment of both sense and antisense GT insertions in bins 378 exclusively annotated as promoter (Fig 2A); we refer to this analysis as "upstream insertion 379 enrichment analysis." To distinguish features identified in the previous two analyses from the looked for enrichment of gene-inactivating insertions, as defined above (sense and antisense 382 insertions in bins annotated with 5'UTR, CDS or 3'UTR, and sense insertions in bins annotated 383 exclusively as intron; see Fig 2A); we refer to this analysis as "inactivating insertion enrichment 384 analysis." 385

Identification of regulatory elements through the analysis of bins
These three analyses were applied to the data from two screens for positive regulators of 386 signaling following stimulation with WNT3A, henceforth referred to as the WNT positive 387 regulator high and low stringency screens, which differed only in the stringency of selection [3]. 388 In these screens, HAP1 cells harboring a WNT-responsive GFP reporter (hereafter referred to as 389 "WT HAP1-7TGP") were mutagenized with GT retrovirus, treated with WNT3A and sorted by 390 Materials and Methods). WNT3A-induced reporter activation was nearly eliminated in TFAP4 GT 452 cells when compared to WT HAP1-7TGP cells (Fig 3B). Expression of AXIN2 mRNA, a 453 universal target gene of the pathway, following treatment with WNT3A was also reduced 454 substantially in TFAP4 GT cells (Fig 3C). Given its location within the boundaries of the TFAP4 455 gene, we tested whether the antisense GT insertion affected expression of TFAP4 itself. Both 456 TFAP4 mRNA and protein levels were severely reduced in TFAP4 GT cells, explaining the 457 observed defect in pathway activity (Figs 3D and 3E). A higher exposure of the TFAP4 458 immunoblot from TFAP4 GT cells revealed a faint band corresponding to TFAP4 (Fig 3E), 459 indicating that the antisense GT insertion in the first intron of TFAP4 reduced expression of a 460 full-length transcript and protein as opposed to disrupting the coding sequence. upstream of the TSS (Fig 4B and B in S4 Fig). Importantly, this region was located upstream of 472 the annotated LRP6 promoter in Ensembl (Fig 4B). These GT insertion patterns were not 473 observed in the mutagenized but unsorted cells used as a control for the WNT positive regulator 474 screens (Figs 4A and 4B). These results suggested that antisense insertions upstream of LRP6 475 impaired WNT signaling. indeed what we observed when we measured total and cell-surface levels of LRP6 protein. 515 LRP6 GT -1(Up) and LRP6 GT -2(Up) cells exhibited a 75-84% reduction in total LRP6 protein and 516 a 68-71% reduction in cell-surface LRP6 compared to WT cells (Figs 4E and 4F). LRP6 GT -3(Int) 517 cells exhibited greater, >99% and 94% reductions in total and cell-surface LRP6, respectively, 518 compared to WT cells, as would be expected from the disruption of the LRP6 coding sequence 519 caused by the sense GT insertion in the first intron (Figs 4E and 4F).
Unexpectedly, despite the reduction in LRP6 protein observed in LRP6 GT -1(Up) and 521 LRP6 GT -2(Up) cells harboring antisense GT insertions upstream of the LRP6 promoter, we did 522 not observe a corresponding decrease in LRP6 mRNA (Fig 4G). In an important control, LRP6 523 mRNA levels were indeed markedly reduced in LRP6 GT -3(Int) cells carrying a sense intronic GT 524 insertion that disrupts the coding sequence (Fig 4G). These results suggest that antisense GT 525 insertions upstream of LRP6 diminished signaling by an enigmatic mechanism that reduced 526 LRP6 protein levels without altering mRNA levels, rather than by simply disrupting the LRP6 527 promoter. Interestingly, sequence elements with similar properties have been described upstream 528 of promoter elements for heat shock target genes in yeast [9]. 529

530
We developed a new bioinformatics tool to analyze haploid genetic screens with the 531 explicit goal of uncovering regulatory elements. We analyzed screen data in a way that discerned 532 GT insertion patterns distinct from those predicted to disrupt the coding sequence of genes, and 533 found that atypical insertions in introns and regions upstream of the TSS can cause down-534 regulation of genes, leading to the phenotype selected for during the screen. When we applied 535 this new analysis to haploid genetic screens interrogating the WNT pathway, we found that 536 antisense GT insertions in the first intron of TFAP4 and upstream of the LRP6 promoter resulted 537 in marked changes in the expression of these genes. These types of insertions had not been 538 accounted for in previous analyses of haploid genetic screens. 539 The identified GT insertions could disrupt regulatory elements such as promoters, 540 enhancers, antisense transcripts or splicing sequences. In the case of TFAP4, most of the 541 insertions were located in the first intron and overlapped with a strong enhancer signal (Fig 3A), 542 suggesting they may disrupt an enhancer. Previous studies have shown that TFAP4 is directly 543 regulated by c-MYC and that the first intron of TFAP4 in fact contains four c-MYC binding sites 544 [10,11], three of which are encompassed by the bin identified in our antisense intronic insertion 545 enrichment analysis (Figs 2B and 2C). In future studies, it will be important to test whether the 546 antisense insertions found in the first intron of TFAP4 down-regulate TFAP4 mRNA (Fig 3D) by 547 disrupting c-MYC binding or through an alternative mechanism. 548 Similarly, LRP6 protein was down-regulated in the LRP6 GT -1(Up) and LRP6 GT -2(Up) 549 cell lines containing antisense GT insertions upstream of the LRP6 promoter (Figs 4E and 4F). 550 Surprisingly, LRP6 mRNA levels were not reduced in these same cell lines, suggesting a 551 mechanism regulating LRP6 protein levels. In yeast, genomic sequences upstream of genes that 552 have no effect on mRNA levels can instead regulate protein levels [9]. The selective enrichment 553 of antisense versus sense GT insertions in the region upstream of the LRP6 promoter in cells 554 sorted for low WNT reporter fluorescence (Figs 4A and 4B) suggests that such insertions are not 555 merely disrupting an enhancer or promoter. Instead, we speculate that these GT insertions may 556 disrupt an antisense transcript or another directional element residing on the antisense strand that 557 positively regulates LRP6 expression. Since no such elements have been described, it will be 558 important to elucidate the nature of this regulatory mechanism in future studies. 559 Unlike other more focused efforts to identify regulatory regions associated with a given 560 gene or set of genes [12-18], our untargeted approach enables the genome-wide identification of 561 cis-regulatory elements involved in any phenotype that can be probed through a haploid genetic 562 screen. Identification of such elements may not be feasible with RNA interference-based screens 563 because they require that the target genomic sequences be transcribed. CRISPR-based 564 technologies to screen for regulatory modules on a genome scale are still limited by the focused 565 mutagenesis or transcriptional modulation of predetermined sequences in the genome [19][20][21][22]. 566 However, focused CRISPR-based approaches would be powerful tools to precisely delineate any 567 regulatory element found though our analysis. 568 While we found new regulatory elements in two central regulators of WNT signaling, we 569 note that our current study is most likely under-powered to comprehensively detect all regulatory 570 elements in the genome affecting the WNT pathway for several reasons. First, we used deep 571 sequencing datasets from previous screens [3] that were designed to uncover protein coding 572 genes involved in WNT signaling. The sequencing depth used to map insertions in these 573 previous screens was sufficient to saturate the protein-coding genome, but is insufficient to 574 interrogate the much larger non-coding genome. Second, the propensity of the retroviral-based mutagen used in this study to insert near TSSs, promoters and enhancers limited our search for 576 regulatory elements to regions within and adjacent to genes. Our methodology could in principle 577 be extended to identify regulatory elements located anywhere in the genome by using agents that 578 integrate in a truly unbiased manner and then exhaustively mapping insertions in both the 579 selected and unselected cell populations by sequencing at greater depth. Finally, because we 580 assigned bins disregarding gene boundaries, our analysis may have missed regulatory elements 581 in bins that overlapped with both an exon and an intron (such bins would have been excluded 582 from the antisense intronic insertion enrichment analysis), and elements in bins that overlapped 583 with features located both upstream and downstream of the TSS (such bins would have been 584 excluded from the upstream insertion enrichment analysis). Reducing the size of the bins could 585 ameliorate this problem, but at the expense of statistical power to determine the significance of 586 GT insertion enrichment due to a reduction in GT insertions per bin and an increase in the 587 multiple testing correction for a larger number of bins. Alternatively, computing GT insertions in 588 intervals defined by the boundaries of genetic features such as introns or promoters (rather than 589 bins of a predetermined size) could also help this issue, but would limit the analysis to annotated 590 regions of the genome. 591 The analysis described in this work provides an untargeted and genome-scale method to 592 identify both genes and regulatory elements involved in any biological process that can be 593 probed by a haploid genetic screen. We hope that this bioinformatics analysis, available through 594 S1 File. BAIMS output.  Upstream insertions Significance (-log 10 p-value)