TeXP: Deconvolving the effects of pervasive and autonomous transcription of transposable elements

The Long interspersed nuclear element 1 (LINE-1) is a primary source of genetic variation in humans and other mammals. Despite its importance, LINE-1 activity remains difficult to study because of its highly repetitive nature. Here, we developed and validated a method called TeXP to gauge LINE-1 activity accurately. TeXP builds mappability signatures from LINE-1 subfamilies to deconvolve the effect of pervasive transcription from autonomous LINE-1 activity. In particular, it apportions the multiple reads aligned to the many LINE-1 instances in the genome into these two categories. Using our method, we evaluated well-established cell lines, cell-line compartments and healthy tissues and found that the vast majority (91.7%) of transcriptome reads overlapping LINE-1 derive from pervasive transcription. We validated TeXP by independently estimating the levels of LINE-1 autonomous transcription using ddPCR, finding high concordance. Next, we applied our method to comprehensively measure LINE-1 activity across healthy somatic cells, while backing out the effect of pervasive transcription. Unexpectedly, we found that LINE-1 activity is present in many normal somatic cells. This finding contrasts with earlier studies showing that LINE-1 has limited activity in healthy somatic tissues, except for neuroprogenitor cells. Interestingly, we found that the amount of LINE-1 activity was associated with the with the amount of cell turnover, with tissues with low cell turnover rates (e.g. the adult central nervous system) showing lower LINE-1 activity. Altogether, our results show how accounting for pervasive transcription is critical to accurately quantify the activity of highly repetitive regions of the human genome.

humans and other mammals. Despite its importance, LINE-1 activity remains difficult to 23 study because of its highly repetitive nature. Here, we developed and validated a 24 method called TeXP to gauge LINE-1 activity accurately. TeXP builds mappability 25 signatures from LINE-1 subfamilies to deconvolve the effect of pervasive transcription 26 from autonomous LINE-1 activity. In particular, it apportions the multiple reads aligned 27 to the many LINE-1 instances in the genome into these two categories. Using our 28 method, we evaluated well-established cell lines, cell-line compartments and healthy 29 tissues and found that the vast majority (91.7%) of transcriptome reads overlapping 30 LINE-1 derive from pervasive transcription. We validated TeXP by independently 31 estimating the levels of LINE-1 autonomous transcription using ddPCR, finding high 32 concordance. Next, we applied our method to comprehensively measure LINE-1 activity 33 across healthy somatic cells, while backing out the effect of pervasive transcription. 34 Unexpectedly, we found that LINE-1 activity is present in many normal somatic cells. 35 This finding contrasts with earlier studies showing that LINE-1 has limited activity in 36 healthy somatic tissues, except for neuroprogenitor cells. Interestingly, we found that 37 the amount of LINE-1 activity was associated with the with the amount of cell turnover, 38 with tissues with low cell turnover rates (e.g. the adult central nervous system) showing 39 lower LINE-1 activity. Altogether, our results show how accounting for pervasive 40 transcription is critical to accurately quantify the activity of highly repetitive regions of the 41 human genome. 42 of the LINE-1 reads appeared to derive from subfamilies that are thought to be 130 autonomously inactive (genomic fossils) for millions of years. As an explanation for this 131 counterintuitive result, we hypothesized that this "genomic-transcriptomic" correlation 132 might be indicative of pervasive transcription. In this model, the stochastic nature of 133 RNA polymerase II transcription would drive the creation of RNA fragments 134 proportionally to the number of copies of LINE-1 subfamilies in the whole genome. 135 We model pervasive transcription as sufficiently broad transcription component that 136 creates a bias in RNA-seq reads overlapping LINE-1 subfamilies. This bias tends to be 137 correlated to the number of instances (or bases) in the reference genome because 138 pervasive transcription is predicted to create small quantities of reads across large 139 portions of the genome. 140 As opposed to the broad pervasive transcription model, under the narrow LINE-1 141 transcription model, we expect that a small number of LINE-1 instances to produce a 142 much higher signal (number of reads) than the lowly expressed instances. Therefore, in 143 order to investigate this issue, we first calculated the number of reads mapping to LINE-144 1 instances. We observed that, the vast majority of LINE-1 instances have a very small 145 number of uniquely mapped reads (S1A Figure and S1B Figure) suggesting that indeed, 146 pervasive transcription is capable of generating small number of transcripts overlapping 147 many LINE-1 instances across the genome. A typical RNA-seq experiment has 148 approximately 35.000 LINE-1 instance with a single read. In fact, we also observed that, 149 on average, 99.94% of LINE-1 instances with at least one read have very small 150 expression levels (RPKM <= 1). 151 To test the amount of signal generated by the top expressed instances and the lowly 152 expressed instances we calculated the ratio of reads overlapping the most 10 153 expressed instances in each RNA-seq experiment and the sum of reads overlapping all 154 instances with less than 10 reads. We find that the summation of the least expressed 155 instances is, on average, one order of magnitude higher than the 10 mostly expressed 156 instances. Together these results suggest that broad pervasive transcription is an 157 important factor when quantifying LINE-1 subfamily transcription level. 158 We then divided samples by their tissues of origin ( Figure 1B) and noticed that some 159 tissues had smaller genomic-transcriptomic correlations, hinting at another confounding 160 signal other than pervasive transcription creating reads overlapping LINE-1 subfamilies. 161 We hypothesized that deviations from a high genomic-transcriptome correlation could 162 be derived from autonomous transcription of the LINE-1 subfamilies (see Methods for 163 details). We then developed a software pipeline, TeXP, that uses mappability signatures 164 from pervasive and simulated LINE-1 subfamilies autonomous transcription to 165 deconvolve reads overlapping LINE-1 elements. TeXP counts the number of reads 166 overlapping recently expanded LINE-1 subfamilies, and calculates the best combination 167 of signatures that explains the observed read counts. Specifically, TeXP regresses the 168 proportion of reads derived from each signal, ensuring sparsity to estimate autonomous 169 transcription of recent LINE-1 subfamilies (L1Hs, L1P1, L1PA2, L1PA3, L1PA4) and 170 remove the effect of pervasive transcription ( Figure 1C).  We benchmarked TeXP by estimating the autonomous transcription of LINE-1 174 subfamilies in RNA sequencing experiments of well-established human cell lines [31]. 175 Figure 2A shows the proportion of reads mapped to LINE-1 subfamilies using a naïve 176 method (left panel) and proportions of reads from each signature using TeXP (right 177 panel). TeXP estimations were also compared to other transposable element 178 quantification pipelines (S2 Figure). In the naïve method (Figure 2A; left panel), 179 cytoplasmic and whole-cell polyadenylated (polyA)+ samples had an enrichment of 180 reads mapping to L1Hs and L1PA2 when compared to whole-cell transcripts without a 181 polyadenylated tail (whole-cell polyA-) and nuclear RNA samples. The enrichment of 182 L1Hs reads was consistent with increased transcription of full-length L1Hs (S3 Figure). 183 The estimates after applying TeXP ( respectively. Figure 2B and S1 Table). To validate the quantification of L1Hs autonomous transcription, we performed droplet 217 digital PCR (ddPCR) to estimate autonomous and pervasive transcription levels on a 218 reference panel of six cell lines: MCF-7, K-562, HeLa, HepG2, SK-MEL-5, and 219 GM12878. For these experiments, we assumed that expression on the 5' end of the 220 L1Hs transcript can be used as an approximation to autonomous transcription due to 221 the large imbalance of 5' truncated and full-length copies.The expression on the 3' end, 222 on the other hand, is an approximation tothe combination of autonomous and pervasive 223 transcription. We initially designed and tested multiple assays targeting different regions 224 of the L1Hs locus and proceeded with the two best performing assays (S2 Table). The 225 first assay targeted ORF1, directly adjacent to the 5'UTR, representing the 5' end of the 226 transcript. The second assay targeted ORF2 about 1.5 kb upstream of the 3' UTR, 227 representing the 3' end of the transcript. We completed the same design process for 228 ORF2 to find the copy numbers of the truncated L1Hs transcripts (i.e., the transcripts 229 missing the 5' end of L1Hs) ( Figure 2C, S3 Table). Since autonomous transcription 230 results in an enrichment full-length transcript of L1Hs, we estimated an approximation to 231 the level of pervasive transcription by subtracting expression of the 5' end (ORF1) from 232 the 3' end (ORF2). 233 234 Figure 2D shows the relative quantification of L1Hs transcripts in these four cell lines 235 using the HPRT1 5' end as a reference. The ddPCR analysis detected 12,600 copies of 236 full-length transcripts/ng in MCF-7 cells. In agreement with our in-silico result, K562 and 237 SK-MEL-5 had 1,512 and 1,708 copies of full-length transcript/ng, respectively. For the 238 GM12878 cell line, we expected to find no autonomous expression of L1Hs; however, 239 our ddPCR assays detected low levels of autonomous transcription of L1Hs (655 copies 240 of full-length transcript/ng; Figure 2D, S3 Table). Overall, the quantification of L1Hs 241 autonomous transcription using ddPCR was highly correlated with the quantification 242 using TeXP (Spearman correlation, rho=0.99, p-value=3.803e-06, S7 Figure). This 243 suggests that TeXP can remove most of the noise derived from pervasive transcription, 244 although it is insensitive to samples with little LINE-1 autonomous transcription (S8 245 Figure). To address TeXP sensitivity in relation to noise we first tested TeXP under an 246 ideal experimental setup. In this simulation, the number of observed reads overlapping 247 L1 subfamilies is a simple combination of known proportions of these signals (i.e. 248 pervasive and autonomous transcription). For example, we simulate read counts where 249 30% of the reads derive from LIHs autonomous transcription and the remaining 70% 250 derive from pervasive transcription. We use simulated read counts as input to TeXP and 251 calculate the root mean square error (rmse) between the known proportions and 252 estimated proportions. As observed in S15 Figure Table). Similar to the cell lines, we found that L1Hs 274 was autonomously transcribed; L1P1, L1PA2, L1AP3, and L1PA4 only had residual or 275 spurious autonomous transcription in healthy tissues (S9 Figure). Furthermore, we 276 found that pervasive transcription was the major signal in most RNA sequencing 277 datasets, accounting for 91.7%, on average, of the reads overlapping LINE-1 instances 278 (S10 Figure and S14 Figure). Overall, healthy tissues had a narrower range of L1Hs 279 autonomous transcription levels than cell lines, with the peak transcription level of 47 280 RPKM ( Figure 2E) versus 180 RPKM in the cell lines (S1 Table). We found no or very While most (74.6%) of the whole blood samples had no transcriptional activity of L1Hs, 300 only one sample from skin had an L1Hs autonomous transcription level below 1 RPKM. 301 We further selected patients with both primary and EBV-transformed cell lines to assess 302 whether the EBV transformation could change L1Hs autonomous transcription. We 303 found that both skin cells and lymphocytes had a drastic down-regulation of L1Hs 304 autonomous transcription (S12 Figure). This finding suggests that EBV-transformed cell 305 lines partially preserve the L1Hs transcription level from their tissue of origin, potentially 306 explaining why fibroblast-derived induced pluripotent stem cells support higher levels of 307

LINE-1 retrotransposition [37]. 308
Human solid tumors display increased levels of LINE-1 activity [38][39][40]. In order to 309 assess if this increase is a result from pervasive transcription or an increase in 310 autonomous transcription in LINE-1 we run TeXP in RNA-seq from healthy thyroid 311 samples and solid thyroid tumors. We than calculate the distribution of number of reads 312 deriving from pervasive and autonomous transcription. We first observed that there 313 indeed a significant difference in the number of reads from LINE-1 elements (S13 314 Figure). We also found that compared to healthy thyroid samples, tumor samples regions (T-test basal ganglia vs. all other brain tissues, t = -7.0943; p value = 9.867e-12 334 - Figure 2E); importantly, these levels were still low compared to other tissues. Other 335 tissues with low cell turnover rates, such as liver, pancreas, and spleen, also showed 336 very little or no autonomous transcription of L1Hs (91.2%, 82.9%, 88.9% of samples, 337 respectively, had a L1Hs RPKM < 1 - Figure 2E). Conversely, germinative tissues have 338 been proposed to support somatic activity of L1Hs elements [45]. Our results suggest 339 that this trend is more general, and most tissues associated with the reproductive 340 system sustain higher levels of L1Hs autonomous transcription ( Figure 2E). In addition, 341 we found that the tissues with the highest levels of L1Hs autonomous transcription were 342 enriched for high cell turnover; these included the nerve (tibia), skin (both exposed and 343 not exposed to the sun), prostate, lung, and vagina ( Figure 2E). and can be a confounding factor in some quantifications, for example, we observed that 357 over 75% of reads mapping to ancient LINE-1 subfamilies derive from inactive 358 subfamilies (Figure 2A and S14 Figure). We validated many aspects of TeXP using ddPCR probes designed to quantify 375 pervasive and autonomous transcription of L1Hs across human cell lines. These results 376 show that our method lacks the sensitivity to measure autonomous transcription. In is the observed number of reads mapping to L1Hs, T is the total number of 409 reads mapped to L1 instances, defines the proportion of L1 bases in the genome 410 annotated as L1Hs, is the percentage of reads emanating from pervasive 411 We selected these five LINE-1 subfamilies based on the rates of cross-mappability of 418 simulated data (S3 Figure). In particular, we are interested in removing the effect of 419 pervasive transcription from the estimates of L1Hs autonomous transcription. We 420 simulated reads from L1Hs transcripts we observed that most (>90%) of the reads 421 emanating from L1Hs autonomous transcription map to the L1Hs, L1PA2, L1PA3, 422 L1PA4 and L1P1 subfamilies. Older subfamilies such as L1PA5 and L1PA6, for 423 example, correspond to less than approximately 5% of L1Hs cross-mappable reads. 424

425
In contrast to L1Hs which only a fourth of the reads map back to L1Hs, older elements 426 such as L1PA5 and L1PA6 have a much higher self-mapping rates, 60% and 70% 427 respectively. Therefore, these subfamilies should be less affected by confounding 428 factors deriving from pervasive transcription. 429 The number of reads mapped to each subfamily ܱ is measured by analyzing paired-430 end or single-end RNA sequencing experiments independently. TeXP extracts basic 431 information from fastq raw files such as read length and quality encoding. Fastq files are 432 filtered to remove homopolymer reads and low quality reads using in-house scripts and 433   On the other had mappability fingerprints, which represents how reads deriving from 448 LINE-1 transcripts would be mapped to the genome, are created by aligning simulated 449 L1Hs (the most recent -and supposedly active L1) maps back to loci annotated as 472 L1Hs. While older subfamilies such as L1PA4, have a higher proportion of reads 473 mapping back to its instances (~70% -S3 Figure).

503
More ancient elements such as DNA transposons and LINE-2 have been shown to be 504 primarily transcribed pervasively, hitchhiking the transcription of nearby autonomously 505 transcribed regions [32]. Therefore, we tested whether our estimation of L1Hs 506 transcription level correlated with genes containing or adjacent to L1Hs instances. We 507 found no significant difference between the correlation distribution of a random set of 508 genes and genes with L1Hs in exons or introns or within 3kb upstream or 3kb 509 downstream of L1Hs. This finding indicates that our estimation of L1Hs autonomous 510 transcription is not significantly influenced by non-autonomous L1Hs transcription 511 adjacent or contained by protein-coding genes' loci. Furthermore, we tested if and 512 enrichment of pervasive transcription deriving from intronic regions would create a 513 background signal distinct from the pervasive transcription derived from a whole 514 genome model. We correlated the number of LINE-1 instances from each subfamily in 515 intergenic and intronic regions based on GENCODE v29 and found a statistically 516 significant correlation between the number of instances in both regions (Spearman 517 corr=0.979057, p-value < 2.2e-16 -S17 Figure).

542
Droplet Digital PCR (ddPCR) System (Bio-Rad Laboratories) was utilized to quantify the 543 L1Hs transcript expression in the cell lines described above. Since L1Hs is a highly 544 repetitive and heterogeneous target, we had initially designed and tested a panel of 545 primers and probes that targeted the 5' untranslated region (5'UTR), the open reading 546 frame 1 (ORF1), the open reading frame 2 (ORF2), and the 3' untranslated region 547 (3'UTR) of the L1Hs locus, respectively. After a pilot screening study, we selected the 548 two assays covering ORF1 and ORF2, which not only exhibited overall better 549 performance, but also could help us to distinguish autonomous and pervasive L1Hs 550 transcriptions. We also designed two reference assays on the housekeeping gene 551 HPRT1, which targeted the 5' and 3' ends of the transcript, respectively (S2 Table). All 552 the ddPCR primers and probes were designed based on the human genome reference 553 hg19 (GRCh37) and synthesized by IDT (Integrated DNA Technologies, Inc. Coralville, 554 Iowa, USA). 555 The ddPCR reactions were performed according to the protocol provided by the 556 manufacturer. Briefly, 10ng DNA template was mixed with the PCR Mastermix, primers, 557 and probes to a final volume of 20 μ L, followed by mixing with 60 μ L of droplet 558 generation oil to generate the droplet by the Bio-Rad QX200 Droplet Generator. After 559 the droplets were generated, they were transferred into a 96-well PCR plate and then 560 heat-sealed with a foil seal. PCR amplification was performed using a C1000 Touch 561 thermal cycler and once completed, the 96-well PCR plate was loaded on the QX200 562 Droplet Reader. All ddPCR assays performed in this study included two normal human 563 controls (NA12878 and NA10851) and two mouse controls (NSG and XFED/X3T3) as 564 well as a no-template control (NTC, no DNA template). All samples and controls were 565 run in duplicates. Data was analyzed utilizing the QuantaSoft™ analysis software 566 provided by the manufacturer (Bio-Rad). Data were presented in copies of transcript/μL 567 format which was mathematically normalized to copies of transcript/ng to allow for 568 comparison between cell lines. 569 570 Reference house-keeping gene (HPRT1)

571
We designed two assays targeting the 5' and 3' ends of the HPRT1 transcript, 572 respectively, and used as the reference controls in this study (S3 Table). The reference 573 gene expression level was found to be constant within each cell line, but varied between 574 cell lines. In addition, while 4 of the 6 cell lines had similar 5' and 3' end expression, 575 K562 and GM12878 both had increased 3' end expression. This could be from different 576 isoforms being expressed with different frequencies 3 . For the 5' end expression of 577 HPRT, SK-MEL-5, GM12878, and HepG2 were all around 600 copies of transcript/ng. 578 The remaining were all around 1200 copies of transcript/ng. When looking at the 3' end 579 expression, we found that SK-MEL-5 and HepG2 were around 750 copies of 580 transcript/ng, while MCF-7, GM12878, and HeLa were around 1350 copies of 581 transcript/ng, and K562 was close to 1800 copies of transcript/ng. The slight difference 582 between the 5' end and the 3' end expression levels in the same cell line could be 583 explained by a potential 3' end bias in the cDNA synthesis. However, all the reference 584 assays were consistent between experiments and did not affect the target expression.