Elevated pyrimidine dimer formation at distinct genomic bases underlies promoter mutation hotspots in UV-exposed cancers

Sequencing of whole cancer genomes has revealed an abundance of recurrent mutations in gene-regulatory promoter regions, in particular in melanoma where strong mutation hotspots are observed adjacent to ETS-family transcription factor (TF) binding sites. While sometimes interpreted as functional driver events, these mutations are commonly believed to be due to locally inhibited DNA repair. Here, we first show that low-dose UV light induces mutations preferably at a known ETS promoter hotspot in cultured cells even in the absence of global or transcription-coupled nucleotide excision repair (NER). Further, by genome-wide mapping of cyclobutane pyrimidine dimers (CPDs) shortly after UV exposure and thus before DNA repair, we find that ETS-related mutation hotspots exhibit strong increases in CPD formation efficacy in a manner consistent with tumor mutation data at the single-base level. Analysis of a large whole genome cohort illustrates the widespread contribution of this effect to recurrent mutations in melanoma. While inhibited NER underlies a general increase in somatic mutation burden in regulatory elements including ETS sites, our data supports that elevated DNA damage formation at specific genomic bases is at the core of the prominent promoter mutation hotspots seen in skin cancers, thus explaining a key phenomenon in whole-genome cancer analyses.


INTRODUCTION
Whole genome analysis of cancer genomes has the potential to reveal non-coding somatic mutations that drive tumor development, but it remains a major challenge to separate these events from non-functional passengers. The main principle for identifying drivers is recurrence across independent tumors, suggestive of positive selection, which led to the recent identification of frequent oncogenic mutations in the promoter of telomere reverse transcriptase (TERT) that can activate its transcription 1,2 . However, mutation rates vary across the genome, and local elevations may give rise to "false" recurrent events that can be misinterpreted as signals of positive selection. While known covariates of mutation rate, such as replication timing and local trinucleotide context, can be accounted for to improve interpretation 3 , the non-coding genome may be particularly challenging. Mutational fidelity may be generally reduced in this vast and relatively unexplored space, as indicated by the presence of mechanisms directing DNA repair specifically to exonic regions 4 , and yetunexplained mutational phenomena may be at play.
Indeed, recent studies have described a remarkable abundance of recurrent promoter mutations in melanoma and other skin cancers, often noted to overlap with sequences matching the recognition element of ETS family transcription factors (TFs) 5,6,7,8,9,10,11 .
Strikingly, a large proportion of frequently recurring promoter mutations in melanoma occur at distinct cytosines one or two bases upstream of TTCCG elements bound by ETS factors as indicated by ChIP-seq, within a few hundred bases upstream of a transcription start site 12 .
While often interpreted as driver events, we recently showed that these sites exhibit highly elevated vulnerability to UV mutagenesis, as evidenced by their rapid induction following low-dose UV light exposure in cultured cells 12 . The effect has sometimes been attributed to locally impaired nucleotide excision repair (NER) caused by binding of ETS TFs 11,13,14 .
However, our analysis of skin tumors lacking global NER (XPC -/-) contradicted this model 12 and the mechanism remains unclear. An understanding of this phenomenon, which may underlie a large part of all non-coding recurrent events in human tumors beyond TERT 5, 7, 11 , would resolve a key question that continues to confound whole cancer genome analyses.
Here, through analysis of 221 whole tumor genomes, we first demonstrate the widespread impact of TTCCG-related mutagenesis on the mutational landscape of melanoma.
Moreover, through UV exposure of a panel of repair-deficient human cell lines, we rule out inhibited DNA repair as an important mechanistic basis for ETS-related recurrent promoter mutations in UV-related cancers. Finally, we generate the highest resolution map of UVinduced cyclobutane pyrimidine dimers (CPDs) in the human genome to date, which provides clear evidence that ETS-related promoter hotspots instead arise due to an exceptional local elevation in the efficacy of UV lesion formation at specific genomic bases.

Widespread contribution from TTCCG-related sites to recurrent non-coding mutations in 221 melanoma whole genomes
To assess the impact of TTCCG-related mutagenesis on the landscape of recurrent mutations in melanoma in a more sensitive way than previously possible, we assembled a cohort of 221 melanomas characterized by whole genome sequencing by TCGA and ICGC 15,16 . These heavily mutated tumors averaged 110k somatic single nucleotide variants (SNVs) per sample, expectedly dominated by C>T transitions and a mutational signature characteristic of mutagenesis by UV light through formation of pyrimidine dimers (Fig. S1).
Notably, despite the genome-wide scope, nearly all highly recurrent mutations were found near annotated transcription start sites (TSSs) (Fig. 1a). For example, of the 22 most recurrent individual bases (mutated in ≥18 patients), four were known drivers (BRAF, NRAS or TERT promoter mutations) while the rest were at most 524 bp away from a known TSS.
Further, the vast majority of highly recurrent promoter sites were found in conjunction with TTCCG sequences (Fig. 1a-b), indicating a widespread influence from ETS elements to the mutational landscape of melanoma.
Of 51 recurrent promoter mutations (+/-500 bp from TSS) mutated in ≥ 12 tumors, 42 (82%) had a TTCCG element in the immediate (+/-10 bp) sequence context, rising to 86% after excluding the known TERT C228T and C250T promoter mutations ( Fig. 1b and Table   S1) 1,2 . Most were within 200 bp upstream of a known TSS, as expected for functional ETS elements ( Fig. 1b-c) 17 . Among the few remaining sites, two (upstream of AP3D1 and TMEM102) were instead flanked by TTCCT sequences likewise matching the ETS recognition motif (Fig. 1b) 17 and the numbers are thus conservative. The fraction TTCCGrelated sites increased as a function of recurrence, from 291/550 promoter sites (53%) at n ≥ 5 to 7/8 (88%) at n ≥ 20, excluding the known TERT sites (Fig. 1d). For comparison, only 0.60% of C>T mutations in the dataset exhibited TTCCG patterns, underscoring their massive enrichment in recurrent positions.
As noted previously 12 , there was a strong correlation between the number of mutated TTCCG hotspot sites and the total mutational burden in each tumor, compatible with these sites being passive passengers (Spearman's r = 0.94, P = 1.5e-106; Fig. 1e). Also confirming earlier observations, the TTCCG-related promoter hotspots were found preferably near highly expressed genes, as expected under a model where interaction with an ETS TF rather than sequence-intrinsic properties are responsible for elevated mutation rates in these sites (Fig.  1f). Taken together, these analyses clearly demonstrate that ETS-related mutations account for nearly all highly recurrent non-coding hotspots genome-wide in melanoma, as well as hundreds of less recurrent sites not detectable in previous analyses based on smaller cohorts.

TTCCG hotspots show elevated sensitivity to UV mutagenesis in vitro in the absence of repair
Recent studies have shown that NER, the main DNA repair pathway for UV damage, is attenuated in TF binding sites, leading to elevated somatic mutation rates 13,14 . While plausible as a mechanism for TTCCG mutation hotspots 11 , we recently showed that the hotspots appeared to be maintained in skin squamous cell carcinomas lacking global NER (XPC -/-).
We also established that mutations can be easily induced in TTCCG hotspot sites in cell culture by UV light, thus recreating in vitro the process leading to recurrent mutations in tumors 12 . We decided to use the RPL13A -116 bp hotspot site, notably more frequently mutated (58/221 tumors) than both canonical TERT sites and on par with BRAF V600E at 60/221 (Fig. 1ab), as a model to further investigate a possible role for impaired NER.
To this end, we UV-exposed A375 cells with intact NER as well as fibroblasts with homozygous mutations in four key DNA repair components: XPC, required for global NER, ERCC8 (CSA) and ERCC6 (CSB), required for transcription coupled NER (TC-NER), and XPA which is required for lesion verification in both global and TC-NER (Fig. S2). Correct genetic identity and complete homozygosity for the mutant allele was confirmed by wholegenome sequencing of all four mutant cell lines (Table S2). Even limited UV exposure led to high cell mortality in the mutant cell lines, forcing us to limit the exposure to a single low dose of UVB (20 J/m 2 ) during approximately two seconds, after which cells were assessed for RPL13A promoter mutations using error-corrected amplicon sequencing following recovery ( Fig. 2a) 18 . Between 7,332 and 13,774 error-corrected reads at ≥10x oversampling were obtained for each of 10 different libraries (Fig. 2b).
Strikingly, even at this miniscule dose, subclonal somatic mutations appeared preferably at the known hotspot site in A375 cells (Fig. 2c) as well as in all of the mutant cell lines ( Fig.   2e-g), despite abundant possibilities for UV lesion formation in flanking assayed positions.
As expected, absolute mutation frequencies were low, less than 0.5% in all samples, bringing us close to the detection limit in some samples as indicated by noise in the untreated controls ( Fig. 2d,e,g). In combination with earlier data from XPC -/-tumors lacking global NER 12 and the fact that the mutations are almost exclusively positioned upstream of TSSs where TC-NER should not be active (Fig. 1bc), these results argue strongly against impaired NER as a mechanism for TTCCG hotspot formation.

High-resolution mapping of CPDs across the human genome
It was established decades ago that DNA conformational changes induced by interactions with proteins can alter conditions for UV damage formation 19,20 , which prompted us to investigate whether ETS-related promoter hotspots may arise due to locally favorable conditions for UV lesion formation rather than inhibited repair. For this, we adapted a protocol first established in yeast using IonTorrent sequencing 21 to the Illumina platform ( Fig.   3a), to generate a genome-wide map of CPDs in A375 human melanoma cells immediately following UV exposure, before DNA repair processes have had a chance to act.
CPDs were preferably detected at TT, TC, CT and CC dinucleotides as expected (Fig. 3b) and the number of detected CPDs after removal of PCR duplicates scaled nearly linearly with simulated sequencing depth, indicating favorable random representation of CPDs (Fig. 3c). A total of 202.1 million CPDs were mapped to dipyrimidines throughout the genome (Fig. 3c), constituting the highest resolution CPD map to date to our knowledge. Additionally, 95.3 million CPDs were mapped in UV-treated naked (acellular) A375 DNA lacking interacting proteins, while a non-UV-treated control, which expectedly yielded limited material, produced 18.5 million CPDs (Fig. 3c).

CPD formation spikes at TTCCG-related promoter mutation hotspots
We next investigated CPD formation patterns at TTCCG mutation hotspots positions identified above in melanoma ( Fig. 1a-b). 291 recurrently mutated (n ≥ 5/221 melanomas) TTCCG promoter sites (+/-500 bp from TSS) were aligned centered on the mutated base such that CPD density in these regions could be determined. This revealed a striking peak in CPD formation that coincided with the hotspots, which was largely absent in naked DNA lacking bound proteins or in non-UV control DNA (Fig. 4a). Additionally, more recurrently mutated sites showed a stronger CPD signal, compatible with increased CPD formation being the key mechanism ( Fig. 4b).
For a more detailed understanding, we subcategorized the 291 melanoma ETS hotspot sites into four main groups based on sequence and mutated position. The strongest mutation hotspots, such as RPL13A and DPH3, typically occurred at cytosines one of two bases upstream of the TTCCG element ( Fig. 1b and Table S1), which notably is outside of the core motif and therefore not expected to disrupt binding 17 . In CCTTCCG sites (n = 82 unique loci), recurrent C>T transitions would typically appear at both 5ʹ cytosines (underscored) independently or, less frequently, as CC>TT double nucleotide substitutions. Aggregated CPD density overs these sites, centered on the motif, revealed a strong peak bridging these two bases, which notably was absent in naked DNA (Fig. 4c). Thus, when the TF site is occupied, CPDs form efficiently between the two pyrimidines, leading to C>T mutations at either base although with a preference for the second position, in agreement with established models for UV mutagenesis 22 . The same pattern of strongly elevated CPD formation in cellular, but not naked, DNA was observed between the same positions in TCTTCCG and CTTTCCG sites (n = 57 and 27, respectively), with C>T mutations expectedly forming only at the first or second pyrimidine depending on the position of the cytosine (Fig. 4d-e).
Many of the less recurrent bases in melanoma were often found at the first middle cytosine of a TTCCG motif (Table S1). Interestingly, a large fraction of these sites lacked a dipyrimidine at the two key positions identified above thus prohibiting CPD formation there, with ACTTCCG being the most common pattern (44/82 sites), which indeed matches the in vivo ETS consensus sequence 17 . Compatible with the mutation data, the strongest CPD peak was observed at the middle TC dinucleotide, and in agreement with the lower mutation recurrence, this signal was weaker compared to the other site types (Fig. 4f). Of note, elevated CPD formation between these bases could also be clearly seen in the other site categories (Fig. 4c-e). Taken together, these analyses based on genome-wide CPD mapping provide strong evidence that locally elevated CPD formation efficacy underlies the formation of mutation hotspots at ETS binding sites.

Overall elevated mutation rate in regulatory regions is not due to increased CPD formation
Earlier studies have described a general increase in mutation rate in promoter regions, attributed to reduced NER activity at sites of TF binding 13,14 . To investigate a possible contribution from increased CPD formation to this pattern, we first determined the overall mutation rate near TSSs, which confirmed a sharp increase in upstream regions that coincided with reduced NER as determined by XR-Seq ( Fig. 5a-b) 23 . However, aggregated over these regions, CPDs were found to form at near-expected frequencies (Fig. 5c). While individual ETS-related mutation hotspots arise due to elevated CPD formation, impaired NER thus appears to be the main contributor to a general increase in mutation burden in promoters.

DISCUSSION
Proper analysis of recurrent non-coding mutations requires an understanding of how mutations arise and distribute across the genome in the absence of selective pressures. Here, we provide a mechanistic explanation for the passive emergence of recurrent mutations at TTCCG/ETS sites in tumors in response to UV light, and also demonstrate their massive impact on the mutational landscape of melanoma using a large whole genome cohort.
Mutations at -116 bp in the RPL13A promoter were used here as a model to study mutation formation at ETS hotspot sites in vitro in repair-deficient cell lines, which ruled out an important role for inhibited repair. Of note, this site is more recurrently mutated than the individual TERT C228T/C250T sites and nearly as frequent as chr7:140453136 mutations (hg19) pertaining to BRAF V600E, thus representing the second most common mutation in melanoma and likely other skin cancers. Notably, mutations were detectable at this site in cultured cells following a UVB dose of 20 J/m 2 UVB, equivalent to about 1/200th of the monthly absorbed UVB dose in July in Northern Europe 24 . This underscores the extreme UV sensitivity of ETS hotspots and explains their high recurrence in tumors.
Genome-wide mapping of CPDs revealed that TTCCG-related mutation hotspots exhibit highly efficient CPD formation at the two bases immediately 5ʹ of the core TTCC ETS motif.
The effect was lost in naked acellular DNA, showing that structural conditions for elevated CPD formation are induced when the TF binding site is in its protein-bound state.
Interestingly, most functional ETS sites are expected to lack pyrimidines in the two key positions 17 thus prohibiting pyrimidine dimer formation, and conditions for forming a strong mutation hotspot are thus only met in a subset of sites with CC, TC or CT preceding the TTCCG element. Additionally, CPDs form at lower but still elevated frequency at the middle TTCCG bases, consistent with weaker recurrence for mutations in these positions. CPD and cancer genomic data are thus in strong agreement, providing a credible mechanism for the formation of ETS-related mutations hotspots in UV-exposed cancers.
As demonstrated here, frequent mutations at ETS-site hotspots are expected for purely biochemical reasons in UV-exposed cancers. Consequently, several observations are compatible with passenger roles for these mutations: The most recurrent sites arise at cytosines outside of the core TTCC ETS recognition element 17 where they are not expected to disrupt ETS binding. While mutations in the middle of the motif, common among the less frequent hotspots, should disrupt binding, ETS factors tend to be oncogenes that are activated in cancer 25 , and it can be noted that TERT promoter mutations instead enable ETS binding through formation of TTCC elements 1, 2 . The mutations tend to arise near highly expressed housekeeping genes rather than cancer-related genes. Moreover, as would be expected in the absence of selection and in contrast to known driver mutations 12 , the number of mutated ETSsites in a tumor is strongly determined by mutational burden.
Our results complement a recent study by Mao, Brown 26 et al., which was published during the preparation of this manuscript. This study likewise determined CPD formation patterns in ETS binding sites, obtaining results that are in full agreement with ours, and also proposed a structural basis for increased CPD formation in the ETS-DNA complex based on available crystallographic data. While inhibited NER appears to explain the majority of the increased mutation burden in regulatory DNA in UV-exposed cancers, it can thus be concluded with confidence that the strongest individual mutation hotspots arise instead due to base-specific elevations in CPD formation efficacy at ETS TF binding sites.

Whole genome mutation analyses
Whole genome somatic mutation calls from the Australian Melanoma Genome Project (AMGP) cohort 15 were downloaded from the International Cancer Genome Consortium's (ICGC) database 27 . These samples were pooled with whole genome mutation calls from The Cancer Genome Atlas (TCGA) melanoma cohort 16 called as described previously 5 .
Population variants (dbSNP v138) and duplicate samples from the same patient were removed, resulting in a total of 221 tumors.
Gene annotations from GENCODE 28 v19 were used to define TSS positions, encompassing 20,149 and 13,307 uniquely mapped coding genes and lncRNAs, respectively, considering the 5ʹ-most annotated transcripts while disregarding non-coding isoforms for coding genes. Processed RNA-seq data was derived from Ashouri, Sayin 29 .

Ultrasensitive mutation analysis
To detect and quantify mutations we applied SiMSen-Seq (Simple, Multiplexed, PCR-based barcoding of DNA for Sensitive mutation detection using Sequencing) as described in Barcode families with at least 10 reads, where all of the reads were identical (or ≥ 90% for families with >20 reads), were required to compute consensus reads. FastQ files were deposited in the Sequence Read Archive under BioProject ID PRJNA487997.

Genome-wide mapping of cyclobutane pyrimidine dimers
A375 cells were grown in DMEM + 10% FCS + Penicillin/streptomycin (GIBCO) and were treated with 1000 J/m 2 UVC following DNA extraction and DNA from untreated cells was isolated as a control, both in duplicates. Additionally, naked DNA from untreated cells was irradiated with the same dose, to provide an acellular DNA control sample. DNA was extracted with the Blood mini kit (Qiagen). Purified DNA (12 ug) was sheared to 400bp with a Covaris S220 in microtubes using the standard 400 bp shearing protocol. CPD-seq was modified from Mao, Smerdon 21 to adapt it to Illumina sequencing methods using primers described previously in Clausen, Lujan 30 (Table S3)