• Loading metrics

Analysis of 7,815 cancer exomes reveals associations between mutational processes and somatic driver mutations

  • Rebecca C. Poulos ,

    Contributed equally to this work with: Rebecca C. Poulos, Yuen T. Wong

    Roles Conceptualization, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    Affiliations Prince of Wales Clinical School and Lowy Cancer Research Centre, Faculty of Medicine, UNSW Sydney, NSW, Australia, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, NSW, Australia

  • Yuen T. Wong ,

    Contributed equally to this work with: Rebecca C. Poulos, Yuen T. Wong

    Roles Formal analysis, Writing – original draft

    Affiliation Prince of Wales Clinical School and Lowy Cancer Research Centre, Faculty of Medicine, UNSW Sydney, NSW, Australia

  • Regina Ryan,

    Roles Formal analysis

    Affiliation Prince of Wales Clinical School and Lowy Cancer Research Centre, Faculty of Medicine, UNSW Sydney, NSW, Australia

  • Herbert Pang,

    Roles Methodology

    Affiliation School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China

  • Jason W. H. Wong

    Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Prince of Wales Clinical School and Lowy Cancer Research Centre, Faculty of Medicine, UNSW Sydney, NSW, Australia, School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China

Analysis of 7,815 cancer exomes reveals associations between mutational processes and somatic driver mutations

  • Rebecca C. Poulos, 
  • Yuen T. Wong, 
  • Regina Ryan, 
  • Herbert Pang, 
  • Jason W. H. Wong

This is an uncorrected proof.


Driver mutations are the genetic variants responsible for oncogenesis, but how specific somatic mutational events arise in cells remains poorly understood. Mutational signatures derive from the frequency of mutated trinucleotides in a given cancer sample, and they provide an avenue for investigating the underlying mutational processes that operate in cancer. Here we analyse somatic mutations from 7,815 cancer exomes from The Cancer Genome Atlas (TCGA) across 26 cancer types. We curate a list of 50 known cancer driver mutations by analysing recurrence in our cohort and annotations of known cancer-associated genes from the Cancer Gene Census, IntOGen database and Cancer Genome Interpreter. We then use these datasets to perform binary univariate logistic regression and establish the statistical relationship between individual driver mutations and known mutational signatures across different cancer types. Our analysis led to the identification of 39 significant associations between driver mutations and mutational signatures (P < 0.004, with a false discovery rate of < 5%). We first validate our methodology by establishing statistical links for known and novel associations between driver mutations and the mutational signature arising from Polymerase Epsilon proofreading deficiency. We then examine associations between driver mutations and mutational signatures for AID/APOBEC enzyme activity and deficient mismatch repair. We also identify negative associations (odds ratio < 1) between mutational signatures and driver mutations, and here we examine the role of aging and cigarette smoke mutagenesis in the generation of driver mutations in IDH1 and KRAS in brain cancers and lung adenocarcinomas respectively. Our study provides statistical foundations for hypothesised links between otherwise independent biological processes and we uncover previously unexplored relationships between driver mutations and mutagenic processes during cancer development. These associations give insights into how cancers acquire advantageous mutations and can provide direction to guide further mechanistic studies into cancer pathogenesis.

Author summary

Cancer develops when cells acquire somatic driver mutations that confer a growth advantage. The origins underlying the development of many of these mutations remain largely unknown. Mutational signatures represent the frequency of different somatic mutations across a genome and can be used to characterise the mutational processes that have operated over time within an individual cancer. In this study, we use mutational signatures as a tool to identify associations between mutational processes and cancer-causing mutations known as drivers. We hypothesised that in some cases a driver mutation would be the underlying cause of an associated signature. In other cases, the altered trinucleotide preferences arising from a signature would have increased the likelihood of the associated driver mutation arising. We determine which scenario is most likely to be the case by examining the trinucleotide context of each driver mutation. Here we identify 39 significant associations using a cohort of 7,815 cancer exomes. We examine known and novel associations between driver mutations and mutational signatures arising from processes such as defective proofreading during DNA replication, AID/APOBEC enzyme-associated mutagenesis and deficient mismatch repair. Our study explores important relationships that can inform our understanding of the complex pathogenic history associated with cancer development.


Cancer occurs following the accumulation of somatic mutations within cellular DNA [1]. Somatic mutations can arise as a result of exposure to external DNA damaging agents, or as a consequence of internal errors in DNA replication or repair [2]. Cells undergo malignant transformation following the acquisition of a subset of somatic mutations, termed driver mutations [3]. Driver mutations confer a growth advantage to cells, and subsequently undergo positive selection in a population. Driver mutations typically affect certain cancer-associated genes by, for example, activating an oncogene or inactivating a tumour suppressor gene. Research in recent years has led to the identification of hundreds of driver mutations in cancer-associated genes, but only a handful of driver mutations are sufficient for oncogenesis in a single cancer sample [4]. Driver mutations form in the cancer genome alongside potentially hundreds of thousands of passenger mutations [3]. Passenger mutations are not directly involved in cancer progression and do not confer a selective advantage.

One method developed to characterise the spectrum of mutations that have accumulated in an individual cancer genome is to determine which mutational signatures are present in that sample’s DNA. Mutational signatures are displayed according to six substitution types (C>A, C>G, C>T, T>A, T>C and T>G) in the context of all trinucleotide combinations, thus representing each of the 96 possible mutation frequencies. Thirty mutational signatures have been curated following Alexandrov et al [5] at the ‘Signatures of Mutational Processes in Human Cancer’ website, hosted by the Catalogue of Somatic Mutations in Cancer (COSMIC) database [6, 7]. These signatures reveal which mutational processes have been operative in a cancer genome, as these processes leave a characteristic imprint on the mutational profile [5, 8]. The underlying aetiology of many of these mutational signatures has already been defined. Some signatures have been attributed to defective DNA replication or repair, and others to particular exogenous or endogenous mutagenic processes [5, 8, 9]. The final mutational landscape in any cancer genome will be determined by the strength and exposure duration of each mutational process, which will result in a unique combination of mutational signatures [5, 8].

A recent study suggested that approximately two-thirds of mutations in human cancers arise due to errors in DNA replication occurring over time [10]. Even cancers that have a strong environmental component therefore still harbour mutations incurred by unavoidable DNA replication errors [10]. Relatively little research has focused on investigating which specific driver mutations typically arise due to factors associated with the environment, heritability or DNA replication. We therefore undertook this study to determine the association between common driver mutations and distinct mutational processes in human cancer. Many mutational processes operating in cellular DNA alter the frequencies of mutation accumulation at certain trinucleotide contexts, thus resulting in these definable mutational signatures [5]. We hypothesised that these mutational preferences would increase the likelihood of certain driver mutations arising in the genome, revealing new insights into cancer pathogenesis. We analysed somatic mutations accumulating in the exomes of 7,815 cancer samples from The Cancer Genome Atlas (TCGA), across 26 cancer types. Combining 50 driver mutations with 30 known mutational signatures, we observed 39 significant associations at a false discovery rate (FDR) of < 5%. Many of these associations have not previously been explored, and we reveal relationships that improve our understanding of cancer development. Our findings have the potential to inform new avenues of research into cancer treatment and prevention.


Landscape of mutational signatures and driver mutations

Our curation and filtering criteria used to select samples for downstream analysis are detailed in the Methods (see also S1 Fig). We selected a cohort of 7,815 samples across 26 cancer types for analysis (Tables 1 and S1). These samples harboured a median of 100 single nucleotide somatic variants per exome.

Table 1. Cancer types, abbreviations and cohort sizes for the whole-exome sequenced samples analysed in this study.

We identified mutational signatures in these samples by using the Sigfit [11] R package. We found that mutational signatures in our cohort were comparable with signatures previously defined in similar cohorts [5] (S2 Fig). For example, Signature 1 exhibits clock-like properties, and the number of mutations in this signature correlates with age across a majority of cancer types [5, 12]. Signature 1 was the most common mutational signature that we identified, contributing an average of 6.21% toward all mutational signatures measured across our cohort. Considering a mutational signature to be present if it contributes > 5% toward the entire mutational load of a sample, we found signature 1 to be present in 35% of all samples (n = 2,750). Signatures 2 and 13 were the next most common mutational signatures that we identified (5.80% and 6.19%, respectively). These signatures are associated with the activity of activation-induced cytosine deaminase (AID) and apolipoprotein B mRNA editing enzyme catalytic polypeptide-like (APOBEC) enzymes [5, 13]. We found signature 2 in 26% (n = 2,016 samples) and signature 13 in 30% (n = 2,343 samples) of all samples examined.

By investigating recurrence of mutations in known cancer driver genes (see Methods and S1 Fig), we selected 50 driver mutations for potential association with mutational signatures (S2 Table). These mutations alter 21 different genes, with TP53 (n = 10), KRAS (n = 7), PIK3CA (n = 4) and PTEN (n = 4) harbouring at total of 50% of all of the driver mutations selected. We next defined the landscape of somatic driver mutations across the cancer samples in our cohort (Fig 1). Focusing on the most common driver mutations, we found ten mutations present in > 10% of samples in at least one cancer type. These mutations include BRAF p.V600E, which was also the most common driver mutation in our cohort (n = 287). The BRAF p.V600E mutation was most frequent in skin cutaneous melanoma and thyroid carcinoma, affecting 43% (n = 193) and 56% (n = 29) of samples respectively. Of the other frequent mutations, four mutations altered KRAS amino acid G12, primarily affecting colorectal, lung and pancreatic adenocarcinomas. PIK3CA (p.E545K and p.H1047R) and TP53 (p.R248Q and p.R273C) harboured two highly frequent mutations each, and IDH1 p.R132H was most frequent in brain lower grade glioma (57%, n = 116). Taken together, we found these driver mutations to generally represent known frequencies in other cancer cohorts [3, 14, 15].

Fig 1. Heat map depicting the frequency of driver mutations within each cancer type analysed.

Mutation frequency within each cancer type ranges from light red (0%) to dark red (≥ 10% of samples), with driver genes indicated on the y-axis. Cancer types are clustered across the x-axis. See Table 1 for the full cancer type name that corresponds to each of the abbreviations listed on the x-axis.

Significant associations between mutational signatures and driver mutations

After quantifying the landscape of driver mutations and mutational signatures amongst the cancer samples in our cohort, we next sought to investigate whether any driver mutations were significantly associated with the presence or contribution of any mutational signatures. To exclude potentially spurious associations, we only examined associations in which ≥ 10 samples in a given cancer type harboured the driver mutation of interest, and ≥ 10 samples in that same cancer type harboured the signature of interest at a frequency of ≥ 20%. Subsequently, we tested 411 associations for statistical significance (S3 Table), examining a total of 15 cancer types and 13 mutational signatures across the 50 driver mutations. We performed a binary univariate logistic regression analysis to test each association and defined significance at P < 0.004 (FDR < 5%; see Methods and S3 Fig). After reciprocal negative associations were excluded (see Methods), we found 39 significant associations between driver mutations and mutational signatures (Table 2 and summary in Fig 2). These associations arose across 11 cancer types, affecting 9 mutational signatures and 18 driver mutations from 11 different genes (Table 2). We found the highest numbers of significant associations in uterine corpus endometrial carcinoma (n = 13) and colorectal adenocarcinoma (n = 6; Table 2). Of the total of 39 significant associations that we observed, two were significant negative associations (odds ratio < 1; Table 2). These negative associations arose between signature 1 and IDH1 p.R132H in brain lower grade glioma and glioblastoma (Table 2).

Fig 2. Significant associations between driver mutations and mutational signatures within cancer types.

Bars are coloured according to cancer type. Negative associations (odds ratio < 1) are indicated with a hash (#). See Table 1 for the full cancer type name that corresponds to each of the abbreviations listed in the legend.

Table 2. Significant associations between mutational signatures and driver mutations within individual cancer types, alongside proposed aetiology and frequency of the trinucleotide context in the associated mutational signature.

We validated a subset of our findings using an independent whole-exome sequenced cancer cohort of 619 colorectal cancer samples published in [16]. Using somatic mutations from this cohort, we examined the statistical relationship between mutational signatures and driver mutations for each of the six associations that we found to be significant in colorectal cancer in our existing analysis using TCGA data (Table 2). Validating these colorectal cancer results from our current study with this independent cohort, we found 5 of the 6 tested associations to be significant at P = 0.0072 or below (S4 Table).

Acquisition of driver mutations associated with signature 10

To validate our methodology, we first investigated the six significant associations that we observed between driver mutations and signature 10 across uterine corpus endometrial carcinoma and colorectal adenocarcinoma (Table 2). Here, existing literature allows us to validate many of the patterns of causality inferred from our regression results. Signature 10 arises in cancers which harbour Polymerase epsilon (POLE) exonuclease domain mutations. In our study, we found the POLE p.P286R (c.857C>G) mutation to be significantly associated with signature 10 in uterine corpus endometrial carcinoma (Table 2). The trinucleotide context of this mutation (C[C>G]T) is not frequently observed in signature 10 (Fig 3A), supporting existing literature that demonstrates the POLE p.P286R mutation to underlie many instances of the presence of signature 10 in cancer [5, 17]. The remaining driver mutations that we found to be significantly associated with signature 10 (PIK3CA p.R88Q, PTEN p.R130Q, ARID1A p.R1989* and TP53 p.R213*) all occur in a T[C>T]G context (Table 2). This trinucleotide context is frequently mutated in signature 10 (Fig 3A), suggesting that these driver mutations likely arose as a direct result of exposure to the mutagenic processes underlying this signature. In fact, in colorectal cancer, TP53 p.R213* mutations have been suggested to arise in response to POLE exonuclease domain mutation, where DNA methylation at this CpG trinucleotide may further enhance the likelihood of mutation occurrence [18]. Endometrial cancers with POLE exonuclease domain mutations have also been shown to exhibit a high prevalence of TP53, ARID1A, PTEN and PIK3CA mutations [19]. By analysing the results of our study in this way, we can demonstrate a likely pathway for oncogenesis via driver mutation accumulation associated with signature 10. This pattern is demonstrated schematically in Fig 3B, and it serves as an example for the associations that we subsequently explore in this study.

Fig 3. Trinucleotide context of driver mutations that are significantly associated with mutational signatures, and schematic depicting proposed mutation accumulation associated with signature 10.

(A) The trinucleotide contexts of significantly associated driver mutation are indicated by numbering on each signature plot. Mutational signature images were generated using the Sigfit [11] R package. Negative associations (odds ratio < 1) are indicated with a hash (#). (B) Schematic diagram depicting the proposed mechanism of driver mutation accumulation in Polymerase Epsilon (POLE) exonuclease domain mutated cancers harbouring signature 10.

AID/APOBEC-associated mutagenesis associated with signatures 2 and 13

We found that a striking 36% (n = 14) of the associations that we identified arose between driver mutations in PIK3CA and signatures 2 or 13 across six different cancer types. (Table 2). Signatures 2 and 13 are attributed to the action of the AID/APOBEC family of cytidine deaminases [5, 13]. APOBEC activity has been implicated in the generation of specific PIK3CA mutations at p.E542K (c.1624G>A) and p.E545K (c.1633G>A) [20], and we identified associations with both mutations in our study. Both mutations occur in the T[C>T]A context that is frequently mutated in signatures 2 and 13 (Table 2 and Fig 3A). In fact, both of these mutations match the extended context of the [T/C]TC[A/G] motif which has been established for APOBEC3A binding in single-stranded DNA (ssDNA) [21]. In addition to the PIK3CA mutations, we found signatures 2 and 13 to be associated with ERBB2 p.S310F (c.929C>T) mutation in bladder cancer, and signature 13 to be associated with PPP2R1A p.P179R (c.536C>G) mutation in uterine corpus endometrial carcinoma (Table 2). The broader contexts of these mutations, T[C>T]C and C[C>G]C respectively, do not match well with the typical APOBEC3A/B mutational context. However, APOBEC enzyme mutagenesis has also been shown to preferentially accumulate on ssDNA associated with lagging strand replication, and with transcription bubbles [2224]. In addition, biophysical studies have shown that APOBEC enzymes also preferentially bind cytosines that are within ssDNA stem-loops [21, 25]. Using ssDNA folding predictions (see Methods), we find that both of the cytosines mutated in the ERBB2 and PPP2R1A drivers are predicted to be located at stem-loops, and that these loops are greater than three bases in size (S4 Fig). Loops with a size greater than three bases have been shown to aid APOBEC enzyme binding [25], with the APOBEC3A binding site requiring bent ssDNA [21]. Confirmation of whether and how APOBEC3A/B binds and mutates these DNA sequences will need to be experimentally validated. However, the association with ssDNA and potential stem loop formation provides one possible explanation that may account for the association of signatures 2 and 13 with these mutations.

Acquisition of driver mutations associated with signatures for mismatch repair deficiency

We found 13 significant associations arising between driver mutations and mutational signatures implicated in mismatch repair deficiency (signatures 6, 14, 20 and 26 [5, 9]; Table 2). These associations arise in stomach and colorectal adenocarcinoma and uterine corpus endometrial carcinoma, and they involve eight different driver mutations (Table 2). Interestingly, only three of these eight mutations occur in trinucleotide contexts that frequently appear in any of the mismatch repair deficiency signatures (Table 2). Among the remaining five mutations is BRAF p.V600E, which is significantly associated with signatures 6 and 26 in colorectal adenocarcinoma (Table 2). Signatures 6 and 26 are the only mutational signatures with significant associations that represent mismatch repair deficiency alone [5, 9]. The mechanism underlying the association between BRAF p.V600E and mismatch repair deficiency has not yet been established to our knowledge. It is possible that this association arises because acquisition of a BRAF p.V600E mutation predisposes otherwise normal cells to developing mismatch repair deficiency, though this hypothesis requires further investigation. In support of this hypothesis, BRAF p.V600E mutations do occur much less commonly in hereditary nonpolyposis colorectal cancers [26]–cancers which frequently arise due to germline mismatch repair defects. Our results suggest that many driver mutations in cancers with mismatch repair deficiencies may arise independently from, or prior to, loss of mismatch repair. However, we note that our analyses cannot eliminate the possibility that we have observed certain significant associations because loss of mismatch repair increases the overall mutation rate in a given cancer genome, regardless of trinucleotide mutation context. Therefore, despite these mutations less frequently arising as a result of mismatch repair defects due to their specific trinucleotide contexts, they may become significantly associated either due to an altered rate of mutation accumulation in mismatch repair deficient cancers or because they confer an overwhelming selective advantage whenever they do occur.

Significant negative associations and inference to causality

Our analyses identified two significant negative associations (odds ratio < 1) where we found no reciprocal positive associations (see Methods). These negative associations arose between the IDH1 p.R132H driver mutation and signature 1, occurring in brain lower grade glioma and in glioblastoma multiforme (Table 2). Demonstrating this negative association, we observed a significantly lower proportion of signature 1 mutations in IDH1 p.R132H mutant rather than wild-type brain lower grade glioma (P < 0.0001) and glioblastoma multiforme (P < 0.001; Fig 4A) by two-sided Mann Whitney U-Test. Signature 1 is ubiquitous amongst all cancer types and results from a common mutational process associated with increasing age [5, 12]. Consistent with our observed association, we found that patients with IDH1 p.R132H mutated tumours in our cohort were generally younger than patients with IDH1 p.R132H wild-type tumours. This association was significant in glioblastoma multiforme (P < 0.0001) and was approaching significance in brain lower grade glioma (P = 0.052 by unpaired t-test; Fig 4B). IDH1 mutations have been found to less commonly occur in older people with glioblastoma [27, 28], and the results of our mutational signature analyses provide molecular support for this finding in glioblastoma multiforme and brain lower grade glioma. While age would increase the likelihood of any mutation arising by chance alone, our results suggest that age might disproportionately favour the occurrence of mutations other than IDH1 p.R132H in these brain cancers, or that this mutation confers a greater selective advantage in younger people. In older people, brain cancers may develop via a pathway that is activated by other mutational events that are age- or replication-associated.

Fig 4. Associations of IDH1 p.R132H driver mutation with mutational signature 1 and with age at diagnosis of people with brain cancers.

(A) Proportion of mutations attributable to signature 1 and (B) age at diagnosis of people with tumours that are wild-type and mutant for IDH1 p.R132H in brain lower grade glioma (LGG; left) and glioblastoma multiforme (GBM; right). Box plots show median and quartiles, with significance by two-sided Mann Whitney U-Test in (A) and unpaired t-test in (B). **** denotes P < 0.0001 and *** denotes P < 0.001.

In our analysis, we noted a negative association between signature 4 and KRAS p.G12D (c.35G>A) in lung adenocarcinoma where P = 0.0075 (S3 Table), which is just above our threshold for defining significance (P < 0.004). KRAS p.G12D transition mutations are the most common KRAS somatic mutation arising in the lung adenocarcinomas of people who have never smoked [29]. Further, people who have never smoked are more likely to have KRAS p.G12D mutations in their lung cancers than are people who have smoked [30]. Signature 4 is associated with exposure to the mutagens in cigarette smoke [5], and for these reasons we examined this observed association in more detail. We found that the association we observed by regression was significant at P < 0.05 by two-sided Mann Whitney U-Test (Fig 5A). Interestingly, in lung adenocarcinomas from people with a recorded history of smoking, we observed no significant difference between the number of pack years smoked by those with and without KRAS p.G12D mutations (P = 0.831 by unpaired t-test; Fig 5B). Our results suggest that increasing proportions of signature 4 mutations do not significantly impact on the likelihood of a lung adenocarcinoma acquiring specifically a KRAS p.G12D mutation.

Fig 5. Association of KRAS p.G12D driver mutation with mutational signature 4 and with pack years smoked by people with lung adenocarcinoma.

(A) Proportion of mutations attributable to signature 4 in lung adenocarcinoma (LUAD) tumours that are wild-type and mutant for KRAS p.G12D. (B) Number of pack years smoked by people with LUAD who have a previous history of smoking, in tumours that are wild-type and mutant for KRAS p.G12D. Box plots show median and quartiles, with significance by two-sided Mann Whitney U-Test in (A) and unpaired t-test in (B). * denotes P < 0.05.


Many driver mutations arise from accelerated replication-associated errors in cancer

It has been suggested that approximately two-thirds of mutations in cancer arise from DNA replication errors [10]. The results of our study suggest that the altered trinucleotide mutation preferences conferred by replication-associated mutational signatures can skew the odds in favour of the acquisition of certain driver mutations (see Fig 3B). 36 of the 37 significant positive associations (odds ratio > 1) that we found in our study are between driver mutations and mutational signatures 2 and 13 (deamination by APOBEC enzymes, which can occur during replication), 6 and 26 (deficient mismatch repair during replication), 10 (proofreading deficiency by POLE during replication) and 14 and 20 (deficient mismatch repair and proofreading by POLE). 24 of the 36 associations (66%) arising across cancer types occur in cases where the trinucleotide context of the driver mutation frequently arises in the associated signature, implying a possibly direct causal relationship between defective DNA replication and occurrence of that driver mutation. In order for these mutational signatures to become apparent in a genome however, normal replication must have become disrupted, such that the process has accelerated within cancer cells to generate that mutational signature. Some of the mutational signatures that we analyse in this study can arise due to existing germline defects (for example, germline mutations in mismatch repair genes or replicative polymerases) [31]. Similarly, it is possible that somatic events that are causal in generating a replication-associated mutational signature arise as a result of exogenous environmental mutagens (for example, APOBEC enzyme activity could be altered following viral infection [20]). In these circumstances, it could be argued that the hereditary or environmental defect was ultimately responsible for the generation of the specific driver mutations that arose due to the defective DNA replication. Thus, the acquisition of many common driver mutations may be the result of a complex pathogenic history originally arising from non-replication associated effects.

Associations between BRAF mutations and mutational signatures in cancer

Ultraviolet (UV) radiation has long been epidemiologically associated with the development of melanoma. Mutagenesis associated with signature 7 is directly related to exposure of cells to UV light [5]. We observed a significant association in our study between BRAF p.V600M (c.1798G>A) and signature 7 in skin cutaneous melanoma (Table 2). While the A[C>T]T trinucleotide context of the BRAF p.V600M mutation is infrequent (0.4%) within signature 7, C>T transition mutations within a pyrimidine dimer context do generally characterise this signature [5]. Of note, BRAF p.V600E mutations more commonly arise than p.V600M mutations in melanomas [32]. The BRAF p.V600E (c.1799T>A) mutation is not a characteristic C>T transition and various models have been proposed for how such mutations may result from exposure to UV radiation [33, 34]. Interestingly, we did not find a significant association between signature 7 and BRAF p.V600E in our study (P = 0.7086, S3 Table). Melanomas arising on skin without chronic sun-induced damage often harbour BRAF p.V600E mutations, while those arising on skin with chronic sun-induced damage typically harbour other BRAF mutations [35]. Additionally, BRAF p.V600E mutations are commonly found in tumours from non-sun-exposed tissues such as thyroid and colorectal cancers, demonstrating that this mutation can arise following mutagenic processes other than UV radiation exposure. In some melanomas, and particularly those with a low contribution from signature 7, we suggest that BRAF p.V600E mutations may also arise independently from UV radiation-associated mutagenesis. We note the possibility though, that some BRAF p.V600E mutations do arise as a result of mutagenesis following UV radiation exposure, and that this mutation could then confer a particularly strong selective advantage over other pyrimidine dimer-associated mutations, accounting for its observed recurrence in melanoma.

Relationship to recent published study

During the preparation of our manuscript we became aware of the publication of a similar analysis [36]. In this study, Temko et al made use of TCGA exome sequencing data, together with data from the International Cancer Genome Consortium (ICGC) for cancer driver and mutational signature analysis. The authors used one-sided Mann Whitney U-Tests to examine a selection of potentially correlated driver mutations and mutational signatures, discovering a total of 56 unique significant associations across two analysis approaches. Our study applied different filtering criteria to those of Temko et al when selecting which samples and driver mutations to examine. Hence, in our study only 19 of those 56 associations were tested, of which we found 16 of these 19 tested associations to be significant in our cohort. Conversely, 23 of the 39 significant associations in our study were not identified by Temko et al. Further examination of their analysis reveals that they did not identify these associations due to the association not being tested (8 associations), Temko et al specifically searching for positive associations (2 associations) and different samples and signatures being used when testing for an association (13 associations). Of the latter 13 associations, 11 remained significant at P < 0.05 and 5 remained significant at P < 0.004 when we conducted a one-sided Mann Whitney U-Test using our cohort (results of one-sided Mann Whitney U-Test shown for all associations in S3 Table). Thus, the major differences in the two studies can be accounted for by the different statistical and biological assumptions that we incorporated into our analyses, as well as by considering different sample set and mutational signature identification approaches. Despite this, the similarities in the results attained by both studies provides validation for our collective findings. The differing criteria applied in selection of samples, driver mutations, biological and statistical assumptions ultimately leads to complementary results which collectively provide a more complete list of true significant associations.


In summary, we performed binary univariate logistic regression analyses to establish a statistical relationship between driver mutations and mutational signatures in 7,815 cancer samples across 26 cancer types. Our analysis led to the identification of 39 significant associations between driver mutations and mutational signatures (P < 0.004, FDR < 5%). Our study provides statistical foundations for hypothesised links between otherwise independent biological processes and explores relationships arising between driver mutations and mutagenic processes during cancer formation. These associations provide new insights into how some cancers acquire advantageous mutations and can provide direction to guide further mechanistic studies into mutational processes and cancer development.


Ethics statement

This study was approved by the University of New South Wales (UNSW) Human Research Ethics Advisory Panel (approval no. HC17187). This study analysed data generated by TCGA, which were collect from patients with written informed consent (

Data sources

Somatic mutations from whole exome sequencing data (generated by MuTect [37] and aligned to the GRCh38 reference genome) were downloaded from TCGA via the Genomic Data Commons (GDC) data portal [38] for 10,539 cancer samples. These mutations were used to define mutational signatures and to determine which driver mutations to incorporate into regression analyses. S1 Fig presents a flow chart that accompanies this methods description.

Age at diagnosis [“age_at_initial_pathologic_diagnosis”] and number of pack years smoked [“number_pack_years_smoked”] were obtained for relevant cancer types from the UCSC Xena Browser [39]. Samples with no data recorded in these fields were excluded from plots at Figs 4 and 5.

Detection of mutational signatures

To select which samples to include in regression analyses, we excluded any mutations that were annotated by MuTect [37] as present in a ‘panel of normals’, and then kept only cancer samples that harboured ≥ 30 single nucleotide somatic variants in cancer types with ≥ 40 samples. Next, where multiple samples existed for a single patient, we randomly selected only one sample per patient to retain, and we merged TCGA-COAD and TCGA-READ into a single colorectal adenocarcinoma (CRC) cancer type. Our final sample list contained 7,815 samples from 26 cancer types (Tables 1 and S1). We then applied Sigfit (version 1.2.0) [11] R package to determine the proportion of mutations attributable to each of the 30 mutational signatures from the COSMIC [6, 7] ‘Signatures of Mutational Processes in Human Cancer’ database. Sigfit was run in the ‘fit_signatures’ mode using exome-normalised COSMIC signatures and default parameters. To validate the accuracy of our signature generation by Sigfit, we also generated mutational signatures using the DeconstructSigs [40] R package. DeconstructSigs was run using COSMIC signatures, with normalisation via ‘exome2genome’ and a maximum of five signatures allowed per sample. We compared mutational signature proportions for each sample, as generated by both Sigfit and DeconstructSigs, for the nine signatures for which we identified significant associations with driver mutations (S5 Fig). We determined the Pearson’s correlation for each signature, finding an average Pearson’s correlation of 0.84 across the nine mutational signatures.

Selection of driver mutations

To ensure that we did not erroneously exclude any true driver mutations, we used MuTect [37] mutation calls without the ‘panel of normals’ filter when selecting driver mutations. We first selected only missense and stop-gain variants that were present in > 3.5% of samples in at least one cancer type (where TCGA-COAD and TCGA-READ were considered collectively as CRC). Next, we retained only mutations that altered genes listed in the COSMIC ‘Cancer Gene Census’ (Tier 1; retrieved 24 November 2017) [6, 7, 41]. Using only the 7,815 samples described above, we then selected mutations that were present in > 10 samples in at least one cancer type, resulting in a list of 34 driver mutations. We next obtained additional driver mutations from the ‘Catalog of driver mutations’ curated by Integrative Onco Genomics (IntOGen) [42] (retrieved 24 November 2017). We excluded any insertions and deletions as these cannot be directly associated with standard trinucleotide mutational contexts, and any splice site variants as these may not be uniformly captured by exome sequencing. We then selected only mutations present in ≥ 5 samples within the IntOGen database, and > 10 samples from at least one cancer type from our TCGA cohort. This analysis resulted in a list of 47 driver mutations. We merged these two lists of mutations and analysed each using the Cancer Genome Interpreter [43]. We removed any mutations that were not designated as being a tumour driver by the Cancer Genome Interpreter [43]. Our analysis produced a final list of 50 unique driver mutations affecting 21 different genes (S2 Table). The trinucleotide contexts of each driver mutation were obtained using BEDTools [44].

Regression analyses

We applied a binary univariate logistic regression model using the glm package in R to evaluate associations between mutational signatures and driver mutations within each cancer type. The regression model used the following formula, where represents the proportion of mutations attributed to a given mutational signature in a sample, represents the probability that a given driver mutation is present or absent in that sample and β values denote estimates from logistic regression:

The odds ratio was calculated by exponentiating the β1 coefficient estimated from the logistic regression model. To limit false discoveries and spurious significant associations arising from limited sample size, we did not test any associations between mutational signatures and driver mutations in cancer types with < 10 samples harbouring either the relevant driver mutation or < 10 samples harbouring ≥ 20% of mutations attributable to that mutational signature. To determine the P-value required for FDR < 5%, we performed a randomisation test by randomising driver mutations across samples. For each association for which we tested significance, we randomly shuffled the presence of the driver mutation across samples within that cancer type and then conducted our regression modelling again. We repeated this process across 1,000 iterations, and calculated the mean number of significant associations that we identified (S3A Fig). We note that the P-values obtained from our regression analyses differ from those obtained following randomisation, with the former including a number of associations that satisfy the alternative hypothesis (S3B Fig). At P < 0.004, we found that the FDR from these randomised iterations was < 5%. We observed that a significant positive association (odds ratio > 1) between a driver mutation and mutational signature was often accompanied by reciprocal negative associations (odds ratio < 1) between the driver mutation and other mutational signatures tested in that cancer type. Many of these associations are likely to arise due to positively associated signatures necessarily reducing the percentage contribution of all other mutational signatures present in affected samples. For this reason, we excluded any reciprocal negative associations when counting significant associations arising in our regression modelling using actual data, and in our randomisation analysis when establishing FDR.

For validation of a subset of our findings, we obtained single nucleotide somatic mutations for an independent cohort of 619 whole-exome sequenced colorectal cancers from a previously published study [16]. We detected mutational signatures using Sigfit and identified significant associations by logistic regression as described for the TCGA cohort.

Prediction of ssDNA secondary structure

DNA sequences ± 20 bp of the mutation site in ERBB2 p.S310F and PPP2R1A p.P179R were obtained from the UCSC genome browser. This choice of sequence length was based on an approximation of the size of the remaining single stranded lagging strand template prior to the completion of an Okazaki fragment [45]. To estimate the ssDNA secondary structure, the mFold tool (version 3.6) [46] was used with default parameters.

Supporting information

S1 Table. Cancer samples analysed in this study, along with cancer type and designation of use in regression analysis.


S2 Table. Driver mutations analysed in this study, curated as described in methods and S1 Fig.


S3 Table. Associations that were tested for significance, together with results from binary univariate logistic regression and one-sided Mann Whitney U-Test.


S4 Table. Results of regression analysis using an independent cohort of 619 whole-exome sequenced colorectal cancers.


S1 Fig. Flow chart depicting the methodology applied to identify significant associations between mutational signatures and cancer driver mutations in this study.

See Methods for further details.


S2 Fig. Heat map depicting the proportion of each mutational signature arising within cancers analysed.

The proportion of mutations attributable to each mutational signature within individual samples for each cancer type, ranging from light red (0%) to dark red (≥ 30% of mutations attributable to mutational signature). Mutational signatures are clustered across the y-axis. Cancer types are named and coloured along the x-axis. See Table 1 for the full cancer type name corresponding to each of the abbreviations.


S3 Fig. Randomisation analysis for false discovery rate and significance evaluation.

(A) Proportion of significant results at P < 0.004 obtained from 1,000 iterations of randomly shuffled driver mutations within each cancer type. Bars indicate the proportion from 1,000 iterations that each count of significant associations was observed (see Methods), with the number found using actual data indicated by a dotted line. (B) Frequency at which P-values were observed from binary univariate logistic regression of 411 associations using actual (red) and one instance of randomly shuffled mutations (blue).


S4 Fig. Predicted single stranded DNA (ssDNA) secondary structure for selected driver mutations.

Predicted ssDNA secondary structure shown for (A) PPP2R1A p.P179R and (B) ERBB2 p.S310F mutations. The mutated base is denoted by the mutation label. Predictions were made using the mFold prediction tool [46] with default parameters.


S5 Fig. Mutational signatures generated by Sigfit and DeconstructSigs.

The proportion of mutations attributed to each mutational signature is shown for Sigfit (y-axis) and DeconstructSigs (x-axis), where dots indicate individual samples. The Pearson’s correlation (r) is indicated for each signature on individual plots.


S1 Data. The data underlying the generation of all figures in the manuscript is available as a supplementary file.



The authors thank D. Temko and T. Graham for making their datasets available, and the TCGA who have made their data public for analysis.


  1. 1. Nowell PC. The clonal evolution of tumor cell populations. Science. 1976;194: 23–8. pmid:959840
  2. 2. Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458: 719–24. pmid:19360079
  3. 3. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Cancer genome landscapes. Science. 2013;339: 1546–58. pmid:23539594
  4. 4. Martincorena I, Raine KM, Gerstung M, Dawson KJ, Haase K, Van Loo P, et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. 2017;171: 1029–41. pmid:29056346
  5. 5. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013;500: 415–21. pmid:23945592
  6. 6. Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, Boutselakis H, et al. COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic Acids Res. 2015;43: D805–11. pmid:25355519
  7. 7. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011;39: D945–50. pmid:20952405
  8. 8. Alexandrov Ludmil B, Nik-Zainal S, Wedge David C, Campbell Peter J, Stratton Michael R. Deciphering Signatures of Mutational Processes Operative in Human Cancer. Cell reports. 2013;3: 246–59. pmid:23318258
  9. 9. Haradhvala NJ, Kim J, Maruvka YE, Polak P, Rosebrock D, Livitz D, et al. Distinct mutational signatures characterize concurrent loss of polymerase proofreading and mismatch repair. Nature Communications. 2018;9: 1746. pmid:29717118
  10. 10. Tomasetti C, Li L, Vogelstein B. Stem cell divisions, somatic mutations, cancer etiology, and cancer prevention. Science. 2017;355: 1330. pmid:28336671
  11. 11. Gori K, Baez-Ortega A. sigfit: flexible Bayesian inference of mutational signatures. bioRxiv. 2018.
  12. 12. Alexandrov LB, Jones PH, Wedge DC, Sale JE, Campbell PJ, Nik-Zainal S, et al. Clock-like mutational processes in human somatic cells. Nat Genet. 2015;47: 1402–7. pmid:26551669
  13. 13. Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, et al. Mutational processes molding the genomes of 21 breast cancers. Cell. 2012;149: 979–93. pmid:22608084
  14. 14. Cohen AL, Holmen SL, Colman H. IDH1 and IDH2 mutations in gliomas. Current neurology and neuroscience reports. 2013;13: 345. pmid:23532369
  15. 15. Xing M. BRAF mutation in thyroid cancer. Endocrine-related cancer. 2005;12: 245–62. pmid:15947100
  16. 16. Giannakis M, Mu Xinmeng J, Shukla Sachet A, Qian Zhi R, Cohen O, Nishihara R et al. Genomic Correlates of Immune-Cell Infiltrates in Colorectal Carcinoma. Cell Reports. 2016;15: 857–65. pmid:27149842
  17. 17. Kane DP, Shcherbakova PV. A common cancer-associated DNA polymerase ε mutation causes an exceptionally strong mutator phenotype, indicating fidelity defects distinct from loss of proofreading. Cancer research. 2014;74: 1895–901. pmid:24525744
  18. 18. Poulos RC, Olivier J, Wong JWH. The interaction between cytosine methylation and processes of DNA replication and repair shape the mutational landscape of cancer genomes. Nucleic Acids Res. 2017;45: 7786–95. pmid:28531315
  19. 19. Hussein YR, Weigelt B, Levine DA, Schoolmeester JK, Dao LN, Balzer BL, et al. Clinicopathological analysis of endometrial carcinomas harboring somatic POLE exonuclease domain mutations. Modern pathology: an official journal of the United States and Canadian Academy of Pathology, Inc. 2015;28: 505–14.
  20. 20. Henderson S, Chakravarthy A, Su X, Boshoff C, Fenton TR. APOBEC-mediated cytosine deamination links PIK3CA helical domain mutations to human papillomavirus-driven tumor development. Cell Rep. 2014;7: 1833–41. pmid:24910434
  21. 21. Silvas TV, Hou S, Myint W, Nalivaika E, Somasundaran M, Kelch BA, et al. Substrate sequence selectivity of APOBEC3A implicates intra-DNA interactions. Scientific reports. 2018;8: 7511–. pmid:29760455
  22. 22. Hoopes JI, Cortez LM, Mertz TM, Malc EP, Mieczkowski PA, Roberts SA. APOBEC3A and APOBEC3B Preferentially Deaminate the Lagging Strand Template during DNA Replication. Cell Rep. 2016;14: 1273–82. pmid:26832400
  23. 23. Seplyarskiy VB, Soldatov RA, Popadin KY, Antonarakis SE, Bazykin GA, Nikolaev SI. APOBEC-induced mutations in human cancers are strongly enriched on the lagging DNA strand during replication. Genome Res. 2016;26: 174–82. pmid:26755635
  24. 24. Lada AG, Kliver SF, Dhar A, Polev DE, Masharsky AE, Rogozin IB, et al. Disruption of Transcriptional Coactivator Sub1 Leads to Genome-Wide Re-distribution of Clustered Mutations Induced by APOBEC in Active Yeast Genes. PLOS Genetics. 2015;11: e1005217. pmid:25941824
  25. 25. Holtz CM, Sadler HA, Mansky LM. APOBEC3G cytosine deamination hotspots are defined by both sequence context and single-stranded DNA secondary structure. Nucleic Acids Res. 2013;41: 6139–48. pmid:23620282
  26. 26. Deng G, Bell I, Crawley S, Gum J, Terdiman JP, Allen BA, et al. BRAF mutation is frequently present in sporadic colorectal cancer with methylated hMLH1, but not in hereditary nonpolyposis colorectal cancer. Clinical cancer research: an official journal of the American Association for Cancer Research. 2004;10: 191–5.
  27. 27. Ichimura K, Pearson DM, Kocialkowski S, Bäcklund LM, Chan R, Jones DTW, et al. IDH1 mutations are present in the majority of common adult gliomas but rare in primary glioblastomas. Neuro-Oncology. 2009;11: 341–7. pmid:19435942
  28. 28. Parsons DW, Jones S, Zhang X, Lin JC, Leary RJ, Angenendt P, et al. An integrated genomic analysis of human glioblastoma multiforme. Science. 2008;321: 1807–12. pmid:18772396
  29. 29. Paik PK, Johnson ML, D’Angelo SP, Sima CS, Ang D, Dogan S, et al. Driver Mutations Determine Survival in Smokers and Never Smokers with Stage IIIB/IV Lung Adenocarcinomas. Cancer. 2012;118: 5840–7. pmid:22605530
  30. 30. Dogan S, Shen R, Ang DC, Johnson ML, D’Angelo SP, Paik PK, et al. Molecular Epidemiology of EGFR and KRAS Mutations in 3026 Lung Adenocarcinomas: Higher Susceptibility of Women to Smoking-related KRAS-mutant Cancers. Clinical cancer research: an official journal of the American Association for Cancer Research. 2012;18: 6169–77.
  31. 31. Jansen AML, van Wezel T, van den Akker BEWM, Ventayol Garcia M, Ruano D, Tops CMJ, et al. Combined mismatch repair and POLE/POLD1 defects explain unresolved suspected Lynch syndrome cancers. European Journal of Human Genetics. 2016;24: 1089–92. pmid:26648449
  32. 32. Heinzerling L, Kühnapfel S, Meckbach D, Baiter M, Kaempgen E, Keikavoussi P, et al. Rare BRAF mutations in melanoma patients: implications for molecular testing in clinical practice. British Journal Of Cancer. 2013;108: 2164. pmid:23579220
  33. 33. Supek F, Lehner B. Clustered Mutation Signatures Reveal that Error-Prone DNA Repair Targets Mutations to Active Genes. Cell. 2017;170: 534–47.e23. pmid:28753428
  34. 34. Besaratinia A, Pfeifer GP. Sunlight ultraviolet irradiation and BRAF V600 mutagenesis in human melanoma. Hum Mutat. 2008;29: 983–91. pmid:18421705
  35. 35. Shain AH, Bastian BC. From melanocytes to melanomas. Nat Rev Cancer. 2016;16: 345–58. pmid:27125352
  36. 36. Temko D, Tomlinson IPM, Severini S, Schuster-Böckler B, Graham TA. The effects of mutational processes and selection on driver mutations across cancer types. Nature Communications. 2018;9: 1857. pmid:29748584
  37. 37. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology. 2013;31: 213. pmid:23396013
  38. 38. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a Shared Vision for Cancer Genomic Data. N Engl J Med. 2016;375: 1109–12. pmid:27653561
  39. 39. Goldman M, Craft B, Brooks AN, Zhu J, Haussler D. The UCSC Xena Platform for cancer genomics data visualization and interpretation. bioRxiv. 2018.
  40. 40. Rosenthal R, McGranahan N, Herrero J, Taylor BS, Swanton C. deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biol. 2016;17: 31. pmid:26899170
  41. 41. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, et al. A census of human cancer genes. Nature reviews Cancer. 2004;4: 177–83. pmid:14993899
  42. 42. Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Tamborero D, Schroeder MP, Jene-Sanz A, et al. IntOGen-mutations identifies cancer drivers across tumor types. Nat Meth. 2013;10: 1081–2.
  43. 43. Tamborero D, Rubio-Perez C, Deu-Pons J, Schroeder MP, Vivancos A, Rovira A, et al. Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Medicine. 2018;10: 25. pmid:29592813
  44. 44. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26: 841–2. pmid:20110278
  45. 45. Balakrishnan L, Bambara RA. Eukaryotic Lagging Strand DNA Replication Employs a Multi-pathway Mechanism That Protects Genome Integrity. Journal of Biological Chemistry. 2011;286: 6865–70. pmid:21177245
  46. 46. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31: 3406–15. pmid:12824337