Guanine Holes Are Prominent Targets for Mutation in Cancer and Inherited Disease

Single base substitutions constitute the most frequent type of human gene mutation and are a leading cause of cancer and inherited disease. These alterations occur non-randomly in DNA, being strongly influenced by the local nucleotide sequence context. However, the molecular mechanisms underlying such sequence context-dependent mutagenesis are not fully understood. Using bioinformatics, computational and molecular modeling analyses, we have determined the frequencies of mutation at G•C bp in the context of all 64 5′-NGNN-3′ motifs that contain the mutation at the second position. Twenty-four datasets were employed, comprising >530,000 somatic single base substitutions from 21 cancer genomes, >77,000 germline single-base substitutions causing or associated with human inherited disease and 16.7 million benign germline single-nucleotide variants. In several cancer types, the number of mutated motifs correlated both with the free energies of base stacking and the energies required for abstracting an electron from the target guanines (ionization potentials). Similar correlations were also evident for the pathological missense and nonsense germline mutations, but only when the target guanines were located on the non-transcribed DNA strand. Likewise, pathogenic splicing mutations predominantly affected positions in which a purine was located on the non-transcribed DNA strand. Novel candidate driver mutations and tissue-specific mutational patterns were also identified in the cancer datasets. We conclude that electron transfer reactions within the DNA molecule contribute to sequence context-dependent mutagenesis, involving both somatic driver and passenger mutations in cancer, as well as germline alterations causing or associated with inherited disease.


Introduction
At least fifteen cancer genome sequencing projects were reported between 2007 and 2011 [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15], and this number is now increasing very rapidly. These studies have been critical for addressing mechanisms of somatic mutation, such as those associated with single base substitutions (SBSs), which not only represent the vast majority of lesions in most patients, but also (in the case of some driver mutations) alter gene function, thereby initiating tumor development. Such investigations have demonstrated that SBSs do not occur randomly throughout the genome. Indeed, frequent CRT transitions have been noted at CpG dinucleotides [4,9,[16][17][18], which are attributable to the high rate of spontaneous 5-methylcytosine ( 5m C) deamination at methylated 5m CpG sites [19,20]. In individuals with a history of exposure to cigarette smoke or radio/chemotherapy, high proportions of GRT transversions, GRA and GRC substitutions at GpA and CpG dinucleotides, and ART and ARG substitutions at TpA dinucleotides have also been reported [6,8,9,21], suggestive of DNA damage through exogenous mechanisms [22,23]. Likewise, large numbers of CRT transitions at YpC (Y = C/T) dinucleotides in melanoma in sun-exposed areas of the skin [11,21,24] have been attributed to cyclobutane pyrimidine dimer (CPD) formation following UV photoexcitation [25]. For less common types of substitution, such as TRC at ApT dinucleotides in hepatocellular carcinoma [16], underlying mutational mechanisms have yet to be proposed.
Studies aimed at identifying the mechanisms underlying the sequence context dependency of SBSs observed in inherited disease [26], cancer and phylogenetic analyses [20,[27][28][29][30][31] are few in number. A recent analysis of breast cancer genomes identified five types of trinucleotide motif enriched in SBSs, all of which contained either a CpG or a GpA motif [18]. Substitutions at CpG were attributed to 5m CpG deamination, whereas mutations at the GpA motif, which displayed sporadic clustering, were linked to enzymatic deamination of C at TpC by TC-specific cytosine deaminases [18]. Cluster analyses in other types of cancer led Roberts et al. to propose a similar mechanism for mutations at GpA sequences [32]. Indeed, recent work has identified APOBEC3B as a likely enzymatic source of CRT transitions in breast cancer [33]. In melanoma, Krauthammer et al. reported an enrichment of mutations at C in the context of 59-TTTCGT-39 motifs, a finding which was attributed to energy transfer along the pyrimidine-rich strand upon UV exposure [34]. Thus, although the influence of flanking bases on SBSs appears to extend beyond dinucleotide units, substantial gaps remain in our understanding of the mutational mechanisms involved. Elucidating these mechanisms is crucial, not only because they provide critical information on the earliest steps of cancer-associated mutagenesis, but also because they may account for inter-individual genetic variation as well as somatic age-related changes within the same individual.
Herein, we analyzed the frequencies of mutation at GNC bp in the context of all possible 4-bp 59-NGNN-39 units from .530,000 SBSs representing 21 cancer genomes, .77,000 germline mutations causing or associated with human inherited disease and 16.7 million benign germline single nucleotide variants (SNVs). The 64 combinations of 59-NGNN-39 motifs provided a suitable set size that was not too large to hamper sequence representation while doubling the length of base interactions relative to the commonly employed dinucleotide sequences [35]. In several cancer mutation datasets, but also in the germline mutations, the frequencies of substitutions correlated with the free energy of base stacking along the G-containing strand, as well as with ionization potentials, i.e. the energy required for abstracting an electron from the target guanines. Such behavior is consistent with an electron transfer mechanism as a consequence of one-electron oxidation reactions between the DNA molecule and radical species in the cell [22,36,37]. We conclude that electron transfer contributes to sequence context-dependent SBSs, not only in the context of cancer genomes but also in pathogenic germline mutations.

SBSs Occur Preferentially at GNC Base-Pairs (bp)
We collected the publicly available data from cancer genome studies reported in PubMed between 2007 and 2011, together with the 5 largest datasets from the International Cancer Genome Consortium (ICGC) ( Table 1). Twenty-one datasets, 13 from exome-wide (EWS) and 8 from genome-wide (GWS) sequencing scans, together represented cancers from 14 different tissues comprising 1,149 patient samples (1 to 393 samples per dataset) and 2 cell lines, for a total of 533,482 SBSs (Table S1). Additional SBSs from 3 other datasets were included in the analysis: a subset of nonsense and missense germline mutations of pathological significance in the context of inherited disease derived from the Human Gene Mutation Database (HGMD); a subset of splicing mutations from HGMD causing human inherited disease; and the set of SNVs from the ''1000 Genomes Project'' included in dbSNP build 129 (''rs'' set), to allow direct comparison of the cancerassociated somatic mutations with germline mutations and polymorphisms present in the general population. The median fractions of SBSs that occurred at GNC bp were 0.78 (mean 6 SD, 0.7560.11) for the EWS datasets and 0.56 (mean 6 SD, 0.6060.13) for the GWS datasets (Fig. 1A), significantly higher than the average GC-content exome-wide (0.55) [38] and genomewide (0.41), respectively [39] (both P values were ,2.2610 216 , the smallest computable number by implementation of the binomial exact test in R). Thus, SBSs occurred more frequently at GNC bp, as compared to ANT bp, than expected by chance alone, both in cancer genomes and in the germline as a cause of inherited disease.

SBS Frequencies Are Modulated by Flanking Base Composition
We addressed the sequence-dependent occurrence of SBSs at GNC bp by retrieving the 59-NGNN-39 and their complementary 59-NNCN-39 sequences (henceforth referred to as NGNN), either genome-wide or exome-wide, and calculating the fractions of mutated motifs, f(NGNN), for each of the 64 sequence combinations. There were two confounding factors in computing f(NGNN): the first related to the fact that only certain portions of the human genome may be effectively mapped by nextgeneration sequencing; the second originating from the various methodologies used during base-variant mapping and calling (see Materials and Methods). Therefore, we first assessed the relative representation of NGNN sequences in the homologous portions (Segmental Duplications, repetitive elements and simple repeats) as compared with the unique portions of the human genome (Text S1 and Table S1). These analyses indicated that CGNN motifs are significantly overrepresented in Segmental Duplications (Text S1 and Table S2). We then used the genomic mutation sites (Table  S3) to compute f(NGNN) according to three methods (Duke35, CRG50 and T_hg19) for the GWS datasets and three methods (AgilentV2, CGR50_exons and T_exons) for the EWS datasets (Text S1), and used these fractions to assess the extent to which base composition at positions 1, 3 and 4 (P1, P3 and P4) would influence GNC mutations at position 2 (P2) for different classes of NGNN sequences. Mutations were observed at each NGNN sequence combination with the exception of the two small colorectal cancer datasets (Table 1; Table S4, Panel A). Thus, for these two datasets, z-tests, rather than t-tests, were used to assess statistical significance. Irrespective of the mapping method used, f(CGNN) mean values were significantly greater than f(DGNN) (D = A/G/T) mean values in all cancer (2-11 fold, depending on cancer type with gastric and colorectal cancers

Author Summary
A large number of DNA mutations identified in cells from patients with cancer or human inherited disease were analyzed to address a fundamental issue in human pathology, viz, the mutational mechanisms that cause irreversible changes to DNA. By using bioinformatics and computational methods, we found that mutations do not occur randomly, but instead affect specific bases, most often guanines flanked by other guanines or adenines. We attribute this effect to electron transfer, a chemical reaction known to underlie basic biological processes such as cellular respiration and photosynthesis. Certain types of carcinogens, oxidants or radiation can interact with DNA and abstract an electron. Our results imply that the ensuing sites of electron loss can migrate from their original position in the DNA to neighboring guanines where they become trapped, leading to further chemical modifications that may eventually result in mutations. Many of the mutations known to be important for tumor growth (driver mutations), as well as passenger mutations and mutations associated with inherited disease, appear to be caused by electron transfer. Beyond pathological mutations, electron transfer may represent a universal mechanism by which genetic changes occur in all life forms to drive population fitness over evolutionary time. displaying the largest differences) and germline mutation datasets ( (P-value ,0.04). Again, in no cases were P-values contradictory based upon the mapping method used (Table S4, Panels B-D). The data for the melanomas were particularly striking since P3-A increased the fractions of mutation at P2-G by ,10-fold relative to DGBN (B = C/G/T) ( 23 ) in Melanoma_gws and Melanoma_ews, respectively, although a mutagenic role for CpG methylation was not apparent (i.e. the CGBN and DGBN fractions were indistinguishable, Pvalues ,0.6-0.7). In addition, for the two melanoma datasets, P4-A significantly increased (2.260.2 fold) mutation at P2-G (Fig. 1B) when P2 and P4 were separated by a purine (P-value 5.6610 27 ). Additional analyses in four melanoma datasets [17,21,34,40] confirmed this finding (ratio for the eight possible NGRA/NGRB groups in these four datasets was 2.260.5 and 2.260.3 for the combined six datasets, Table S6). The increase in mutation at P2-G, relative to P4-C and P4-T, was also observed for P4-G; however, this effect was less consistent than P4-A and was observed more frequently when P3 was occupied by a guanine (18/23 cases) rather than an adenine (5/23 cases), Table S6.
In summary, SBSs at P2-G were dependent upon the sequence composition of the 39-nearest neighbor in a number of different cancer types; in melanoma, this effect extended to the next 39 base when bridged by a purine. Thus, GpR and GpRpA sequences constitute mutational hotspots that render the 59 G sensitive to mutation. Further analyses performed in individual cancer samples (Text S1, Figure S1 and Table S4, Panels B-E) indicated that biological mechanisms, rather than differences in variantcalling algorithms or variability between individual samples, were the likely causes of such mutational patterns.

Electron Transfer and Sequence-Dependent SBSs
Guanine is the most readily oxidized base [41] and its ionization energy, i.e. the energy required to abstract an electron, depends upon the identity of the flanking nucleotides [42][43][44][45]. Substantial work performed with model DNA sequences in vitro has shown that, following one-electron oxidation reactions, the sites of electron loss (hole) migrate efficiently (rate constants ,10 7 s 21 ) from the original locations to distant sites, where they become trapped in troughs of low ionization energy, most often at GG and GGG sequences [36,44,[46][47][48]. Because oxidative DNA damage occurs spontaneously in the cell, we tested whether the sequencedependent SBS patterns were consistent with a mutagenesis model that included: a) loss of an electron within the NGNN sequences; b) hole migration to the P2-guanine; and c) chemical modification of the P2-guanines leading to base substitutions [22].

Absolute Free Energy of Base Stacking
The binding energy of single-stranded stacked bases is presumed to be dependent upon the affinity of interactions, or the extent of electron sharing, between p orbitals across bases [49]. The free energy of base stacking, rather than hydrogen bonding, has been reported to be the major source of stability in duplex DNA [50]. Hence, we expected that strongly interacting bases would be more prone to one-electron oxidation and, hence, to higher SBS rates than weakly interacting bases. To this end, we used the absolute free-energy values of base stacking between non-bonded bases, DG(n), derived from a theoretical study [51] using a continuum solvation model and Amber force field to assess the relationships with f(DGNN) values, as we previously employed [52]. For 5/7 datasets with f(DGRN).f(DGYN) ( Table 1), i.e. two melanomas, Lung_nsc, Liver_riken and Mixed, a significant positive correlation existed between the fraction of mutated DGNN sequences and free energies of base stacking (Table S7, Panel A; r 2 0.10-0.71; Pvalues ,0.001-0.031). The normalized mutation fractions for the combined 7 datasets also displayed significant correlation (r 2 0.47; P,0.001; P(a) 0.05 = 1.000; Fig. 1C and Table S7, Panel A; f(DGNN) were according to Duke35 and AgilentV2 mappability).

Vertical Ionization Potentials (VIPs)
VIP, the minimum energy required to abstract an electron, is commonly used as a measure of one-electron oxidation reactivity [41]. We modeled the susceptibilities of G-centered DGN doublestranded trimers to oxidation via quantum chemical computations of VIPs. These analyses are expected to compare favorably with data obtained from the computationally more demanding tetramers; the VIPs for tetramers would be expected to be lower than for trimers while maintaining similar sequence-dependent rankings for P2 [53][54][55][56]. The VIP values for the 12 trimers were ,30-40% lower than that of an isolated guanine ( Table 2) whose VIP estimate was close to the experimentally determined lowest band maximun [57]. The trimer with the lowest VIP was GGG, in agreement with prior calculations [42][43][44][45] and all trimers containing a GG doublet had lower VIPs than those with a single G. In addition, a purine at the 39 position was consistently associated with lower VIPs than a pyrimidine at the 39 position. Thus, DNA sequence context affects VIPs, in accordance with guanine reactivities to oxidative reactions in vitro [42,44].
Inspection of the lowest unoccupied beta molecular orbital (LUBMO) for each DNA trimer cation, [DGN] + , in which the ionization state was modeled by removing an electron, showed that the electron hole invariably had p character with high densities at the central guanine ( Fig. 1D), or at the 59G in the GGH (H = A, C, T) sequences, consistent with previous work [42,55,56,58,59], implying that P2G was a frequent site for oneelectron oxidation reactions. Analyses between f(DGN) and VIP values displayed significant correlations for 4/5 datasets that also revealed a correlation with DG(n), i.e. melanomas, Lung_nsc and Liver_riken (as per Duke35 and AgilentV2 mappability; Figure 1E, Figure S2 and Table S7, Panel B; r 2 0.54-0.75; P-values,0.001-0.007). Notably, robust correlation was also evident when the f(DGN) data from all 18 cancer datasets were normalized and then computed as average values (Table S7, Panel B and Figure S2, Panel D, r 2 0.40; P-value 0.026; P(a) 0.05 0.615). The regression coefficients obtained using the T_hg19 and T_exons mappability data for the datasets with .2,000 SBSs were also used to perform hierarchical clustering based on absolute Manhattan distances (Fig. 1F). At a .90% confidence interval, this yielded three clusters, the largest of which contained the same cancer datasets, with the exception of Ovarian carcinoma, that also displayed f(DGRN).f(DGYN) ratios (Table 1). In summary, both base stacking and VIP data support the conclusion that electron transfer in DNA represents a significant mechanism for sequence context-dependent mutagenesis in cancer.

Sequence-Dependent SBSs in Cancer-Associated Genes
Driver mutations include non-synonymous (NS) substitutions that play a key role in cancer initiation and progression. To assess whether bona fide driver mutations also occurred in a sequence context-dependent manner, we examined the NS substitutions that altered the same genomic coordinate in more than one patient sample, and the genes affected (Text S1, Figure S3 and Table S8). For the 224 recurrent NS substitutions at GNC bps, we calculated the relative enrichment E for each of the 64 NGNN motifs, a value which is expected to approximate to 1 if the base substitutions are completely independent of flanking sequence. E values were greater for the CGNN than for the DGNN (D = A/G/T) sequences ( Figure 2, Panel A). Among the DGNN sequences, P3-purines were associated with significantly more mutations than P3-pyrimidines, a difference that was attributable to the presence Thus, recurrent NS substitutions occurred preferentially at CpG and GpA dinucleotides in cancer genomes, mirroring the sequence context-dependent pattern of SBSs observed both genome-wide and exome-wide ( Figure 1 and Table 1). Of the 18 codon changes that recurred .12 times, 7/10 affected NGNN sequences and 5/ 10 occurred at CGNN sequences, all in well-established cancer genes (Table S9, Panel A). Likewise, the most commonly mutated CGNN (Table S9, Panel B) and DGAN (Table S9, Panel C) motifs affected known driver mutations, alongside several novel candidate genes and driver mutations (Table S9), including p53 R248G , which has been reported to alter protein function (http://www-p53.iarc. fr), and GRHL3, WNK3, EPHB1, ADCY2, GSK3B and LRRN3, which are not currently listed in the cancer gene census (http:// www.sanger.ac.uk/genetics/CGP/Census).

Tissue Distribution and Networks Affected
To examine whether recurrent NS substitutions occurred equally in all tumor tissue types, we determined the relative distributions of the most frequently mutated genomic coordinates after normalizing for both tissue representation and the total number of SBSs per dataset; in the absence of any bias, each tissue would contribute 12.5%. The four genes with $12 recurrently mutated genomic coordinates (TP53, KRAS, PIK3CA and BRAF) ( Table S9, Panel A) were predominantly of breast (34%), intestine (24%) and lung (18%) origin ( Figure 2, Panel B). The three most commonly mutated CGNN sequences (CGTC, CGGA and CGTG; S = 9.5, 6.0 and 5.7, respectively) were found in genes mutationally altered in the intestine (26%), ovary (19%) and breast (15%), whereas the most commonly mutated DGAN sequences (TGAT, GGAA and TGAA; S = 2.8, 2.0 and 2.0, respectively) were found predominantly in genes altered in melanoma (37%), breast and lung (17% each) (Table S9). By contrast, these mutated motifs were underrepresented in the liver (#2%). Of the 64 codons affected, 6 (3 in TP53, 2 in PIK3CA and 1 in GNAS) are known driver mutations, 4 introduced stop codons into TP53, and 26 occurred within genes whose involvement in cancer is strongly suspected (Table S9 and http://www.sanger.ac.uk/genetics/ CGP/Census). Thus, although high-confidence driver mutations occurred preferentially at CGNN and DGAN motifs, their occurrence between tissues was highly asymmetrical, with DGAN mutations occurring predominantly in tumors of the skin.
Finally, we used pathway analyses to survey the 150 recurrently mutated genes ( Figure 2, Panel C). In all tumor tissues, 18 pathways/networks related to cell-cycle checkpoints and the DNA damage response were found to be compromised in all types of tumor, the sole exception being melanomas in which only the MAP kinase signaling pathway was consistently altered. A similar pattern was revealed when all NS substitutions were analyzed, irrespective of whether the data from all patients were merged ( Figure S4, Panel A) or plotted separately ( Figure S4, Panels B and  C). The highest-ranking pathways were dominated by TP53 mutations in most tumor types ( Figure S4, Panels B and C), with the exception of pancreatic cancers in which KRAS mutations dominated. Both p53 and KRAS proteins are known to act on parallel signaling cascades that regulate TERT, the active reverse transcriptase component of telomerase that controls the stability of chromosome ends (http://www.biocarta.com/ pathfiles/h_telPathway.asp). Hence, although critical pathways represent common targets for oncogenic transformation, the altered genes may vary between different patients or organ/ tissue types. In summary, a distinction emerged between melanoma and the other types of cancer, both with regard to the sequence contexts targeted by driver mutations, DGAN vs. CGNN sequences, and to the pathways that hosted these mutations, MAP kinase vs. p53-associated signaling pathways.

Guanine Is Preferentially Targeted by Pathogenic Germline Mutations
In the HGMD missense/nonsense mutation dataset, approximately 68% of SBSs occurred at GNC bps, a proportion similar to the EWS cancer datasets, although correlations with DG(n) or VIPs were absent (Table S7, Panel B) and no enrichment for DGRN sequences was apparent (Table 1). However, r(DGNN), which measured the fraction of mutated DGNN motifs relative to the direction of transcription, revealed that the P2 position was more likely to contain a guanine on the non-transcribed strand, relative to the transcribed strand, when stacking interactions with neighboring bases were high (r 2 0.32; P-value,0.001; P(a) 0.05 0.991; Figure 3, Panel A). No such behavior was evident in the cancer datasets (not shown), whereas limited bias was observed in the 1000 Genomes Project dataset (Figure 3, Panel A). Thus, transcription led to a pattern of sequence context-dependent SBSs among pathogenic germline mutations, which mirrored that observed in several cancer genomes.
The HGMD dataset of inherited splicing mutations contained 9,907 SBSs that may be assumed to adversely affect RNA processing; 8,308 (84%) of these mapped to within 5 bases of donor and acceptor splice junctions (Figure 3, Panel B). Strikingly, although the canonical GT and AG intronic splice junctions at the donor AG ' GTAAGT and acceptor CAG ' GT sequences were found to be .99.9% conserved in the RefSeq dataset of human genes ( Figure S5 top), only three positions, i.e. AG ' GTAAGT and CAG ' GT, were frequently mutated (1,297-2,601 SBSs, 65%), the T-containing donor position being only modestly affected (836 SBSs). We also used the 1000 Genomes Project dataset to assess the extent of splicing variation in the general population (Figure S5 middle); the smallest number of SNVs occurred at all 4 highly conserved positions, as expected. By contrast, in the cancer datasets, the number of SBSs around splice junctions was found to be independent of sequence conservation ( Figure S5 bottom). In summary, pathological germline splicing mutations preferentially targeted those positions that exposed a purine base to the non-transcribed strand during DNA transcription.

Discussion
Large-scale next-generation sequencing projects of cancer genomes are providing an unprecedented opportunity to address the key issue of the nature of the underlying mechanisms of base substitution in tumorigenesis. This issue is generally approached by analyzing the types of base substitution specific to the cancer tissue, i.e. mutation spectra, based on the assumption that different types of base substitution originate via different mutational mechanisms, as assessed by animal model systems [4,6,8,9,[16][17][18]21,[60][61][62][63]. We have chosen the alternative approach of addressing the sequence-context dependency of single base substitution, with the expectation of shedding light on the earliest step(s) in the process of mutagenesis, i.e. the susceptibility of DNA to base modification. Once modified, a base would then undergo various types of substitution based upon the type of modification it incurred, its interactions with DNA repair and replication systems and, possibly, the tissue of origin [62].
As revealed by correlations with VIPs and absolute free energies of base stacking, we uncovered a direct correlation between electronic coupling along the DNA chain, leading to electron transfer, and sequence-dependent SBSs, both in human cancers and as a cause of inherited disease. Thus, charge transfer appears to be the earliest event in the mutational mechanism acting along  the path leading to base substitution in cancer. Electron transfer is the simplest chemical reaction and is known to underlie a number of fundamental biological processes such as cellular respiration and photosynthesis. By establishing the relevance of charge transfer to mutational changes in the DNA molecule, our study enables improved predictions of the relative contribution of individual mutagenic processes and DNA repair activities to cancer (Donohue et al., unpublished data). The somatic and germline settings studied here are qualitatively and quantitatively different and quite distinct from one another. In the former, very large numbers of somatic mutations occur as a consequence of the disease state whereas in the latter, only one or two germline mutations are generally involved in disease etiology. Despite these fundamental differences, similar sequence context-dependencies are evident, which are explicable in terms of the intrinsic physical properties of DNA, i.e., free energy of co-axial base stacking and electronic coupling among flanking bases. We propose a model for SBSs that includes one-electron oxidation reactions (Figure 3, Panel C). In the first step, abstraction of an electron from DNA (base or sugar) by a radical species, either endogenous or exogenous, creates an electron hole. In the second step, the electron hole migrates reversibly to various competing sites, including flanking or more distant bases, as well as other molecules and contacting chromatin-associated amino acids, causing in some instances DNA-protein crosslinks [64]. Guanines with the lowest ionization potentials, as determined by neighboring bases, are the strongest hole-attracting sites. The resulting radical cations (G N+ ) are then expected to undergo a number of chemical modifications, leading to a variety of stably modified bases, including 8-oxoG, a key toxicological lesion, oxazolone, imidazolone and others, some of which can result in base changes during DNA replication if left unrepaired [22,[65][66][67]. Guanine-protein crosslinks may also lead to SBSs [68].
GpA, which we confirmed to be a key mutation hotspot [6,8,9,21,61,63] and found to be enriched in sporadically clustered non-synonymous substitutions (Table S10), would therefore yield mutations through electron transfer [36,37,42,69], tissue-specific deamination [33] and photoexcitation, leading to cyclobutane pyrimidine dimers (CPDs) in melanoma [10,11]. We also identified NGRA, and to a lesser extent NGRG, sequences as mutation hotspots specific to melanoma. Attempts to determine whether mutations at NGRA might have been caused by UVphotosensitization or electron transfer, based on mutation spectra analyses (see Introduction), were uninformative since base substitution patterns were heavily sequence-context dependent. For example, in the context of these melanoma datasets, P2-G in the TGTT motifs underwent GRA:GRT substitutions in the relative proportions 17:75, whereas for the TGCC motifs this ratio was 81:13. Hence, sequence context determines the outcome of single base substitution in a manner that still eludes complete understanding. Nevertheless, if electron transfer reactions were involved, then P4-A would be expected to exert stronger effects than P4-G on mutations at P2-G, since hole trapping is much weaker on adenine than on guanine bases [47]. x-axis, position of SBSs relative to +/220 nt of splice junctions; Panel C, model for sequence-dependent SBSs in cancer and human inherited disease. In the first step, an electron is lost from within a tetranucleotide sequence, leaving a hole. In the second step, the hole migrates to and from various competing sites, including nearby bases and chromatin-associated amino acids (not shown), eventually being trapped by a guanine base. The resulting guanine radical cation either causes DNA-protein crosslinking or undergoes subsequent chemical modifications. If the modified base is not corrected by DNA repair, it may give rise to a mutation (X-Y base-pair) as a result of error-prone DNA polymerase synthesis during DNA replication (dashed arrow). doi:10.1371/journal.pgen.1003816.g003 The CpG dinucleotide was found to be a consistent mutational hotspot, both in cancer and the germline, a result that generalizes the conclusions drawn from previous studies [4,9,[16][17][18]63]. The high frequency of CRT transitions at CpG dinucleotides is generally attributed to high rates of deamination of 5-methylcytosine resulting from methylation of CpG sites [9,19,20,62]. However, other mechanisms have been proposed [60,62], such as enhanced susceptibility of methylated CpG sites to damage by physical and chemical genotoxic agents [62]. This latter interpretation would be consistent with our finding that electronic coupling is an important factor in establishing the hierarchy for base modification in DNA. During the course of normalizing mutation fractions by genome mappability, we noted an enrichment of CGNN sequences in Segmental Duplications. Nakken et al. [70] reported a higher density of CpG islands in Segmental Duplications than in unique chromosomal regions, whereas Xie et al. found methylation-associated SNP clusters to be more prevalent in Segmental Duplications than in unique regions [71]. Thus, the prevalence of CGNN-associated SBSs may well be greater than our study indicates.
A confounding factor in our analyses is the relatively small number of SBSs, particularly in EWS datasets, which caused large variations in f i values. Indeed, three of the four datasets that displayed high-confidence (i.e. P,0.05 and P(a) 0.05 .0.800) correlations between SBSs and DG(n) or VIPs were obtained from genome-wide studies. Combining all f i values into a single group ( Figure S2, Panel D) only alleviates the problem, since the f i values for each dataset are given the same weight. Nevertheless, the ensuing ''cautiously significant correlation'' is consistent with a role for electronic coupling in cancer-related mutagenesis. A second confounding factor is the multiple roles that the GpA dinucleotide plays in mutagenesis, as eluded to earlier. In the case of melanomas, if the numbers of mutated NGAN (and NGA) sequences were dominant, this might cause chance correlation with DG(n) or VIP values, when in fact most mutations could arise from CPDs on the complementary strand. Correlations for both the Melanoma_gws and Melanoma_ews datasets remained highly significant (P,0.002; P(a) 0.05 0.920-1.000) when the f i values for the NGAN (or NGA) sequences were excluded from the analyses, thereby confirming a role for charge transfer. This conclusion is further supported by the observation that electronic coupling and photo-induced energy transfer reactions at pyrimidine dimers occur simultaneously and impinge on one another [72][73][74].
In cancer, the subset of mutational changes resulting from NS substitutions that recurred in different patient samples displayed the same enrichment of mutations at CpG and GpA sequences as the exome-wide and genome-wide sequence alterations, supporting the notion of common underlying causes, i.e. cytosine methylation, electron transfer (this study), enzymatic cytosine deamination and CPD formation (in melanoma) [10,11,17]. These commonalities suggest that the mechanisms involved in generating ''driver'' tumor initiating mutations are likely to be similar to those involved in generating the bulk of subsequent ''passenger'' mutations. Hodis et al. [75] reached a similar conclusion using a quite different approach. Thus, electron transfer appears to be involved in both the early (driver mutations) and late (passenger mutations) phases of tumorigenesis, particularly in tissues of epithelial origin. Recurrent NS substitutions were observed predominantly in gene networks associated with p53 function in all tumor types, the exception being melanoma where a preponderance of mutations at GpA segregated with genes of the MAP kinase signaling pathway. The reason for this distinction remains unclear; however, the critical role played by the MAP kinase signaling pathway in melanocyte proliferation in response to UV damage [76] suggests that positive selection may have been a contributory factor.
The results of the HGMD data analysis support the occurrence of electron transfer in germline mutagenesis associated with human inherited disease, although sequence context-dependent mutagenesis was evident only when mutations were mapped onto the non-transcribed strands of genes. Guanines modified by oxidative DNA damage are repaired predominantly by base excision repair (BER) [77,78]. Since oxidative DNA damage occurs more efficiently in single-stranded DNA than in doublestranded DNA [79,80], oxidative guanine lesions may have formed more frequently on the single-stranded, non-transcribed, strand than on the DNA:RNA duplex during transcription. Thus, a greater number of lesions would be expected to escape BER on the non-transcribed strand than on the transcribed strand. In cancer cells, the large number of mutations that generally accumulate during tumor growth could have masked this bias. An alternative or additional possibility is that transcriptioncoupled nucleotide excision repair, a mechanism that processes bulky DNA adducts and which selectively corrects errors on the transcribed DNA strand [81], might have contributed to the strand asymmetric mutations observed in the HGMD dataset [9,11,12,82,83]. In similar vein, we interpret the selectivity of mutations at purines on the non-transcribed strands of splice junctions as a consequence of oxidative damage, whose effect could have been prolonged by the pausing of transcriptioncoupled splicing at splice junctions [84]. With the number of sequenced genomes rapidly increasing, it will be of great interest to ascertain whether electron transfer constitutes a general mutational mechanism that is common to all forms of life.

Datasets
We collected the publicly available data from cancer genome studies reported in PubMed from 2007 through December 2011 [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15] together with the 5 largest datasets available from the International Cancer Genome Consortium (ICGC). The cancer genome datasets varied widely in terms of sequencing strategies, mapping techniques and variant-calling algorithms, implying that the power to detect SBSs may differ depending upon the datasets and methodologies used [85]. However, all studies excluded base variants present in matched-control tissues, such that the reported SBSs were changes attributed to somatic mutations in the tumor tissue. Matched controls were used for all patient samples. On average, between 6 [15] and 1834 [1] tumor-specific SBSs were reported in the EWS studies (between 1012 [3] and .50,000 [8] in the GWS studies) ( Table 1), which is ,1-3 orders of magnitude lower than the numbers of non-synonymous and splice-site variants noted on average in whole-exome studies [86]. In addition to normal-tumor matched samples, single nucleotide polymorphisms present in dbSNP databases or in the Venter and Watson genomes [1,10,11] were also used to exclude common base variants. Differences in variant-calling power were mitigated in our study since we examined relative proportions of mutated sequences, rather than absolute mutation fractions.
A second source of variation in detecting SBSs among the cancer genome studies was the sequencing instrument used. Illumina sequencers have been reported to yield systematic basecall errors, especially at the last base of context-specific GGC and GGT sequences, which affect either the forward or reverse strand, and at inverted repeats [87,88]. The sequencing technologies employed included Illumina genome analyzers, SOLiD nextgeneration DNA sequencing, ion semiconductor sequencing, dubbed cPAL (combinatorial probe-anchor ligation) nanoballs, capillary electrophoresis, 454 pyrosequencing and mass spectrometry, often used in combination to verify variant calling. Illumina sequencers were the most commonly instruments employed in the studies whose data we used [1,3,4,6,7,[10][11][12][13][14]. The frequency of such base-call errors has been estimated at ,0.1-0.3% before filtering, and even lower after filtering (SAMtools) [87]. Considering that sequencing errors tend to occur over long simple repeat tracts, which have low mappability, and that systematic errors at GGT were ignored (we analyzed mutations at GNC bps only), it seems unlikely that base-call errors have biased our analyses by .0.1%, an acceptable limit.

Mappable Mutations
Approximately half of the human genome sequence comprises highly homologous repetitive DNA elements (Alu repeats, LINE elements etc.) and simple repeats, and an additional ,3.6% contains Segmental Duplications, i.e. segments of .1 kb in length that are present at multiple loci and which share ,90-98% sequence similarity (http://genome.ucsc.edu). Thus, because only the mappable genome may be scored for mutations, we used various methods to estimate the total number of mappable NGNN sequences to use as denominators in the f i fractions (see below). Three methods were used for the GWS studies: 1) the entries with a mappability index of 1 (representing unique sequences) from file wgEncodeDukeMapabilityUnique-ness35bp.bigWig (http://hgdownload.soe.ucsc.edu/goldenPath/ hg19/database/) generated for the ENCODE project by the Duke University Institute for Genome Sciences and Policy (IGSP) and at the European Bioinformatics Institute (EBI), which we refer to as Duke35; 2) we selected sequences from the mappability file wgEncodeCrgMapabilityAlign50mer.bw ((http://hgdownload.soe. ucsc.edu/goldenPath/hg19/database/) [89] (Donohue et al., unpublished data), referred to as CRG50; and 3) we retrieved all NGNN sequences in the GRCh37/hg19 release of the human genome assembly (chromFa.tar.gz file at http://hgdownload.soe. ucsc.edu/goldenPath/hg19/bigZips/) (T_hg19).

Definitions
We defined f i = m i /t i , where m i was the number of mutations at a specific NGNNNNNCN sequence (henceforth designated as NGNN) and t i was the total number of that sequence in one of the six ''mappable'' sets described above. The total number of NGNN sequences was doubled for the self-complementary AGCT, CGCG, GGCC and TGCA sequences since, like all NGCN sequences, they contain two mutations sites, one on the forward and one on the reverse strand. In relation to the counts of mutated NGNN motifs, if the .G.. occurred at the same genomic coordinate more than once within a cancer dataset, or if it was a homozygous mutation, it was considered as one count. Custom shell and FORTRAN scripts were used to obtain the total numbers of mappable NGNN and f i fractions (see Text S1 for sample scripts). The normalized fractions of mutated DGNN sequences were defined as F i = f i /gf i , thus, gF i scaled to 1. N indicates any base (A/C/G/T); D indicates A/G/T; B indicates C/G/T. As mentioned, sequence designation implies doublestranded DNA (i.e. AGTC = (59-AGTC-39)N(59-GACT-39)). The average base stacking free energies ,DG(n). were obtained from Friedman and Honig [51] by using the DG(n) (e i = 2) values for the three base steps (DpG + GpN + NpN)/3. The free energy of base stacking DG(n) is an estimate of the absolute contribution of base stacking to nucleic acid stability in the absence of hydrogen bonding interactions, and contains a contribution from nonpolar plus electrostatic forces, as assessed from a theoretical approach using the Amber force field and a continuum solvation model of water. The largest contribution to DG(n) was found to arise from nonpolar [51], as opposed to electrostatic, interactions. Nonpolar interactions were contributed for the most part by enhancement in the Lennard-Jones component as a result of close packing, and to a smaller extent from hydrophobic interactions. Thus, the DG(n) values follow the same trend as the nonpolar contributions to free energies of base stacking DG np (n), i.e. purine-purine .. purinepyrimidine . pyrimidine-purine . pyrimidine-pyrimidine, in qualitative agreement with experimental determinations [51]. The relative enrichment E of sequence i, E i , was defined as the ratio D i /T i , where D i = d i /gd i , d i being the number of times sequence i was mutated at least twice at the same hg19 coordinate and T i = t i / gt i , t i being the total number of occurrences of sequence i exomewide (T_exons). Finally, S = s n /gs n and s n = t n /c n , t n being the number of times the combined (top) sequences were recurrently mutated and c n being the total number of NS substitutions in a particular type of cancer.

Molecular Modeling
Three-dimensional structures of the 12 possible double-stranded DGN trinucleotides were constructed using w3DNA [90]. Hydrogen atoms, atomic charges and four neutralizing Na + counterions were assigned to each sequence according to the amber99 force field [91], using UCSF CHIMERA [92]. Na + counterions were positioned next to the four DNA backbone phosphates. Each trinucleotide was energy minimized in vacuo using the 10,000 steps steepest descent algorithm and the amber99 force field in GROMACS 4.5.1 [93]. Ten and 14 Å cutoffs were used for Coulomb and van der Waals interactions, respectively.

Vertical Ionization Potentials (VIPs)
VIPs were computed using Kohn-Sham density functional theory (DFT) [94] employing the Minnesota M06-26 functional [95,96] with all-electron 6-31G(d) basis sets [97,98], as implemented in the GAMESS electronic structure package [99], and including backbone phosphate groups and sodium counter ions in addition to the DGN double-stranded bases. The M06-26 functional was used since this method provides accurate descriptions of hydrogen bonding and stacking interactions between basepairs. We reasoned that the DGN set would provide the same type of information as the computationally more demanding NDGN set. Molecular orbitals were depicted using the MacMolPlt graphics program [100].

Pathway Analysis
For individual patient samples, mutations were collated and sorted into lists of genes carrying mutations using customized R scripts (http://www.r-project.org/). The gene lists for each sample were entered into our pattern extraction pipeline analysis (PPEP) [101], as implemented in the WPS package [102], to obtain the ListHit of genes (number of genes from each list that are annotated to each pathway) for each of the BioCarta pathways. For each tumor type, each pathway was ranked on the basis of how frequently it was ''hit'' by individual patient samples and the ranking scores were obtained as the percentages of patient samples that had at least one hit in the corresponding pathway, using customized R scripts. The tumor type ranking scores for each pathway were combined and used to rank the pathways for all tumor types. The highest ranked pathways represent the most ''popularly'' hit pathways amongst all types of tumors. For each highly ranked pathway, the genes carrying the mutations were retrieved from each patient sample, ranked and displayed as genelevel heatmaps. For the pathway analysis of recurrent NS substitutions, the relevant genes for each tumor type were collated into lists and subjected to PPEP analysis, as described above.

Hierarchical Clustering
Agglomerative hierarchical clustering dendrograms [103] were built using either the regression coefficients, r, between the fractions of mutated DGN sequences, f(DGN), and the VIP values, or the absolute orthogonal distances (Manhattan distances) between each f(NGNN) data point for all datasets. All-to-all comparisons were performed, allowing the relative estimation of all components of the systems, including the reference VIP branch.  Table S4 f i fractions and associated statistics. Panel A, counts of SBSs occurring at each NGNN sequence in all datasets and ratios of SBSs affecting GNC base-pairs; Panel B, f i fractions for the GWS datasets according to three methods and associated statistics (twotailed paired t-test); Panel C, f i fractions for the EWS datasets according to three methods and associated statistics (two-tailed paired t-test); Panel D, f i fractions for the 2 small EWS datasets (Colorectal_1 and Colorectal_2) and the 14 Melanoma_ews samples according to three methods and associated statistics (ztest; P(a) 0.05 statistics); Panel E, chromosomal positions and counts of SBSs occurring at each NGNN sequence in the 14 patient samples from the Melanoma_ews dataset (*T series) and in the 5 patient samples from the Pancreatic_ca dataset (PCSI* series).   Hg19 coordinates containing the largest number of recurrent NS substitutions for various types of motifs and their potential as driver mutations; 1 sequences are reported with the mutated base (A or G) at P2; 2 support from PubMed for a driver mutation following searches for ''gene_name & mutated_codon'' or ''gene_name & cancer'' and a manual review of the articles found: *** mutated codon that has been reported as a driver mutation; ** (!) mutated codon in p53 that has not been reported to harbor a driver mutation; ** strong support for gene mutation or change in gene expression being involved in tumorigenesis; * weak support for gene mutation or change in gene expression being involved in tumorigenesis; unk, insufficient information to assess whether a gene mutation or change in gene expression was involved in tumorigenesis. (DOCX) Text S1 Supporting information. The text contains information on the distribution of NGNN sequences in the mappability files and in Segmental Duplications, a description of mutational mechanisms shared by similar cancer types and individual samples, the analysis of recurrent non-synonymous substitutions, and exemplary scripts for obtaining the fractions f i of mutated NGNN sequences. (DOCX)