Prominent features of the amino acid mutation landscape in cancer

Cancer can be viewed as a set of different diseases with distinctions based on tissue origin, driver mutations, and genetic signatures. Accordingly, each of these distinctions have been used to classify cancer subtypes and to reveal common features. Here, we present a different analysis of cancer based on amino acid mutation signatures. Non-negative Matrix Factorization and principal component analysis of 29 cancers revealed six amino acid mutation signatures, including four signatures that were dominated by either arginine to histidine (Arg>His) or glutamate to lysine (Glu>Lys) mutations. Sample-level analyses reveal that while some cancers are heterogeneous, others are largely dominated by one type of mutation. Using a non-overlapping set of samples from the COSMIC somatic mutation database, we validate five of six mutation signatures, including signatures with prominent arginine to histidine (Arg>His) or glutamate to lysine (Glu>Lys) mutations. This suggests that our classification of cancers based on amino acid mutation patterns may provide avenues of inquiry pertaining to specific protein mutations that may generate novel insights into cancer biology.

Reflecting this, cancer progression is determined in part by genetic diversification and 55 clonal selection within complex tissue landscapes and with changing tumor properties 56 and microenvironment features [2,3]. Genetic sequencing of tumor samples has been 57 critical in developing the evolutionary theory of cancer. While cancers traditionally have 58 been-and continue to be-classified by tissue of origin, genetic sequencing has allowed 59

Several cancers are enriched for R>H and E>K amino acid mutations 83
Multiple studies have interrogated nucleotide mutation biases by analyzing 84 somatic variation across a wide range of cancers [4,5]. However, in protein coding 85 regions of the genome (i.e. the exome), it is essential to study patterns of amino acid 86 variation to reveal information about potential functional effects at the protein level. We 87 characterized the global properties of amino acid mutations encoded by somatic 88 mutations across a range of cancers by analyzing a tumor-normal paired mutation 89 database [5] consisting of 6,931 samples across 29 cancer types. We applied filtering to 90 remove sequencing artifacts and restricted mutation data to nonsynonymous amino acid 91 mutations (see Materials and Methods, S1 Table and S2 Table for details). 92 Using this amino acid mutation database, we performed an unbiased 93 characterization of mutation signatures across cancer types using Non-negative Matrix 94 Factorization (NMF), which has proven to be a useful tool for pattern discovery in cancer 95 tissue mutation datasets [5] and other biological systems [9]. Applying NMF to the 96 pooled mutation data reveals six mutation signatures at the amino acid level (S1G Fig), 97 including two with strong Arg>His components and two with strong Glu>Lys 98 components ( Fig 1A, S1 Fig). Although the cancers are comprised of a mixture of the 99 signatures identified, ten cancers (AML, colorectal, esophageal, low grade glioma, 100 kidney chromophobe, medulloblastoma, pancreatic, prostate, stomach, and uterine) have 101 majority contributions from Arg>His-prominent mutation signatures (R>H and 102 A>T/R>H). We also identify four cancers (bladder, cervix, head and neck, and 103 melanoma) that have majority contributions from Glu>Lys-prominent mutation 104 signatures (E>K and E>K/E>Q). Additionally, there are two complex signatures not 105 dominated by any particular amino acid mutation. Glioblastoma, kidney papillary, liver, 106 and thyroid cancers have majority contribution from the Complex 1 signature, and lung 107 adenocarcinoma, small cell lung, squamous cell lung, and neuroblastoma cancers all have 108 majority contribution from the Complex 2 signature. Finally, seven cancers from a 109 variety of tissues (ALL, breast, CLL, clear cell kidney, B-cell lymphoma, myeloma, and 110 ovarian) have heterogeneous mutation signature contributions. To alternatively visualize the amino acid mutation spectrum, we use principal 126 component analysis to reveal cancers clustering by dominant mutation classes (Fig 1B).

Individual Cancer Samples Recapitulate Amino Acid Mutation Patterns 133
We also analyze samples individually with NMF and find that Arg>His and 134 Glu>Lys features continue to dominate (Fig 2A and S3 Fig). For many cancer subtypes 135 (melanoma, bladder, uterine, colorectal, low-grade glioma, cervix, neuroblastoma, and 136 the three different lung cancers), individual patients within each 137 cancer exhibit consistent amino acid signatures ( Fig 2B). This is true even within 138 clinically diverse cancers such as bladder, uterine, colorectal, and lung cancer, which all 139 have multiple identified driver mutations. This suggests that the amino acid signatures we 140 identified may be independent of underlying driver mutations and may instead be a 141 consequence of common features of the cancer, tumor microenvironment, or selective 142 pressures, all of which may be targeted therapeutically.  We calculated correlation coefficients between all COSMIC Data signatures and each 175 Alexandrov Data signature. When the correlations are very high, this indicates that NMF 176 has identified the same general mutation signature in the two different data sets. Indeed, 177 we found high correlation between the COSMIC signatures and our initially identified 178 signatures for five of the six (Fig 4)  Charge-changing mutations, whether buried or surface-exposed, can alter protein 208 charge, electrostatics, and conformation [16]. Electrostatics of surface residues have been 209 shown to play a key role in protein-protein interactions [17], protein-membrane 210 interactions [18,19], and kinase substrate recognition [20]. While it is important to note 211 that our analyses are agnostic to the location of the mutation within the proteome and 212 within a protein, the strong bias towards amino acid mutations that alter charge in our 213 identified mutation signatures may suggest an adaptive advantage conferred by these 214

mutations. 215
Glu>Lys mutations swap a negatively charged amino acid for a positively charged 216 amino acid, which may in some cases affect protein function. Indeed, in some cases 217 buried lysine mutations can induce global protein unfolding upon charging that alters 218 mutant protein stability and function [21]. Furthermore, Glu>Lys mutations have been 219 known to affect the function of PIK3CA [22][23][24]. 220 Arg>His mutations swap a positively charged amino acid for a titratable amino 221 acid. Whereas arginine (pKa ~12) should always be protonated, histidine (pKa ~6.5) can 222 titrate within the narrow physiological pH range. Indeed, the pH-sensitive function of 223 many wild-type proteins has been shown to be mediated by titratable histidine residues 224 [25][26][27]. Moreover, recent work has shown that some Arg>His mutations can confer pH 225 sensitivity to the mutant protein and alter function [28]. We predict that some Arg>His 226 mutations may be adaptive to increased pHi, conferring a gain in pH sensing to the 227 From our analyses, Arg>His mutations define the mutation landscape of a diverse 229 set of cancers across a range of tissues including brain (low-grade glioma), digestive 230 (colorectal), reproductive (uterine), and blood (AML) cancers. Importantly, these cancers 231 do not have overlapping nucleotide mutation signatures [5], which suggests that the 232 amino acid mutation signatures we identified may reflect other aspects of the cancers 233 including distinct physiological pressures, microenvironment features, or functional 234 requirements. Indeed, these results may help inform studies in the emerging field of 235 Molecular Pathologic Epidemiology (MPE) [29,30], which seeks to integrate knowledge 236 across disciplines to inform personalized approaches to cancer prevention and therapy. 237 Linking amino acid signatures to physiological or pathological features of the cancer 238 could be important for identifying selective pressures that may be driving or sustaining 239 the cancer as well as for limiting disease progression, particularly where targeted 240 approaches fail [31][32][33].

Mutation Dataset Filtering 246
We validated the dataset [5] by comparing known frequencies of well-studied 247 cancer driver genes with observed frequencies in the dataset. Specifically, BRAF is 248 mutated in 40−50% of melanoma samples, and IDH1 is mutated in 75−85%, low-grade 249 glioma, AML, and glioblastoma samples are mutated 75-85%, 8−12%, and 1−5% of the 250 time, respectively. We used the p53 database (http://p53.fr/index.html) to find expected 251 p53 mutation frequency for various cancers: colorectal, head and neck, pancreatic, 252 stomach, liver, and breast cancer have 43%, 42%, 34%, 32%, 31%, and 22% p53 253 mutation rates, respectively. The observed mutation frequencies were consistently lower 254 than expected for the genes/cancers we assessed, which suggests that the dataset authors 255 [5] were perhaps too stringent in quality control (QC) filtering. Different levels of QC 256 filtering were performed, and we systematically relaxed filters in order to recapitulate the 257 expected mutation frequencies of the selected canonical driver genes. Applying only the 258 'sequencing artifact' QC filter (from [5]) most closely recapitulated expected mutation 259 frequencies for the canonical driver genes, and this filter alone was used for the 260 remainder of the bioinformatics analyses.  NMF is an unsupervised learning method used to decompose a data matrix into a 283 product of two non-negative matrices representing a set of k signals and mixture 284 coefficients. For example if X is an × matrix representing the nonsynonymous 285 mutation frequency data, then the NMF of the data is given by 286 = where W is an × matrix with the k columns representing mutation signatures and H 287 is a × matrix representing the mixture coefficients that best reconstruct X. Often it is 288 not possible to factor X exactly, so a typical approach to solving the decomposition will We use the R package prcomp to perform all PCA analyses. 303 304

Validation of NMF Mutation Signatures 305
In order to validate the mutation signatures that we discovered in our data, we 306 sought an orthogonal data set in which to replicate our analysis. We used the COSMIC 307