A framework for mutational signature analysis based on DNA shape parameters

The mutation risk of a DNA locus depends on its oligonucleotide context. In turn, mutability of oligonucleotides varies across individuals, due to exposure to mutagenic agents or due to variable efficiency and/or accuracy of DNA repair. Such variability is captured by mutational signatures, a mathematical construct obtained by a deconvolution of mutation frequency spectra across individuals. There is a need to enhance methods for inferring mutational signatures to make better use of sparse mutation data (e.g., resulting from exome sequencing of cancers), to facilitate insight into underlying biological mechanisms, and to provide more accurate mutation rate baselines for inferring positive and negative selection. We propose a conceptualization of mutational signatures that represents oligonucleotides via descriptors of DNA conformation: base pair, base pair step, and minor groove width parameters. We demonstrate how such DNA structural parameters can accurately predict mutation occurrence due to DNA repair failures or due to exposure to diverse mutagens such as radiation, chemical exposure, and the APOBEC cytosine deaminase enzymes. Furthermore, the mutation frequency of DNA oligomers classed by structural features can accurately capture systematic variability in mutagenesis of >1,000 tumors originating from diverse human tissues. A nonnegative matrix factorization was applied to mutation spectra stratified by DNA structural features, thereby extracting novel mutational signatures. Moreover, many of the known trinucleotide signatures were associated with an additional spectrum in the DNA structural descriptor space, which may aid interpretation and provide mechanistic insight. Overall, we suggest that the power of DNA sequence motif-based mutational signature analysis can be enhanced by drawing on DNA shape features.

Nucleic Acids Research 2017) and thus by analogy we focus on these features also for mutation risk prediction. Additionally, using a more focused set of DNA structure features allowed us to match the number of features on the sequence-context side, and to reduce the numbers of free parameters in the signature analyses." We added the following text to Discussion: "One caveat of the set of structural features currently employed is that their intended use is for ds DNA, however they are ill-defined in ss DNA segments. This means that the framework presented here, in its current implementation, would not be able to adequately address those mutational processes mediated by structured ss DNA. One known example are the hypermutable hairpin structures in DNA, where the loop is a target of the APOBEC3A cytosine deaminase (Buisson et al Science 2019) and is thus hypermutable (we note that only a minority of APOBEC3A mutations genome-wide is mediated by these stem-loop structures (Mas-Ponte Nature Genetics 2020)). Future work building upon our study may incorporate DNA structural descriptors pertinent to ss DNA, as well as increase the DNA oligomer length by drawing upon the ever-increasing amounts of cancer genomic data to increase statistical power." These are the references cited:

2, For Fig2 and Fig3
, the author only used one sample to represent the typical mutational processes. It would be good to see if the trend still holds for additional individuals as there are a lot of such samples available from TCGA/ICGC. In addition, POLE mutants have different hotspots (like V411L and P286R) with differential spectrum (Fang et al, 2020, Plos Gene;Hodel et al, 2020, Mol Cell), the sample used in this study should be indicated.

Response:
We thank the Reviewer for a suggestion to support our findings with genomes from additional tumor samples. We previously considered all 1637 WGS samples and selected only one hypemutated representative for each of the six representative mutational processes.
However, we do indeed agree that it is interesting to examine multiple tumor samples affected by the same process in order to check consistency of associations we reported between DNA structural features and mutagenesis by that process.
The tables below show analyses of the original representative tumor sample (n=6 total) and several additional samples (n=32), for a total of 38 tumor WGS analyzed by the Poisson regression.
Overall, these additional samples are remarkably consistent with the original sample in terms of overall model fit (as McFadden pseudo-R 2 ; Supplementary Table 1), and perhaps more interestingly, associations with structural features (regression coefficient profiles; Supplementary Fig. 1).
Related with this point: for the tissues considered (colon/rectum, bladder, lung adenocarcinoma, skin melanoma, glioblastoma) we have now included two samples per tissue (total n=10 out of 32) that we know do not exhibit the same dominant mutational process. This resulted in lower pseudo-R 2 measures for these samples (Supplementary Table S1) and changes in the regression coefficients ( Supplementary Fig. S1) in those samples. This is consistent with the DNA structural features associating strongly with each single mutational process (as in the hypermutators), but not a nondescript mix of mutational processes.
Additionally, we also indicate the known POLE hotspot identified for the POLE samples (as mentioned in Shinbrot et al, 2014, Genome Research;Fang et al, 2020, Plos Genetics). The one POLE sample we originally studied had a less common S297F mutation and mutations of the remaining POLE samples are listed in the Supplementary Table S1.
-We added text describing the analysis of multiple hypermutated samples and the consistency of results therein, as well as adding text -We added this text to the Results: -On Page 10 (now 11): "To check consistency of results across individuals, we considered additional samples affected by the same mutational processes (S1 Table, S1 Figure; 4 additional tumors per process, except TMZ where there were 2 additional tumor samples available). As a negative control, we further considered tumor samples originating from matched tissues, but not affected by the single-process hypermutation (S1 Table, S1 Figure)." -On Page 17: "For each of the six types of hypermutators studied, we considered additional tumor genomes affected by the same mutational process. The outcomes of the analyses were consistent across individuals, in terms of the overall fit of regression models, both the structure-based and the sequence-based ones (S1 Table), with some variability noted across MSI samples. Furthermore, the statistical associations of individual structural features with mutation rates appeared overall consistent across tumors affected by the same hypermutation mechanism (S1 Fig). As a control, we considered tumor samples matched by tissue but not affected by the same hypermutation mechanism; instead, the mutations in these control samples likely originated from a mixture of similarly prevalent mutagenic mechanisms. As expected, the fit of the regression modelseither sequence or structure-based onewas poorer on the control tumors (S1 Table). Additionally, these control samples did not exhibit the same patterns associations of structural features with mutation rates as the original hypermutator samples (S1 Fig "Poisson regression is commonly used for modeling count data. The model is given by: where λ i is a countable response variable, modelled as the linear combination of the covariates corresponding to the i th observation of the predictor variable x. Similarly to linear and logistic regression, the covariates are fixed, and the regression coefficients β = (β 0 ,…, β p ) are the model parameters to be estimated. The countable response variable λ can be substituted with the ratio λ/D of the count of the events (here mutations) and total number of opportunities D that the event had to occur. This leads to an extra term an offset log Dadded on the right-hand site of the equation and an adjustment of the regression coefficients." 4, In page 34, I think the second paragraph is out of scope of this paper. Maybe it is better to discuss how the shape parameters incorporated signature would benefit to personalized therapy.
Response: We agree. We have now edited the paragraph to make it more concise, omitting peripheral information, and thus better focused on the potential for incorporation of shape parameters for patient stratification using genomics: "There is a growing awareness that analysis of somatic mutation data may be able to provide markers to guide personalized therapy for cancer patients. Current focus is on refining drug selection to match the underlying genetic profile of tumors, particularly in terms of driver changes. 58,59,60 However, because mutational signatures reflect ongoing genomic instability, which is common in tumors and recognized as a targetable vulnerability of cancer, mutational signatures could help better stratify patients for targeted therapies, e.g., immunotherapy for MMR-deficient tumors. 61,62 The mutational Signature SBS3 and also a pattern of deletions with microhomology signal failures in the homologous recombination repair (HRR) pathway, which is targetable by PARP inhibitors. 19,20 This principle may extend beyond MMR and HRR deficiencies: in an analysis of mutational signatures across cancer cell line panels screened for drug activity, many signatures were associated with drug response, suggesting that some of the signatures might constitute useful genomic markers in patients. 21 For example, given that the error-prone DNA polymerase eta (POLH protein) can cause resistance to treatment with cisplatin, 63 identification of its mutational signaturespreviously via DNA sequence features 47 but possibly also by the affinities of this error-prone DNA polymerase(s) towards certain DNA shape features (Fig 5) could inform treatment decisions. In summary, there is great promise for clinical use of genomic predictive markers in tumors, even though often the underlying mechanisms remain elusive. 60 Among such genomic markers, the utility of mutational processes in particular merits more attention."

Reviewer #2:
In this study Karolak et al. described how data on DNA structure can predict mutation probability. They firstly analyzed the role of DNA structure and nucleotide context in mutation probability independently and then tried to extend the idea of standard mutational signatures (based on trinucleotide mutational spectrum) by including information about structural features. The idea itself looks very exciting. My main concern about it is prediction of structural features from DNA sequence and very strong correspondence between contexts and structures at such a small scale. Mutagenic processes in investigated cancers are highly context-dependent, so by definition you will have structure-dependence as well if each context corresponds to very specific structure. It would be informative to understand how much different contexts vary in described structural parameters. It would also be interesting to see whether different contexts with similar structure can have similar changes in mutation probability. Probably it would be helpful to investigate more large-scale structures to see whether there are differences in mutability of particular context in different structures? Similar to example with APOBEC in stem loops. For investigated cancers, models with structural features performed similarly well to contexts. It would be interesting to investigate whether this is the case for cancers without such high context-specificity. The most interesting part of the paper is prediction of new signatures that could not be extracted based only on nucleotide context. Although the method seems not to be ideal and the biological interpretation is quite limited, this work is a good start point in this direction.

Response to the Reviewer:
We thank the Reviewer for the positive evaluation of our work in terms of the idea being exciting, and in terms of our work being a good starting point for development of structural signature methods and of their biological interpretation. For the latter point (interpretation), we certainly agree that our study provides results that are preliminary in nature however, we would also argue that extending this study further would be more opportune for future work, rather than a revision for PLOS One.
Additionally, we appreciate the comments on the interest in sequence-structure correspondence patterns, in examining larger-scale structures' role in mutagenesis, and stem-loop structures' role. Some of these points require non-trivial extensions of the methodology, e.g., the hypermutable ssDNA segments within stem-loops, which are not well described with the current dsDNA-based DNA structure descriptors. These are certainly important directions for future studies, which we now make note of in the Discussion (see the response to Reviewer 1 question 1 for details, and the text we added).
Overall, we think the reviewer's specific comments were constructive and helped improve our manuscript; our point-by-point responses are below. In addition, we would like to draw the reviewer's attention to the new Supplementary Table 1  Comments: 1. Figure 1 is not fully understandable. From the scheme, it seems that far right and far left nucleotides are not engaged in any parameter estimation that seems not to be true.

Response:
We thank the Reviewer for the comment on schematic in Figure 1; we see how this could be seen as confusing. We have now added the explanation why five nucleotides are needed to calculate minor groove width (mgw), base pair, and base pair step parameters to the caption of Figure 1. Although the flanking nucleotides appear unused in the calculations of the parameters, the whole pentamer DNA motif is necessary to calculate mgw and two of the base pair step parameters (Zhou et al, 2013) and the flanking nucleotides do enter the calculation this way. Figure 1 legend by appending: "Considering the DNA pentamer sequence is an approach to derive DNA structural features at a locus, while accounting for the nearest and next-nearest neighbors of the central nucleotide. The structural features: mgw, the one bp, and the two bp step parameters, at a given central base pair are the function of its pentamer environment."

2.
Is it correct that Poisson regressions do not take information about mutation type into account? I mean that coefficients on Figure 2 and Figure 3 show influence on mutation probability of central nucleotide independently of mutation type. In this case, they seem to be biased by the main mutation type in cancer.
Response: Indeed, this should be clarified. The reviewer is correct in that the regression coefficients show association with mutation probability in the central position, while jointly considering all mutation types (e.g. in case of APOBEC mutagenesis that has both C>T and C>G in the spectrum, both the neighborhoods of C>T and C>G mutations are considered equally). We think the underlying assumption is reasonable: in the same mutagenic process, different mutation types are likely to originate from the same or at least a shared mechanism (e.g., APOBEC mutagenesis always starts by deamination of C to U, which can later either be repaired correctly [C>C] or it can resolve to either C>T or C>G mutations). Based on this assumption, the mutational events originating from one process can be treated as a set; this has the practical advantage of increasing statistical power to find associations due to larger sample sizes. To evaluate if this assumption broadly holds, for the APOBEC example we checked the structural and sequence features associated with C>T mutations separately, and with C>G mutations separately, in the Poisson regression (see new S2 Fig). We observe broadly similar trends across the two APOBEC mutation types, C>T and C>G, suggesting this assumption is reasonable.
With respect to these regression coefficients being biased by the major mutation type in that cancer: we would not call it a bias. In determining the coefficients, regression will place a higher weight on the more common mutation type in that tumor, which we think is the sensible thing to do if the goal is to better represent all mutations in that tumor. We have now edited the the Methods section of the text to clarify how the mutation type is treated in the regression. This is the text we added: "The mutation type is not explicitly considered in the Poisson regression as implemented, but instead all mutations occurring at a C:G pair are considered jointly (C>A, C>T and C>G), and all mutations considered at an A:T pair are considered jointly (T>A, T>C and T>G). The assumption underlying this is that in a tumor predominantly affected by a single mutagenic process, different mutation types are likely to originate from the same or a shared mechanism. This provides a rationale for treating various mutation types as a single set, thus increasing statistical power to find associations. Of note, the more commonly observed mutation types in a particular tumor sample will have a higher weight towards determining coefficients in the regression run on that tumor sample." Finally, we would also like to note that the NMF analysis for mutation signature extraction does treat the structural features separately by mutation type. Indeed, if there were any differences in, for example taking the APOBEC mutagenesis, the structural features relevant for the C>G versus the C>T mutations, this would be evident in the signatures extracted by NMF.

3.
Could you please discuss in text why 3nt-seq model always performs better than 5nt-seq model? (Table 1).

Response:
We thank the Reviewer for this question. In fact, we had discussed this extensively while preparing our manuscript. We now have added a brief discussion of this to the Results section: "Consistently across samples, we noticed that performance of the trinucleoide DNA sequence models (3ntseq) outperforms performance of pentanucleotide sequence models (5nt-seq) ( Table 1, Supplementary  Table 1). This may be due to data that are sparser, i.e., many more of the mutation count values in the regression tables for 5nt-seq are zero, due to pentanucleotide motifs being longer and an individual pentanucleotide occurring less frequently genome-wide compared to an individual trinucleotide."

4.
Page 13 "the structural DNA features compared favorably (pR2=0.81 for 7nt-str, Table 1) to the sequence DNA features (pR2=0.81 for 5nt-seq)". Why do you call this comparison favorable if they have equal pR2?
Response: This comment has been addressed by amending the Results text, which now states: "As with the MSI tumor, also in the POLE tumor the structural DNA features yielded a similarly wellfitting model (pR 2 =0.81 for 7nt-str, Table 1) as the sequence DNA features did (pR 2 =0.81 for 5nt-seq). This further supported that the structural features we examined are appropriate to describe the propensity of DNA sites to mutate in DNA repair-deficient tumors."

5.
Page 13 "The overall predictive accuracy was higher for the POLE tumor than for MSI tumor ( Table 1), suggesting that mutational risk can be predicted from DNA shape to a variable degree across different mutational processes." From Table 1 it is seen that context-model also performs better for PolE compared to MSI. As structural features were predicted from nucleotide sequence this fact can probably be explained by lower contextspecificity of MMR compared to PolE.

Response:
We understand the point the reviewer is making, and we agree. The statement as originally written did not imply that the context-model did not perform better, although we do see how the wording could be understood as such. To clarify, the Reviewer's interpretation is now made explicit in this text segment: "The overall predictive accuracy was higher for the POLE tumor than for MSI tumor (Table 1, S1 Table). This suggests that mutational risk can be predicted from DNA shape to a variable degree across different mutational processes, similarly so as when predicting mutational risk from DNA sequence."

6.
Page 14 "Considered together with overand under-twisting at positions -1 and +1, respectively, this suggests that DNA motifs experiencing APOBEC mutagenesis may be more prone to have flipped out bases". It is interesting to investigate the co-occurrence and simultaneous influence of structures found significant in Poisson regression.

Response:
We agree, it is indeed interesting to analyze the co-occurrence and simultaneous influence of structural features found significant through Poisson regression (PR For some of the shifts in structural feature values (e.g., over-and under-twisting of the adjacent bases) our current knowledge allows to understand the conformational changes in DNA taking place (e.g. flipped out bases, in this particular example). However, to interpret other co-occurrences of structural features, it would be necessary to have an analysis using molecular simulations (on longer time scales) to better understand structural changes taking place. This would be, in our opinion, out of scope for the current study, but is certainly an interesting focus for future work.
7. Figure 5. X-axis is impossible to read. Structural signatures seem to be very similar between different signatures in contrast to difference between context signatures. For example, structural elements for SBS1, SBS3L, SBS6, SBS7b seems very similar for C>T mutation but contexts differ extremely. Could you explain this? I lack the interpretation as I was not able to understand the correspondence of bars to structural features.

Response:
We thank the Reviewer for this comment. We have changed the two-column arrangement of Figure 5 to one-column to display individual signatures more clearly and to make each structural parameter easily recognizable. We also improved the legend within the figure.
We agree with the reviewer's observations that the contributions of structural elements for some of the signatures appear similar (e.g., when looking at C>T changes), even though the contribution of the sequence trinucleotide contexts appears to be much different.
The reason for this is quite simple: the baseline frequencies for trinucleotides (i.e., the number of times they are seen in the human genome) are quite uniform, except for the NCG dinucleotide which is rarer than others. However, when considering various combinations of structural features across the genome, their abundance is rather diverse: some combinations of structural features are quite rare. (Speculatively, reasons for that could be that they are disfavored by physics and/or by evolution, however this does also depend a lot on the implementation of the methodology, in particular the discretization thresholds used to make low/medium/high bins). Thus, for a certain combination of physical features, the baseline "nucleotides-at-risk" may be much lower and for another higher, and this will be reflected in all mutational signatures. This is because NMF, in its usual setup for mutational signature analysis, operates on the raw (unnormalized) counts, and the raw counts are trivially correlated with the number of "nucleotides-at-risk" for each feature.
To address this question with data, we now include a chart with the "nucleotides-at-risk" counts for the trinucleotide spectra, and also separately for the loci stratified by the DNA structural features; this provides a baseline for interpreting the signatures. Indeed it can be observed some of the mutation contexts, as currently stratified by the structural features, are much less common than others, which can give an impression of high correlation between the structural signatures (we note that the signatures could also easily be normalized by this baseline, which we provide in a supplementary table).
We now include the following text in the MS to explain this matter: "A further methdological consideration is that these NMF signatures were derived from raw mutation counts, thus their shapes in part reflect also the number of genomic nucleotides-at-risk encompassed by each feature in the spectrum (S5 Fig). This holds true both for the trinucleotide part of the spectrum (e.g. the NCG trinucleotides are rare in the human genome), and also for the structural features (e.g. mgwL bin, using the current definition of L, M and H thresholds, is rarer than mgwM and mgwH bins, S5 Fig). This may result in an apparent similarity of the profiles across different signatures." 8. Figure 6 looks confusing. For example, SBS4 that you describe as tobacco smoking induced was not found in any lung cancer. For LUSC you determined high exposure to SBS8L/4L that is also similar to SBS4 but for luad you didn't observe any of them. SBS7a,b,c are not pronounced in skcm.

Response:
We agree with the reviewer that the former Figure 6 (referring to the 'exposures' of our mutational signatures) may be able to be improved in future work.
It was recognized in the field that inferring mutational signatures, as well as estimating their 'exposures' in genomes, can be difficult. For instance it is clear that signature SBS8 (that the reviewer mentions) is in analysis usually difficult to unmix from SBS5 and SBS3 (recently reviewed e.g. PMID 34316057).
We think that in our analysis the tobacco-smoking C>A rich signature SBS4 is likely mixed with a similar tobacco-chewing signature SBS29L which is also C>A rich. This latter signature we observe rather enriched in both LUSC and also LUAD, just as the reviewer notes should be the case for a tobacco-associated signature. We edited the text to acknowledge that we might be confounding the two tobacco signatures into SBS29L in our analysis; given their similarity we don't necessarily see it as critical issue at this point. We do acknowledge that ideally --perhaps with more tumor samples analyzedsuch related signatures may be successfully disentangled.
We added this text: "While many extracted mutational signatures closely resembled their Cosmic counterparts in the trinucleotide spectrum, we recognize that inference of some of known signatures may be able to be further improved. For example, the signature SBS29 was previously associated with tobacco chewing. In our analysis, SBS29L may be a mix of the two similar, C>A-rich signatures: the tobacco chewing SBS29, and the tobacco smoking signature SBS4 (judging by its exposure in two lung cancer types, S Figure)." About the example of the SBS7 (UV-associated) signatures the referee mentions: we do not think there is a problem there -SBS7a and SBS7b are indeed very strongly enriched in skin overall, a large cohort of which was labelled 'mela' in our previous plot (previously we had a 'mela' and 'skcm' datasets separation which was not necessary, now these samples were merged under 'skcm' and the plot updatednow it is in S4 Fig).