Figures
Abstract
The rates at which mutations accumulate across human cell types vary. To identify causes of this variation, mutations are often decomposed into a combination of the single-base substitution (SBS) “signatures” observed in germline, soma, and tumors, with the idea that each signature corresponds to one or a small number of underlying mutagenic processes. Two such signatures turn out to be ubiquitous across cell types: SBS signature 1, which consists primarily of transitions at methylated CpG sites thought to be caused by spontaneous deamination, and the more diffuse SBS signature 5, which is of unknown etiology. In cancers, the number of mutations attributed to these 2 signatures accumulates linearly with age of diagnosis, and thus the signatures have been termed “clock-like.” To better understand this clock-like behavior, we develop a mathematical model that includes DNA replication errors, unrepaired damage, and damage repaired incorrectly. We show that mutational signatures can exhibit clock-like behavior because cell divisions occur at a constant rate and/or because damage rates remain constant over time, and that these distinct sources can be teased apart by comparing cell lineages that divide at different rates. With this goal in mind, we analyze the rate of accumulation of mutations in multiple cell types, including soma as well as male and female germline. We find no detectable increase in SBS signature 1 mutations in neurons and only a very weak increase in mutations assigned to the female germline, but a significant increase with time in rapidly dividing cells, suggesting that SBS signature 1 is driven by rounds of DNA replication occurring at a relatively fixed rate. In contrast, SBS signature 5 increases with time in all cell types, including postmitotic ones, indicating that it accumulates independently of cell divisions; this observation points to errors in DNA repair as the key underlying mechanism. Thus, the two “clock-like” signatures observed across cell types likely have distinct origins, one set by rates of cell division, the other by damage rates.
Citation: Spisak N, de Manuel M, Milligan W, Sella G, Przeworski M (2024) The clock-like accumulation of germline and somatic mutations can arise from the interplay of DNA damage and repair. PLoS Biol 22(6): e3002678. https://doi.org/10.1371/journal.pbio.3002678
Academic Editor: Laurence D. Hurst, University of Bath, UNITED KINGDOM
Received: November 17, 2023; Accepted: May 14, 2024; Published: June 17, 2024
Copyright: © 2024 Spisak et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files. The code is available at github.com/n-t-n-el/clock-like and archived at zenodo.org/doi/10.5281/zenodo.11188647.
Funding: This work was supported by NIH R01 GM83098 to MP (https://www.nih.gov), HFSP postdoctoral fellowship LT000257 to MdM (https://www.hfsp.org), and NIH R01 GM115889 to GS (https://www.nih.gov). The sponsors or funders didn't play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations: BER, base excision repair; EM, expectation–maximization; MMR, mismatch repair; SBS, single base substitution
Introduction
Mutations are the net result of multiple processes, including endogenous and exogenous DNA damage, errors in DNA replication, and DNA repair. The rate at which they accumulate varies substantially across the human body: mutation rates in germline lineages are typically very low, averaging fewer than one mutation per haploid genome per year, whereas hundreds of mutations accrue yearly in tissues exposed to extensive exogenous mutagens [1–3]. The origins of these pronounced differences remain poorly understood: while the biochemical mechanisms that underlie mutations have been well characterized (e.g., [4,5]), their relative importance in a given cell type or tissue remains largely unknown.
Over the past decade, a new approach to this question has become feasible, with the analysis of whole genome mutational spectra in cancer genomes and the identification of repeatable patterns of mutations termed mutational signatures [6–8]. In these analyses, single base substitutions (SBSs) are classified into 96 types based on the identity of the substitution and the flanking nucleotides, and each tumor sample is characterized by the distribution over the 96 types. Variation across samples is then modeled as a linear combination of relatively few signatures, each characterized by a different profile of the 96 SBS types. The proportion of mutations attributed to a given signature is often described as its “activity” or “exposure” (e.g., [8,9]), though in what follows, we use the term loading. The idea behind this decomposition is that the signatures reflect one or more mutational processes occurring in different types of samples. In practice, signature profiles and signature loadings across samples are inferred jointly using nonnegative matrix factorization [6]. Nearly 100 signatures have been identified to date [8], which are cataloged and updated in the COSMIC database [10]. Only a subset of COSMIC signatures have a known or partially known etiology, while the sources of most remain elusive. Part of the difficulty is that all signatures reflect the interplay between the initial DNA damage or replication error and DNA repair, and disentangling these from their net effects is not straightforward.
In studies conducted to date, two such COSMIC SBS signatures, signature 1 (SBS1) and signature 5 (SBS5), are ubiquitous. They are detected in tumor samples as well as in healthy somatic cells in humans and other mammals [11], in many cell types contributing the majority of mutations [2]. They are also the predominant contributors to the spectrum of germline mutations [12]. The number of mutations attributed to SBS1 correlates with the age of cancer diagnosis in many cancer types, as do mutations assigned to SBS5, leading the 2 signatures to be described as “clock-like” [13]. Because they share this behavior, mutations attributed to these 2 signatures are sometimes combined and analyzed together [14–16].
SBS1 is dominated by cytosine transitions in the CpG context. The rate of SBS1 accumulation varies among cell types and increases with the rate of cellular turnover [3,17]. The loading of SBS1 mutations has also been found to increase in metastatic tumors relative to their primary tumor counterparts, plausibly due to accelerated cell division rates in metastasis [18].
In contrast, SBS5 has a more diffuse distribution among the 96 SBS types. Its ubiquity across cell types, cancerous and healthy, has led to the suggestion that the signature reflects “background” endogenous cellular damage to DNA [19,20]. However, the number of SBS5 mutations is also affected by the presence of at least one exogenous mutagen, tobacco smoke [21,22].
Here, we consider conditions under which mutations can accumulate in a clock-like fashion and how the rate of accumulation depends on underlying processes of damage and repair as well as DNA replication. Analyzing whole-genome sequencing data sets of somatic and de novo germline mutations in cell types with varying rates of cell division, ranging from postmitotic neurons to rapidly dividing intestinal epithelium, we identify mechanisms that likely underlie the clock-like mutation accumulation of SBS1 and SBS5 and highlight key differences between them.
Results
General model of mutation accumulation
We study the interplay between DNA replication, damage and repair, by extending the model introduced in [23]. We first consider the mutational processes that occur between cell divisions (Fig 1A). A given source of damage causes lesions at a rate u per base pair, which are detected and repaired at a rate r. Repair leads to mismatches with probability ϵ, due to the misincorporation of a nucleotide. We assume that once repair is complete, there are no mechanisms that differentiate the newly synthesized strand from the strand used as a template by the DNA repair polymerase. We denote the rate of mismatch resolution by q and the probability of correct resolution by p. The outcome of mismatch resolution can vary depending on the type of the mismatch and the local sequence context. We discuss the special case of T:G mismatches in CpG contexts in more detail below.
(A) Interplay of DNA damage and repair during the cell cycle. DNA damage leads to lesions at rate u per base pair. Lesions are repaired at rate r and lead to mismatches with probability ϵ, due to the misincorporation of nucleotides by the DNA polymerase used in repair. Mismatches are resolved at rate q, resulting in the incorrect base pair and a mutation with probability 1-p. (B) Consequences of DNA replication. Replicating DNA over a lesion requires translesion synthesis. This process is not always accurate: it causes an error and a mutation in one of the 2 daughter cells with probability R (assuming that the lesion is repaired in the next cell cycle, i.e., that r≫ϕ). Unresolved mismatches cause a mutation in one of the 2 daughter cells. (C) The predicted number of mutations, m, in a genome length l at age x contributed by the different mechanisms. The genome length is denoted by l and the rate of cell division by ϕ.
Based on these assumptions, we derive the expected number of lesions, mismatches, and mutations and their variances as a function of the time since the last round of DNA replication t in a genome of length l. The full analysis of the model is presented in Methods; here, we describe the main findings. Over short time periods, t<r−1, during which repair has had little chance to occur, the number of lesions grows at rate ul and the number of mismatches at rate ulϵ. Over longer times, t≫r−1 and ≫q−1, the expected number of lesions and mismatches reach a steady state, at which they are approximately equal to ul/r and ulϵ/q, respectively. In turn, the number of mutations due to incorrect repair increases at rate ulϵ(1−p).
From independent lines of evidence, we know that the vast majority of DNA damage in healthy cells does not lead to mutations: the numbers of mutations in healthy cells are substantially lower than estimated damage rates would suggest [24], and mutation rates in individuals with DNA repair deficiencies are orders of magnitude higher [25]. These observations imply that in healthy cells, the repair rate is on the order of the total damage rate, i.e., that r∼ul. Therefore, the steady state between damage and repair likely is established long before the next cell division. For simplicity, we further assume the rate of mismatch resolution is of the order of the repair rate, q∼r. We note that the dynamics of cancer cells may be different from those of the healthy cells on which we focus here; in particular, the repair machinery may be debilitated or overwhelmed and lesions may persist over multiple cell divisions, a phenomenon termed “lesion segregation” [26,27].
The next part of the model describes the mutational processes that occur during replication (Fig 1B). For simplicity, we assume that the cell divides immediately after DNA replication and treat the two processes as simultaneous, occurring at a fixed rate ϕ. An unrepaired lesion stalls DNA replication and triggers the recruitment of polymerases that can replicate over the lesion, through translesion synthesis [28]. This synthesis leads to the incorporation of the incorrect nucleotide opposite the lesion on the template strand with probability R, which depends on the type of lesion and possibly on replication timing [29]. If, with probability 1-R, the correct nucleotide is incorporated then, given our assumption that repair is rapid relative to the rate of cell division (i.e., that r≫ϕ), the lesion is likely to be repaired during the next cell cycle. If the translesion polymerase incorporates an incorrect nucleotide, however, the repair process will propagate the error to the complementary strand, generating a mutation. Overall, erroneous translesion synthesis introduces mutations at a rate of ulRϕ/2r.
Mismatches unresolved during the cell cycle cause a mutation in one of the 2 daughter cells. Given our assumption that the rate of mismatch resolution far exceeds the rate of cell division (i.e., q≫ϕ), these mutations track cell division and accumulate at a rate of ulϵϕ/2q. In contrast, mutations caused by repair errors are independent of cell divisions and accumulate with absolute time, at a rate of ulϵ(1−p).
Lastly, mismatches also arise during DNA replication due to the misincorporation of nucleotides by replicative polymerases. We denote the probability of a misincorporation per base pair by w and the probability that such a misincorporation leads to a mutation by P. Importantly, replicating DNA carries transient features that distinguish the newly synthesized strand from its template [30] and help mismatch repair substantially decrease the number of replication errors that become mutations, i.e., P≪1. In sum, the number of mutations due to replication errors increases with age at a rate of wlPϕ.
Considering these different processes together, our model predicts that the expected number of mutations m at age x is given by (1)
The two terms in this equation correspond to two kinds of clock-like behaviors. The first depends on the cell division rates and includes damage-induced mutations as well as DNA replication errors; assuming, as we do, that cell division rates are fixed, these mutations accrue with age at a constant rate. The second type of mutation is driven by errors of DNA repair in response to damage; assuming damage rates are constant, it too depends on absolute time. However, a distinguishing feature of the two types of clock-like mutations is how they behave as a function of cellular turnover rates; in particular, postmitotic cells should show no increase of the first type of mutations.
Accumulation of CpG transitions.
The process of spontaneous cytosine deamination contributes substantially to the mutation rate [31]. The standard explanation is that because at methylated cytosines, the deamination rate results in a thymine, one of the canonical bases, the efficiency of repair is low [32]. To investigate the dynamics underlying the accumulation of CpG transitions, we first consider the consequences of methylated cytosine deamination during the cell cycle. Given that deamination leads directly to a mismatch, we can employ a simpler model, originally introduced in [23]. The dynamics is analogous to the general model and can be recovered in the q→0 limit (see Methods, Eq 12).
Methylated cytosines in double-stranded DNA deaminate spontaneously at a high rate (estimated as ud = 2.3×10−5 per year [33]). The resulting T:G mismatches can be detected and repaired by base excision repair (BER, [34]) or mismatch excision repair (MMR, [35]) at effective rate r. With probability ϵ, the repair machinery erroneously substitutes the guanine for an adenine, leading to a mutation (Fig 2A).
(A) Model of the consequences of spontaneous deamination of methylated cytosines during the entire cell cycle [23]. Double-stranded cytosines deaminate at a constant rate ud per base pair and the resulting mismatches are repaired at a rate r. With probability ϵ, the mismatch resolution is incorrect and leads to a mutation. We assume that the cell divides immediately after DNA replication and treat the two processes as simultaneous (occurring at a rate ϕ). Unresolved mismatches at cell division lead to a mutation in one of the two daughter cells. We provide the prediction of the model for the number of mutations m at a given age x. Inefficiently repaired mismatches accumulate with the cell divisions, which occur at rate ϕ, and repair errors accumulate with absolute time, independent of cell divisions. The number of methylated cytosines is denoted by l. (B) An alternative source of CpG transitions is the deamination of methylated cytosines in single-stranded DNA (at a rate of us) during DNA replication. The model assumes that a deamination immediately preceding the polymerization of the second strand is not repaired and leads to a mutation in one of the daughter cells. The expected number of mutations is proportional to the rate of cell divisions and the time of transient single-strandedness Δt.
As in the more general model, at early times, t<r−1, the number of mismatches grows linearly with the rate ud, eventually reaching a steady state with the expected number of mismatches equal to udl/r, where l stands for the total number of methylated cytosines. Given the high estimated rate of spontaneous deamination and the relatively low observed rates of SBS1 mutations [13], repair must be efficient and have time to act before the cell division. The implication is that the repair rate is higher than the cell division rate, r≫ϕ, and that the cell divides after the number of mismatches has reached a steady state. At DNA replication, unrepaired mismatches lead to mutations in one of the two daughter cells. Therefore, the number of mutations m at age x depends on the cell division rate ϕ. On the other hand, the number of mutations due to defective repair follows absolute time, and such mutations accumulate regardless of cell divisions.
Under an alternative hypothesis that is not mutually exclusive (Fig 2B), CpG transitions result from the deamination of methylated cytosines in single-stranded DNA immediately prior to replication. Mismatches arising at this point cannot be repaired. The number of mutations that arise from such deamination events is proportional to the expected number of cell divisions ϕx and the time of transient single-strandedness Δt. The rate of deamination of methylated cytosines in single-stranded DNA is estimated to be orders of magnitude higher than in double-stranded DNA (us = 3.5×10−3 per year [33]), making this scenario a plausible explanation for the observed SBS1 mutation rates: to account for 1 mutation in 10 divisions, the time of transient single-strandedness would need to be of order Δt∼1 min. This order of magnitude can be compared with the typical velocity of replication fork in human cells, of order 1 kb per minute [36].
Some of the CpG transitions could arise from DNA replication errors, if methylated cytosines are a difficult template for DNA polymerases [37,38]. On average, replicatory errors contribute wlP transitions per cell division, where w denotes the probability of a misincorporation per base pair, and P the probability that such a misincorporation is left unrepaired by mismatch repair and leads to a mutation.
Taken together, the expected number of CpG transitions m at age x is given by (2) where l denotes the number of methylated cytosines in a CpG context.
Relating model predictions to data.
In any given cell, a combination of clock-like and non-clock-like mutagenic processes contribute to the mutation rate. Thus, although in our model, we focus on the conditions under which a clock-like accumulation of mutations will arise, in practice, the total number of mutations of any signature observed in a cell will likely not be strictly proportional to time (Eqs 1 and 2). Instead, there may be a non-negligible contribution of mutations that do not increase with age (i.e., are “age-independent”). These mutations can have several different sources: for example, they may have accrued during the first few cell divisions of development [39], the last stages of cell differentiation [21], or in response to an acute exposure to exogenous mutagens [40]. If these mutations occur in a burst, i.e., over a short period of time, then they will contribute a set number of mutations regardless of age, and a regression of the number of mutations against age can lead to a positive intercept.
A nonzero intercept can also be generated by a clock-like process occurring at a varying pace over development. We discuss model expectations in this case with a toy piece-wise linear model, in which mutations accumulate at 2 different rates over 2 time periods (S1 Fig): initially, the number of mutations grows at rate μ0, and after some time x0, the underlying parameters of mutagenesis (Eq 1) change and mutations accumulate at rate μ≠μ0. For data points collected after time x0, the expected number of mutations is then (3) where a = μ and b = (μ0−μ)x0. Hence, the intercept is positive if μ0>μ; for example, if entering a postmitotic state at x0.
Conversely, an increase in the rate of cell division or the rate of damage at x>x0 (such as occurs in the lung epithelia of regular smokers [21]) will lead to an increased slope in the number of mutations with age. If the mutation rate was significantly lower at earlier stages (say, before the person smoked), i.e., if μ0<μ, then a regression of the total number of mutations on age will yield a negative intercept.
In practice, another reason for a positive intercept in data are sequencing errors that arise from technical artifacts and will not depend on the age of the donor.
In reality, more than one of these phenomena is likely operating at once, and it is therefore important, in disentangling the origins of clock-like mutations, to distinguish those that accrue with age from those that contribute to the intercept, and may or may not be clock-like. Concretely, in data sets for which there is a significant positive intercept, the goal is to tease apart signatures that contribute at a constant level in all samples from signatures for which mutations increase in number with age. To this end, we extend the standard signature decomposition method to allow for a mixture of 2 components, a constant and an age-dependent one, and attribute mutational signatures to the two jointly.
Clock-like signatures across cell types
In order to gain insight into the origins of mutations that accumulate with age, we analyze patterns of mutation accumulation across cell types with different characteristics. To this end, we consider data sets that provide single-cell resolution mutation data, collected using a variety of experimental approaches (see Methods), including mutations in neurons and muscle cells [3], liver hepatocytes [41], lung epithelium [21], small bowel epithelium [42], colonic epithelium, and testis seminiferous tubules [2], as well as germline mutations identified from blood samples of pedigrees [43,44]. We rely on mutation data from donors without a disease diagnosis and on lung samples from non-smokers (see Methods).
For each cell or tissue type, we attribute mutational signatures by relying on the COSMIC database of signatures inferred from a large collection of cancer samples [8], as also done previously to describe mutational landscapes in noncancerous soma (e.g., in ref. [41]) and germline mutations (e.g., in [12]). Most of these signatures have been linked to specific mechanisms or are associated with exposures to mutagens, and they therefore provide a useful basis for analyzing and comparing mutation accumulation across tissues and cell types. Nonetheless, because the signatures were originally inferred from tumor samples, they may not fully capture the mutational processes acting in normal cell types, particularly in the male and female germ cells. This limitation could lead to a poorer fit, as well as to incorrect assignments of mutations to signatures to which they do not, in fact, belong. Here, we focus on clock-like signatures and choose a method of signature attribution that limits erroneous assignments of mutations to SBS1 and SBS5 and thus avoids overestimating their contributions (see Methods for details).
To compare observations with our model predictions, we develop an approach to focus on the subset of mutations in a given cell type that accumulate with age. This is done in 2 steps. First, we fit a linear model for all mutations jointly, i.e., y = ax+b, where y denotes the number of mutations per genome and x denotes age (see Fig 3A for mutations in neurons). Second, we model the distribution over the 96 substitution types as a mixture distribution of 2 components: the constant component (Fig 3B, yellow) contributes on average the same number of mutations in samples of all ages, whereas the age-dependent component (Fig 3B, blue) contributes an increasing number of mutations with age. We decompose the slope and the intercept into COSMIC signatures jointly, such that the dependence of the number of mutations yS attributed to a given signature s on age takes the form (4) where Pa(s) and Pb(s) denote the loadings of signature s in age-dependent and constant components, respectively. We estimate the loadings by extending the standard methods of signature attribution (see Methods); see Fig 3C and 3D for the example of loadings estimated for mutations in neurons. We note that applying this method is only possible if the intercept b is large enough (i.e., if the data contains enough age-independent mutations to attribute mutations to signatures in the constant component).
(A) Age-dependent signature attribution of mutations in neurons. Shown is the increase in the number of mutations with age (reported per haploid genome); each point corresponds to a single donor. (B) The relative contributions of different signatures vary with age. The decomposition of the mutation spectrum into age-dependent (blue, C) and constant signature distributions (yellow, D). Mutational signatures are indicated by their COSMIC label; asterisks indicate unattributed signatures (see Methods). (E–L) The number of mutations assigned to clock-like signatures, SBS1 (red) and SBS5 (turquoise) in: (E) neurons, (F) maternal germline mutations, (G) paternal germline mutations, (H) smooth muscle from bladder, (I) liver hepatocytes, (J) lung epithelium, (K) small bowel epithelium, (L) colon epithelium (*for this data set, in which the intercept is negative, we assume both signatures increase with age). Throughout (E–L), shaded areas represent 95% confidence intervals, estimated by bootstrapping (see Methods). Underlying data for this figure can be found in S2 Data.
In all cell types except for colonic epithelium, we find significant positive intercepts in the regression of the number of mutations on age, consistent with a burst of mutations in early development, for example. In these cases, we decompose mutations into the age-dependent distribution Pa and the constant component distribution Pb. In the case of colonic epithelium (Fig 3L), the intercept is significantly negative, possibly because the mutation rate during ontogenesis is lower than in adult life (S1B Fig), when the exposure to damage and the cell division rate is higher [45]. To proceed with our analysis in this case, we assume (rather than infer, as for other cell types) that all signatures increase at constant rates with age.
Our analysis reveals multiple signatures that increase with age (see S2 Fig), including the 2 that were previously reported [13], SBS1 and SBS5, and 2 additional signatures that are common across cell types considered, SBS12 and SBS16. SBS16 may not be independent from SBS5 [8,10]. In turn, SBS12 contributes up to 7% to 10% of mutations in neurons and the female germline, approximately 7% of mutations in lung and liver, approximately 3% in small bowel and colon, approximately 1% in the paternal germline, but is not found at detectable levels in muscle. Both SBS12 and SBS16 are of unknown etiology and dominated by T to C/A to G transitions.
In examining the mutation accumulation across cell types that vary in their division rates, we focus on SBS1 and SBS5, the two ubiquitous clock-like signatures, and use our decomposition method to examine possible sources for their age dependencies. If driven by cell division, we predict that the rate at which they will accumulate should vary substantially with cellular turnover rates. In contrast, if driven by damage rates, the rate should be much less sensitive to turnover rates, but may vary among tissues owing to differences in endogenous and exogenous damage rates.
In this regard, the accumulation of mutations with age observed in neurons is particularly informative, given that neurons are fully postmitotic cells. Despite the lack of cell divisions, mutations accumulate at rates similar to actively dividing lineages [3]. Using our decomposition, signatures whose mutation numbers increase with age are distinct from those that do not (Fig 3C and 3D). Notably, the increase with age is predominantly driven by mutations assigned to signature SBS5, with secondary contributions from SBS16 and SBS12. Strikingly, there is no discernible contribution of SBS1, as we discuss in more detail below. In turn, mutations in the constant component are attributed primarily to signatures SBS5 and SBS1, as well as signature SBS89, which is of unknown etiology but has been reported to be active in the first decade of life [45].
Mutations assigned to the clock-like signature SBS5 are found across cell types and increase significantly with age in every one (Fig 3E–3L). Moreover, SBS5 is the prevalent mutation signature in all cell types, except for small bowel and colon, for which more mutations are attributed to SBS1. That SBS5 is the dominant signature in postmitotic cells such as neurons, as well as in maternal mutations, most of which arose in oocytes, indicates that such mutations can arise independently of DNA replication cycles and points to errors in DNA repair, which accumulate with damage rates (Eq 1). Similarly, SBS12 and SBS16 contribute to both neurons and female germline mutations as well as to mutations in rapidly dividing cells (S2 Fig), suggesting that the age dependencies of these signatures are not driven by DNA replication cycles either.
When not arising from replication errors, our model predicts that the number of mutations will be clock-like only if the damage rate u is constant. If we assume that probabilities ϵ and p are fixed, as seems sensible if they are primarily determined by inherent properties of DNA repair (e.g., the error rate of a polymerase), then the variation in the rate of SBS5 mutation across cell types reflects differences in rates of endogenous and exogenous damage. Consistent with this notion, the rate of SBS5 mutation is highest for epithelia in the colon and lung (Fig 3), which plausibly experience high rates of damage, and lowest for mutations assigned to the maternal genome, potentially reflecting the fact that oocytes are particularly well protected [46,47]. This model also helps to explain the observation that increasing the damage rate by exogenous factors, such as long-term exposure to tobacco, significantly increases SBS5 mutation rate in lung cells [21,48].
In that light, it may seem puzzling that in such different cell types, which presumably experience distinct sources of damage, a large fraction of mutations are consistently comprised of SBS5. As an explanation, we propose that SBS5 reflects errors in DNA synthesis during repair, a critical step in many repair pathways (e.g., nucleotide excision repair or homologous recombination) [4]. These pathways often involve the synthesis of multiple nucleotides surrounding the lesion, using the intact strand as a template. The errors of the gap-filling polymerase may be displaced from the position of the original lesion, disassociating the mutational signature from the context of the original damage. We therefore hypothesize that this mutational signature reflects the error profile of the polymerase (ϵ) and the asymmetry of mismatch resolution (p).
The second signature to increase with age, SBS1, does so in all cell types considered, except for liver, where hepatocytes are routinely dormant in the cell cycle [49], and neurons, which are postmitotic. More generally, the rate at which mutations assigned to SBS1 increase with age varies widely among cell types and is highest in those characterized by the highest turnover rates (such as intestinal epithelia, where turnover time estimates are of the order of 3 days [50]). Thus, SBS1 appears to be driven by cell division rates. A possible exception is the observation of a slight increase with age in maternal germline mutations, most of which arose in oocytes (see the discussion of germline mutations below). These observations are in agreement with previous observations from cancer studies [6,13,51].
The origin of CpG transitions remains unclear. If they arise because methylated CpGs are a poor template for replication or from spontaneous deamination of single strands during replication, their dependence on cell divisions is expected. Less intuitively perhaps, the same expectation holds if they arise from spontaneous deamination and are efficiently and accurately repaired during the cell cycle (Fig 2). Current data do not allow us to pinpoint when in the cell cycle the damage accrues, however. Two plausible sources are unrepaired mismatches that accrue during the cell cycle and deamination of single-stranded cytosines during DNA replication. As we show, their relative importance will depend on efficiency of mismatch repair as well as the length of time spent single-stranded during replication, parameters that are to our knowledge unknown.
Regardless, we can use the fact that SBS1 does not discernibly increase with age in postmitotic neurons in order to estimate an upper bound on the rate of SBS1 mutations due to repair errors. Given no cell divisions, ϕ = 0, we estimate the upper limit of the error rate of the resolution of T:G mismatches at CpG sites to be (here, we assume the detection threshold to be ≤5%, the lowest loading of an attributed signature in neurons). Transient single-strandedness, as could possibly arise during transcription, or double strand break repair [52], could enhance the rate of deamination [33], in which case the error rate of repair would need to be lower. Our estimate is similar to the measurements of the fidelity of polymerase beta [53], employed by BER, one of the pathways that repairs T:G mismatches [34]. It is also of the order of the lower bound on the error rate for DNA synthesis without proofreading, ϵ0∼10−4, found by considering the equilibrium kinetics of DNA synthesis, given the energy difference between a mismatch and a correct base pair [54,55]. These calculations show that, despite a high rate of spontaneous deamination, repair mechanisms should be accurate enough for incorrect repair to be an insignificant source of mutations. Instead, repair of this type of mutation is likely both very efficient and accurate in all cell types, leading the number of SBS1 mutations to track cell divisions.
Mutation accumulation in the germline
While SBS1 and SBS5 are known to predominate among germline mutations [12,40], we expect the rates of mutation accumulation to differ between the sexes, given the pronounced differences in gametogenesis. In mothers, any mutation that increases with maternal age should have arisen in an oocyte, a postmitotic cell (although a small fraction may also arise in the early development of the child, if children of older mothers have more mutations in the first few cell divisions [56]), whereas in fathers, age-dependent mutations should arise in dividing spermatogonia. In turn, the mutations that do not depend on parental ages in either sex originated either prior to the onset of puberty in the parents or soon after fertilization of the offspring [57]. Given that in humans, early development is the same in both sexes until the ∼6th week of embryonic life [58], we might expect some similarity between the mutation types that contribute to the constant component in the two sexes.
In agreement with these expectations, for maternal mutations, the distributions of mutational signatures differ markedly between the age-dependent and constant components: in the constant component distribution (Fig 4D and 4H), C to T/G to A transitions dominate, with leading contributions of SBS1, SBS6 (associated with defective DNA mismatch repair [8]), and SBS30 (associated with defective base excision repair [59]). The age-dependent distribution is significantly more diffuse (Fig 4C and 4G), with top contributions from SBS5, SBS12, SBS16, and SBS39. SBS39 predominantly features C to G/G to C transversions, a substitution type known to increase sharply with maternal age and associated with double strand break repair [56,60]. Given that mutational signatures have been identified from cancer somatic tissues [8], it is conceivable that SBS39 absorbs a significant portion of the C to G/G to C substitutions characteristic of maternal mutations, even if the process that generated them in the germline has a distinct etiology. Qualitative conclusions are similar when adding the father’s age as a covariate in the model (see S3 Fig).
(A) Effect of maternal age on mutations assigned to the maternal germline in pedigree data. (B) Effect of paternal age on mutations assigned to the paternal germline in pedigree data. Paternal germline mutations (purple) accumulate at similar to rate to somatic mutations in testis seminiferous tubules (blue). (C–F) The decomposition of the mutation spectrum into age-dependent (C, E) and constant signatures (D, F) in maternal and paternal mutations; see (A) and (B) for the color code. The inset in (E) shows the decomposition of somatic mutations in the testis seminiferous tubules. SBS signatures are indicated by their COSMIC label; asterisks indicate the contribution of unattributed mutations (see Methods). (G–J) The distribution of SBS types reconstructed using the signature attributions (C–F). SBS types are grouped by the 6 substitution types (see J for color code) and sorted alphabetically by the sequence context (ACA to TTT). Underlying data for this figure can be found in S2 Data.
Surprisingly, there is a small but significant increase of SBS1 mutations with maternal age (0.019 mutations per year per gamete), which seems at odds with the lack of cell divisions in oocytes. One possibility is that there is an increase in the steady state number of T:G mismatches in aging oocytes, potentially due to reductions in repair efficiency r in older mothers [56]. Uncorrected T:G mismatches in oocytes would lead to zygotic mutations in the first division after fertilization and be detected by pedigree sequencing of trios. Alternatively, the slight increase of SBS1 mutations with age in maternal mutations, in contrast to what is seen in neurons, could be explained by a higher error rate of the repair of T:G mismatches in oocytes. In other words, while the rate of erroneous repair of mismatches due to cytosine deamination may be minimal in neurons, it could be higher and detectable in oocytes. In principle, a similar outcome would be observed if the rate of erroneous repair is the same but cytosines deaminate more often in oocytes relative to neurons, but this explanation seems less likely given the lower levels of genome-wide DNA methylation in oocytes until they enter the growth phase [61].
Among paternal mutations, the age-dependent and age-independent distributions are both dominated by signature SBS5. Most of the mutations in the constant component (Fig 4F) are attributed to SBS5, but there is also an enrichment of C to T/G to A transitions (SBS32, SBS1, and SBS30). Unlike in mothers, the increase with age is driven by SBS5 and SBS1 jointly (Fig 4E). To test whether age-dependent paternal germline mutations accumulate primarily in spermatogonia, as expected, we compare these findings with the decomposition of somatic mutations detected in the seminiferous tubules of the testes [2]. As reported by [2], the slope of the regression line is very similar for the 2 data sets (Fig 4B). In contrast to their study, however, the intercept for testis is not significantly different from 0, while the intercept for germline mutations is of approximately 10 mutations per haploid genome; the reasons for this difference are unclear to us. Regardless, in our analysis, the age-dependent component of paternal mutations has a very similar distribution of signatures to those inferred for testis (Fig 4E and inset).
Overall, the age-dependent distribution is highly similar in the two sexes (Fig 4G and 4I), except for SBS1, which is significantly more pronounced in the paternal germline. The similarity of age-dependent mutational signatures between the two sexes suggests that it is not only the continuous cell divisions but also a higher damage rate that distinguishes the paternal germline from the maternal one. While the constant components differ between maternal and paternal mutations, SBS1 and SBS30 are found in both, likely reflecting a contribution of gonosomal mutations.
Discussion
As we show by modeling, distinct mutagenic processes can give rise to the clock-like accumulation of mutations observed in the germline and soma, so long as cell divisions and damage occur at a reasonably constant rates. To tease these processes apart, we estimate the rate of accrual of clock-like mutations across dividing and non-dividing cell types. Our analysis reveals that the ubiquitous SBS1 and SBS5 originate predominantly from different sources: whereas SBS1 tracks cell divisions, SBS5 accumulates in postmitotic as well as dividing cells, and appears to track DNA damage levels.
Based on the behavior of SBS5 across cell types, we hypothesize that such mutations arise from errors in DNA synthesis during repair. This hypothesis could be explored further by examining, for example, if SBS5 mutations are enriched at loci with high repair rates. In turn, the dependence of SBS1 on cell divisions suggests that the rate of accumulation of such mutations could serve as a “counter” for cell divisions [13,62], applicable to different cell lineages. For now, however, cell division rates at different stages of human development remain poorly characterized, stymieing such efforts.
In gaining a better understanding of cell division rates, it will be interesting to examine if rates of DNA damage and cell division covary. As an example, reactive oxygen species—an important source of DNA damage—influence cell cycle progression [63]. More broadly, cell metabolism and cell cycle progression are tightly related (reviewed in [64]). This relationship could contribute to the observed variations in clock-like mutation rates among different cell types within an organism and potentially to differences across species. For example, cells in smaller, shorter-lived mammals typically exhibit higher mass-specific metabolic rates [65], shorter cell cycles [66], and higher mutation rates for both SBS1 and SBS5 [11]. While establishing causality remains a challenge, these findings hint at a potential link between rates of DNA damage and of cell division.
Our analyses further confirm the importance of damage as a source not only of somatic mutations, but also for germline mutations [56,67,68]: over two-thirds of germline mutations are assigned to SBS1 and SBS5, signatures that arise from damage that is either not repaired or repaired incorrectly. The source of such DNA damage is most likely endogenous cellular processes, accounting for the omnipresence of these two signatures across cell types [2] and species [69], as well as their characteristic clock-like behavior under most conditions [48].
While our analyses help to make sense of a number of disparate observations in human germline and soma, they also raise a number of new questions. In particular, it is puzzling that CpG transitions accumulate with absolute time in phylogenetic data from mammals [70,71], when SBS1 mutations accumulate with cell divisions within humans, and there are dramatic differences in germ cell division rates across mammalian species.
It is also unclear how species-specific rates are set for other types of mutations. The dominance of SBS5 in the germline implies that most of the variation in mutation rates across species is likely explained by differences in the rates of DNA damage and repair (e.g., [72,73]). To what extent this balance is directly shaped by selection versus a byproduct of changes in cellular activity remains to be explored, however.
Methods
Model of mutation accumulation
We first consider the interplay of damage and repair during the cell cycle. The time since the last DNA replication is denoted by t. We denote the number of intact sites by n0, the number of lesions by n1, the number of mismatches by n2, and the number of mutations by n3, where all these sum up to the genome size, i.e., n0+n1+n2+n3 = l. DNA damage leads to new lesions at rate u. Lesions are repaired at rate r, with error rate ϵ. Incorrectly repaired lesions lead to mismatches, which are resolved at rate q. We assume the mismatch resolution mechanism cannot discern the correct base in the mismatch site and it results in a correct base pair with probability p (see schema in Fig 5). The model parameters and variables are summarized in Table 1.
(A) Kinetics of the interplay of damage and repair in the general case. The two-state model used for CpG transitions accumulation is obtained by taking the limit q→0. (B) Example solution of the model for the number of lesions (left), mismatches (center), and mutations (right).
The expected number of sites in each state is well approximated by the following system of equations: (5) where we have neglected the probability of damage affecting the same site multiple times, as is realistic. The number of intact sites is always much greater than the number of lesions, mismatches, and mutations combined, i.e., n0≈l≫n1+n2+n3. This dynamics is therefore well approximated by (6)
The solution to this system with initial condition n0(0) = l (and thus n1(0) = n2(0) = n3(0) = 0) is (7)
At steady state, which is achieved at time t≫r−1 and q−1, the expected numbers of lesions and mismatches are constant, and the number of mutations n3 grows linearly with time. Specifically, (8) where is the delay due to lesion and mismatch processing. We assume, as is plausible, that a steady state is established before cell division, which in terms of model parameters means that the rate of cell division obeys ϕ≪r and q, and consequently is negligible at timescales of ϕ−1 or longer.
Next, we consider the expected number of mutations m over a period x that spans multiple cell divisions occurring at rate ϕ. We additionally account for unrepaired replicatory errors, of which there are wlP per cell division, where w denotes the error rate of a replicative polymerase and P is the probability that a mismatch is unrepaired by mismatch repair. Altogether, the expected number of mutations is given by (9) where we assume that unrepaired lesion leads to a mutation in one of the two dividing cells with probability R.
To quantify variation in the number of mutations, we note that the set of Eq (6) corresponds to a monomolecular reaction system and therefore that the equivalent chemical master equation is solved by a product Poisson distribution [74]. Thus, the variances of the random variables N1,N2, and N3 (numbers of lesions, mismatches, and mutations) are equal to their expected values and given by (10)
We further assume that cell division dynamics is independent of the dynamics within a cell cycle, and that the number of cell divisions in time x is Poisson distributed with mean ϕx. We can therefore estimate the variance of the number of mutations M at age x using the law of total variance, (11)
To obtain the reduced model describing the dynamics underlying CpG transitions, we take the limit q→0. This results in (12) where n1 now stands for the number of T:G mismatches, and n2 is the number of mutations. This system is solved by (13)
Unrepaired mismatches lead to a mutation in one of the 2 dividing cells. Additionally, we consider that cytosines can deaminate at rate us during the transient single-strandedness that occurs at DNA replication, for time Δt. Analogously to the general model, assuming the dynamics reaches a steady state before the cell division, the expected number of mutation is then given by (14)
Data analysis
We analyze the patterns of mutation accumulation in different cell types, using a variety of publicly available data sets that each include 7 or more individuals of different ages. The studies used a variety of approaches in order to characterize mutations that accrue in a given cell type, including single molecule sequencing (S), sequencing of small monoclonal samples (C), and sequencing of cell colonies derived from a single cell (D). In addition, we consider de novo mutation calls based on sequenced genomes from blood samples of human trios (P). In this case, mutations are assigned to maternal and paternal genomes using “read-backed phasing” [60,75] as well as by transmission to a third generation [44,60]; such mutations are assumed to have arisen in the oocyte and during spermatogenesis, although there is also a small contribution of gonosomal mutations and mutations that arose in the early development of the child [44]. We only select probands in which the parental origin has been determined for over 90% of germline mutation calls. The data set (see Table 2 below) includes cell types with varying cell division rates, ranging from postmitotic neurons to rapidly dividing epithelial cells from small bowel and colon [76]. We excluded data from donors with cancers in the focal tissue, and past and current smokers from the lung data set [21]. Download links to the data are provided in S1 Data.
We annotated each SBS substitution with its local context, i.e., the 5’ and 3’ flanking nucleotides, using the GRCh37 genome reference [77]. Using this annotation, we classified each substitution into one of the 192 types (there are 43 = 64 different 3-mer contexts, and for each of them, there are 3 possible substitutions). This classification was then collapsed to give the standard 96 categories, which are strand-invariant [6,7].
Mutational signatures attribution
First, we describe how we decompose a set of mutations as a linear combination of 79 COSMIC version 3.3 SBS mutational signatures [10]. For any mutation of SBS type z, the probability that it is attributed to a given signature class s is given by Bayes’ formula (15) where P(z) = Σs P(z|s)P(s). P(z|s) is known, but we need to infer the set of loadings P = {P(s)} from the data.
The log-likelihood of the loadings P given the observed mutations is given by (16) where P(zj,sj) is the joint probability of mutation of type zj and the underlying signature sj (a hidden variable). We maximize the likelihood iteratively using the expectation–maximization method (EM, [78]), under the constraint of normalization, ΣsP(s) = 1, and P(s)≥0 for all s.
Initial condition: We initialize the optimization with a uniform distribution.
E step: Provided the distribution at iteration i, Pi = {Pi(s)}, we compute the pseudo-log-likelihood of the set of loadings P, (17) where (18)
M step: We find the distribution in the next iteration, Pi+1, by maximizing the pseudo-log-likelihood Q(P|Pi) under the constraint Σs P(s) = 1, i.e., (19) where Z is a Lagrange multiplier. In this way, we find that (20) and the value of the multiplier can be found by applying the condition of normalization, .
We apply regularization to avoid overfitting. Specifically, we impose sparsity on the distribution Pi by introducing a Dirichlet prior over the set of distributions P, (21) where β≥0 parametrizes the degree of sparsity. As discussed in the main text, the COSMIC catalog of signatures is based on cancer samples, and therefore some of the processes that shape the mutational spectra in noncancerous cells may be missing from this data set. It may lead to errors in signature attribution: mutations corresponding to undefined signatures will be incorrectly assigned to COSMIC signatures. One strategy to overcome this limitation might be to impose a high degree of sparsity of signature loadings, leading to a small number of attributed signatures and avoiding spurious signature attributions. However, because signature SBS5 comprises many mutation types, strong regularization could lead to the misassignment of many mutations to this signature, and consequently, its loading may be overestimated. Given our focus on signature SBS5, we instead chose to apply a weak regularization that avoids this problem, but at the potential cost of spuriously assigning mutations to other cancer-specific signatures that do not in fact operate in the germline.
With this regularization approach, the distribution Pi+1 in the next iteration takes a closed form. Taking advantage of the fact that a Dirichlet prior is conjugate to multinomial likelihood (16), it can be shown [79] that (22) where the normalization factor reads . This expression implies that if the estimated loading of a signature s is too low, it is set to 0.
Convergence criterion: We iterate until convergence, i.e., when for all signatures s we have |Pi+1(s)−Pi(s)|<10−4.
In what follows, we extend this signature attribution method to infer age-dependent signatures in data consisting in mutations from donors of varying ages. First, we preform a linear regression on the number of mutations per haploid genome, y, on age, x (23)
We find a and b for each cell type, using the scipy package [80] (see S4 Fig). Next, we assign the mutational signatures that make up the slope and the intercept. Namely, we assume that independent distributions govern the signature compositions of the age-dependent and age-independent mutations and denote them Pa = {Pa(s)} and Pb = {Pb(s)}, respectively. The expected proportion of mutations of type z at age x is then given by (24) where we sum over COSMIC SBS signatures.
In order to pose the problem of estimating the two components, Pa and Pb, we assign to 2 hidden variables to each mutation j: the signature to which the mutation is attributed, sj, and an indicator variable Ij, which equals 1 for age-dependent mutations and 0 for age-independent mutations. In these terms, the log-likelihood of the data given the two sets of loadings is (25)
We estimate the loadings using expectation–maximization, as previously described. In this case, the pseudo-log-likelihood is (26) where the two sets of membership probabilities are computed analogously to Eq 18.
Inclusion of the age-dependent component leads to consistently higher likelihoods in all the data sets examined and lower Bayesian information criterion values in all data sets except for muscle and paternal mutations (S5 Fig), for which we find few differences between the age-dependent and age-independent distributions. We assess the uncertainty of our estimates by bootstrapping. For each data set, we resample cells (or trios in the case of pedigree data) with replacement 1,000 times. For each replicate, we estimate the slope and intercept of the age dependence and infer the corresponding signature attribution. We test for the significant presence of a given signature by performing the inference with weak regularization (i.e., with β = 0.1). This choice of regularization strength does not affect the main components and only zeroes out the components of the decomposition with very low membership probabilities (see S6 Fig), as specified by Eq 22. We include a signature s from a given data set if the probability Pa/b(s) is nonzero in at least 95% of bootstrap replicates; in the figures, we denote the residual contribution of signatures that do not meet this condition with an asterisk.
Supporting information
S1 Fig. Piecewise linear model.
Until age x0 = 18 mutations accumulate at rate μ0 (in red) and after age x0 at rate μ (blue). (A) When μ0>μ, for data from donors of ages x>x0, a regression would yield a positive intercept, b>0. (B) When μ0>μ, such regression would yield a negative intercept, b<0. See discussion in section “Relating model predictions to data” in the main text. Underlying data for this figure can be found in S2 Data.
https://doi.org/10.1371/journal.pbio.3002678.s001
(TIFF)
S2 Fig.
Inference results with weak regularization (β = 0.1): loadings for the age-dependent component, Pa, and the constant component, Pb, in (A, B) neurons, (C, D) maternal germline mutations, (E, F) paternal germline mutations, (G, H) smooth muscle from bladder, (I, J) liver hepatocytes, (K, L) lung epithelium, (M, N) small bowel epithelium, (O) colon epithelium (*for this data set, in which the intercept is negative, we assume all signatures increase with age). Signatures are indicated by their COSMIC label; we only show top 10 attributed signatures, and “o.” stands for other attributed signatures. An asterisk indicates the contribution of unattributed signatures (see Methods). Underlying data for this figure can be found in S2 Data.
https://doi.org/10.1371/journal.pbio.3002678.s002
(TIFF)
S3 Fig. Age-dependent signature attribution for germline maternal mutations.
(A) Effect of maternal age on mutations assigned to the maternal germline in pedigree data. Paternal age was included as a covariate in linear regression. (B, C) The decomposition of the maternal mutation spectrum into age-dependent (B) and constant signatures (C); see (A) for the color code. SBS signatures are indicated by their COSMIC label; asterisks indicate unattributed signatures (see Methods). Underlying data for this figure can be found in S2 Data.
https://doi.org/10.1371/journal.pbio.3002678.s003
(TIFF)
S4 Fig.
The total number of mutations as a function of age in: (A) neurons, (B) maternal germline mutations, (C) paternal germline mutations, (D) smooth muscle from bladder, (E) liver hepatocytes, (F) lung epithelium, (G) small bowel epithelium, (H) colon epithelium. We report the coefficient of determination R2 in each panel. Underlying data for this figure can be found in S2 Data.
https://doi.org/10.1371/journal.pbio.3002678.s004
(TIFF)
S5 Fig.
Comparison of the constant (P(s)) and age-dependent () signature attributions for (A) neurons, (B) maternal germline mutations, (C) paternal germline mutations, (D) smooth muscle from bladder, (E) liver hepatocytes, (F) lung epithelium, (G) small bowel epithelium. We report the mean log-likelihood per mutation, L(P)/N, and the Bayesian information criterion, BIC = klogN−2L(P), where k denotes the number of nonzero parameters. Underlying data for this figure can be found in S2 Data.
https://doi.org/10.1371/journal.pbio.3002678.s005
(TIFF)
S6 Fig.
Inference results with weak regularization (β = 0.1) compared to no regularization (β = 0): loadings for the age-dependent component, Pa, and the constant component, Pb (insets), in (A) neurons, (B) maternal germline mutations, (C) paternal germline mutations, (D) smooth muscle from bladder, (E) liver hepatocytes, (F) lung epithelium, (G) small bowel epithelium, (H) colon epithelium (*for this data set, in which the intercept is negative, we assume all signatures increase with age). Underlying data for this figure can be found in S2 Data.
https://doi.org/10.1371/journal.pbio.3002678.s006
(TIFF)
S1 Data. Download links to the publicly available data sets used in this study.
https://doi.org/10.1371/journal.pbio.3002678.s007
(XLSX)
Acknowledgments
We thank Ziyue Gao and members of the Andolfatto, Przeworski, and Sella labs for useful discussions.
References
- 1. Ségurel L, Wyman MJ, Przeworski M. Determinants of mutation rate variation in the human germline. Annu Rev Genomics Hum Genet. 2014;15:47–70. pmid:25000986
- 2. Moore L, Cagan A, Coorens THH, Neville MDC, Sanghvi R, Sanders MA, et al. The mutational landscape of human somatic and germline cells. Nature. 2021;597:381–386. pmid:34433962
- 3. Abascal F, Harvey LMR, Mitchell E, Lawson ARJ, Lensing SV, Ellis P, et al. Somatic mutation landscapes at single-molecule resolution. Nature. 2021;593:405–410. pmid:33911282
- 4. Chatterjee N, Walker GC. Mechanisms of DNA damage, repair, and mutagenesis. Environ Mol Mutagen. 2017;58:235–263. pmid:28485537
- 5. Kucab JE, Zou X, Morganella S, Joel M, Nanda AS, Nagy E, et al. A compendium of mutational signatures of environmental agents. Cell. 2019;177:821–836.e16. pmid:30982602
- 6. Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, et al. Mutational processes molding the genomes of 21 breast cancers. Cell. 2012;149:979–993. pmid:22608084
- 7. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013;500:415–421. pmid:23945592
- 8. Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, et al. The repertoire of mutational signatures in human cancer. Nature. 2020;578:94–101. pmid:32025018
- 9. Koh G, Degasperi A, Zou X, Momen S, Nik-Zainal S. Mutational signatures: emerging concepts, caveats and clinical applications. Nat Rev Cancer. 2021;21:619–637. pmid:34316057
- 10. Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 2019;47:D941–D947. pmid:30371878
- 11. Cagan A, Baez-Ortega A, Brzozowska N, Abascal F, Coorens THH, Sanders MA, et al. Somatic mutation rates scale with lifespan across mammals. Nature. 2022;604:517–524. pmid:35418684
- 12. Rahbari R, Wuster A, Lindsay SJ, Hardwick RJ, Alexandrov LB, Turki SA, et al. Timing, rates and spectra of human germline mutation. Nat Genet. 2016;48:126–133. pmid:26656846
- 13. Alexandrov LB, Jones PH, Wedge DC, Sale JE, Campbell PJ, Nik-Zainal S, et al. Clock-like mutational processes in human somatic cells. Nat Genet. 2015;47:1402–1407. pmid:26551669
- 14. Matsuno Y, Kusumoto-Matsuo R, Manaka Y, Asai H, Yoshioka KI. Echoed induction of nucleotide variants and chromosomal structural variants in cancer cells. Sci Rep. 2022;12:20964. pmid:36470958
- 15. Schiantarelli J, Pappa T, Conway J, Crowdis J, Reardon B, Dietlein F, et al. Mutational footprint of platinum chemotherapy in a secondary thyroid cancer. JCO Precis Oncol. 2022;6:e2200183. pmid:36075011
- 16. Caballero M, Boos D, Koren A. Cell-type specificity of the human mutation landscape with respect to DNA replication dynamics. Cell Genomics. 2023;3:100315. pmid:37388911
- 17. Blokzijl F, de Ligt J, Jager M, Sasselli V, Roerink S, Sasaki N, et al. Tissue-specific mutation accumulation in human adult stem cells during life. Nature. 2016;538:260–264. pmid:27698416
- 18. Martínez-Jiménez F, Movasati A, Brunner SR, Nguyen L, Priestley P, Cuppen E, et al. Pan-cancer whole-genome comparison of primary and metastatic solid tumours. Nature. 2023;618:333–341. pmid:37165194
- 19. Ivanov D, Hwang T, Sitko LK, Lee S, Gartner A. Experimental systems for the analysis of mutational signatures: no ‘one-size-fits-all’ solution. Biochem Soc Trans. 2023;51:1307–1317. pmid:37283472
- 20. Massaar S, Sanders MA. The etiology of clonal mosaicism in human aging and disease. Aging Cancer. 2023;4:3–20.
- 21. Yoshida K, Gowers KHC, Lee-Six H, Chandrasekharan DP, Coorens T, Maughan EF, et al. Tobacco smoking and somatic mutations in human bronchial epithelium. Nature. 2020;578:266–272. pmid:31996850
- 22. Ernst SM, Mankor JM, van Riet J, von der Thüsen JH, Dubbink HJ, Aerts JGJV, et al. Tobacco Smoking-Related mutational signatures in classifying Smoking-Associated and Nonsmoking-Associated NSCLC. J Thorac Oncol. 2023;18:487–498. pmid:36528243
- 23. Gao Z, Wyman MJ, Sella G, Przeworski M. Interpreting the dependence of mutation rates on age and time. PLoS Biol. 2016;14:e1002355. pmid:26761240
- 24. Tubbs A, Nussenzweig A. Endogenous DNA damage as a source of genomic instability in cancer. Cell. 2017;168:644–656. pmid:28187286
- 25. Sanders MA, Vöhringer H, Forster VJ, Moore L, Campbell BB, Hooks Y, et al. Life without mismatch repair. 2021.
- 26. Aitken SJ, Anderson CJ, Connor F, Pich O, Sundaram V, Feig C, et al. Pervasive lesion segregation shapes cancer genome evolution. Nature. 2020;583:265–270. pmid:32581361
- 27. Anderson CJ, Talmane L, Luft J, Nicholson MD, Connelly J, Pich O, et al. Strand-resolved mutagenicity of DNA damage and repair. 2022.
- 28. Zhao L, Washington MT. Translesion synthesis: Insights into the selection and switching of DNA polymerases. Genes. 2017;8. pmid:28075396
- 29. Powers KT, Washington MT. Eukaryotic translesion synthesis: Choosing the right tool for the job. DNA Repair. 2018;71:127–134. pmid:30174299
- 30. Li GM. Mechanisms and functions of DNA mismatch repair. Cell Res. 2008;18:85–98. pmid:18157157
- 31. Guo Q, Lakatos E, Bakir IA, Curtius K, Graham TA, Mustonen V. The mutational signatures of formalin fixation on the human genome. Nat Commun. 2022;13:4487. pmid:36068219
- 32. Schmutte C, Yang AS, Beart RW, Jones PA. Base excision repair of U:G mismatches at a mutational hotspot in the p53 gene is more efficient than base excision repair of T:G mismatches in extracts of human colon tumors. Cancer Res. 1995;55:3742–3746. pmid:7641186
- 33. Zhang X, Mathews CK. Effect of DNA cytosine methylation upon deamination-induced mutagenesis in a natural target sequence in duplex DNA. J Biol Chem. 1994;269:7066–7069. pmid:8125913
- 34. Kunz C, Saito Y, Schär P. DNA repair in mammalian cells: Mismatched repair: variations on a theme. Cell Mol Life Sci. 2009;66:1021–1038. pmid:19153655
- 35. Fang H, Zhu X, Yang H, Oh J, Barbour JA, Wong JWH. Deficiency of replication-independent DNA mismatch repair drives a 5-methylcytosine deamination mutational signature in cancer. Sci Adv. 2021;7:eabg4398. pmid:34730999
- 36. Conti C, Saccà B, Herrick J, Lalou C, Pommier Y, Bensimon A. Replication fork velocities at adjacent replication origins are coordinately modified during DNA replication in human cells. Mol Biol Cell. 2007;18:3059–3067. pmid:17522385
- 37. Tomkova M, McClellan M, Kriaucionis S, Schuster-Böckler B. DNA replication and associated repair pathways are involved in the mutagenesis of methylated cytosine. DNA Repair. 2018;62:1–7. pmid:29223032
- 38. Seplyarskiy VB, Sunyaev S. The origin of human mutation in light of genomic data. Nat Rev Genet. 2021;22:672–686. pmid:34163020
- 39. Ju YS. Somatic mutations reveal asymmetric cellular dynamics in the early human embryo. Nature. 2017;543:714–718. pmid:28329761
- 40. Kaplanis J. Genetic and chemotherapeutic influences on germline hypermutation. Nature. 2022;605:503–508. pmid:35545669
- 41. Brunner SF. Somatic mutations and clonal dynamics in healthy and cirrhotic human liver. Nature. 2019;574:538–542. pmid:31645727
- 42. Wang Y. APOBEC mutagenesis is a common process in normal human small intestine. Nat Genet. 2023;55:246–254. pmid:36702998
- 43. Halldorsson BV. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science. 2019;363.
- 44. Sasani TA. Large, three-generation human families reveal post-zygotic mosaicism and variability in germline mutation accumulation. Elife. 2019;8. pmid:31549960
- 45. Lee-Six H. The landscape of somatic mutation in normal colorectal epithelial cells. Nature. 2019;574:532–537. pmid:31645730
- 46. Seplyarskiy VB, Soldatov RA, Koch E, McGinty RJ, Goldmann JM, Hernandez RD, et al. Population sequencing data reveal a compendium of mutational processes in the human germ line. Science. 2021;373:1030–1035. pmid:34385354
- 47. Rodríguez-Nuevo A. Oocytes maintain ROS-free mitochondrial metabolism by suppressing complex I. Nature. 2022;607:756–761. pmid:35859172
- 48. Alexandrov LB. Mutational signatures associated with tobacco smoking in human cancer. Science. 2016;354:618–622. pmid:27811275
- 49. Miyaoka Y. Hypertrophy and unconventional cell division of hepatocytes underlie liver regeneration. Curr Biol. 2012;22:1166–1175. pmid:22658593
- 50. Darwich AS, Aslam U, Ashcroft DM, Rostami-Hodjegan A. Meta-analysis of the turnover of intestinal epithelia in preclinical animal species and humans. Drug Metab Dispos. 2014;42:2016–2022. pmid:25233858
- 51. Liu MH. Single-strand mismatch and damage patterns revealed by single-molecule DNA sequencing. 2023. pmid:36824744
- 52. Hinch R, Donnelly P, Hinch AG. Meiotic dna breaks drive multifaceted mutagenesis in the human germ line. Science. 2023;382:eadh2531. pmid:38033082
- 53. Brown JA, Pack LR, Sanman LE, Suo Z. Efficiency and fidelity of human DNA polymerases λ and β during gap-filling DNA synthesis. DNA Repair. 2011;10:24–33.
- 54. Hopfield JJ. Kinetic proofreading: a new mechanism for reducing errors in biosynthetic processes requiring high specificity. Proc Natl Acad Sci U S A. 1974;71:4135–4139. pmid:4530290
- 55.
Bialek W. Biophysics: Searching for Principles. Princeton University Press, Annotated edition edition; 2012.
- 56. Gao Z. Overlooked roles of DNA damage and maternal age in generating human germline mutations. Proc Natl Acad Sci U S A. 2019;116:9491–9500. pmid:31019089
- 57. Moorjani P, Gao Z, Przeworski M. Human germline mutation and the erratic evolutionary clock. PLoS Biol. 2016;14:e2000744. pmid:27760127
- 58.
Rey R, Josso N, Racine C. Sexual Differentiation (MDText.com, Inc.). 2020.
- 59. Grolleman JE. Mutational signature analysis reveals NTHL1 deficiency to cause a multi-tumor phenotype. Cancer Cell. 2019;35:256–266.e5. pmid:30753826
- 60. Jónsson H, Sulem P, Kehr B, Kristmundsdottir S, Zink F, Hjartarson E, et al. Parental influence on human germline de novo mutations in 1,548 trios from iceland. Nature. 2017;549:519–522. pmid:28959963
- 61. Yan R. Decoding dynamic epigenetic landscapes in human oocytes using single-cell multi-omics sequencing. Cell Stem Cell. 2021;28:1641–1656.e7. pmid:33957080
- 62. Gerstung M. The evolutionary history of 2,658 cancers. Nature. 2020;578:122–128. pmid:32025013
- 63. Verbon EH, Post JA, Boonstra J. The influence of reactive oxygen species on cell cycle progression in mammalian cells. Gene. 2012;511:1–6. pmid:22981713
- 64. Diehl FF, Sapp KM, Vander Heiden MG. The bidirectional relationship between metabolism and cell cycle control. Trends Cell Biol. 2023. pmid:37385879
- 65. Diaz-Cuadros M. Metabolic regulation of species-specific developmental rates. Nature. 2023;613:550–557. pmid:36599986
- 66. Lázaro J. A stem cell zoo uncovers intracellular scaling of developmental tempo across mammals. Cell Stem Cell. 2023. pmid:37343565
- 67. Seplyarskiy VB. Error-prone bypass of DNA lesions during lagging-strand replication is a common source of germline and cancer mutations. Nat Genet. 2019;51:36–41. pmid:30510240
- 68. Wu FL. A comparison of humans and baboons suggests germline mutation rates do not track cell divisions. PLoS Biol. 2020.
- 69. Gelova SP, Doherty KN, Alasmar S, Chan K. Intrinsic base substitution patterns in diverse species reveal links to cancer and metabolism. Genetics. 2022;222. pmid:36149294
- 70. Hwang DG, Green P. Bayesian markov chain monte carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci U S A. 2004;101:13994–14001. pmid:15292512
- 71. Moorjani P, Amorim CEG, Arndt PF, Przeworski M. Variation in the molecular clock of primates. Proc Natl Acad Sci U S A. 2016;113:10607–10612. pmid:27601674
- 72. Hart RW, Setlow RB. Correlation between deoxyribonucleic acid excision-repair and life-span in a number of mammalian species. Proc Natl Acad Sci U S A. 1974;71:2169–2173. pmid:4526202
- 73. Tian X. SIRT6 is responsible for more efficient DNA Double-Strand break repair in Long-Lived species. Cell. 2019;177:622–638.e22. pmid:31002797
- 74. Jahnke T, Huisinga W. Solving the chemical master equation for monomolecular reaction systems analytically. J Math Biol. 2007;54:1–26. pmid:16953443
- 75. Goldmann JM. Parent-of-origin-specific signatures of de novo mutations. Nat Genet. 2016;48:935–939. pmid:27322544
- 76. Gehart H, Clevers H. Tales from the crypt: new insights into intestinal stem cells. Nat Rev Gastroenterol Hepatol. 2019;16:19–34. pmid:30429586
- 77. Church DM. Modernizing reference genome assemblies. PLoS Biol. 2011;9:e1001091. pmid:21750661
- 78. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via theEMAlgorithm. J R Stat Soc. 1977;39:1–22.
- 79. Figueiredo MAT, Jain AK. Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell. 2002;24:381–396.
- 80. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods. 2020;17:261–272. pmid:32015543