Single-cell mutational burden distributions in birth–death processes

Christo Morison; Dudley Stark; Weini Huang

doi:10.1371/journal.pcbi.1013241

Abstract

Genetic mutations are footprints of cancer evolution and reveal critical dynamic parameters of tumour growth, which otherwise are hard to measure in vivo. The mutation accumulation in tumour cell populations has been described by various statistics, such as site frequency spectra (SFS), single-cell division distributions (DD) and mutational burden distributions (MBD). While DD and SFS have been intensively studied in phylogenetics especially after the development of whole genome sequencing technology of bulk samples, MBD has drawn attention more recently with the single-cell sequencing data. Although those statistics all arise from the same somatic evolutionary process, an integrated understanding of these distributions is missing and requires novel mathematical tools to better inform the ecological and evolutionary dynamics of tumours. Here we introduce dynamical matrices to analyse and unite the SFS, DD and MBD and derive recurrence relations for the expectations of these three distributions. While we successfully recover classic exact results in pure-birth cases for the SFS and the DD through our new framework, we derive a new expression for the MBD and approximate all three distributions when death is introduced. We demonstrate a natural link between the SFS and the single-cell MBD, and show that the MBD can be regenerated through the DD. Counter-intuitively, the single-cell MBD is mainly driven by the stochasticity arising in the DD, rather than the extra stochasticity in the number of mutations at each cell division.

Author summary

Somatic mutations accumulated in tissue growth and maintenance lead to genetic variation in tumours and healthy tissues. The patterns of those mutations have been used to reveal tumour history. Here, we developed a general framework to unite different statistical properties of mutation distributions between bulk sequencing data and single-cell data. The site frequency spectra from bulk data, division distributions and single-cell mutational burden distributions from single-cell data can be connected using dynamic matrices and recurrence relations. Counter-intuitively, the stochasticity in the number of mutations acquired in each cell division does not play a critical role in the single-cell mutation burden distribution.

Citation: Morison C, Stark D, Huang W (2025) Single-cell mutational burden distributions in birth–death processes. PLoS Comput Biol 21(7): e1013241. https://doi.org/10.1371/journal.pcbi.1013241

Editor: Ivana Bozic, University of Washington, UNITED STATES OF AMERICA

Received: August 28, 2024; Accepted: June 14, 2025; Published: July 7, 2025

Copyright: © 2025 Morison et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Code can be found on GitHub at https://github.com/crmorison/mbds-in-bds.

Funding: This work was supported by the European Union (grant number 955708 to CM). CM is fully funded by European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie EvoGamesPlus. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Somatic mutations are important for the evolution of biological systems with clonal reproduction, including the development from healthy tissues to cancer [1,2]. While less is known about the somatic mutation rates in clonal species such as plants and corals, they have been studied extensively in human tissues. Healthy cells may accumulate in the order of 1 to 2 mutations per cell per division, which is directly observable in early development [3–5]. The mutational rate of tumour cells is often thought to be higher, which can be caused for example by genomic instability [6–8]. This large number of mutations accumulated in tumours serves as a genetic footprint to reveal their evolutionary history. Since the majority of these mutations are neutral [9], not impacting the fitness of a cell compared to its parental cell, neutral theory has been used to explain mutational patterns in many patient samples across different tumour types [10–12]. These measurements often demonstrate an early expansion of tumour populations, wherein driver mutations are clonal and the intratumour heterogeneity arises from neutral passenger mutations accumulated after cancer initiation. Although clonal interference, where cells carrying different sets of driver mutations intercompete, is a likely alternative scenario especially in large populations [13,14], here we focus on a further understanding of mutation accumulation under neutral selection as an important baseline dynamics.

Distributions of genetic heterogeneity under neutral selection have been studied in population genetics for over half a century [15,16]. One of such statistics is the site frequency spectrum (SFS), which describes the frequencies of mutations in a population [17]. Because the SFS deals with population-level information, it can be compared to bulk genomic data or pooled singe-cell data [12,18,19]. For an exponentially-growing population, a rescaling of the SFS, the variant allele frequency spectrum, has been shown to follow a power law, for f the frequency of a mutation in the population [12,18,20,21]. More recently, exact expressions for the SFS were found under the assumption of neutral evolution [22].

The advent of single-cell sequencing [23,24] opens the door for combining bulk and single-cell data to understand the growth history and dynamic traits of (healthy or tumorous) tissues, which are otherwise difficult to measure directly [19,25]. There is a great need for new mathematical and computational machinery to cope with single-cell data, which provides different mutational distributions beyond the SFS. The number of unique mutations in the population, also known as the overall tumour mutational burden (TMB) [26], has been studied both in a single tumour [22] and (its distribution) between tumours [27,28]. However, the distribution of mutational burdens between cells, the so-called single-cell mutational burden distribution (MBD), has only recently been experimentally observable through single-cell sequencing. Understanding the MBD may further help in inferring important evolutionary parameters, determining the growth history of the tumour and the level of selection at play, with neutral selection as a baseline with which to compare. Using data from healthy haematopoietic stem cells and oesophageal epithelial cells, Moeller, Mon Père et al. showed that analysis of single-cell and bulk data complement each other and narrowed down the parameter inference of the mutation rate and stem cell population size [19]. More specifically, the mean and variance of the MBD for a growing population were derived and used to estimate the underlying mutation rate [19]. However, the exact analytical shape of the MBD has not yet been explicitly found.

The MBD evolves during the cell division process, and thus an instructive object to study it with is the cell lineage tree [29], whose leaves symbolise living cells and whose root is the progenitor of the population. Branching processes can then be viewed as growing trees, where cell division is represented by a leaf bifurcating into two leaves, and cell death is the removal of a leaf. Because this framework can generate phylogenetic trees [30], cell lineage trees have properties that have been extensively studied [31]. One such property is the distribution of leaf heights (or the distances in edges from root to leaves), known as the division distribution (DD) of individual cells. By including the accumulation of new mutations at internal nodes of the tree, the MBD is obtained [29]. An expression for the DD generated by a pure-birth, or Yule, process has been found [32]; though when death is included it has not yet been solved exactly. Our goal is to build upon knowledge of the DD and the SFS to better understand the MBD, by formulating a discrete-time approach that integrates all three distributions.

We introduce a new framework via dynamical matrices to investigate mutation accumulation in a birth–death process and explain how key quantities such as the SFS, DD and MBD are obtained from these mutational matrices. This framework allows us to derive exact solutions of these distributions by recurrence relations in the pure-birth case, as well as first-order approximations when death is introduced, which hold in the low-death and large-population limits. By comparing our solutions for the SFS and DD to known results in population genetics, we first demonstrate the efficacy of our framework. We then show new results in expressions for the birth–death DD (Eq 7) and both the pure-birth and birth–death MBD (Eq 9). Our analytical results for all three distributions agree well with stochastic simulations. We find that the MBD can be generated via the DD and the mean mutation rate per cell division, independently of the stochasticity in the number of mutations per cell division.

Results

We begin by describing a birth–death process with stochastic mutation accumulation, before deriving expected distributions for various summary statistics of interest.

Dynamical matrices to unite the SFS, DD and MBD in a birth–death process

In a birth–death process where a uniformly randomly chosen cell either divides with probability or dies with probability , the population size N_i at time step i can be described by a discrete-time Markov chain (Fig 1a). The state space in this Markov chain is the finite integer set , where N is the largest possible population size. In some cases, we are interested in the limit .

Download:

Fig 1. Discrete-time Markov chain model with binary tree representation of a sample realisation.

(a) Discrete-time Markov chain description of the population size. (b) The population size N_i (solid red line) and number of unique mutations M_i (dashed black line) versus the step count i for the example realisation in (c). (c) Growing binary tree representations of an example realisation of the birth–death process with mutations described in the main text, with birth probability , death probability , mutational mean , initial population N₀ = 1. Cells that are crossed out have died. Edges are labelled by the number of new mutations occurring during that division. Leaves (living cells) are labelled by their mutational burden, which is equal to the sum of the edges that connect them to the root, or the mutation-free progenitor cell. The three sub-panels show snapshots of the process at steps i = 4, i = 8 and i = 12, with their population sizes N_i and M_i labelled.

https://doi.org/10.1371/journal.pcbi.1013241.g001

Often, the stochastic birth–death process explicitly involves a continuous time parameter t instead of a discrete step count i. This allows for rates to be considered instead of probabilities; however, as long as events are assumed not to be simultaneous, these two schemes can be mapped to one another by choosing a distribution of times between events. Most often, events are assumed to be exponentially distributed, and thus their frequencies grow with the population size.

Here, we focus on the growing-population case and assume N₀ = 1 unless otherwise mentioned. In this case, the birth–death process is a growing rooted binary tree, where the root is the lone progenitor cell (assumed to be mutation-free, as any of its mutations will be clonal in the population), leaves are living cells, and pruned leaves are dead cells. See Fig 1b for an example realisation of such a process. Novel mutations are accumulated during cell divisions and old mutations may be lost in cell death. We deem mutations unique under the infinite sites approximation, where the probability of two point mutations occurring at the same location along the large genome is supposed vanishingly small [33]. Note that the infinite sites approximation has been disputed in the cancer context: Kuipers et al. [34] presented data that called into question the rarity of multiply-mutated sites; see Cheek and Antal [35,36] for branching process models that do not rely on the infinite sites approximation. Point mutations occurring during each duplication of the genome are modelled as occurring with a constant rate and independently from one another, a common modelling assumption [19]. Their occurrences therefore follow a Poisson process, and their number is Poisson distributed. Thus, when a cell divides, its daughter cells inherit all mutations carried by the mother cell and acquire a random number of new mutations drawn independently from a Poisson distribution with mean : , where the indices refer to the two daughter cells.

We are most interested in three key quantities of this birth–death process with mutations at each step i: (i) the site frequency spectrum (SFS) , whose elements S_j,i denote the number of mutations which occur j times in the population [17]; (ii) the single-cell mutational burden distribution (MBD) , whose elements B_k,i are the number of cells having a mutational burden of k [19]; and (iii) the division distribution (DD) , whose elements indicate the number of cells having undergone divisions during the process. Note that is also the number of leaves lying at a distance of edges from the root in the growing tree framework.

While the importance of the SFS and the DD have been investigated in growing populations [12,18,20–22,31], we are interested in the relationship between them and how it can help us understand the MBD. Here, we introduce a novel discrete-time framework to demonstrate the symmetry between these distributions. The number of unique mutations at step i is , and the population size is , both of which are plotted in Fig 1c for the example found in Fig 1b. Thus, the MBD and the DD form partitions of the number of cells in a way similar to the SFS partitioning the number of unique mutations. Next, we introduce dynamical matrices to connect those quantities arising from the same population of individual cells.

We consider a collection of matrices Y_i, where the rows refer to cells and the columns refer to mutations, known as genotype matrices or SNP (single nucleotide polymorphism) matrices in bioinformatics [37]. Our matrices are dynamical in that their entries are updated at each step by a binary filling in the following manner: the (n,m)th entry of the matrix is equal to 1 if the nth cell possesses the mth mutation at step i and equal to 0 otherwise. When a cell dies, its row is removed from the matrix. Fig 2a shows an example of the matrix Y_i associated to the tree example in Fig 1b. We extend the concept of genotype matrices by marking mutations arising during a single (past) division by grey shaded areas. Note that if no mutations arise during a division, a placeholder column must be added with only zeros. The 0 entry corresponding to the cell that did not gain any new mutations would still be shaded in grey, as this shading tracks the division burden. These matrices are also how mutational data can be stored in stochastic simulations.

Download:

Fig 2. Dynamical matrix representation of the model with sample plots of the SFS, DD and MBD.

The matrix framework described in the main text, where i refers to the Markov step count (in this example, i = 12). (a) The matrix Y_i corresponding to the example realisation of the birth–death process depicted in Fig 1b, where entry (n,m) is 1 if cell n possesses mutation m and 0 otherwise. Grey shaded mutational entries arose during the same division and thus always occur together in the descendants of their progenitor. For example, mutations m–1, m and m + 1 in cell n (shown outlined in black) were generated in the same past division. (In Fig 1b, we can determine by inspection this ancestor cell to be the one that later divided into two cells with mutational burdens of 6.) The row sum of the entries of Y_i is shown in orange, the column sum of the number of grey areas is in pink and the column sum of the entries of Y_i is in blue. (b) The site frequency spectrum (SFS) : a histogram of the bottommost (orange) vector of (a). (c) The division distribution (DD) : a histogram of the middle (pink) vector of (a). (d) The single-cell mutational burden distribution (MBD) : a histogram of the rightmost (blue) vector of (a), or equivalently of the sums of the weights of the edges in Fig 1b from the root to the leaves.

https://doi.org/10.1371/journal.pcbi.1013241.g002

We can obtain the distributions of our key quantities, the SFS, DD and MBD (Fig 2b and 2d), from our dynamical mutation matrices (Fig 2a). For each mutation (column), the number of cells carrying this mutation is the row sum of the entries of Y_i (orange vector in Fig 2a). Thus, the histogram of this vector is the SFS at step i. For each cell (row), the number of divisions that the cell has undergone is the column sum of the number of grey areas (pink vector in Fig 2a), and the number of mutations in this cell is the column sum of the entries of Y_i (blue vector in Fig 2a). Correspondingly, the histograms of these two vectors lead to the other distributions obtainable from single-cell information: the DD and the MBD.

The symmetry provided by this mutation matrix Y_i gives rise to the following relationship between the site frequency spectrum and the single-cell mutational burden distribution:

(1)

We call this quantity the number of mutational occurrences: that is, the sum of the entries of Y_i.

Law of total expectation and conditioning on survival

Our primary approach for deriving the distributions of our key quantities from the discrete-time model is as follows. We use the law of total expectation (, for any random variables X and Y) to equate an expected quantity at step i + 1 to a conditional expectation. This usually is a function of the quantity at step i (conditional on knowledge of this information at step i), as earlier knowledge is never needed due to the Markov nature of the model. From this, we derive a recurrence relation for the expected values of our desired quantity, which can be solved.

We first note that conditioning on the survival of the entire population plays a role in all of our subsequent expected values. In the pure-birth case, the population is deterministic and equal to N_i = i + 1. Once death is included, however, the population size becomes a random variable. For the birth–death chain of Fig 1a, its expected value at step i, both conditioned on survival and not, can be exactly calculated, which is done in Proposition B of S1 Appendix. Fig A of S1 Appendix shows that the expected population size conditioned on survival can be linearly approximated by , valid for low death, since in this limit the expected gain in population in one step is − . All of our ensuing expectations are conditioned on survival and the initial condition N₀ = 1, which we will omit from our notation for brevity. Table 1 displays our notation.

Download:

Table 1. Notation used in this manuscript.

https://doi.org/10.1371/journal.pcbi.1013241.t001

Site frequency spectrum

With the recurrence relation method outlined above, we can formally derive the pure-birth () site frequency spectrum. For any birth probability , the total instances of j-abundant mutations in the population at step i is equal to jS_j,i. For example, in the leftmost sub-panel of Fig 1c, one can verify that there are 1 · S_1,4 = 7 instances of 1-abundant mutations and 2 · S_2,4 = 4 instances of 2-abundant mutations in the population at step i = 4. When there is no death, N_i = i + 1. Thus, at step i, the expected number of j-abundant mutations in a (randomly chosen) dividing cell is jS_j,i/(i + , since we average the total number of instances over the population. After the division in step i, any j-abundant mutations in the dividing cell will become -abundant and thus no longer contribute to the j-site. Similarly, (j–1)-abundant mutations in the dividing cell now contribute to the j-site. We therefore have

(2)

for the Kronecker delta symbol, whose term arises from the new 1-abundant mutations occurring during division, which are independently drawn from a Poisson distribution: .

We make the change of variables + , absorbing the source term into the boundary condition . That Q_j is independent of i can be argued in the following manner: in expectation, during neutral exponential growth, mutations preserve their frequency in the population, as all cells (both those with and without a given mutation) grow at the same rate [12]. (For a brute-force demonstration of this, see the Supplementary Information of [19].) Thus, using the linearity of expectation, Eq (2) becomes simply , which telescopes to obtain the known result

An identical procedure can be applied in the birth–death case to recover the large-population expected SFS, though a difficulty here is that now N_i becomes a random variable itself. This means that the denominators i + 1 in Eq (2) become N_i, and so we are left with terms of the form .

When random variables A and nonzero B are close to their expected values, expanding around the point provides the first-order approximation (see Corollary C of S1 Appendix). When we state to first order, this is the approximation we are making and the region of interest—near —unless otherwise specified.

Here, we outline the birth–death derivation of the SFS, leaving the details for Proposition D of S1 Appendix. Let R (for “reaction”) be a random variable equal to 1 when a division event takes place (so that ) and –1 when a death event occurs (). By conditioning on these two outcomes and multiplying their probabilities of occurrence, the law of total expectation provides the corresponding recurrence relation to Eq (2):

where we have used the indicator function to be 1 on the set A and 0 elsewhere. Since R is independent from the other random variables at play, the expectation of the indication functions become and , respectively.

By expanding to first order, we can make the ansatz , as in the pure-birth case. After some rearranging, this produces a homogeneous second-order recurrence relation of the form , for some linear functions of j. Solutions of such recurrence relations are known [38], and we obtain the following first-order approximation, which matches the result from [22]:

(3)

where all expectations are conditioned on non-extinction of the whole population [22]. In the limit of low death (), this approximation is sound, as then the variance in population size is small (see Proposition F of S1 Appendix for details).

Division distribution

The expected division distribution in the pure-birth case can be obtained in a similar manner as the site frequency spectrum. The probability of selecting a cell with divisions in its history is : that is, the number of cells have divided times, divided by the total population. This factor will then no longer contribute to , dividing into two cells with one more division in their history than before. The law of total expectation then becomes

(4)

which can be solved to recover the result from [32]:

(5)

where the unsigned Stirling numbers of the first kind are defined by the relation

with boundary conditions and if or . Eq (5) can be substituted into Eq (4) to show that it satisfies the desired recurrence relation; by uniqueness of solutions that agree with boundary conditions, we have the result. Distributions generated by stochastic simulations using a Gillespie algorithm (see Methods) agree well with this expression (see Fig 3).

Download:

Fig 3. Expected division distribution for a pure-birth process matches stochastic simulation results.

Average (solid dark pink line) of 200 simulation realisations (representatives in solid pale pink lines) of the division distribution (DD) for a pure-birth process up to final population size N = 10³, along with the predicted expected DD obtained from Eq (5) (dashed black line).

https://doi.org/10.1371/journal.pcbi.1013241.g003

We outline the birth–death derivation of the DD, leaving the details to Proposition E of S1 Appendix. Again, for R the random variable denoting division or death, the recurrence relation corresponding to Eq (4) can be written

(6)

Similarly to with the SFS, we can expand to first order: ; this is valid near , so the expansion will hold when and N_i are close to their expected values. We also make the linear approximation , valid for low death; see Fig A of S1 Appendix and surrounding remarks for details. This allows us to rewrite an ansatz , which can be shown to satisfy Eq (6), into the following neat first-order approximation for the birth–death DD:

(7)

Here, it is evident that the division distribution partitions the population size (since summing the fraction on the right-hand side over gives unity) and that this partitioning is orchestrated by the functions .

Mutational burden distribution

The single-cell mutational burden distribution differs from the division distribution because there is an additional stochasticity at each cell division due to mutational (Poisson) distributions. To obtain an expected MBD from a DD, we can employ a procedure to introduce this stochasticity as follows: each cell contributing to a division burden will have undergone divisions, so will have acquired mutations, where represents the number of mutations acquired during the cell’s pth division. Since the Poisson distribution is additive, this sum is in turn a Poisson-distributed random variable with mean . The left-hand side of Fig 4a qualitatively depicts the elements of the DD being converted into Poisson probability mass functions associated to these sums of Poisson variables. These are then summed to obtain the MBD, shown on the right-hand side of Fig 4a, in the following manner.

Download:

Fig 4. Conversion of the division distribution into the single-cell mutational burden distribution.

(a) The elements of a DD are translated into Poisson distributions with means , weighted by (such that the sums over the teal and cyan distributions are and , respectively, for example), and then summed to obtain the corresponding MBD. Note that if the mean of the DD is , then the mean of the resulting MBD is . (b) Average (solid dark pink line) of 200 simulation realisations (representatives in solid pale pink lines) of the DD for a pure-birth process up to final population size N = 10⁴ with mutational mean . (c) Average (solid dark blue line) of the MBD for the same simulation realisations as (b) (representatives in solid pale blue lines), along with the MBD obtained from converting the average DD as explained in (a) and the main text (dashed red line).

https://doi.org/10.1371/journal.pcbi.1013241.g004

Writing for the number of mutations acquired during the pth division of the qth cell (for some labelling of cells ) having undergone divisions, we can sum over the elements of the DD labelled by to obtain

(8)

Now, using the linearity of expectation, that i is fixed and just an index, and the independence of the random variables and , the right-hand side of Eq (8) becomes

Finally, substituting the expression in Eq 5 for the expected DD and the probability mass function for the Poisson distribution with mean , we find the pure-birth expected MBD:

(9)

Fig 4b and 4c verify the conversion from a DD to a MBD described in the previous discussion with simulations.

The same conversion procedure can be implemented in the birth–death case. Again, working with first-order approximations, the expression in Eq (7) for the expected birth–death DD can be used instead of the pure-birth expression in Eq (5) during the final step to obtain a first-order approximation of the expected birth–death MBD.

Finally, consider the number of mutational occurrences: that is, the sum of the entries of the mutational matrix Y_i or, equivalently, either side of Eq (1). If this quantity is divided by the number of mutations M_i, we obtain the mean of the SFS; and if it is divided by the population N_i, we obtain the mean of the MBD. We can derive the expected number of mutational occurrences using our recurrence relation approach, from which we deduce that this mean, representing the expected mutational burden of a cell, grows logarithmically with the step count i (see Propositions G and H of S1 Appendix). In the pure-birth case, it is simply a rescaling of the harmonic numbers.

Discussion

The distribution of genetic mutations in cell populations has been studied both in the cases of constant [17,19,37,39] and growing populations [12,21,40–45]. With the development of single-cell sequencing technologies, exploration of more precise information in single cells is sure to follow in the footsteps of population-level research [23,24,46]. At the population level, both site frequency spectra (SFS) and overall tumour mutational burden (TMB) have been investigated analytically [12,18–22]. Here we focus on the single-cell distribution of the latter (the single-cell mutational burden distribution, or MBD), and use the foundation of the SFS to better understand the MBD analytically.

A new framework uniting the SFS and the MBD is presented, relying on a simple procedure: dynamical matrices store the mutational information of a population of cells, whose size is dictated by a birth–death process. Our approach of encoding the data in binary matrices, where the entry (n,m) is 1 when cell n has mutation m and 0 otherwise, naturally emerges from the (neutral) evolution-motivated idea wherein a cell is identified by its mutation load [37]. Two different ways of partitioning the entries of this mutational matrix provide definitions of both the SFS and the MBD as histograms of the row- and column-sums, respectively, as shown in Fig 2. With this symmetry in mind, which gives rise to Eq (1), an identical analytical approach depending on the discrete-time Markov nature of the model can be applied to both cases, along with an intermediary case of the division distribution (DD), to obtain recurrence relations for the distributions of interest: we employ the law of total expectation to write the expected value of a quantity of interest in terms of expected values at the previous time step. These recurrences are solved exactly in the pure-birth case and approximately in the birth–death case, giving rise to analytical predictions for the SFS, DD and MBD, which are compared to stochastic simulations as well as previous work on the SFS and the DD.

Indeed, in Propositions D and H of S1 Appendix, we recover the expected values of the SFS and TMB derived by Gunnarsson et al. [22] (their Propositions 2 and 3). Our stochastic-time first-order approximation in Eq (3) matches theirs from the stochastic-population scenario with a fixed elapsed time in the large-population limit, where the regimes coincide according to their convergence analysis [22]. Our derivation for the pure-birth DD in Eq (5) recovers a result from previous work on phylogenetic trees produced by Yule processes: combinatorics results relating to binary search trees [32] were then applied to the phylogenetic context [31].

The reverse-time coalescent approach supplies complementary tools to branching processes (though more often compared to the continuous-time setting [47]), which can also provide information on the summary statistics we discuss here [48]. Coalescent theory allows one to reconstruct phylogenic trees that describe genetic information found in individuals sample at present. Phylogenetic tree branch lengths can then be used to determine the mutations accumulated during this time interval, along with informing the population growth rate [47]. For example, Popovic [49] introduced the coalescent point process (CPP), which reconstructs phylogenetic trees using independent and identically-distributed coalescent times for a sample of individuals; when populations’ genealogies can be represented with this formalism, their population size satisfies a geometric distribution [49]. Lambert [50] used the CPP to derive the expected SFS under certain conditions and later integrated the effects of sampling: both Lambert and Stadler [51] and Lambert [52] derived distributions of node depths of a sample within a phylogenetic tree. This relates to our DD—although, again, in a differently-conditioned, continuous-time process. Still more recently, Schweinsberg and Shuai [47] recovered the supercritical SFS result of Durrett [20] (that is, the form in Eq (3)) using CPPs; with Johnson and Curtius, they applied these results to haematopoeitic data to infer growth rates of clones with one or multiple driver mutations [53]. While phylogenetic trees inferred using coalescent theory can be mapped to DDs—or, directly to the MBD, if every division event corresponds to a point of coalescence—to our knowledge the triple connection between the SFS, the DD and the MBD has not yet been made in either the coalescent nor the branching process literature.

When comparing the theoretical expected distributions discussed here with experimental data, two further factors come into play: noise and sampling. The impact of noise on bulk whole genome or exome sequencing data has been investigated at length [12,54], where increasing the depth of coverage in sequencing and filtering out the possible false mutations with single reads help to reduce noise. However, single-cell DNA sequencing faces much higher levels of noise due to the limited amount of DNA in single cells and consequently high amplification errors and bias generated in multiple polymerase chain reactions (PCR) [55–57]. While bioinformatic tools have been developed to handle the noise in calling mutations from single-cell DNA sequencing data [57], obtaining reliable single-cell MBDs directly from such data remains challenging. Consequently, constructing single-cell phylogenies and DDs can be also difficult when using single-cell data generated by PCR-based sequencing technologies. While a wide application of theoretical tools developed herein would rely on the improvement of technologies to generate more reliable data, there are a few designed experiments providing robust “single-cell” MBDs through whole-genome sequencing of single cell-derived colonies [4,58]. Sequencing errors are avoided by only using clonal mutations with high frequencies in single cell-derived colonies, which are private mutations in single ancestor cells back in the evolutionary time. Pulling together those mutations across all single cell-derived colonies from the same donor, reliable “single-cell” mutation burden distributions can be generated with the sampling size as the number of sequenced colonies.

Next, the effect of the sampling size on the SFS is well-documented: following the aforementioned coalescent approach of Lambert [50], Dinh et al. [48] derived hypergeometric terms in the SFS under sampling. In addition, Durrett (see Theorem 3 of [21]) approximated the impact of sampling on the SFS, an approach that Stein and Werner [59] have recently used to model cancer treatment and its impact on genetic heterogeneity within a tumour. The MBD, however, does not suffer from the same sampling distortions as the SFS: Moeller, Mon Père et al. [19] demonstrated that the MBD provides a way of inferring evolutionary parameters regardless of sample size. They showed that sampling increases the noise, resulting in higher errors, but that the expectations provided by the inferences remain unchanged [19].

Our analysis holds for a single clone. Neutral subclones can be identified in our dynamic matrices as follows: rows (corresponding to cells) with an entry of 1 in a particular column (corresponding to a given mutation m) form a subclone, all of whose cells possess mutation m. We can therefore extract, from the matrices, summary statistics for that subclone. To see that we can recover the expected analytical distribution, consider the following argument. Each division event in the branching process gives rise to two new, identical processes, with the daughter cells acting as progenitors, and their mutations already clonal within their sub-processes. Let m be the mutation that defines a subclone (that is, all members of the subclone possess mutation m); the number of sites at which this mutation occurs in the population is the population size of the subclone. We then simply modify our expected distributions by conditioning on the population size being rather than the total population size (up to possible sampling effects, as previously discussed). For example, by estimating the number of mutations M₀ possessed by the progenitor cell of the subclone, we can shift the MBD correspondingly, increasing all mutational burdens by M₀.

A natural extension of our work is to consider clonal competition, where different subclones have different fitnesses. In the cancer context, this might correspond to a subclone that is resistant to treatment; see [59] for details, for example. While the two-type branching process, i.e. a branching process containing a wild-type and a differently-fit mutant type, has been solved by Antal and Krapivsky [60,61], their methods do not allow for the accrual of many neutral mutations, as is the goal of our analysis. Here, we can include clonal competition by labelling cells with an index n, where wild-type cells are characterised by n = 0 and mutants have n = 1. We would thus allow transitions from n = 0 to n = 1 during division events and let birth and death probabilities be type-dependent, which makes the clones have different fitnesses. Our approach then produces coupled (via n) recurrence relations, whose solutions are not tractable with the current methods (see S1 Appendix for further discussion).

Intuitively, we would think that the explicit single-cell MBD results from both the DD and the extra stochasticity arising from the mutational distribution at each past cell division (the internal nodes of the cell lineage tree). Surprisingly, we found that the latter nodal stochasticity does not play a large role in the MBD. While there is certainly higher variance in the MBD than in the DD, as evidenced by Fig 4b and 4c, the shapes of the two distributions remain similar and we can construct the MBD based on the DD and , the mean value of number of mutations acquired per cell in each past cell division. The derivation from Eq (8) to Eq (9) demonstrates that only the mean of the mutational distribution matters when obtaining the MBD, rather than its higher moments. We further tested this conclusion by applying other mutational distributions than the Poisson in stochastic simulations, which lead to the same predicted MBD, as shown in Fig B of S1 Appendix. Employing the binning procedure described in Eq (A27) of S1 Appendix allows us to retrace our steps from the MBD to the DD, which reinforces that it is only the mean of the mutational distribution that is of critical importance to the shape of the MBD, not the exact form of the distribution. By considering the variances of the two distributions, we note that the variance in the single-cell MBD itself is growing while that of the mutational distribution is fixed. We thus expect that after sufficient events, the former will dominate.

We showed that the expected mutational burden for an arbitrary cell in a population (the mean of the MBD) increases logarithmically with the step count i in our model (see Propositions G and H of S1 Appendix). In Moeller, Mon Père et al.’s continuous-time framework, this mean is shown to be the product of the expected number of divisions in the cell’s past and the mutational mean [19], much as we have argued in Fig 4 for our conversion from the DD to the MBD. Under their intuitive assumption of mutation burdens arising from a compound Poisson distribution, the variance of the MBD is dependent on the means and the variances of the DD and the mutational (Poisson) distribution [19], whereas our derivations and simulations show that only the mean of the mutational distribution plays a significant role, not its higher moments.

Knowledge of the connection between the DD and the MBD also provides a means of evaluating the divisions in a cell’s history. By reversing the argument in Fig 4, MBD data can provide the distribution of divisional histories in a cell population, without resorting to direct measurements (for example, via telomere shortening [62]).

While single-cell sequencing is still in its adolescence, grappling with hurdles such as trade-offs between sequencing noise, sample size and cost [63,64], there is a growing need and theoretical gap for mathematical and computational machinery to handle the vast quantities of data being produced [23,46]. Our model serves as a new framework to integrate single-cell and bulk information, and shows how various distributions of accumulated mutations are linked through the same stochastic process.

Methods

Besides the analysis described in the Results section, we employed a modified Gillespie algorithm to stochastically simulate our system and verify our expressions [65]. The original Gillespie formulation is used to simulate (in a statistically exact manner) continuous-time reactions that have specified rates within one or multiple populations. Rather than independently drawing an exponentially-distributed random number for each reaction (here the reactions would be birth and death within the single population of cells), the Gillespie algorithm leverages the fact that the time until the first reaction occurs is also exponentially-distributed, with rate equal to the sum of the rates of all of the reactions. The algorithm evolves by drawing one such number, then randomly selecting (proportional to their rates) which reaction takes place at that time, before updating the populations.

In our discrete-time model, we need only draw the second of these random numbers, determining whether a birth or death occurs at the given step. The cell that is dividing or dying is then (uniformly, since mutations are neutral) randomly selected, replicating itself or being removed from the system, respectively. If the event was a birth, new mutations are added to the two daughter cells according to the mutational distribution considered (we use a Poisson distribution unless otherwise mentioned, such as in S1 Appendix).

Supporting information

S1 Appendix. Further mathematical proofs and discussion.

https://doi.org/10.1371/journal.pcbi.1013241.s001

(PDF)

Acknowledgments

We thank Tibor Antal, Sabin Lessard, Nathaniel Mon Père and Alexander Stein for fruitful discussions and two reviewers for their suggestions that improved the text.

References

1. Weinberg RA. The biology of cancer. Garland Science; 2013.
2. Reusch TBH, Baums IB, Werner B. Evolution via somatic genetic variation in modular species. Trends Ecol Evol. 2021;36(12):1083–92. pmid:34538501
- View Article
- PubMed/NCBI
- Google Scholar
3. Bae T, Tomasini L, Mariani J, Zhou B, Roychowdhury T, Franjic D, et al. Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis. Science. 2018;359(6375):550–5. pmid:29217587
- View Article
- PubMed/NCBI
- Google Scholar
4. Lee-Six H, Øbro NF, Shepherd MS, Grossmann S, Dawson K, Belmonte M, et al. Population dynamics of normal human blood inferred from somatic mutations. Nature. 2018;561(7724):473–8. pmid:30185910
- View Article
- PubMed/NCBI
- Google Scholar
5. Werner B, Case J, Williams MJ, Chkhaidze K, Temko D, Fernández-Mateos J, et al. Measuring single cell divisions in human tissues from multi-region sequencing data. Nat Commun. 2020;11(1):1035. pmid:32098957
- View Article
- PubMed/NCBI
- Google Scholar
6. Frank SA, Nowak MA. Problems of somatic mutation and cancer. Bioessays. 2004;26(3):291–9. pmid:14988930
- View Article
- PubMed/NCBI
- Google Scholar
7. Komarova NL. Cancer, aging and the optimal tissue design. Semin Cancer Biol. 2005;15(6):494–505. pmid:16143543
- View Article
- PubMed/NCBI
- Google Scholar
8. Burrell RA, McGranahan N, Bartek J, Swanton C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature. 2013;501(7467):338–45. pmid:24048066
- View Article
- PubMed/NCBI
- Google Scholar
9. Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, et al. Accumulation of driver and passenger mutations during tumor progression. Proc Natl Acad Sci U S A. 2010;107(43):18545–50. pmid:20876136
- View Article
- PubMed/NCBI
- Google Scholar
10. Ling S, Hu Z, Yang Z, Yang F, Li Y, Lin P, et al. Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution. Proc Natl Acad Sci U S A. 2015;112(47):E6496-505. pmid:26561581
- View Article
- PubMed/NCBI
- Google Scholar
11. Sottoriva A, Kang H, Ma Z, Graham TA, Salomon MP, Zhao J, et al. A Big Bang model of human colorectal tumor growth. Nat Genet. 2015;47(3):209–16. pmid:25665006
- View Article
- PubMed/NCBI
- Google Scholar
12. Williams MJ, Werner B, Barnes CP, Graham TA, Sottoriva A. Identification of neutral tumor evolution across cancer types. Nat Genet. 2016;48(3):238–44. pmid:26780609
- View Article
- PubMed/NCBI
- Google Scholar
13. Park S-C, Krug J. Clonal interference in large populations. Proc Natl Acad Sci U S A. 2007;104(46):18135–40. pmid:17984061
- View Article
- PubMed/NCBI
- Google Scholar
14. Karlsson K, Przybilla MJ, Kotler E, Khan A, Xu H, Karagyozova K, et al. Deterministic evolution and stringent selection during preneoplasia. Nature. 2023;618(7964):383–93. pmid:37258665
- View Article
- PubMed/NCBI
- Google Scholar
15. Ewens WJ. The pseudo-transient distribution and its uses in genetics. J Appl Prob. 1964;1(1):141–56.
- View Article
- Google Scholar
16. Kimura M. Genetic variability maintained in a finite population due to mutational production of neutral and nearly neutral isoalleles. Genet Res. 1968;11(3):247–69. pmid:5713805
- View Article
- PubMed/NCBI
- Google Scholar
17. Fu YX. Statistical properties of segregating sites. Theor Popul Biol. 1995;48(2):172–97. pmid:7482370
- View Article
- PubMed/NCBI
- Google Scholar
18. Bozic I, Gerold JM, Nowak MA. Quantifying clonal and subclonal passenger mutations in cancer evolution. PLoS Comput Biol. 2016;12(2):e1004731. pmid:26828429
- View Article
- PubMed/NCBI
- Google Scholar
19. Moeller ME, Mon Père NV, Werner B, Huang W. Measures of genetic diversification in somatic tissues at bulk and single-cell resolution. Elife. 2024;12:RP89780. pmid:38265286
- View Article
- PubMed/NCBI
- Google Scholar
20. Durrett R. Population genetics of neutral mutations in exponentially growing cancer cell populations. Ann Appl Probab. 2013;23(1):230–50. pmid:23471293
- View Article
- PubMed/NCBI
- Google Scholar
21. Durrett R. Branching process models of cancer. In: Durrett R, editor. Branching process models of cancer. Cham: Springer; 2015. p. 1–63.
22. Gunnarsson EB, Leder K, Foo J. Exact site frequency spectra of neutrally evolving tumors: a transition between power laws reveals a signature of cell viability. Theor Popul Biol. 2021;142:67–90. pmid:34560155
- View Article
- PubMed/NCBI
- Google Scholar
23. Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet. 2013;14(9):618–30. pmid:23897237
- View Article
- PubMed/NCBI
- Google Scholar
24. Wang Y, Navin NE. Advances and applications of single-cell sequencing technologies. Mol Cell. 2015;58(4):598–609. pmid:26000845
- View Article
- PubMed/NCBI
- Google Scholar
25. Abascal F, Harvey LMR, Mitchell E, Lawson ARJ, Lensing SV, Ellis P, et al. Somatic mutation landscapes at single-molecule resolution. Nature. 2021;593(7859):405–10. pmid:33911282
- View Article
- PubMed/NCBI
- Google Scholar
26. Chalmers ZR, Connelly CF, Fabrizio D, Gay L, Ali SM, Ennis R, et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med. 2017;9(1):34. pmid:28420421
- View Article
- PubMed/NCBI
- Google Scholar
27. Fernandez EM, Eng K, Beg S, Beltran H, Faltas BM, Mosquera JM, et al. Cancer-specific thresholds adjust for whole exome sequencing-based tumor mutational burden distribution. JCO Precis Oncol. 2019;3:PO.18.00400. pmid:31475242
- View Article
- PubMed/NCBI
- Google Scholar
28. Martínez-Pérez E, Molina-Vila MA, Marino-Buslje C. Panels and models for accurate prediction of tumor mutation burden in tumor samples. NPJ Precis Oncol. 2021;5(1):31. pmid:33850256
- View Article
- PubMed/NCBI
- Google Scholar
29. Derényi I, Demeter MC, Pérez-Jiménez M, Grajzel D, Szöllősi GJ. How mutation accumulation depends on the structure of the cell lineage tree. Phys Rev E. 2024;109(4–1):044407. pmid:38755817
- View Article
- PubMed/NCBI
- Google Scholar
30. Page RD, Holmes EC. Molecular evolution: a phylogenetic approach. Wiley; 2009.
31. Steel M, McKenzie A. Properties of phylogenetic trees generated by Yule-type speciation models. Math Biosci. 2001;170(1):91–112. pmid:11259805
- View Article
- PubMed/NCBI
- Google Scholar
32. Lynch WC. More combinatorial properties of certain trees. Comput J. 1965;7(4):299–302.
- View Article
- Google Scholar
33. Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893–903. pmid:5364968
- View Article
- PubMed/NCBI
- Google Scholar
34. Kuipers J, Jahn K, Raphael BJ, Beerenwinkel N. Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 2017;27(11):1885–94. pmid:29030470
- View Article
- PubMed/NCBI
- Google Scholar
35. Cheek D, Antal T. Mutation frequencies in a birth–death branching process. Ann Appl Probab. 2018;28(6).
- View Article
- Google Scholar
36. Cheek D, Antal T. Genetic composition of an exponentially growing cell population. Stochast Process Appl. 2020;130(11):6580–624.
- View Article
- Google Scholar
37. Ronen R, Udpa N, Halperin E, Bafna V. Learning natural selection from the site frequency spectrum. Genetics. 2013;195(1):181–93. pmid:23770700
- View Article
- PubMed/NCBI
- Google Scholar
38. Deijfen M, Lindholm M. Growing networks with preferential deletion and addition of edges. Phys A: Statist Mech Appl. 2009;388(19):4297–303.
- View Article
- Google Scholar
39. Durrett R. Probability models for DNA sequence evolution. 2nd ed. New York, NY: Springer.
40. Simons BD. Deep sequencing as a probe of normal stem cell fate and preneoplasia in human epidermis. Proc Natl Acad Sci U S A. 2016;113(1):128–33. pmid:26699486
- View Article
- PubMed/NCBI
- Google Scholar
41. Loeb LA, Kohrn BF, Loubet-Senear KJ, Dunn YJ, Ahn EH, O’Sullivan JN, et al. Extensive subclonal mutational diversity in human colorectal cancer and its significance. Proc Natl Acad Sci U S A. 2019;116(52):26863–72. pmid:31806761
- View Article
- PubMed/NCBI
- Google Scholar
42. Watson CJ, Papula AL, Poon GYP, Wong WH, Young AL, Druley TE, et al. The evolutionary dynamics and fitness landscape of clonal hematopoiesis. Science. 2020;367(6485):1449–54. pmid:32217721
- View Article
- PubMed/NCBI
- Google Scholar
43. Poon GYP, Watson CJ, Fisher DS, Blundell JR. Synonymous mutations reveal genome-wide levels of positive selection in healthy tissues. Nat Genet. 2021;53(11):1597–605. pmid:34737428
- View Article
- PubMed/NCBI
- Google Scholar
44. Tung H-R, Durrett R. Signatures of neutral evolution in exponentially growing tumors: a theoretical perspective. PLoS Comput Biol. 2021;17(2):e1008701. pmid:33571199
- View Article
- PubMed/NCBI
- Google Scholar
45. Kurpas MK, Kimmel M. Modes of selection in tumors as reflected by two mathematical models and site frequency spectra. Front Ecol Evol. 2022;10:889438. pmid:37333691
- View Article
- PubMed/NCBI
- Google Scholar
46. Cho H, Kuo Y-H, Rockne RC. Comparison of cell state models derived from single-cell RNA sequencing data: graph versus multi-dimensional space. Math Biosci Eng. 2022;19(8):8505–36. pmid:35801475
- View Article
- PubMed/NCBI
- Google Scholar
47. Schweinsberg J, Shuai Y. Asymptotics for the site frequency spectrum associated with the genealogy of a birth and death process. Ann Appl Probab. 2025;35(1).
- View Article
- Google Scholar
48. Dinh KN, Jaksik R, Kimmel M, Lambert A, Tavaré S. Statistical Inference for the Evolutionary History of Cancer Genomes. Statist Sci. 2020;35(1).
- View Article
- Google Scholar
49. Popovic L. Asymptotic genealogy of a critical branching process. Ann Appl Probab. 2004;14(4).
- View Article
- Google Scholar
50. Lambert A. The allelic partition for coalescent point processes. Markov Process Relat Fields. 2009;15(3):359–86.
- View Article
- Google Scholar
51. Lambert A, Stadler T. Birth-death models and coalescent point processes: the shape and probability of reconstructed phylogenies. Theor Popul Biol. 2013;90:113–28. pmid:24157567
- View Article
- PubMed/NCBI
- Google Scholar
52. Lambert A. The coalescent of a sample from a binary branching process. Theor Popul Biol. 2018;122:30–5. pmid:29704514
- View Article
- PubMed/NCBI
- Google Scholar
53. Johnson B, Shuai Y, Schweinsberg J, Curtius K. cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory. Bioinformatics. 2023;39(9):btad561. pmid:37699006
- View Article
- PubMed/NCBI
- Google Scholar
54. Williams MJ, Werner B, Heide T, Curtis C, Barnes CP, Sottoriva A, et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nat Genet. 2018;50(6):895–903. pmid:29808029
- View Article
- PubMed/NCBI
- Google Scholar
55. Roerink SF, Sasaki N, Lee-Six H, Young MD, Alexandrov LB, Behjati S, et al. Intra-tumour diversification in colorectal cancer at the single-cell level. Nature. 2018;556(7702):457–62. pmid:29643510
- View Article
- PubMed/NCBI
- Google Scholar
56. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020;21(1):31. pmid:32033589
- View Article
- PubMed/NCBI
- Google Scholar
57. Valecha M, Posada D. Somatic variant calling from single-cell DNA sequencing data. Comput Struct Biotechnol J. 2022;20:2978–85. pmid:35782734
- View Article
- PubMed/NCBI
- Google Scholar
58. Mitchell E, Spencer Chapman M, Williams N, Dawson KJ, Mende N, Calderbank EF, et al. Clonal dynamics of haematopoiesis across the human lifespan. Nature. 2022;606(7913):343–50. pmid:35650442
- View Article
- PubMed/NCBI
- Google Scholar
59. Stein A, Werner B. On the patterns of genetic intra-tumour heterogeneity before and after treatment. Genetics. 2025:iyaf101. pmid:40439127
- View Article
- PubMed/NCBI
- Google Scholar
60. Antal T, Krapivsky PL. Exact solution of a two-type branching process: clone size distribution in cell division kinetics. J Stat Mech. 2010;2010(07):P07028.
- View Article
- Google Scholar
61. Antal T, Krapivsky PL. Exact solution of a two-type branching process: models of tumor progression. J Stat Mech. 2011;2011(08):P08018.
- View Article
- Google Scholar
62. Blasco MA. Telomere length, stem cells and aging. Nat Chem Biol. 2007;3(10):640–9. pmid:17876321
- View Article
- PubMed/NCBI
- Google Scholar
63. Goldman SL, MacKay M, Afshinnekoo E, Melnick AM, Wu S, Mason CE. The impact of heterogeneity on single-cell sequencing. Front Genet. 2019;10.
- View Article
- Google Scholar
64. Lim B, Lin Y, Navin N. Advancing cancer research and medicine with single-cell genomics. Cancer Cell. 2020;37(4):456–70. pmid:32289270
- View Article
- PubMed/NCBI
- Google Scholar
65. Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J Comput Phys. 1976;22(4):403–34.
- View Article
- Google Scholar

[ref1] 1. Weinberg RA. The biology of cancer. Garland Science; 2013.

[ref2] 2. Reusch TBH, Baums IB, Werner B. Evolution via somatic genetic variation in modular species. Trends Ecol Evol. 2021;36(12):1083–92. pmid:34538501
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Bae T, Tomasini L, Mariani J, Zhou B, Roychowdhury T, Franjic D, et al. Different mutational rates and mechanisms in human cells at pregastrulation and neurogenesis. Science. 2018;359(6375):550–5. pmid:29217587
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref4] 4. Lee-Six H, Øbro NF, Shepherd MS, Grossmann S, Dawson K, Belmonte M, et al. Population dynamics of normal human blood inferred from somatic mutations. Nature. 2018;561(7724):473–8. pmid:30185910
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Werner B, Case J, Williams MJ, Chkhaidze K, Temko D, Fernández-Mateos J, et al. Measuring single cell divisions in human tissues from multi-region sequencing data. Nat Commun. 2020;11(1):1035. pmid:32098957
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Frank SA, Nowak MA. Problems of somatic mutation and cancer. Bioessays. 2004;26(3):291–9. pmid:14988930
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Komarova NL. Cancer, aging and the optimal tissue design. Semin Cancer Biol. 2005;15(6):494–505. pmid:16143543
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref8] 8. Burrell RA, McGranahan N, Bartek J, Swanton C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature. 2013;501(7467):338–45. pmid:24048066
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, et al. Accumulation of driver and passenger mutations during tumor progression. Proc Natl Acad Sci U S A. 2010;107(43):18545–50. pmid:20876136
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Ling S, Hu Z, Yang Z, Yang F, Li Y, Lin P, et al. Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution. Proc Natl Acad Sci U S A. 2015;112(47):E6496-505. pmid:26561581
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Sottoriva A, Kang H, Ma Z, Graham TA, Salomon MP, Zhao J, et al. A Big Bang model of human colorectal tumor growth. Nat Genet. 2015;47(3):209–16. pmid:25665006
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref12] 12. Williams MJ, Werner B, Barnes CP, Graham TA, Sottoriva A. Identification of neutral tumor evolution across cancer types. Nat Genet. 2016;48(3):238–44. pmid:26780609
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref13] 13. Park S-C, Krug J. Clonal interference in large populations. Proc Natl Acad Sci U S A. 2007;104(46):18135–40. pmid:17984061
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref14] 14. Karlsson K, Przybilla MJ, Kotler E, Khan A, Xu H, Karagyozova K, et al. Deterministic evolution and stringent selection during preneoplasia. Nature. 2023;618(7964):383–93. pmid:37258665
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref15] 15. Ewens WJ. The pseudo-transient distribution and its uses in genetics. J Appl Prob. 1964;1(1):141–56.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref16] 16. Kimura M. Genetic variability maintained in a finite population due to mutational production of neutral and nearly neutral isoalleles. Genet Res. 1968;11(3):247–69. pmid:5713805
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref17] 17. Fu YX. Statistical properties of segregating sites. Theor Popul Biol. 1995;48(2):172–97. pmid:7482370
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref18] 18. Bozic I, Gerold JM, Nowak MA. Quantifying clonal and subclonal passenger mutations in cancer evolution. PLoS Comput Biol. 2016;12(2):e1004731. pmid:26828429
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref19] 19. Moeller ME, Mon Père NV, Werner B, Huang W. Measures of genetic diversification in somatic tissues at bulk and single-cell resolution. Elife. 2024;12:RP89780. pmid:38265286
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref20] 20. Durrett R. Population genetics of neutral mutations in exponentially growing cancer cell populations. Ann Appl Probab. 2013;23(1):230–50. pmid:23471293
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref21] 21. Durrett R. Branching process models of cancer. In: Durrett R, editor. Branching process models of cancer. Cham: Springer; 2015. p. 1–63.

[ref22] 22. Gunnarsson EB, Leder K, Foo J. Exact site frequency spectra of neutrally evolving tumors: a transition between power laws reveals a signature of cell viability. Theor Popul Biol. 2021;142:67–90. pmid:34560155
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref23] 23. Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet. 2013;14(9):618–30. pmid:23897237
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref24] 24. Wang Y, Navin NE. Advances and applications of single-cell sequencing technologies. Mol Cell. 2015;58(4):598–609. pmid:26000845
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref25] 25. Abascal F, Harvey LMR, Mitchell E, Lawson ARJ, Lensing SV, Ellis P, et al. Somatic mutation landscapes at single-molecule resolution. Nature. 2021;593(7859):405–10. pmid:33911282
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref26] 26. Chalmers ZR, Connelly CF, Fabrizio D, Gay L, Ali SM, Ennis R, et al. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Med. 2017;9(1):34. pmid:28420421
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref27] 27. Fernandez EM, Eng K, Beg S, Beltran H, Faltas BM, Mosquera JM, et al. Cancer-specific thresholds adjust for whole exome sequencing-based tumor mutational burden distribution. JCO Precis Oncol. 2019;3:PO.18.00400. pmid:31475242
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref28] 28. Martínez-Pérez E, Molina-Vila MA, Marino-Buslje C. Panels and models for accurate prediction of tumor mutation burden in tumor samples. NPJ Precis Oncol. 2021;5(1):31. pmid:33850256
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref29] 29. Derényi I, Demeter MC, Pérez-Jiménez M, Grajzel D, Szöllősi GJ. How mutation accumulation depends on the structure of the cell lineage tree. Phys Rev E. 2024;109(4–1):044407. pmid:38755817
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref30] 30. Page RD, Holmes EC. Molecular evolution: a phylogenetic approach. Wiley; 2009.

[ref31] 31. Steel M, McKenzie A. Properties of phylogenetic trees generated by Yule-type speciation models. Math Biosci. 2001;170(1):91–112. pmid:11259805
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref32] 32. Lynch WC. More combinatorial properties of certain trees. Comput J. 1965;7(4):299–302.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref33] 33. Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61(4):893–903. pmid:5364968
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref34] 34. Kuipers J, Jahn K, Raphael BJ, Beerenwinkel N. Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors. Genome Res. 2017;27(11):1885–94. pmid:29030470
View Article
PubMed/NCBI
Google Scholar

[123] View Article

[124] PubMed/NCBI

[125] Google Scholar

[ref35] 35. Cheek D, Antal T. Mutation frequencies in a birth–death branching process. Ann Appl Probab. 2018;28(6).
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref36] 36. Cheek D, Antal T. Genetic composition of an exponentially growing cell population. Stochast Process Appl. 2020;130(11):6580–624.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref37] 37. Ronen R, Udpa N, Halperin E, Bafna V. Learning natural selection from the site frequency spectrum. Genetics. 2013;195(1):181–93. pmid:23770700
View Article
PubMed/NCBI
Google Scholar

[133] View Article

[134] PubMed/NCBI

[135] Google Scholar

[ref38] 38. Deijfen M, Lindholm M. Growing networks with preferential deletion and addition of edges. Phys A: Statist Mech Appl. 2009;388(19):4297–303.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref39] 39. Durrett R. Probability models for DNA sequence evolution. 2nd ed. New York, NY: Springer.

[ref40] 40. Simons BD. Deep sequencing as a probe of normal stem cell fate and preneoplasia in human epidermis. Proc Natl Acad Sci U S A. 2016;113(1):128–33. pmid:26699486
View Article
PubMed/NCBI
Google Scholar

[141] View Article

[142] PubMed/NCBI

[143] Google Scholar

[ref41] 41. Loeb LA, Kohrn BF, Loubet-Senear KJ, Dunn YJ, Ahn EH, O’Sullivan JN, et al. Extensive subclonal mutational diversity in human colorectal cancer and its significance. Proc Natl Acad Sci U S A. 2019;116(52):26863–72. pmid:31806761
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref42] 42. Watson CJ, Papula AL, Poon GYP, Wong WH, Young AL, Druley TE, et al. The evolutionary dynamics and fitness landscape of clonal hematopoiesis. Science. 2020;367(6485):1449–54. pmid:32217721
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref43] 43. Poon GYP, Watson CJ, Fisher DS, Blundell JR. Synonymous mutations reveal genome-wide levels of positive selection in healthy tissues. Nat Genet. 2021;53(11):1597–605. pmid:34737428
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref44] 44. Tung H-R, Durrett R. Signatures of neutral evolution in exponentially growing tumors: a theoretical perspective. PLoS Comput Biol. 2021;17(2):e1008701. pmid:33571199
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref45] 45. Kurpas MK, Kimmel M. Modes of selection in tumors as reflected by two mathematical models and site frequency spectra. Front Ecol Evol. 2022;10:889438. pmid:37333691
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref46] 46. Cho H, Kuo Y-H, Rockne RC. Comparison of cell state models derived from single-cell RNA sequencing data: graph versus multi-dimensional space. Math Biosci Eng. 2022;19(8):8505–36. pmid:35801475
View Article
PubMed/NCBI
Google Scholar

[165] View Article

[166] PubMed/NCBI

[167] Google Scholar

[ref47] 47. Schweinsberg J, Shuai Y. Asymptotics for the site frequency spectrum associated with the genealogy of a birth and death process. Ann Appl Probab. 2025;35(1).
View Article
Google Scholar

[169] View Article

[170] Google Scholar

[ref48] 48. Dinh KN, Jaksik R, Kimmel M, Lambert A, Tavaré S. Statistical Inference for the Evolutionary History of Cancer Genomes. Statist Sci. 2020;35(1).
View Article
Google Scholar

[172] View Article

[173] Google Scholar

[ref49] 49. Popovic L. Asymptotic genealogy of a critical branching process. Ann Appl Probab. 2004;14(4).
View Article
Google Scholar

[175] View Article

[176] Google Scholar

[ref50] 50. Lambert A. The allelic partition for coalescent point processes. Markov Process Relat Fields. 2009;15(3):359–86.
View Article
Google Scholar

[178] View Article

[179] Google Scholar

[ref51] 51. Lambert A, Stadler T. Birth-death models and coalescent point processes: the shape and probability of reconstructed phylogenies. Theor Popul Biol. 2013;90:113–28. pmid:24157567
View Article
PubMed/NCBI
Google Scholar

[181] View Article

[182] PubMed/NCBI

[183] Google Scholar

[ref52] 52. Lambert A. The coalescent of a sample from a binary branching process. Theor Popul Biol. 2018;122:30–5. pmid:29704514
View Article
PubMed/NCBI
Google Scholar

[185] View Article

[186] PubMed/NCBI

[187] Google Scholar

[ref53] 53. Johnson B, Shuai Y, Schweinsberg J, Curtius K. cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory. Bioinformatics. 2023;39(9):btad561. pmid:37699006
View Article
PubMed/NCBI
Google Scholar

[189] View Article

[190] PubMed/NCBI

[191] Google Scholar

[ref54] 54. Williams MJ, Werner B, Heide T, Curtis C, Barnes CP, Sottoriva A, et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nat Genet. 2018;50(6):895–903. pmid:29808029
View Article
PubMed/NCBI
Google Scholar

[193] View Article

[194] PubMed/NCBI

[195] Google Scholar

[ref55] 55. Roerink SF, Sasaki N, Lee-Six H, Young MD, Alexandrov LB, Behjati S, et al. Intra-tumour diversification in colorectal cancer at the single-cell level. Nature. 2018;556(7702):457–62. pmid:29643510
View Article
PubMed/NCBI
Google Scholar

[197] View Article

[198] PubMed/NCBI

[199] Google Scholar

[ref56] 56. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020;21(1):31. pmid:32033589
View Article
PubMed/NCBI
Google Scholar

[201] View Article

[202] PubMed/NCBI

[203] Google Scholar

[ref57] 57. Valecha M, Posada D. Somatic variant calling from single-cell DNA sequencing data. Comput Struct Biotechnol J. 2022;20:2978–85. pmid:35782734
View Article
PubMed/NCBI
Google Scholar

[205] View Article

[206] PubMed/NCBI

[207] Google Scholar

[ref58] 58. Mitchell E, Spencer Chapman M, Williams N, Dawson KJ, Mende N, Calderbank EF, et al. Clonal dynamics of haematopoiesis across the human lifespan. Nature. 2022;606(7913):343–50. pmid:35650442
View Article
PubMed/NCBI
Google Scholar

[209] View Article

[210] PubMed/NCBI

[211] Google Scholar

[ref59] 59. Stein A, Werner B. On the patterns of genetic intra-tumour heterogeneity before and after treatment. Genetics. 2025:iyaf101. pmid:40439127
View Article
PubMed/NCBI
Google Scholar

[213] View Article

[214] PubMed/NCBI

[215] Google Scholar

[ref60] 60. Antal T, Krapivsky PL. Exact solution of a two-type branching process: clone size distribution in cell division kinetics. J Stat Mech. 2010;2010(07):P07028.
View Article
Google Scholar

[217] View Article

[218] Google Scholar

[ref61] 61. Antal T, Krapivsky PL. Exact solution of a two-type branching process: models of tumor progression. J Stat Mech. 2011;2011(08):P08018.
View Article
Google Scholar

[220] View Article

[221] Google Scholar

[ref62] 62. Blasco MA. Telomere length, stem cells and aging. Nat Chem Biol. 2007;3(10):640–9. pmid:17876321
View Article
PubMed/NCBI
Google Scholar

[223] View Article

[224] PubMed/NCBI

[225] Google Scholar

[ref63] 63. Goldman SL, MacKay M, Afshinnekoo E, Melnick AM, Wu S, Mason CE. The impact of heterogeneity on single-cell sequencing. Front Genet. 2019;10.
View Article
Google Scholar

[227] View Article

[228] Google Scholar

[ref64] 64. Lim B, Lin Y, Navin N. Advancing cancer research and medicine with single-cell genomics. Cancer Cell. 2020;37(4):456–70. pmid:32289270
View Article
PubMed/NCBI
Google Scholar

[230] View Article

[231] PubMed/NCBI

[232] Google Scholar

[ref65] 65. Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. J Comput Phys. 1976;22(4):403–34.
View Article
Google Scholar

[234] View Article

[235] Google Scholar

Figures

Abstract

Author summary

Introduction

Results

Dynamical matrices to unite the SFS, DD and MBD in a birth–death process

Law of total expectation and conditioning on survival

Site frequency spectrum

Division distribution

Mutational burden distribution

Discussion

Methods

Supporting information

S1 Appendix. Further mathematical proofs and discussion.

Acknowledgments

References