As a cancer develops, its cells accrue new mutations, resulting in a heterogeneous, complex genomic profile. We make use of this heterogeneity to derive simple, analytic estimates of parameters driving carcinogenesis and reconstruct the timeline of selective events following initiation of an individual cancer, where two longitudinal samples are available for sequencing. Using stochastic computer simulations of cancer growth, we show that we can accurately estimate mutation rate, time before and after a driver event occurred, and growth rates of both initiated cancer cells and subsequently appearing subclones. We demonstrate that in order to obtain accurate estimates of mutation rate and timing of events, observed mutation counts should be corrected to account for clonal mutations that occurred after the founding of the tumor, as well as sequencing coverage. Chronic lymphocytic leukemia (CLL), which often does not require treatment for years after diagnosis, presents an optimal system to study the untreated, natural evolution of cancer cell populations. When we apply our methodology to reconstruct the individual evolutionary histories of CLL patients, we find that the parental leukemic clone typically appears within the first fifteen years of life.
By the time a patient’s cancer is diagnosed, it has been growing undetected for years, or even decades. A cancer’s initiation, development, and progression are driven by a sequence of driver mutations, genetic alterations that confer a fitness advantage to the cells containing them. As a cancer expands, it also accumulates many neutral mutations that don’t confer a growth advantage. As a result, tumors are highly heterogeneous, made up of different genetically distinct populations, or subclones, of cancer cells. Most cancers will require immediate treatment upon diagnosis, making study of their natural progression over time difficult. However, the blood cancer chronic lymphocytic leukemia (CLL) often does not require immediate treatment and is closely monitored for years, which makes it ideal for studying cancer evolution before treatment radically alters the cancer’s dynamics. We make use of the complex tumor heterogeneity to reconstruct the timing of key driver events in the tumor’s development, showing that the initial leukemic clone often appears early in life. Additionally, we estimate mutation rate, subclone growth rates, and fitness advantage provided by driver mutations.
Citation: Lee ND, Bozic I (2022) Inferring parameters of cancer evolution in chronic lymphocytic leukemia. PLoS Comput Biol 18(11): e1010677. https://doi.org/10.1371/journal.pcbi.1010677
Editor: Jasmine Foo, University of Minnesota, UNITED STATES
Received: May 26, 2022; Accepted: October 24, 2022; Published: November 4, 2022
Copyright: © 2022 Lee, Bozic. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All code for analysis and simulations can be found at https://github.com/nathanlee543/Cancer_Inf_Sims. The manuscript is a computational study and no new data -- besides simulated data -- has been generated. Simulated data is available in the Github repository. CLL data analyzed is from previously published work (Gruber, M., Bozic, I., Leshchiner, I. et al. Growth dynamics in naturally progressing chronic lymphocytic leukaemia. Nature 570, 474-479 (2019). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6630176/) and is publicly available in that work’s supplementary tables.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
When a cell accrues a sequence of driver mutations—genetic alterations that provide a proliferative advantage relative to surrounding cells—it can begin to divide uncontrollably and eventually develop the complex features of a cancer [1–3]. Thousands of specific driver mutations have been implicated in carcinogenesis, with individual tumors harboring from few to dozens of drivers, depending on the cancer type . Mutations that don’t have a significant effect on cellular fitness also arise, both before and after tumor initiation . These neutral mutations, or “passengers”, can reach detectable frequencies by random genetic drift or the positive selection of a driver mutation in the same cell [6–9]. Mutational burden detectable by bulk sequencing reveals tens to thousands of passengers per tumor [10, 11].
Genome sequencing technologies have revealed the heterogeneous, informative genetic profiles produced by the evolutionary process driving carcinogenesis [12, 13]. These genetic profiles have been used to obtain insight into specific features of the carcinogenic process operating in individual patients. For example, the molecular clock feature of passenger mutations has been employed to measure timing of early events in tumor formation, as well as identify stages of tumorigenesis and metastasis [14–22]. Other studies have estimated mutation rates [5, 23, 24], selective growth advantages of cancer subclones [25–28], and the effect of spatial structure on cancer evolution [29–31]. We note that previous approaches typically only estimate one or a few parameters of cancer evolution. In addition, many state-of-the-art methods make use of computationally expensive approaches [24, 30, 32] or simplifying assumptions, such as approximating tumor expansion as deterministic or ignoring cell death [27, 32]. Our approach relies on analytic formulas and sampling, which for realistic numbers of subclones and time points is efficient, and does not require simulation of tumor growth or computationally expensive model fitting.
Mathematical models of cancer progression, especially when used in conjunction with experimental and clinical data, can provide important insights into the evolutionary history of cancer [9, 19, 33–37]. Branching processes—a type of a stochastic process—can be used to model how different populations of dividing, dying, and mutating cells in a tumor evolve over time . Their theory and applications have been well developed to model the multistage nature of cancer development [25, 29, 35, 38–40]. Here we use a branching process model of carcinogenesis to derive a comprehensive reconstruction of an individual tumor’s evolution.
Tumors can grow for many years, even decades, before they reach detectable size . Typically, tumor samples used for sequencing would be obtained at the end of the tumor’s natural, untreated progression. More recently, longitudinal sequencing, where a tumor is sequenced at multiple times during its development, has provided better resolution of tumor growth dynamics and evolution in various cancer types [27, 41–44]. Chronic lymphocytic leukemia (CLL) is an ideal system for studying cancer evolution because it can be monitored, via peripheral blood samples, without treatment until disease progression .
We establish that two longitudinal bulk sequencing and tumor size measurements are sufficient to reconstruct virtually all parameters (mutation rate, growth rates, times of appearance of driver mutations, and time since the driver mutation) of cancer evolution in individual patients. Our analytic approach yields simple formulas for the parameters; thus, estimation of the parameters governing cancer growth is not computationally intensive, regardless of tumor size. Our framework makes possible a personalized, high-resolution reconstruction of a cancer’s timeline of selective events and quantitative characterization of the evolutionary dynamics of the subclones making up the cancer cell population.
We consider a multi-type branching process of tumor expansion (Fig 1A). Tumor growth is started with a single initiated cell at time 0. Initiated tumor cells divide with rate b and die with rate d. These cells already have the driver mutations necessary for expansion, so we assume b > d. The population of initiated cells can go extinct due to stochastic fluctuations, or survive stochastic drift and start growing (on average) exponentially with net growth rate r = b − d. We will focus only on those populations that survived stochastic drift.
(a) Stochastic branching process model for tumor expansion. Initiated tumor cells (blue) divide with birth rate b, die with death rate d, and accrue passenger mutations with mutation rate u. Type-1 cells, which carry the driver mutation, divide with birth rate b1, die with death rate d1, and accrue passenger mutations with mutation rate u. (b) The initiated tumor, or type-0, (blue) population growth is initiated from a single cell. A driver mutation occurs in a single type-0 cell at time t1, starting the type-1 population (red). The tumor sample is collected and bulk sequenced at times t1 + t and t1 + t + Δ, where the driver fraction is α1 and α2, respectively. Tumor size (in number of cells) is M1 and M2 at first and second sample collection dates. (c) By the time the tumor is observed, it has a high level of genetic heterogeneity due to the mutations that have accrued in both type-0 (blue) and type-1 populations (red). Each yellow star represents a different passenger mutation.
At some time t1 > 0 a new driver mutation occurs in a single initiated tumor cell, starting a new independent birth-death process, with birth rate b1 and death rate d1 (Fig 1B). Net growth rate of cells with the new driver is r1 = b1 − d1. The new driver increases the rate of growth, i.e., r1 > r. We define the driver’s selective growth advantage by g = (r1/r − 1). In addition, both populations of cells (with and without the driver) accrue passenger mutations with rate u (Fig 1C).
After the driver mutation occurs, an additional time t passes before the tumor is observed. Type-0 cells are original initiated tumor and type-1 cells contain the driver mutation. In Materials and methods we also analyze the more general case of two nested or sibling driver mutations, as well as the fully generalized case of any clonal structure that might arise during tumor expansion.
Parameter estimates from two longitudinal measurements
We demonstrate that with two longitudinal bulk sequencing measurements, it is possible to accurately estimate net growth rates, time of appearance of a driver mutation, time between a driver mutation and observation, and mutation rate in the tumor. The tumor is first sequenced at time of observation, t1 + t, where both time of driver mutation, t1, and time from driver mutation to observation, t, are yet unknown (Fig 1B). A second bulk sequencing is performed at t1 + t + Δ, a known Δ time units after the tumor is first observed (Fig 1B). Later, we apply our method to the CLL data from Ref. , where the average size of Δ for all the pre-treatment samples sequenced is 1.8 years (0.6–4.9 years). In general, we expect that in the case of smaller Δ values measurement errors would have a larger effect on the estimated growth rates, due to an expected smaller change in cancer cell count and subclonal structure during a smaller time interval. From the bulk sequencing data, the fraction of cells carrying the driver mutation, α1 and α2, can be measured at the time points t1 + t and t1 + t + Δ, respectively. We denote total number of cells in the tumor at the two bulk sequencing time points as M1 and M2. For liquid cancers, cell counts of the relevant cancer cell population serve as indicators of cancer progression. In the case of CLL, white blood cell (WBC) count is useful as a measure of tumor burden in peripheral blood, as it is routinely taken and includes the cancerous cell population. More precise estimates of tumor burden would include absolute lymphocyte count (ALC) and number of B lymphocytes. Both ALC and WBC counts can suffer from inaccuracies due to the prevalence of smudge cells in CLL, often resulting in an underestimate of these counts .
Equating expected values of the sizes of the type-0 and type-1 population at the two bulk sequencing time points with the measured numbers of cells present in clones 0 and 1, we obtain estimates of the net growth rates of the two subclones: (1) (2) From the growth rate estimates and subclone sizes, we can approximate the expected value of the time a population in a branching process takes to reach an observed size . This yields an estimate of the time t from the appearance of driver mutation until observation: (3) Using the bulk sequencing data from the second time point, γ, the number of subclonal passengers between the specified frequencies f1 and f2, can be measured. Using results from previous work , we derive the expected value of γ (Materials and methods), which can be used to estimate the mutation rate u: (4) The m passenger mutations that were present in the original type-1 cell when the driver mutation occurred (Fig 1C) are present in all type-1 cells. m can be estimated from bulk sequencing data and used to estimate time of appearance of the driver. We maximize the likelihood function P(m|t1) with respect to time of appearance of the driver, t1, (see Materials and methods) to obtain the maximum likelihood estimate (5) Using formulas (4) and (5), we can now estimate t1.
Estimates verified in simulated tumors
To assess the accuracy of the parameter estimates for several modes of tumor evolution, we simulate tumor growth by performing a Monte Carlo simulation, which simulates the birth, death, and accumulation of mutations in the individual cells that make up a tumor. This simulation generates the mutation frequency and tumor size data used by the estimates (see Methods section for details of simulation). We simulate three different types of tumors (slow growing, fast growing, and no cell death), with a high and a low mutation rate for each (S1 Table).
In a simulation of a fast-growing tumor with a single subclonal driver mutation that confers a strong selective growth advantage of 100%, we can accurately estimate growth rates, mutation rate, time of driver event, and time since driver event (Fig 2A and 2B). Growth rates of both initiated tumor and driver subclones can be estimated with a high degree of accuracy, achieving mean percentage error (MPE) of 0.03% and -0.07% for the lower mutation rate (u = 1) scenario. The mutation rate u and estimates for time of driver appearance, t1, and time since driver, t, can also be estimated accurately, with MPEs of -0.9%, 3.8%, and -0.4%, respectively. Estimates for u, t1, and t have a somewhat greater degree of variation compared to the growth rate estimates, due to the inherent randomness of the number of mutations and time to reach the observed size that occur in each realization of the stochastic process.
For the parameter regime with no cell death and the regime for a slow-growing tumor, we again achieve high accuracies for the net growth rates (S1(A), S1(B), S2(A) and S2(B) Figs). In the lower mutation rate (u = 1) scenario, parameter estimates for the mutation rate u and time of driver appearance t1 can be accurately estimated for both regimes, with MPEs of -1.3% and 4.9% for the no cell death case, and MPEs of -3% and 3.7% for the slow-growing tumor.
We note that the estimator for t (time since driver event) is biased, with the extent depending on the ratio of birth rate to net growth rate, and the tumor size. The underlying cause of the bias is due to a simplifying assumption in the estimator’s derivation (see Methods, “Derivation of estimates of evolutionary parameters”), and this bias decreases as tumor size increases and as the ratio of growth and division rate gets closer to 1. For the three main modes of growth in our study, we performed additional Monte Carlo simulations to precisely quantify the effect of death:birth ratio and tumor size on the estimator’s accuracy (S5 Fig). For all three modes of growth, we observe a monotonic decrease in error as tumor size increases to more clinically realistic sizes. For a tumor size of 109, all modes of growth have a MPE of less than 4%, so for a clinically realistic cancer size—1011 for the CLL dataset—we expect an even better accuracy.
We also perform Monte Carlo simulations for the more complex cases of two nested and two sibling driver subclones (see Methods for derivations of estimators) for the same three modes of cancer growth used for the single driver subclone case above: fast growth (Fig 2C and 2D), no cell death (S1(C) and S1(D) Fig), and slow growth (S2(C) and S2(D) Fig). For two nested driver subclones, the second driver subclone also carries its parental subclone’s driver mutation (S4(A) Fig). For two sibling driver subclones, the drivers occur in separate subclones (S4(B) Fig). The growth rate estimates show good agreement with the ground truth values, with MPEs close to 0. The mutation rate estimates also have good accuracy, with the absolute values of their MPEs all ≤4%. As for the single subclone cases already discussed, the time estimates for the nested and sibling subclone simulations have a greater variance. The estimate for t—time between the last driver mutation and diagnosis—shows good accuracy for the fast-growing tumors, but larger errors for the no cell death and slow growth cases. For both the nested and sibling simulations, the estimates for the times of driver mutations 1 and 2 (t1 and t2, respectively) have MPEs less than 6%.
We simulated tumor growth by performing a Monte Carlo simulation, which simulates the birth, death, and accumulation of mutations in the individual cells that make up a tumor, and generates the mutation frequency and tumor size data used by the estimates. Simulations are of fast-growing tumors with (a) single driver subclone and mutation rate u = 1, (b) single driver subclone and u = 3, (c) two nested driver subclones with u = 1, and (d) two sibling driver subclones with u = 1. Mean percent errors (MPEs) of estimates are shown in black above the plots, and mean absolute percent errors (MAPEs) are shown in gray. Boxes contain 25th-75th quartiles, with median indicated by thick horizontal black line. Whiskers of boxplots indicate 2.5 and 97.5 percentiles. Violins are smoothed density estimates of the percent error data points. Complete parameter values and number of runs are included in S1 Table.
Correcting mutation counts observed from genome sequencing data
We note that in our estimate for the time of appearance of the driver, t1 (see formula (5)), used for comparison to simulated data, we employed a correction to m, the number of mutations that were present in the founder type-1 cell at t1. From sequencing data, these m mutations are indistinguishable (Fig 3A) from mutations that occurred after t1 in type-1 cells and reached fixation in the type-1 population . Thus, the value of m observed from sequencing data, mobs, will overestimate the true m. In Materials and methods we show that the expected value of the number of passengers that occurred after t1 and reached fixation in the type-1 population is u/r1. We subtract this correction factor from mobs: (6)
(a) If passenger mutations (circles with stars) that occur after the driver reach fixation in the driver population (red), then they are indistinguishable from the passengers that were present in the first cell with the driver, which accrued in the type-0 population (blue). The estimate of when the driver occurred needs to account for these mutations (circled). In (b), we compare percent errors of parameter estimates for time from tumor initiation until appearance of a driver subclone, t1, with and without this correction (Eq (6)). Errors for estimate with correction are shown in blue, and for estimate without correction (Eq. (5)) in orange. Errors are plotted as a kernel density estimate for Monte Carlo simulations of slow-growing tumor with mutation rate u = 5. Mean percent errors (MPEs) and mean absolute percent errors (MAPEs) are listed. (c) Mutations present on two or fewer variant reads (red) are filtered out in post-processing. Mutations with more than two variant reads (black) are included. The number of subclonal mutations between frequencies f1 and f2, γ, which is used in the mutation rate estimate, must be corrected for mutations that are filtered out. In (d), the percent errors for the observed (orange) and corrected (blue) γ (Eq (7)) are plotted as kernel density estimates. Observed mutations are those that passed post-processing, i.e. those that have more than L = 2 mutant reads. True mutation frequencies were generated from 135 surviving runs of a Monte Carlo simulation of a fast-growing tumor with mutation rate u = 1, from which sequencing reads were simulated with 200x average coverage (see Materials and methods). Percent errors are calculated relative to the true γ measured from the true mutation frequencies.
The correction for the m mutations present in the original type-1 cell (6) at time t1 improves the accuracy of the estimate for time of appearance of driver mutation t1. For the fast-growing tumor with mutation rate u = 1 (S3(A) Fig), the correction lowers the mean percent error (MPE) of the t1 estimate from 14.0% to 3.8%. For the slow-growing tumor with mutation rate u = 5 (Fig 3B), the correction lowers the MPE of the t1 estimate from 22.0% to 5.7% (Fig 3B).
Another issue arises from obtaining mutation count γ, number of mutations with frequency between f1 and f2, from genome sequencing data. When sequencing data is post-processed by filtering out mutations with L or fewer variant reads, low-frequency mutations will be difficult to detect  (Fig 3C). For a sample with average sequencing coverage of R and tumor purity p, mutations with mutant allele frequency below L/(pR) will typically not be observable. As a result, since mutations with frequencies between f1 and f2 count towards γ, if f1 ≤ 2L/(pR), the observed number of subclonal mutations between frequencies f1 and f2, γobs, will underestimate the true value, γ. For cancers with low mutational burden, such as CLL, we set a relatively low f1 (1%) to have sufficient resolution to infer mutation rate. Consequently, some mutations with frequency above f1 will likely be filtered out, and we account for this by correcting for the expected number of such subclonal mutations present at cancer cell frequencies (CCFs) between f1 and 2L/(pR) (see Materials and methods): (7)
Before applying our methodology to patient sequencing data, we estimated the validity of the above correction applied to observed simulated mutation counts. When we simulate sequencing reads from simulated mutation frequencies (see Materials and methods) and post-process by removing mutations with L = 2 or fewer variant reads, the adjustment we derived for mutation count γ (7) is critical, even for average sequencing coverage of 200x (Fig 3D). Without any correction, the observed γ has MPE of -53.3% compared to true γ, but with the correction, the computed γ has MPE of -1.4%. When average coverage is 100x, this correction becomes even more important, as many of the low-frequency mutations are discarded (S3(B) Fig). Without any correction, the observed γ has MPE of -79.7%. With the correction the computed γ has MPE of -3.4%. The accuracy of the γ measurement affects our estimate of the mutation rate (4).
Estimating parameters for individual patients with CLL
We use our formulas to infer the patient-specific parameters of cancer evolution for four patients with CLL whose growth patterns and clonal dynamics were analyzed in . These CLLs had peripheral WBC counts measured and whole exome sequencing (WES) performed at least twice before treatment. We consider patients whose WBC counts were classified as having an exponential-like growth pattern, with average γobs > 2, and with 3 or fewer macroscopic subclones (i.e. subclones with cancer cell fractions of 20% or greater for at least one pre-treatment time point). Our framework is designed specifically to study naturally evolving cancer dynamics, unperturbed by treatment, which will drastically alter the cancer’s dynamics and size. For calculation of the γobs mutations between frequencies f1 and f2, we set f1 = 1% due to the difficulty of detecting low frequency variants <1% [48, 49]. We set f2 to 20% to minimize overlap with potential driver mutations of the macroscopic subclones. The average γobs for the four analyzed patients ranges from 2.5 to 19.3, with a median of 5.2. As in Ref. , we perform subclonal reconstruction for each patient using PhylogicNDT . To obtain confidence intervals for our parameter estimates, we utilize a sampling procedure to account for model and measurement uncertainties, including uncertainties in subclone frequencies, fitted growth curves, and the Poisson process for mutation accumulation (see Materials and methods). For each patient’s tumor, we compute estimates of the growth rate of each clone, exome mutation rate, the times that each subclone arose, and how long each subclone expanded before the tumor was detected (Tables 1 and 2). We also estimate what time the cancer was clinically detectable, by sampling from the distribution of fitted growth parameters and solving the resulting root-finding problem for time to reach detectable size under our growth model (see Materials and methods). For CLL specifically, we compute time of leukocytosis—an abnormally high WBC count. We reconstruct these histories for tumors with various clonal structures.
Estimates are computed from tumor size measurements and mutation frequencies from whole exome sequencing. Mutation rates are for the exome only. The time estimates are in terms of the patient’s age in years.
Estimates are computed from tumor size measurements and mutation frequencies from whole exome sequencing. Mutation rates are for the exome only. The time estimates are in terms of the patient’s age in years.
Patients 3 and 21 are examples of a CLL with a single subclone (Fig 4). For Patient 3, Clone 0, the most recent common ancestor (MRCA) of this patient’s CLL, was initiated when the patient was 14.6 [1.4, 26.8] years old (median and [95% confidence interval] of estimate). Clone 0 grew with a net growth rate of 0.51 [0.20, 0.85] per year. Approximately two decades later, Clone 1 was initiated when the patient was 33.5 [24.1, 39.2] years old. Clone 1 expanded with a growth rate of 0.85 [0.65, 1.04] per year (corresponding to a selective growth advantage of 68.7% over Clone 0), and the patient was diagnosed approximately three decades later at age 63.
We applied our methodology to estimate subclonal growth rates, mutation rates and evolutionary timelines in CLL tumors from Ref. . Vertical height of a clone represents its log10-scaled size. Mutations were clustered into clones and phylogenetic trees were inferred using PhylogicNDT . Tree edges are colored by clone number and are labeled with driver mutations, if any. For each patient, we show estimates for patient age at CLL initiation and times of appearance of CLL subclones. Dashed white line indicates when the patient was diagnosed. Solid black arrows indicate times of bulk sequencing measurements.
For patient 21, we estimate that the parental clone (MRCA, Clone 0) of this patient’s CLL was initiated when the patient was 6.4 [0.3, 16.7] years old, and grew with a net growth rate of 0.79 [0.30, 1.14] per year. Clone 1 appeared when the patient was 19.6 [10.8, 24.0] years old, and grew more quickly than Clone 0, with a selective growth advantage of ∼90% over Clone 0). Clone 1 contained a FGFR1 mutation, which might have been acting as a driver of the increased net proliferation. Clone 1 then grew for ∼15 years before the patient was diagnosed at age 35.
Patients 6 and 9 present more complex clonal structures (Fig 4). Clone 0, the parental clone of the CLL of Patient 9, arose when the patient was 4.9 [1.2, 10.8] years old, and had a growth rate of 0.28 [0.17, 0.42] per year. Clone 1 arose when the patient was 18.8 [8.8, 35.1] years old. Interestingly, during clinical observation between diagnosis and treatment, Clone 1 was declining in size, with a growth rate of -0.40 [-0.45, -0.19] per year. In line with recent findings , we found that sometimes the estimated growth rate during the period of observation, such as the negative growth rate of Clone 1, is smaller than the minimal possible growth rate necessary to reach the observed clone size. In that case, for calculating mutation rate, time of the driver(s), time of detectability, and time between driver(s) and diagnosis we use the minimal growth rate. Clone 2, containing a KRAS mutation, had the largest net growth rate of the three clones (0.67 [0.49, 0.94] per year), corresponding to a selective growth advantage of 140.9% over the parental clone. Clone 2 arose when the patient was 21.3 [7.7, 31.7] years old.
We estimate that the CLL of Patient 6 was initiated when the patient was 2.8 [0.1, 13.2] years old. The leukemic parental clone, Clone 0, then grew at a rate of 0.68 [0.15, 1.30] per year. Approximately 33 years after the appearance of Clone 0, when the patient was 35.4 [21.7, 46.1] years old, the first subclone, Clone 1 appeared. Clone 3 arose from within Clone 1 when the patient was 45.9 [31.3, 54.6] years old. Clone 3 harbored a driver mutation in ASXL1 and had selective growth advantage of 60.8% over Clone 0. The patient was diagnosed at age 58, eventually needing treatment 12.0 years after diagnosis.
The average mutation rate in the four CLL patients we analyze is 0.30 mutations/year. This rate is over the exome, which accounts for ∼1% of the human genome. Our average estimated mutation rate in CLL exomes is similar to the measured rate of accumulation of mutations in human tissues of 40 mutations per year over the entire genome . Other recent work has estimated a mutation rate of 17 mutations per year in human haematopoietic stem cell/multipotent progenitors . Our estimated mutation rates during CLL progression are on par or higher than the recent estimates in healthy hematopoietic cells , in line with the expectation that mutation rates may be increased in cancer. The estimated times of appearance of CLL subclones are very long, on the order of 10 years or more. This finding is in agreement with results from Gruber et al. , who find few new CLL subclones over years to a decade of evolution. We observe that CLL initiation occurred early in most patients, within the first fifteen years of their lives, consistent with recent work in other cancer types [19, 36]. We find that CLL patients reach leukocytosis an average of 1.5 years before the first timepoint at which cancer genome sequencing was performed. For three of the patients, our estimated time of leukocytosis was before diagnosis, on average 1.3 years prior to diagnosis.
We use a stochastic branching process model to reconstruct the timing of driver events and quantify the evolutionary dynamics of different subclonal populations of cancer cells. We estimate growth rates of tumor subclones, selective growth advantage of individual driver mutations, mutation rate in the tumor, time between tumor initiation and appearance of a subclonal driver mutation, and time between driver mutation and tumor observation. Together, this allows us to estimate the age of the patient at tumor initiation, as well as the age at appearance of a subclonal driver.
Previous work has computed relative order of driver events [18, 21, 53], while other studies have given estimates for scaled mutation rates and time of events [24, 32]. However, we present estimates for absolute, unscaled mutation rates and times, which are easily interpretable and don’t implicitly depend on unknown parameters. We assume that mutations accrue with time, which simplifies derivations and is supported by recent experimental data that shows that non-dividing cells may accrue mutations at a similar rate as dividing cells . Other potential assumptions regarding mutation accumulation include mutations occurring at cell division  or assuming mutation rate is proportional to the copy number state . For example, recent work reported that some mutational signatures in human cancers are generated during mitosis . Other work has shown that the rate of accumulation of somatic single nucleotide variants is proportional to copy number . We further assume that all cancer subpopulations have the same passenger mutation rate. In the case that mutations occur predominantly at cell division, assuming that the rate of cell division is comparable across all tumor subclones, our estimates would still be valid. In the case of a subclone that has an elevated mutation rate (e.g. due to a chromosomal amplification, mutation in a DNA repair pathway gene or an increased cell division rate), we would underestimate the mutation rate and overestimate the time of driver mutation(s) in that subclone. In the other subclones, the opposite would be true.
For individual CLLs that underwent bulk sequencing at two time points , we infer growth rates of individual subclones, mutation rate in the tumor, the times when cancer subclones began growing, the time between driver mutations and the patient’s diagnosis, and time when the cancer is clinically observable. Our inferences are limited by the relatively low number of mutations present in CLL, as well as sequencing coverage , so we set a minimum passenger mutation count when selecting specific cases to analyze. The accuracy of estimates presented here is expected to be higher with whole genome sequencing available, with higher sequencing coverage, or in cancer types with more mutations, with some important limitations. Exponential growth—the mean behavior of our branching process model—has been well documented in vivo [27, 57–59], but tumors can also often exhibit sigmoidal growth (e.g. logistic, Gompertz models), where initial exponential growth is followed by a deceleration in growth [58, 60–63]. Our estimators should only be used for cancers exhibiting exponential growth; for other modes of growth, such as the logistic-growing class of CLL patients in Ref. , the parameter estimates would have to be derived specifically for the particular mode of growth observed. Exponential growth is the simplest common cancer growth pattern, and yet, estimating the exponential growth rates requires at least two longitudinal timepoints. To fit all parameters for patients with more complex growth dynamics, additional longitudinal samples will be needed; this type of analysis would be further limited due to the scarcity of longitudinal pre-treatment samples in many cancer types. In the case of solid tumors, the number of cells can be estimated from measurements of tumor volume , however multiple biopsies would potentially be needed to fully account for the existing genetic heterogeneity. Furthermore, a solid tumor’s spatial structure, mode of evolution, and biopsy collection influence how well selection and mutation spectra can be observed [30, 31, 65]. Recent modeling and computational work, in combination with careful multi-region sequencing and single cell sequencing, have begun to disentangle these confounding factors [26, 29, 30].
Our model and derivations assume a fixed mutation rate u after transformation and fixed growth rates of cancer subclones, similar to previous approaches [24, 30, 35]. Some individual cancer subclones (such as Clone 1 from Pt. 9) not only do not grow exponentially, they actually decline in absolute cell numbers, even if the overall tumor is undergoing expansion. This phenomenon has been previously observed [27, 66], and could be caused by the declining subclone getting outcompeted by more fit subclones. Sudden genomic instability events, or a change in cancer mutation and/or growth rate over time could also introduce errors into our parameter inferences. Recent sequencing data points to mutational processes that change over time during cancer evolution [20, 67]; incorporating possible changes in the mutation and/or growth rate into the model would require much higher density of sequencing and clinical data , as would employing a more complex growth model (e.g. boundary-driven or sigmoidal growth).
Materials and methods
Branching process model of tumor evolution
We employ a continuous, multi-type branching process model of cancer evolution. For the case of a single driver subclone, there are two cell types, type-0 and type-1. Tumor expansion is initiated by a single type-0, or initiated tumor cell. Type-0 cells divide with rate b and die with rate d, yielding a net growth rate of r = b − d. At time t1, a single driver mutation is introduced into a randomly selected cell in the type-0 population, founding a new type-1 population of cells. This type-1 population undergoes its own independent branching process. They divide with rate b1, die with rate d1, and have net growth rate r1 = b1 − d1. If the driver mutation gives type-1 cells a selective growth advantage over the type-0 population, then r1 > r. With the ratios of the growth rates denoted as s = r1/r, the growth advantage can be quantified as g = (s − 1) ⋅ 100%. In the case of neutral evolution, g = 0. If there is a selective advantage, g > 0. Neutral mutations, or passengers, have no effect on the cell’s fitness, and accrue according to a Poisson process with rate u. We assume an infinite alleles model such that there is no back mutation and an infinite sites model such that every new passenger mutation is unique. Only surviving populations are considered. All derivations below will condition on survival. The type-0 and type-1 populations at time t will be denoted as X0(t) and X1(t), respectively.
Measurements sufficient to determine evolutionary history
Here we derive estimates for parameters describing the carcinogenic process for a single driver subclone, using measurements taken from two time points late in the tumor’s development. We require sequencing of the tumor at the two time points, when the tumor is first observed at the unknown time t1 + t and a specified Δ later, at t1 + t + Δ. From these two bulk sequencing measurements, we obtain measurements of α1 and α2, the fraction of cells carrying the driver mutation at t1 + t and t1 + t + Δ, respectively. In addition, from the bulk sequencing at t1 + t + Δ, we obtain measurements of m, the number of mutations present in the founder type-1 cell, as well as γ, the number of mutations with frequency between the specified f1 and f2. The total population size at these times, M1 and M2, is also measured.
Expected value of γ, number subclonal mutations
For a population consisting of a single clone with birth and death rates b and d, the expected number of subclonal mutations present at a frequency larger than f is shown to be  (8) where δ = d/b and is the probability that a daughter cell gains a new passenger mutation at cell division. In this paper, we allow mutations to occur at any point in time and consider the absolute mutation rate per cell, u, which is equal to . Then the expected number of subclonal mutations between f1 and f2, , is (9) (10) where r = b − d > 0.
Now we derive in the case of clones 0 through k, each clone with growth rate ri > 0 and fraction . Each clone i has expected subclonal passengers between frequencies f1 and f2. Thus, the total expected number of passengers with frequencies between f1 and f2 is (11)
For the simplest case we consider, a tumor with a single driver mutation occurring in the initiated tumor population, there is a type-0 population with growth rate r and a type-1 population with growth rate r1. Eq (11) reduces to (12) where α is the fraction of cells having the driver mutation.
Derivation of estimates of evolutionary parameters for single driver subclone
With the cancer bulk sequenced at the two time points t1 + t and t1 + t + Δ, we are able to derive estimates for t1, t, r, r1, and u. First we solve for r and r1, based on the estimated cell counts at t1 + t and t1 + t + Δ. The observed type-i cell count is equated to the expected value of the type-i population size, conditioned on survival. For a birth-death process started with a single type-i cell at time 0, we have . That process has extinction probability di/bi . Then, (13) (14) (15) where is a random variable and indicator function defined as Thus, from (15), for large enough time t, (16) It then follows that for the type-0 population, (17) (18) Proceeding similarly for the type-1 population, we obtain (19) (20) The expected value of the first time a population of type-1 cells in a branching process reaches the observed size α1M1 is  (21) (22) (23) (24)
The last approximation is justified because for realistic cell counts, the first term in (23) dominates the other two, which is also evident in simulation studies (S5 Fig). For example, if , then the second term log(r1/b1) = −0.69, compared to the first term log(α1M1) = 19.11. Even if r1 is as low as 0.1b1, the second term is -2.30. In this case, the percent error of the approximation (24) is 7.3%. In general, the accuracy increases with increased tumor size.
Using the estimates of r and r1 from (19) and (20), and the measured value of γ from the second bulk sequencing, Eq (25) can be solved for the mutation rate u, (26) When estimating mutation rate for the CLL patients from Ref. , for which there is bulk sequencing at two or more time points, we average the mutation rate calculated at each of these time points. (26) is applied for each time point with the respective CCFs and observed γ values for each time point.
To derive the maximum likelihood estimates of t1, we consider the likelihood function P(m|t1). The number of passenger mutations present in the founder type-1 cell that appeared at time t1 is a Poisson process with rate u. Thus, (27) Maximizing the logarithm of the likelihood function with respect to t1 yields a MLE for t1 in terms of estimated or measured quantities: (28)
Estimating number of unobserved subclonal mutations from sequencing data
When sequencing data is post-processed by filtering out any mutations with L or fewer variant reads, the number of mutations between f1 and f2 will likely be underestimated if 2L/(Rp) > f1, where R is average sequencing coverage and p is tumor purity. Define γobs as the observed number of mutations between frequencies f1 and f2, after post-processing has been performed that filtered out any mutations with L or fewer variant reads. The expected number of subclonal mutations between frequencies f1 and x is given by (29) where c is a constant that will vary depending on the patient and sample. It can be fit on the sequencing data by noting (30) (31) Therefore, c can be estimated from the sequencing data as (32) Then, we can estimate γ as (33)
Number of passengers reaching fixation after t1
We estimate the number of passengers that occurred after t1 and reached fixation in the type-1 population in order to adjust the mobs mutation count. From , when mutations occur at cell division, the expected number of clonal passengers is . is the probability that a daughter cell gains a new passenger mutation at cell division, so the mutation rate is . For the type-1 population, δ = d1/b1 < 1. When mutations accrue over time, and not only at divisions, the expected number of clonal passengers is thus (34) Similarly, for a clone i, the expected number of passengers that occur after time ti and reach fixation is (35) where ri = bi − di > 0.
Simulation of tumor evolution and sequencing data
To assess the accuracy of the analytic results, we perform a continuous time Monte Carlo simulation to model tumor evolution and collection of sequencing data with an implementation of the Gillespie algorithm . Simulations are written in C/C++.
The type-j population has division rate bj, death rate dj, and mutation rate u. Mutations can occur at any point of the cell cycle, not just during division. zn is the number of type-j cells with passenger n as their most recent passenger mutation. The type-0 population is initiated with a single cell at time 0, and the type-j population for k ≥ j > 0 is initiated with a single cell at time tj. Let a be the vector recording the ancestor of new mutations. Element ai is the subclonal ancestor of the ith passenger mutation. For each j ∈ 0, 1, …, k, repeat 1–4 while time is less than tk + t + Δ.
- Set Γ = Nj(bj + dj + u). Time increment to next event time is randomly sampled from Exp[Γ].
- If j < k, if time is greater than or equal to tj+1 for first time, randomly select type-j subclone i to have driver mutation, remove one cell from type-j population count, and set Nj+1 = 1. Record the true value of mj+1, the number of passenger mutations present in the founder type-(j + 1) cell.
- Randomly select cell, with most recent passenger mutation i, to have the event.
- Determine which type of event and update population and mutation frequencies. Sample Y from Uniform[0, Γ] to determine event type:
- y ∈ (0, bj) → birth. , .
- y ∈ (bj, bj + dj) → death. , .
- y ∈ (bj + dj, bj + dj + u) → passenger mutation. Suppose it’s the pth passenger, , zp = 1. Update ancestor: ap = i.
- For j = 0, if time is less than t1 and population goes extinct, restart simulation. For j ≥ 1, if time is greater than tj and population goes extinct, restart type-j simulation at tj with a single cell.
- Reindex to remove extinct passenger mutations, and traverse back through ancestor vector a to sum total number of cells with each passenger.
Measurements are taken at bulk sequencing times tk + t and tk + t + Δ. If time is greater than or equal to tk + t, we measure and CCF of clone j as Nj/M1. Then an additional bulk sequencing measurement is taken at the final time tk + t + Δ, where we measure and the CCF of clone j as Nj/M2. At tk + t + Δ, we measure γ, the number of mutations with frequency between f1 and f2.
To measure mj,obs, the observed number of passengers in the founder type-j cell, we count the number of passengers present in all type-j cells. We also save the true value of mj.
For when we calculate a percent error of corrected and observed γ values in Fig 3D and S3(B) Fig, we simulate sequencing data by sampling from the mutation frequencies obtained in the Monte Carlo simulation, outlined above, using the approach of . Define average sequencing coverage as R, number of cells at time of sequencing as M, Zi as the number of cells with mutation i, Ri as read coverage, and χi as the true mutation frequency from Monte Carlo simulation. For each saved Monte Carlo simulation run, repeat the following 100 times:
- Generate read coverage: Ri ∼ Binomial[M, R/M].
- Generate number of cells carrying mutation i: Zi ∼ Binomial[Ri, χi/2].
- Post-processing. If there are L = 2 or fewer variant reads, discard mutation.
- Measure γobs, the observed number of subclonal mutations between frequencies f1 and f2: γobs = ∑i I(f1 ≤ 2Zi/R ≤ f2, Zi > L).
- Calculate the truth, γtrue, from the true mutation frequencies: γtrue = ∑i I(f1 ≤ χi ≤ f2).
Parameter values for simulations
For the simulations we consider three parameter sets corresponding to three modes of tumor evolution: a fast-growing tumor, slow-growing tumor, and tumor with no cell death, each with multiple mutation rates. We simulate three clonal structures: single driver subclone, two nested driver subclones, and two sibling driver subclones. All parameter values are listed in S1 Table. Mutation rate parameter values lie within observed genome wide point mutation rates per day . For simulation of parental clone and subclone, the fast-growing tumor dynamics are from . The slower growing tumor parameter regime has a reduced net growth of r = 0.025, compared to the fast-growing tumor’s net growth rate of r = 0.07.
Subclonal reconstruction of CLL sequencing data
The sequencing data from all CLLs analyzed is from Ref. , Supplementary Tables 2–4. As in that publication, we use PhylogicNDT  to perform subclonal reconstruction. We run the Cluster and BuildTree modules of PhylogicNDT on the longitudinal mutation data from Supplementary Table 3 of , using mutation alternate/reference counts, copy number, and tumor purity at all pre-treatment time points. Then for each patient, PhylogicNDT outputs a clonal reconstruction, which includes a phylogenetic tree of the subclones and posterior distributions of subclone CCFs. Additionally, it clusters mutations and assigns them to clones. We directly use subclone assignments and posteriors generated from PhylogicNDT. In our analysis we focus on estimating timing and growth rates of macroscopic subclones whose CCFs are greater than 20% for at least one pre-treatment time point.
Accounting for uncertainties in subclone frequencies and growth rates
Our estimates for parameters of cancer evolution require as input the information on the number of subclonal populations in the tumor, their CCFs and their phylogenetic relationships. In order to obtain this information, we use PhylogicNDT , which performs subclonal reconstruction of longitudinal cancer sequencing data. The uncertainty in subclone CCFs reported by PhylogicNDT affects our estimates for subclone growth rates, which in turn affect the estimates of mutation rate and time t between driver(s) and diagnosis. We account for this uncertainty by drawing from the CCF posterior distributions that are output by PhylogicNDT. Using these sampled CCF values, we then calculate growth rates, mutation rate u, and time t between driver(s) and diagnosis, thereby generating confidence intervals for these parameters due to CCF uncertainty.
To estimate subclonal growth rates, we fit an exponential growth curve to subclonal sizes measured at two or more time points. This regression yields fitted values for each clone’s growth rate and age. To account for uncertainty in the curve fit (in the case of more than two longitudinal samples), we sample the growth rates and age of clone from a bivariate normal distribution with mean equal to the fitted parameters and variance equal to the covariance matrix of the fitted parameters. When the estimated growth rate during the period of observation—including negative growth rates—is smaller than the minimal possible growth rate necessary to reach the observed clone size, we use the minimal growth rate for calculating mutation rate, time of the driver(s), time between driver(s) and diagnosis, and time of detectability.
Estimating time of cancer detectability
The time a cancer is detectable is the time at which the cancer exceeds the minimum observable size. For the CLL data, we estimate the time that the patients first exhibited an abnormally high WBC count, or leukocytosis, characterized by a WBC count of 11,500/μL , or approximately 5.75 x 1010 total WBCs, assuming a total blood volume of 5 L. In the previous section, we describe how we fit the growth dynamics for the CLL data and obtain a distribution of the fitted growth parameters. Here, we sample from the distribution of the fitted parameters 10,000 times (using the minimal growth rate in the case of a growth rate too low to give rise to the observed WBC count), and numerically solve for the time at which the total WBC count was equal to 5.75 x 1010. i.e., we numerically find the root with respect to ti of (36) where ti is the ith estimated time out of 10,000 estimates, f(⋅) is the exponential function describing the mean cancer growth, and is the ith random sample from the fitted growth parameters (intercept and growth rate).
Accounting for model uncertainty
The largest source of model uncertainty is the Poisson process for how mutations accumulate, which is used to estimate the time t1 of the driver mutation. In the fast-growing tumor simulation experiments, the time t1 had the largest error and variation (Fig 2). The estimate for t1 depends on the m mutations present in all cells in the driver subclone. The observed m is a single random sample from a Poisson distribution. To account for the uncertainty in t1 arising from m in the CLLs analyzed, we sample t1 from the posterior distributions P(t1|m). This source of model uncertainty due to the Poisson process will be most significant for cancers like CLL with a smaller number of mutations.
The time t between driver mutation and diagnosis is a random variable due to the stochasticity of cancer cell growth, and will naturally have a certain amount of variation. Time between driver event and diagnosis in a branching process follows a Gumbel distribution  and will have a constant variance. The mean, however, will increase with the logarithm of the cancer cell counts, which for the CLLs analyzed are ∼ 1011. The simulations of cancer evolution grow to smaller tumor sizes (∼ 105) and, as a result, the estimate for t has a significant amount of uncertainty (Fig 2). However, for time scales necessary to generate a tumor, the estimate for t will be quite accurate. For commonly observed tumor sizes, the stochastic fluctuations in the time for the cancer to reach that size will be smaller relative to the magnitude of the time. For a cancer with cell count ∼ 1011, the standard deviation of the time t will be less than 5% of its expected value.
Tumor with two nested driver subclones
Here we consider the case where there are two nested driver subclones (S4(A) Fig). “Nested” means that all cells carrying the second driver mutation also carry the first. Type-0, or initiated tumor, cells have birth rate b0, death rate d0, and net growth rate r0 = b0 − d0. Type-1 cells, which only have the first driver, have birth rate b1, death rate d1, and net growth rate r1 = b1 − d1. Type-2 cells, which carry both drivers, have birth rate b2, death rate d2, and net growth rate r2 = b2 − d2. The first driver occurred in a type-0 cell at time t1. The second driver occurred in a type-1 cell at . The mutation rate u is the same for all subclones.
At times and , the tumor is bulk sequenced. The bulk sequencing allows the measurement of the fraction of cells with driver 1 at time , α1; the fraction of cells with driver 2 at , α2; fraction of cells with driver 1 at time , β1; the fraction of cells with driver 2 at , β2; and the observed number of subclonal passenger mutations between frequencies f1 and f2, γobs. Note that the fraction of the population that is a type-1 cell at the two times is α1 − α2 and β1 − β2. The fraction of type-0 cells at the two bulk sequencing time points are 1 − α1 and 1 − β1. The total number of cells at bulk sequencing time points are M1 and M2. We then equate the estimated cell counts to the expected value of the type-i population size Xi, conditioned on survival. (37) (38) (39) (40) Solving the above equations for ri, we obtain the growth rate estimates: (41) (42) (43) The expected value of the first time a population of type-2 cells in a branching process reaches the observed size α2M1 , (44) (45) (46) where the approximation in (46) is justified as for (24).
By (11), (47) Using the estimates for r0, r1, and r2 from (41)–(43), and setting (47) equal to the value of γ obtained from (33) and the second bulk sequencing, u can be estimated: (48) When estimating mutation rate for the CLL patients from Ref. , for which there is bulk sequencing at two or more time points, we average the mutation rate calculated at each of these time points. (48) is applied for each time point with the respective CCFs and observed γ values for each time point.
Every type-1 cell carries the m1 passenger mutations that were present in the original type-1 cell when the first driver mutation occurred at t1. Similarly, every type-2 cell carries the m2 passengers that were present in the founder type-2 cell when the second driver mutation occurred at t2. Note, none of the m1 mutations are counted towards m2. Now we consider the likelihood function (49) (50) (51) Now, maximizing the logarithm of (51) with respect to t1 and , (52) (53)
The number of passengers present in the founder type-i cell cannot be directly observed, but we can measure mi obs, the number of passengers present in all type-i cells. An expected u/r1 passengers occurring after t1 in type-1 cells and reaching fixation in the type-1 subclone will be incorrectly included in m1 obs, rather than in m2 obs (see Methods). Similarly, an expected u/r2 passengers occurring after t2 in type-2 cells and reaching fixation in the type-2 subclone will be incorrectly included in m2 obs. Thus, (54) (55)
Tumor with two sibling driver subclones
Here we consider a tumor with two “sibling” driver mutations (S4(B) Fig). Sibling driver mutations are drivers that occur in separate subclones. In this case, cells are either initiated tumor cell (type-0), carry driver 1 (type-1), or carry driver 2 (type-2). No cells contain both drivers. Driver 1 occurred in a type-0 cell at time t1. Driver 2 occurred in a type-0 cell at t2. Type-0 cells have birth rate b0, death rate d0, and net growth rate r0 = b0 − d0. Type-1 cells, which carry driver 1, have birth rate b1, death rate d1, and net growth rate r1 = b1 − d1. Type-2 cells, which carry driver 2, have birth rate b2, death rate d2, and net growth rate r2 = b2 − d2. The mutation rate u is the same for all subclones.
Suppose time τi elapses between driver mutation i and tumor observation. Bulk sequencing of the tumor is performed at t1 + τ1 (or equivalently t2 + τ2), and a known Δ later. Sequencing the tumor allows the measurement of the fraction of cells with driver 1 at the first sequencing, α1; the fraction of cells with driver 2 at the first sequencing, α2; fraction of cells with driver 1 at the second sequencing, β1; the fraction of cells with driver 2 at the second sequencing, β2; and the number of subclonal passenger mutations between frequencies f1 and f2, γ. The fraction of type-0 cells at the two bulk sequencing time points are 1 − α1 − α2 and 1 − β1 − β2. The total number of cells at the two sequencing time points are M1 and M2.
The expected value of the first time a population of type-i cells in a branching process reaches the observed size αiM1 is  (62) (63) (64) where the approximation in (64) is justified as for (24).
By (11), (65)
Using the estimates for r0, r1, and r2 from (60) and (61), and setting (65) equal to the value of γ obtained from (33) and the second bulk sequencing, u can be estimated. (66) When estimating mutation rate for the CLL patients from Ref. , for which there is bulk sequencing at two or more time points, we average the mutation rate calculated at each of these time points. (66) is applied for each time point with the respective CCFs and observed γ values for each time point.
Every type-1 cell carries the m1 passenger mutations that were present in the original type-1 cell when the first driver mutation occurred at t1. Similarly, every type-2 cell carries the m2 passengers that were present in the founder type-2 cell when the second driver mutation occurred at t2. We assume that m1 and m2 don’t contain any shared mutations. In the CLL dataset we use, this is true. We consider the likelihood function P(m1, m2|t1, t2) (67) (68) Maximizing the logarithm of (68) with respect to t1 and t2 yields the maximum likelihood estimates: (69) (70) Using the same approach as in the case of a single driver, we obtain the corrections for the observed number of mutations present in all cells of each subclone: (71) (72)
Fully generalized estimates for any phylogeny of k drivers
Here we derive estimates for a completely general tumor phylogeny. Suppose a tumor has k driver mutations. In this general case, define a type-i cell as a cell where its most recent driver mutation was driver i. Note that a type-i cell can have between 0 and k − 1 other driver mutations. A phylogenetic reconstruction of the k driver mutations is necessary for the completely general case. From this phylogenetic tree, the ancestor of each subclone can be obtained. Define the function a(i) as the ancestor of the type-i population. That is, if all driver mutations contained in the type-i population are ordered, a(i) gives the driver mutation that occurred prior to i. Define ti as the time between when driver i occurred and when the type-i cells’ previous driver mutation occurred. At time of observation, assume the type-i population has κi total driver mutations, where 1 ≤ κi ≤ k for all 1 ≤ i ≤ k. Denote the time between the type-i’s κi, or last, driver mutation and when the tumor is observed as τi. This is the time between the founder type-i cell’s birth and tumor observation. Then the tumor is first observed and bulk sequenced at (equivalently τ0 for i = 0), where we denote aj as the jth iterate of the function a: (73) (74) The tumor is also bulk sequenced at (equivalently τ0 + Δ for i = 0). These assumptions allow for any subclone phylogeny, including combinations of the previously discussed sibling and nested subclone types.
The bulk sequencing allows the measurement of the fraction of cells with driver i at T1, αi; the fraction of cells with driver i at time T2, βi; and the number of subclonal passenger mutations between frequencies f1 and f2, γ. Again, the total number of cells at measurement times T1 and T2 are M1 and M2. To write the type-i frequencies, and , in terms of the driver frequencies, we subtract the fraction of cells descending from type-i cells but gaining additional driver mutation(s) after i, from the fraction of cells containing driver i: (75) (76) where δi,a(j) is the Kronecker delta, defined as We equate the estimated cell counts at the first bulk sequencing time point to the expected value of the type-i population size Xi, conditioned on survival. (77) And similarly, at the second bulk sequencing time point, (78) (79) Solving the above equations for ri, we obtain (80) By (11) (81)
Now, using the growth rate estimates ri and the subclone sizes, we can estimate each τi. The expected value of the first time a population of type-i cells in a branching process reaches the observed size is  (82) (83) (84) where the approximation in (84) is justified as for (24).
Using the (k + 1) ri estimates from (80), and setting (81) equal to the value of γ obtained at the second bulk sequencing from (33), u can be estimated: (85) When estimating mutation rate for the CLL patients from Ref. , for which there is bulk sequencing at two or more time points, we average the mutation rate calculated at each of these time points. (85) is applied for each time point with the respective CCFs and observed γ values for each time point.
The number of passengers present in the original type i founder cell cannot be directly observed, but we can measure mi, the number of clonal passengers present in the type i population, only including passengers not present in other clones. We will assume that the mi don’t contain any shared mutations, which is true for the CLL dataset we consider. The likelihood function P(m1, …, mk|t1, …, tk) is proportional to (86) Then, maximizing the logarithm of (86) with respect to t1, t2, …, tk, (87) The observed clonal passengers in the founder type-i cell will incorrectly include passengers that reached fixation in the type-i population after driver mutation i occurred, instead of correctly being counted toward the descendant of clone i. As a result, we again correct for the expected number of these passengers, u/ri. That is, (88)
S1 Fig. Percent errors (PEs) for case with no death.
Accuracy of parameter inferences for Monte Carlo simulation of tumor with no cell death for (a) single driver subclone with mutation rate u = 1, (b) single driver subclone with u = 10, (c) two nested subclones with u = 1, and (d) two sibling subclones with u = 1. Mean percent error (MPEs) are the black numbers above the plots, and mean absolute percent errors (MAPEs) are the grey numbers below the MPEs. Boxes contain 25th-75th quartiles, with median indicated by thick horizontal black line. Whiskers of boxplots indicate 2.5 and 97.5 percentiles. Violins are smoothed density estimates of the percent error datapoints. Complete parameter values and number of runs are included in S1 Table.
S2 Fig. Percent errors (PEs) for slow-growing tumor.
Accuracy of parameter inferences for surviving Monte Carlo simulation runs of slow-growing tumor for (a) single subclone with mutation rate u = 1, (b) single subclone with u = 5, (c) two nested subclones with u = 1, and (d) two sibling subclones with u = 1. Mean percent error (MPEs) are the black numbers above the plots, and mean absolute percent errors (MAPEs) are the grey numbers below the MPEs. Boxes contain 25th-75th quartiles, with median indicated by thick horizontal black line. Whiskers of boxplots indicate 2.5 and 97.5 percentiles. Violins are smoothed density estimates of the percent error data points. Complete parameter values and number of runs are included in S1 Table.
S3 Fig. Corrections for observed mutation counts.
(a) We compare percent errors of parameter estimates for time from tumor initiating until appearance of a driver subclone, t1, with and without the correction for passengers that occur after the driver and reach fixation in the driver population (Eq (6), main text). Errors for estimate with correction are shown in blue, and for estimate without correction (Eq (5), main text) in orange. Errors are plotted as a kernel density estimate for Monte Carlo simulations of fast-growing tumor with mutation rate u = 1. Mean percent errors (MPEs) and mean absolute percent errors (MAPEs) are listed. (b) The percent errors for the observed (orange) and corrected (blue) number of subclonal mutations between frequencies f1 and f2, γ, (Eq (7), main text) are plotted as kernel density estimates. Observed mutations are those that passed post-processing, i.e. those that have more than L = 2 mutant reads. True mutation frequencies were generated from 135 surviving runs of a Monte Carlo simulation of a fast-growing tumor with mutation rate u = 1, from which sequencing reads were simulated with 100x average coverage (see Materials and methods). Percent errors are calculated relative to the true γ measured from the true mutation frequencies.
S4 Fig. Model for tumor expansion with two driver mutations.
(a) Two nested driver subclones. Initiated tumor (type-0) cells in blue, cells with driver 1 (type-1) in red, and cells with both drivers (type-2) in orange. A driver mutation occurs in a type-0 cell at t1. A second driver mutation occurs in a type-1 cell at . Tumor is bulk sequenced at and . (b) Two sibling driver subclones. Type-0 cells (in blue). A driver mutation occurs in a type-0 cell at t1. A second driver mutation occurs in a different type-0 cell at t2. Tumor is bulk sequenced at t1 + τ1 (or, equivalently t2 + τ2) and t1 + τ1 + Δ (equivalently t2 + τ2 + Δ).
S5 Fig. Accuracy for t estimate increases with tumor size.
A Monte Carlo simulation of a birth-death process was performed for (a) fast-growing, (b) slow-growing, and (c) no cell death parameter regimes. For each of the 100 surviving simulated tumors, the percent error of the t estimate (Eq (3)) was calculated when the tumor first reached the specified tumor sizes. Means are indicated by red points and lines, ± one standard deviation is shown by the red region, and individual data points for each simulation run are shown as the grey points (with horizontal jitter for visibility).
S1 Table. Parameter values.
Parameter values and number of surviving runs for Monte Carlo simulations. For all simulations f1 = 0.01, f2 = 0.20, L = 2.
- 1. Nowell PC. The Clonal Evolution of Tumor Cell Populations. Science. 1976;194(4260):23–28. pmid:959840
- 2. Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458(7239):719–724. pmid:19360079
- 3. Hanahan D, Weinberg RA. Hallmarks of Cancer: The Next Generation. Cell. 2011;144(5):646–674. pmid:21376230
- 4. Bailey MH, Tokheim C, Porta-Pardo E, Sengupta S, Bertrand D, Weerasinghe A, et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell. 2018;173(2):371–385.e18. pmid:29625053
- 5. Williams MJ, Werner B, Barnes CP, Graham TA, Sottoriva A. Identification of neutral tumor evolution across cancer types. Nature Genetics. 2016;48(3):238–244. pmid:26780609
- 6. Kimura M. Evolutionary Rate at the Molecular Level. Nature. 1968;217(5129):624–626. pmid:5637732
- 7. Kimura M. Genetic variability maintained in a finite population due to mutational production of neutral and nearly neutral isoalleles*. Genetics Research. 1968;11(3):247–270.
- 8. Smith JM, Haigh J. The hitch-hiking effect of a favourable gene. Genetics Research. 1974;23(1):23–35. pmid:4407212
- 9. Turajlic S, Sottoriva A, Graham T, Swanton C. Resolving genetic heterogeneity in cancer. Nature Reviews Genetics. 2019;20(7):404–416. pmid:30918367
- 10. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Cancer Genome Landscapes. Science. 2013;339(6127):1546–1558. pmid:23539594
- 11. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–218. pmid:23770567
- 12. Merlo LMF, Pepper JW, Reid BJ, Maley CC. Cancer as an evolutionary and ecological process. Nature Reviews Cancer. 2006;6(12):924–935. pmid:17109012
- 13. Pepper JW, Findlay CS, Kassen R, Spencer SL, Maley CC. SYNTHESIS: Cancer research meets evolutionary biology. Evolutionary Applications. 2009;2(1):62–70. pmid:25567847
- 14. Tsao JL, Yatabe Y, Salovaara R, Järvinen HJ, Mecklin JP, Aaltonen LA, et al. Genetic reconstruction of individual colorectal tumor histories. Proceedings of the National Academy of Sciences. 2000;97(3):1236–1241. pmid:10655514
- 15. Jones S, Chen Wd, Parmigiani G, Diehl F, Beerenwinkel N, Antal T, et al. Comparative lesion sequencing provides insights into tumor evolution. Proceedings of the National Academy of Sciences of the United States of America. 2008;105(11):4283–4288. pmid:18337506
- 16. Yachida S, Jones S, Bozic I, Antal T, Leary R, Fu B, et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature. 2010;467(7319):1114–1117. pmid:20981102
- 17. Naxerova K, Brachtel E, Salk JJ, Seese AM, Power K, Abbasi B, et al. Hypermutable DNA chronicles the evolution of human colon cancer. Proceedings of the National Academy of Sciences. 2014;111(18):E1889–E1898. pmid:24753616
- 18. McGranahan N, Favero F, Bruin ECd, Birkbak NJ, Szallasi Z, Swanton C. Clonal status of actionable driver events and the timing of mutational processes in cancer evolution. Science Translational Medicine. 2015;7(283):283ra54–283ra54. pmid:25877892
- 19. Mitchell TJ, Turajlic S, Rowan A, Nicol D, Farmery JHR, O’Brien T, et al. Timing the Landmark Events in the Evolution of Clear Cell Renal Cell Cancer: TRACERx Renal. Cell. 2018;173(3):611–623.e17. pmid:29656891
- 20. PCAWG Evolution & Heterogeneity Working Group, PCAWG Consortium, Gerstung M, Jolly C, Leshchiner I, Dentro SC, et al. The evolutionary history of 2,658 cancers. Nature. 2020;578(7793):122–128. pmid:32025013
- 21. Sundermann LK, Wintersinger J, Rätsch G, Stoye J, Morris Q. Reconstructing tumor evolutionary histories and clone trees in polynomial-time with SubMARine. PLOS Computational Biology. 2021;17(1):e1008400. pmid:33465079
- 22. PCAWG Evolution and Heterogeneity Working Group, PCAWG Consortium, Rubanova Y, Shi R, Harrigan CF, Li R, et al. Reconstructing evolutionary trajectories of mutation signature activities in cancer using TrackSig. Nature Communications. 2020;11(1):731. pmid:32024834
- 23. Tomasetti C, Bozic I. The (not so) immortal strand hypothesis. Stem Cell Research. 2015;14(2):238–241. pmid:25700960
- 24. Werner B, Case J, Williams MJ, Chkhaidze K, Temko D, Fernández-Mateos J, et al. Measuring single cell divisions in human tissues from multi-region sequencing data. Nature Communications. 2020;11(1):1–9. pmid:32098957
- 25. Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, et al. Accumulation of driver and passenger mutations during tumor progression. Proceedings of the National Academy of Sciences of the United States of America. 2010;107(43):18545–18550. pmid:20876136
- 26. Sun R, Hu Z, Sottoriva A, Graham TA, Harpak A, Ma Z, et al. Between-region genetic divergence reflects the mode and tempo of tumor evolution. Nature Genetics. 2017;49(7):1015–1024. pmid:28581503
- 27. Gruber M, Bozic I, Leshchiner I, Livitz D, Stevenson K, Rassenti L, et al. Growth dynamics in naturally progressing chronic lymphocytic leukaemia. Nature. 2019;570(7762):474–479. pmid:31142838
- 28. Salichos L, Meyerson W, Warrell J, Gerstein M. Estimating growth patterns and driver effects in tumor evolution from individual samples. Nature Communications. 2020;11(1):1–14. pmid:32024824
- 29. Noble R, Burri D, Le Sueur C, Lemant J, Viossat Y, Kather JN, et al. Spatial structure governs the mode of tumour evolution. Nature Ecology & Evolution. 2021. pmid:34949822
- 30. Chkhaidze K, Heide T, Werner B, Williams MJ, Huang W, Caravagna G, et al. Spatially constrained tumour growth affects the patterns of clonal selection and neutral drift in cancer genomic data. PLOS Computational Biology. 2019;15(7):e1007243. pmid:31356595
- 31. Fu X, Zhao Y, Lopez JI, Rowan A, Au L, Fendler A, et al. Spatial patterns of tumour growth impact clonal diversification in a computational model and the TRACERx Renal study. Nature Ecology & Evolution. 2021.
- 32. Williams MJ, Werner B, Heide T, Curtis C, Barnes CP, Sottoriva A, et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nature Genetics. 2018;50(6):895. pmid:29808029
- 33. Avanzini S, Kurtz DM, Chabon JJ, Moding EJ, Hori SS, Gambhir SS, et al. A mathematical model of ctDNA shedding predicts tumor detection size. Science Advances. 2020;6(50):eabc4308. pmid:33310847
- 34. Bozic I, Reiter JG, Allen B, Antal T, Chatterjee K, Shah P, et al. Evolutionary dynamics of cancer in response to targeted combination therapy. eLife. 2013;2:e00747. pmid:23805382
- 35. Dinh KN, Jaksik R, Kimmel M, Lambert A, Tavaré S. Statistical Inference for the Evolutionary History of Cancer Genomes. Statistical Science. 2020;35(1):129–144.
- 36. Lahouel K, Younes L, Danilova L, Giardiello FM, Hruban RH, Groopman J, et al. Revisiting the tumorigenesis timeline with a data-driven generative model. Proceedings of the National Academy of Sciences. 2020;117(2):857–864. pmid:31882448
- 37. Bozic I, Wu CJ. Delineating the evolutionary dynamics of cancer from theory to reality. Nature Cancer. 2020;1(6):580–588. pmid:35121980
- 38. Durrett R. Branching Process Models of Cancer. In: Durrett R, editor. Branching Process Models of Cancer. Mathematical Biosciences Institute Lecture Series. Cham: Springer International Publishing; 2015. p. 1–63. Available from: https://doi.org/10.1007/978-3-319-16065-8_1.
- 39. Tavaré S. The linear birth-death process: an inferential retrospective. Advances in Applied Probability. 2018;50(A):253–269.
- 40. Heyde A, Reiter JG, Naxerova K, Nowak MA. Consecutive seeding and transfer of genetic diversity in metastasis. Proceedings of the National Academy of Sciences. 2019;116(28):14129–14137.
- 41. Griffith M, Miller C, Griffith O, Krysiak K, Skidmore Z, Ramu A, et al. Optimizing Cancer Genome Sequencing and Analysis. Cell Systems. 2015;1(3):210–223. pmid:26645048
- 42. Haber DA, Velculescu VE. Blood-Based Analyses of Cancer: Circulating Tumor Cells and Circulating Tumor DNA. Cancer Discovery. 2014;4(6):650–661. pmid:24801577
- 43. Leshchiner I, Livitz D, Gainor JF, Rosebrock D, Spiro O, Martinez A, et al. Comprehensive analysis of tumour initiation, spatial and temporal progression under multiple lines of treatment. Bioinformatics; 2018. Available from: http://biorxiv.org/lookup/doi/10.1101/508127.
- 44. Myers MA, Satas G, Raphael BJ. CALDER: Inferring Phylogenetic Trees from Longitudinal Tumor Samples. Cell Systems. 2019;8(6):514–522.e5. pmid:31229560
- 45. Hallek M, Cheson BD, Catovsky D, Caligaris-Cappio F, Dighiero G, Döhner H, et al. iwCLL guidelines for diagnosis, indications for treatment, response assessment, and supportive management of CLL. Blood. 2018;131(25):2745–2760. pmid:29540348
- 46. Marionneaux SM, Keohane EM, Lamanna N, King TC, Mehta SR. Smudge Cells in Chronic Lymphocytic Leukemia: Pathophysiology, Laboratory Considerations, and Clinical Significance. Laboratory Medicine. 2021;52(5):426–438. pmid:33527134
- 47. Bozic I, Gerold JM, Nowak MA. Quantifying Clonal and Subclonal Passenger Mutations in Cancer Evolution. PLOS Computational Biology. 2016;12(2):e1004731. pmid:26828429
- 48. Kim J, Kim D, Lim JS, Maeng JH, Son H, Kang HC, et al. The use of technical replication for detection of low-level somatic mutations in next-generation sequencing. Nature Communications. 2019;10(1):1047. pmid:30837471
- 49. Song P, Chen SX, Yan YH, Pinto A, Cheng LY, Dai P, et al. Selective multiplexed enrichment for the detection and quantitation of low-fraction DNA variants via low-depth sequencing. Nature Biomedical Engineering. 2021;5(7):690–701. pmid:33941896
- 50. Fabre MA, de Almeida JG, Fiorillo E, Mitchell E, Damaskou A, Rak J, et al. The longitudinal dynamics and natural history of clonal haematopoiesis. Nature. 2022;606(7913):335–342. pmid:35650444
- 51. Blokzijl F, de Ligt J, Jager M, Sasselli V, Roerink S, Sasaki N, et al. Tissue-specific mutation accumulation in human adult stem cells during life. Nature. 2016;538(7624):260–264. pmid:27698416
- 52. Mitchell E, Spencer Chapman M, Williams N, Dawson KJ, Mende N, Calderbank EF, et al. Clonal dynamics of haematopoiesis across the human lifespan. Nature. 2022;606(7913):343–350. pmid:35650442
- 53. Auslander N, Wolf YI, Koonin EV. In silico learning of tumor evolution through mutational time series. Proceedings of the National Academy of Sciences. 2019;116(19):9501–9510. pmid:31015295
- 54. Abascal F, Harvey LMR, Mitchell E, Lawson ARJ, Lensing SV, Ellis P, et al. Somatic mutation landscapes at single-molecule resolution. Nature. 2021;593(7859):405–410. pmid:33911282
- 55. PCAWG Mutational Signatures Working Group, PCAWG Consortium, Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, et al. The repertoire of mutational signatures in human cancer. Nature. 2020;578(7793):94–101. pmid:32025018
- 56. Wang Z, Xia Y, Mills L, Nikolakopoulos AN, Maeser N, Sheltzer JM, et al. Evolving copy number gains promote tumor expansion and bolster mutational diversification. Genomics; 2022. Available from: http://biorxiv.org/lookup/doi/10.1101/2022.06.14.495959.
- 57. Friberg S, Mattson S. On the growth rates of human malignant tumors: implications for medical decision making. Journal of Surgical Oncology. 1997;65(4):284–297. pmid:9274795
- 58. Rodriguez-Brenes IA, Komarova NL, Wodarz D. Tumor growth dynamics: insights into evolutionary processes. Trends in Ecology & Evolution. 2013;28(10):597–604. pmid:23816268
- 59. Talkington A, Durrett R. Estimating Tumor Growth Rates In Vivo. Bulletin of Mathematical Biology. 2015;77(10):1934–1954. pmid:26481497
- 60. Norton L. A Gompertzian model of human breast cancer growth. Cancer Research. 1988;48(24 Pt 1):7067–7071. pmid:3191483
- 61. Spratt JA, von Fournier D, Spratt JS, Weber EE. Decelerating growth and human breast cancer. Cancer. 1993;71(6):2013–2019. pmid:8443753
- 62. Gerlee P. The Model Muddle: In Search of Tumor Growth Laws. Cancer Research. 2013;73(8):2407. pmid:23393201
- 63. Vaghi C, Rodallec A, Fanciullino R, Ciccolini J, Mochel JP, Mastri M, et al. Population modeling of tumor growth curves and the reduced Gompertz model improve prediction of the age of experimental tumors. PLOS Computational Biology. 2020;16(2):e1007178. pmid:32097421
- 64. Carlsson G, Gullberg B, Hafström L. Estimation of liver tumor volume using different formulas?An experimental study in rats. Journal of Cancer Research and Clinical Oncology. 1983;105(1):20–23. pmid:6833336
- 65. West J, Schenck RO, Gatenbee C, Robertson-Tessi M, Anderson ARA. Normal tissue architecture determines the evolutionary course of cancer. Nature Communications. 2021;12(1):2060. pmid:33824323
- 66. Marusyk A, Tabassum DP, Altrock PM, Almendro V, Michor F, Polyak K. Non-cell-autonomous driving of tumour growth supports sub-clonal heterogeneity. Nature. 2014;514(7520):54–58. pmid:25079331
- 67. Petljak M, Alexandrov LB, Brammeld JS, Price S, Wedge DC, Grossmann S, et al. Characterizing Mutational Signatures in Human Cancer Cell Lines Reveals Episodic APOBEC Mutagenesis. Cell. 2019;176(6):1282–1294.e20. pmid:30849372
- 68. Gillespie DT. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry. 1977;81(25):2340–2361.
- 69. Bozic I, Paterson C, Waclaw B. On measuring selection in cancer from subclonal mutation frequencies. PLOS Computational Biology. 2019;15(9):e1007368. pmid:31557163
- 70. Keohane EM, Smith LJ, Walenga JM, editors. Rodak’s hematology: clinical principles and applications. Fifth edition ed. St. Louis, Missouri.: Elsevier/Saunders; 2016.