Inferring parameters of cancer evolution in chronic lymphocytic leukemia

As a cancer develops, its cells accrue new mutations, resulting in a heterogeneous, complex genomic profile. We make use of this heterogeneity to derive simple, analytic estimates of parameters driving carcinogenesis and reconstruct the timeline of selective events following initiation of an individual cancer, where two longitudinal samples are available for sequencing. Using stochastic computer simulations of cancer growth, we show that we can accurately estimate mutation rate, time before and after a driver event occurred, and growth rates of both initiated cancer cells and subsequently appearing subclones. We demonstrate that in order to obtain accurate estimates of mutation rate and timing of events, observed mutation counts should be corrected to account for clonal mutations that occurred after the founding of the tumor, as well as sequencing coverage. Chronic lymphocytic leukemia (CLL), which often does not require treatment for years after diagnosis, presents an optimal system to study the untreated, natural evolution of cancer cell populations. When we apply our methodology to reconstruct the individual evolutionary histories of CLL patients, we find that the parental leukemic clone typically appears within the first fifteen years of life.


Introduction
When a cell accrues a sequence of driver mutations-genetic alterations that provide a proliferative advantage relative to surrounding cells-it can begin to divide uncontrollably and eventually develop the complex features of a cancer [1][2][3]. Thousands of specific driver mutations have been implicated in carcinogenesis, with individual tumors harboring from few to dozens of drivers, depending on the cancer type [4]. Mutations that don't have a significant effect on cellular fitness also arise, both before and after tumor initiation [5]. These neutral mutations, or "passengers", can reach detectable frequencies by random genetic drift or the positive selection of a driver mutation in the same cell [6][7][8][9]. Mutational burden detectable by bulk sequencing reveals tens to thousands of passengers per tumor [10,11].
Genome sequencing technologies have revealed the heterogeneous, informative genetic profiles produced by the evolutionary process driving carcinogenesis [12,13]. These genetic profiles have been used to obtain insight into specific features of the carcinogenic process operating in individual patients. For example, the molecular clock feature of passenger mutations has been employed to measure timing of early events in tumor formation, as well as identify stages of tumorigenesis and metastasis [14][15][16][17][18][19][20][21][22]. Other studies have estimated mutation rates [5,23,24], selective growth advantages of cancer subclones [25][26][27][28], and the effect of spatial structure on cancer evolution [29][30][31]. We note that previous approaches typically only estimate one or a few parameters of cancer evolution. In addition, many state-of-the-art methods make use of computationally expensive approaches [24,30,32] or simplifying assumptions, such as approximating tumor expansion as deterministic or ignoring cell death [27,32]. Our approach relies on analytic formulas and sampling, which for realistic numbers of subclones and time points is efficient, and does not require simulation of tumor growth or computationally expensive model fitting.
Mathematical models of cancer progression, especially when used in conjunction with experimental and clinical data, can provide important insights into the evolutionary history of cancer [9,19,[33][34][35][36][37]. Branching processes-a type of a stochastic process-can be used to model how different populations of dividing, dying, and mutating cells in a tumor evolve over time [38]. Their theory and applications have been well developed to model the multistage nature of cancer development [25,29,35,[38][39][40]. Here we use a branching process model of carcinogenesis to derive a comprehensive reconstruction of an individual tumor's evolution.
Tumors can grow for many years, even decades, before they reach detectable size [16]. Typically, tumor samples used for sequencing would be obtained at the end of the tumor's natural, untreated progression. More recently, longitudinal sequencing, where a tumor is sequenced at multiple times during its development, has provided better resolution of tumor growth dynamics and evolution in various cancer types [27,[41][42][43][44]. Chronic lymphocytic leukemia (CLL) is an ideal system for studying cancer evolution because it can be monitored, via peripheral blood samples, without treatment until disease progression [45].
We establish that two longitudinal bulk sequencing and tumor size measurements are sufficient to reconstruct virtually all parameters (mutation rate, growth rates, times of appearance of driver mutations, and time since the driver mutation) of cancer evolution in individual patients. Our analytic approach yields simple formulas for the parameters; thus, estimation of the parameters governing cancer growth is not computationally intensive, regardless of tumor size. Our framework makes possible a personalized, high-resolution reconstruction of a cancer's timeline of selective events and quantitative characterization of the evolutionary dynamics of the subclones making up the cancer cell population.

Model
We consider a multi-type branching process of tumor expansion (Fig 1A). Tumor growth is started with a single initiated cell at time 0. Initiated tumor cells divide with rate b and die with rate d. These cells already have the driver mutations necessary for expansion, so we assume b > d. The population of initiated cells can go extinct due to stochastic fluctuations, or survive stochastic drift and start growing (on average) exponentially with net growth rate r = b − d. We will focus only on those populations that survived stochastic drift. (blue) divide with birth rate b, die with death rate d, and accrue passenger mutations with mutation rate u. Type-1 cells, which carry the driver mutation, divide with birth rate b 1 , die with death rate d 1 , and accrue passenger mutations with mutation rate u. (b) The initiated tumor, or type-0, (blue) population growth is initiated from a single cell. A driver mutation occurs in a single type-0 cell at time t 1 , starting the type-1 population (red). The tumor sample is collected and bulk sequenced at times t 1 + t and t 1 + t + Δ, where the driver fraction is α 1 and α 2 , respectively. Tumor size (in number of cells) is M 1 and M 2 at first and second sample collection dates. (c) By the time the tumor is observed, it has a high level of genetic heterogeneity due to the mutations that have accrued in both type-0 (blue) and type-1 populations (red). Each yellow star represents a different passenger mutation.
At some time t 1 > 0 a new driver mutation occurs in a single initiated tumor cell, starting a new independent birth-death process, with birth rate b 1 and death rate d 1 (Fig 1B). Net growth rate of cells with the new driver is r 1 = b 1 − d 1 . The new driver increases the rate of growth, i.e., r 1 > r. We define the driver's selective growth advantage by g = (r 1 /r − 1). In addition, both populations of cells (with and without the driver) accrue passenger mutations with rate u (Fig 1C).
After the driver mutation occurs, an additional time t passes before the tumor is observed. Type-0 cells are original initiated tumor and type-1 cells contain the driver mutation. In Materials and methods we also analyze the more general case of two nested or sibling driver mutations, as well as the fully generalized case of any clonal structure that might arise during tumor expansion.

Parameter estimates from two longitudinal measurements
We demonstrate that with two longitudinal bulk sequencing measurements, it is possible to accurately estimate net growth rates, time of appearance of a driver mutation, time between a driver mutation and observation, and mutation rate in the tumor. The tumor is first sequenced at time of observation, t 1 + t, where both time of driver mutation, t 1 , and time from driver mutation to observation, t, are yet unknown (Fig 1B). A second bulk sequencing is performed at t 1 + t + Δ, a known Δ time units after the tumor is first observed (Fig 1B). Later, we apply our method to the CLL data from Ref. [27], where the average size of Δ for all the pre-treatment samples sequenced is 1.8 years (0.6-4.9 years). In general, we expect that in the case of smaller Δ values measurement errors would have a larger effect on the estimated growth rates, due to an expected smaller change in cancer cell count and subclonal structure during a smaller time interval. From the bulk sequencing data, the fraction of cells carrying the driver mutation, α 1 and α 2 , can be measured at the time points t 1 + t and t 1 + t + Δ, respectively. We denote total number of cells in the tumor at the two bulk sequencing time points as M 1 and M 2 . For liquid cancers, cell counts of the relevant cancer cell population serve as indicators of cancer progression. In the case of CLL, white blood cell (WBC) count is useful as a measure of tumor burden in peripheral blood, as it is routinely taken and includes the cancerous cell population. More precise estimates of tumor burden would include absolute lymphocyte count (ALC) and number of B lymphocytes. Both ALC and WBC counts can suffer from inaccuracies due to the prevalence of smudge cells in CLL, often resulting in an underestimate of these counts [46].
Equating expected values of the sizes of the type-0 and type-1 population at the two bulk sequencing time points with the measured numbers of cells present in clones 0 and 1, we obtain estimates of the net growth rates of the two subclones: From the growth rate estimates and subclone sizes, we can approximate the expected value of the time a population in a branching process takes to reach an observed size [38]. This yields an estimate of the time t from the appearance of driver mutation until observation: Using the bulk sequencing data from the second time point, γ, the number of subclonal passengers between the specified frequencies f 1 and f 2 , can be measured. Using results from previous work [47], we derive the expected value of γ (Materials and methods), which can be used to estimate the mutation rate u: The m passenger mutations that were present in the original type-1 cell when the driver mutation occurred (Fig 1C) are present in all type-1 cells. m can be estimated from bulk sequencing data and used to estimate time of appearance of the driver. We maximize the likelihood function P(m|t 1 ) with respect to time of appearance of the driver, t 1 , (see Materials and methods) to obtain the maximum likelihood estimate Using formulas (4) and (5), we can now estimate t 1 .

Estimates verified in simulated tumors
To assess the accuracy of the parameter estimates for several modes of tumor evolution, we simulate tumor growth by performing a Monte Carlo simulation, which simulates the birth, death, and accumulation of mutations in the individual cells that make up a tumor. This simulation generates the mutation frequency and tumor size data used by the estimates (see Methods section for details of simulation). We simulate three different types of tumors (slow growing, fast growing, and no cell death), with a high and a low mutation rate for each (S1 Table).
In a simulation of a fast-growing tumor with a single subclonal driver mutation that confers a strong selective growth advantage of 100%, we can accurately estimate growth rates, mutation rate, time of driver event, and time since driver event (Fig 2A and 2B). Growth rates of both initiated tumor and driver subclones can be estimated with a high degree of accuracy, achieving mean percentage error (MPE) of 0.03% and -0.07% for the lower mutation rate (u = 1) scenario. The mutation rate u and estimates for time of driver appearance, t 1 , and time since driver, t, can also be estimated accurately, with MPEs of -0.9%, 3.8%, and -0.4%, respectively. Estimates for u, t 1 , and t have a somewhat greater degree of variation compared to the growth rate estimates, due to the inherent randomness of the number of mutations and time to reach the observed size that occur in each realization of the stochastic process.
For the parameter regime with no cell death and the regime for a slow-growing tumor, we again achieve high accuracies for the net growth rates (S1(A), S1(B), S2(A) and S2(B) Figs). In the lower mutation rate (u = 1) scenario, parameter estimates for the mutation rate u and time of driver appearance t 1 can be accurately estimated for both regimes, with MPEs of -1.3% and 4.9% for the no cell death case, and MPEs of -3% and 3.7% for the slow-growing tumor.
We note that the estimator for t (time since driver event) is biased, with the extent depending on the ratio of birth rate to net growth rate, and the tumor size. The underlying cause of the bias is due to a simplifying assumption in the estimator's derivation (see Methods, "Derivation of estimates of evolutionary parameters"), and this bias decreases as tumor size increases and as the ratio of growth and division rate gets closer to 1. For the three main modes of growth in our study, we performed additional Monte Carlo simulations to precisely quantify the effect of death:birth ratio and tumor size on the estimator's accuracy (S5 Fig). For all three modes of growth, we observe a monotonic decrease in error as tumor size increases to more clinically realistic sizes. For a tumor size of 10 9 , all modes of growth have a MPE of less than 4%, so for a clinically realistic cancer size-10 11 for the CLL dataset-we expect an even better accuracy.
We also perform Monte Carlo simulations for the more complex cases of two nested and two sibling driver subclones (see Methods for derivations of estimators) for the same three modes of cancer growth used for the single driver subclone case above: fast growth (Fig 2C  and 2D), no cell death (S1(C) and S1(D  Accuracy of parameter inferences from simulated data. We simulated tumor growth by performing a Monte Carlo simulation, which simulates the birth, death, and accumulation of mutations in the individual cells that make up a tumor, and generates the mutation frequency and tumor size data used by the estimates. Simulations are of fast-growing tumors with (a) single driver subclone and mutation rate u = 1, (b) single driver subclone and u = 3, (c) two nested driver subclones with u = 1, and (d) two sibling driver subclones with u = 1. Mean percent errors (MPEs) of estimates are shown in black above the plots, and mean absolute percent errors (MAPEs) are shown in gray. Boxes contain 25th-75th quartiles, with median indicated by thick horizontal black line. Whiskers of boxplots indicate 2.5 and 97.5 percentiles. Violins are smoothed density estimates of the percent error data points. Complete parameter values and number of runs are included in S1 Table. https://doi.org/10.1371/journal.pcbi.1010677.g002 estimates for the nested and sibling subclone simulations have a greater variance. The estimate for t-time between the last driver mutation and diagnosis-shows good accuracy for the fastgrowing tumors, but larger errors for the no cell death and slow growth cases. For both the nested and sibling simulations, the estimates for the times of driver mutations 1 and 2 (t 1 and t 2 , respectively) have MPEs less than 6%.

Correcting mutation counts observed from genome sequencing data
We note that in our estimate for the time of appearance of the driver, t 1 (see formula (5)), used for comparison to simulated data, we employed a correction to m, the number of mutations that were present in the founder type-1 cell at t 1 . From sequencing data, these m mutations are indistinguishable ( Fig 3A) from mutations that occurred after t 1 in type-1 cells and reached fixation in the type-1 population [47]. Thus, the value of m observed from sequencing data, m obs , (a) If passenger mutations (circles with stars) that occur after the driver reach fixation in the driver population (red), then they are indistinguishable from the passengers that were present in the first cell with the driver, which accrued in the type-0 population (blue). The estimate of when the driver occurred needs to account for these mutations (circled). In (b), we compare percent errors of parameter estimates for time from tumor initiation until appearance of a driver subclone, t 1 , with and without this correction (Eq (6)). Errors for estimate with correction are shown in blue, and for estimate without correction (Eq. (5)) in orange. Errors are plotted as a kernel density estimate for Monte Carlo simulations of slow-growing tumor with mutation rate u = 5. Mean percent errors (MPEs) and mean absolute percent errors (MAPEs) are listed. (c) Mutations present on two or fewer variant reads (red) are filtered out in post-processing. Mutations with more than two variant reads (black) are included. The number of subclonal mutations between frequencies f 1 and f 2 , γ, which is used in the mutation rate estimate, must be corrected for mutations that are filtered out. In (d), the percent errors for the observed (orange) and corrected (blue) γ (Eq (7)) are plotted as kernel density estimates. Observed mutations are those that passed post-processing, i.e. those that have more than L = 2 mutant reads. True mutation frequencies were generated from 135 surviving runs of a Monte Carlo simulation of a fastgrowing tumor with mutation rate u = 1, from which sequencing reads were simulated with 200x average coverage (see Materials and methods). Percent errors are calculated relative to the true γ measured from the true mutation frequencies.
https://doi.org/10.1371/journal.pcbi.1010677.g003 will overestimate the true m. In Materials and methods we show that the expected value of the number of passengers that occurred after t 1 and reached fixation in the type-1 population is u/r 1 . We subtract this correction factor from m obs : The correction for the m mutations present in the original type-1 cell (6) at time t 1 improves the accuracy of the estimate for time of appearance of driver mutation t 1 . For the fast-growing tumor with mutation rate u = 1 (S3(A) Fig), the correction lowers the mean percent error (MPE) of the t 1 estimate from 14.0% to 3.8%. For the slow-growing tumor with mutation rate u = 5 (Fig 3B), the correction lowers the MPE of the t 1 estimate from 22.0% to 5.7% (Fig 3B).
Another issue arises from obtaining mutation count γ, number of mutations with frequency between f 1 and f 2 , from genome sequencing data. When sequencing data is post-processed by filtering out mutations with L or fewer variant reads, low-frequency mutations will be difficult to detect [35] (Fig 3C). For a sample with average sequencing coverage of R and tumor purity p, mutations with mutant allele frequency below L/(pR) will typically not be observable. As a result, since mutations with frequencies between f 1 and f 2 count towards γ, if f 1 � 2L/(pR), the observed number of subclonal mutations between frequencies f 1 and f 2 , γ obs , will underestimate the true value, γ. For cancers with low mutational burden, such as CLL, we set a relatively low f 1 (1%) to have sufficient resolution to infer mutation rate. Consequently, some mutations with frequency above f 1 will likely be filtered out, and we account for this by correcting for the expected number of such subclonal mutations present at cancer cell frequencies (CCFs) between f 1 and 2L/(pR) (see Materials and methods): Before applying our methodology to patient sequencing data, we estimated the validity of the above correction applied to observed simulated mutation counts. When we simulate sequencing reads from simulated mutation frequencies (see Materials and methods) and postprocess by removing mutations with L = 2 or fewer variant reads, the adjustment we derived for mutation count γ (7) is critical, even for average sequencing coverage of 200x ( Fig 3D). Without any correction, the observed γ has MPE of -53.3% compared to true γ, but with the correction, the computed γ has MPE of -1.4%. When average coverage is 100x, this correction becomes even more important, as many of the low-frequency mutations are discarded (S3(B) Fig). Without any correction, the observed γ has MPE of -79.7%. With the correction the computed γ has MPE of -3.4%. The accuracy of the γ measurement affects our estimate of the mutation rate (4).

Estimating parameters for individual patients with CLL
We use our formulas to infer the patient-specific parameters of cancer evolution for four patients with CLL whose growth patterns and clonal dynamics were analyzed in [27]. These CLLs had peripheral WBC counts measured and whole exome sequencing (WES) performed at least twice before treatment. We consider patients whose WBC counts were classified as having an exponential-like growth pattern, with average γ obs > 2, and with 3 or fewer macroscopic subclones (i.e. subclones with cancer cell fractions of 20% or greater for at least one pretreatment time point). Our framework is designed specifically to study naturally evolving cancer dynamics, unperturbed by treatment, which will drastically alter the cancer's dynamics and size. For calculation of the γ obs mutations between frequencies f 1 and f 2 , we set f 1 = 1% due to the difficulty of detecting low frequency variants <1% [48,49]. We set f 2 to 20% to minimize overlap with potential driver mutations of the macroscopic subclones. The average γ obs for the four analyzed patients ranges from 2.5 to 19.3, with a median of 5.2. As in Ref. [27], we perform subclonal reconstruction for each patient using PhylogicNDT [43]. To obtain confidence intervals for our parameter estimates, we utilize a sampling procedure to account for model and measurement uncertainties, including uncertainties in subclone frequencies, fitted growth curves, and the Poisson process for mutation accumulation (see Materials and methods). For each patient's tumor, we compute estimates of the growth rate of each clone, exome mutation rate, the times that each subclone arose, and how long each subclone expanded before the tumor was detected (Tables 1 and 2). We also estimate what time the cancer was clinically detectable, by sampling from the distribution of fitted growth parameters and solving the resulting root-finding problem for time to reach detectable size under our growth model (see Materials and methods). For CLL specifically, we compute time of leukocytosis-an abnormally high WBC count. We reconstruct these histories for tumors with various clonal structures.  [10.8, 24.0] years old, and grew more quickly than Clone 0, with a selective growth advantage of �90% over Clone 0). Clone 1 contained a FGFR1 mutation, which might have been acting as a driver of the increased net proliferation. Clone 1 then grew for �15 years before the patient was diagnosed at age 35.
Patients 6 and 9 present more complex clonal structures (Fig 4). Clone 0, the parental clone of the CLL of Patient 9, arose when the patient was 4  [50], we found that sometimes the estimated growth rate during the period of observation, such as the negative growth rate of Clone 1, is smaller than the minimal possible growth rate necessary to reach the observed clone size. In that case, for calculating mutation rate, time of the driver(s), time of detectability, and time between driver(s) and diagnosis we use the minimal growth rate. Clone 2, containing a KRAS mutation, had the largest net growth rate of the three clones (0. 67  years old. Clone 3 harbored a driver mutation in ASXL1 and had selective growth advantage of 60.8% over Clone 0. The patient was diagnosed at age 58, eventually needing treatment 12.0 years after diagnosis.
The average mutation rate in the four CLL patients we analyze is 0.30 mutations/year. This rate is over the exome, which accounts for �1% of the human genome. Our average estimated mutation rate in CLL exomes is similar to the measured rate of accumulation of mutations in human tissues of 40 mutations per year over the entire genome [51]. Other recent work has estimated a mutation rate of 17 mutations per year in human haematopoietic stem cell/multipotent progenitors [52]. Our estimated mutation rates during CLL progression are on par or higher than the recent estimates in healthy hematopoietic cells [52], in line with the expectation that mutation rates may be increased in cancer. The estimated times of appearance of CLL subclones are very long, on the order of 10 years or more. This finding is in agreement with results from Gruber et al. [27], who find few new CLL subclones over years to a decade of evolution. We observe that CLL initiation occurred early in most patients, within the first fifteen years of their lives, consistent with recent work in other cancer types [19,36]. We find that CLL patients reach leukocytosis an average of 1.5 years before the first timepoint at which cancer genome sequencing was performed. For three of the patients, our estimated time of leukocytosis was before diagnosis, on average 1.3 years prior to diagnosis. Reconstructing the timeline of CLL evolution in patients. We applied our methodology to estimate subclonal growth rates, mutation rates and evolutionary timelines in CLL tumors from Ref. [27]. Vertical height of a clone represents its log 10 -scaled size. Mutations were clustered into clones and phylogenetic trees were inferred using PhylogicNDT [43]. Tree edges are colored by clone number and are labeled with driver mutations, if any. For each patient, we show estimates for patient age at CLL initiation and times of appearance of CLL subclones. Dashed white line indicates when the patient was diagnosed. Solid black arrows indicate times of bulk sequencing measurements.

Discussion
We use a stochastic branching process model to reconstruct the timing of driver events and quantify the evolutionary dynamics of different subclonal populations of cancer cells. We estimate growth rates of tumor subclones, selective growth advantage of individual driver mutations, mutation rate in the tumor, time between tumor initiation and appearance of a subclonal driver mutation, and time between driver mutation and tumor observation. Together, this allows us to estimate the age of the patient at tumor initiation, as well as the age at appearance of a subclonal driver.
Previous work has computed relative order of driver events [18,21,53], while other studies have given estimates for scaled mutation rates and time of events [24,32]. However, we present estimates for absolute, unscaled mutation rates and times, which are easily interpretable and don't implicitly depend on unknown parameters. We assume that mutations accrue with time, which simplifies derivations and is supported by recent experimental data that shows that non-dividing cells may accrue mutations at a similar rate as dividing cells [54]. Other potential assumptions regarding mutation accumulation include mutations occurring at cell division [55] or assuming mutation rate is proportional to the copy number state [56]. For example, recent work reported that some mutational signatures in human cancers are generated during mitosis [55]. Other work has shown that the rate of accumulation of somatic single nucleotide variants is proportional to copy number [56]. We further assume that all cancer subpopulations have the same passenger mutation rate. In the case that mutations occur predominantly at cell division, assuming that the rate of cell division is comparable across all tumor subclones, our estimates would still be valid. In the case of a subclone that has an elevated mutation rate (e.g. due to a chromosomal amplification, mutation in a DNA repair pathway gene or an increased cell division rate), we would underestimate the mutation rate and overestimate the time of driver mutation(s) in that subclone. In the other subclones, the opposite would be true.
For individual CLLs that underwent bulk sequencing at two time points [27], we infer growth rates of individual subclones, mutation rate in the tumor, the times when cancer subclones began growing, the time between driver mutations and the patient's diagnosis, and time when the cancer is clinically observable. Our inferences are limited by the relatively low number of mutations present in CLL, as well as sequencing coverage [27], so we set a minimum passenger mutation count when selecting specific cases to analyze. The accuracy of estimates presented here is expected to be higher with whole genome sequencing available, with higher sequencing coverage, or in cancer types with more mutations, with some important limitations. Exponential growth-the mean behavior of our branching process model-has been well documented in vivo [27,[57][58][59], but tumors can also often exhibit sigmoidal growth (e.g. logistic, Gompertz models), where initial exponential growth is followed by a deceleration in growth [58,[60][61][62][63]. Our estimators should only be used for cancers exhibiting exponential growth; for other modes of growth, such as the logistic-growing class of CLL patients in Ref. [27], the parameter estimates would have to be derived specifically for the particular mode of growth observed. Exponential growth is the simplest common cancer growth pattern, and yet, estimating the exponential growth rates requires at least two longitudinal timepoints. To fit all parameters for patients with more complex growth dynamics, additional longitudinal samples will be needed; this type of analysis would be further limited due to the scarcity of longitudinal pre-treatment samples in many cancer types. In the case of solid tumors, the number of cells can be estimated from measurements of tumor volume [64], however multiple biopsies would potentially be needed to fully account for the existing genetic heterogeneity. Furthermore, a solid tumor's spatial structure, mode of evolution, and biopsy collection influence how well selection and mutation spectra can be observed [30,31,65]. Recent modeling and computational work, in combination with careful multi-region sequencing and single cell sequencing, have begun to disentangle these confounding factors [26,29,30].
Our model and derivations assume a fixed mutation rate u after transformation and fixed growth rates of cancer subclones, similar to previous approaches [24,30,35]. Some individual cancer subclones (such as Clone 1 from Pt. 9) not only do not grow exponentially, they actually decline in absolute cell numbers, even if the overall tumor is undergoing expansion. This phenomenon has been previously observed [27,66], and could be caused by the declining subclone getting outcompeted by more fit subclones. Sudden genomic instability events, or a change in cancer mutation and/or growth rate over time could also introduce errors into our parameter inferences. Recent sequencing data points to mutational processes that change over time during cancer evolution [20,67]; incorporating possible changes in the mutation and/or growth rate into the model would require much higher density of sequencing and clinical data [37], as would employing a more complex growth model (e.g. boundary-driven or sigmoidal growth).

Branching process model of tumor evolution
We employ a continuous, multi-type branching process model of cancer evolution. For the case of a single driver subclone, there are two cell types, type-0 and type-1. Tumor expansion is initiated by a single type-0, or initiated tumor cell. Type-0 cells divide with rate b and die with rate d, yielding a net growth rate of r = b − d. At time t 1 , a single driver mutation is introduced into a randomly selected cell in the type-0 population, founding a new type-1 population of cells. This type-1 population undergoes its own independent branching process. They divide with rate b 1 , die with rate d 1 , and have net growth rate r 1 = b 1 − d 1 . If the driver mutation gives type-1 cells a selective growth advantage over the type-0 population, then r 1 > r. With the ratios of the growth rates denoted as s = r 1 /r, the growth advantage can be quantified as g = (s − 1) � 100%. In the case of neutral evolution, g = 0. If there is a selective advantage, g > 0. Neutral mutations, or passengers, have no effect on the cell's fitness, and accrue according to a Poisson process with rate u. We assume an infinite alleles model such that there is no back mutation and an infinite sites model such that every new passenger mutation is unique. Only surviving populations are considered. All derivations below will condition on survival. The type-0 and type-1 populations at time t will be denoted as X 0 (t) and X 1 (t), respectively.

Measurements sufficient to determine evolutionary history
Here we derive estimates for parameters describing the carcinogenic process for a single driver subclone, using measurements taken from two time points late in the tumor's development. We require sequencing of the tumor at the two time points, when the tumor is first observed at the unknown time t 1 + t and a specified Δ later, at t 1 + t + Δ. From these two bulk sequencing measurements, we obtain measurements of α 1 and α 2 , the fraction of cells carrying the driver mutation at t 1 + t and t 1 + t + Δ, respectively. In addition, from the bulk sequencing at t 1 + t + Δ, we obtain measurements of m, the number of mutations present in the founder type-1 cell, as well as γ, the number of mutations with frequency between the specified f 1 and f 2 . The total population size at these times, M 1 and M 2 , is also measured.

Expected value of γ, number subclonal mutations
For a population consisting of a single clone with birth and death rates b and d, the expected number of subclonal mutations present at a frequency larger than f is shown to be [47] � uð1 À f Þ ð1 À dÞf ð8Þ where δ = d/b and � u is the probability that a daughter cell gains a new passenger mutation at cell division. In this paper, we allow mutations to occur at any point in time and consider the absolute mutation rate per cell, u, which is equal to � ub. Then the expected number of subclonal mutations between f 1 and f 2 , Eg, is where r = b − d > 0. Now we derive Eg in the case of clones 0 through k, each clone with growth rate r i > 0 and fraction a c i . Each clone i has a c i u r i ð1=f 1 À 1=f 2 Þ expected subclonal passengers between frequencies f 1 and f 2 . Thus, the total expected number of passengers with frequencies between f 1 and f 2 is For the simplest case we consider, a tumor with a single driver mutation occurring in the initiated tumor population, there is a type-0 population with growth rate r and a type-1 population with growth rate r 1 . Eq (11) reduces to where α is the fraction of cells having the driver mutation.

Derivation of estimates of evolutionary parameters for single driver subclone
With the cancer bulk sequenced at the two time points t 1 + t and t 1 + t + Δ, we are able to derive estimates for t 1 , t, r, r 1 , and u. First we solve for r and r 1 , based on the estimated cell counts at t 1 + t and t 1 + t + Δ. The observed type-i cell count is equated to the expected value of the type-i population size, conditioned on survival. For a birth-death process started with a single type-i cell at time 0, we have E½X i ðtÞ� ¼ e r i t . That process has extinction probability d i /b i [38]. Then, where I X i ðtÞ>0 is a random variable and indicator function defined as (15), for large enough time t, It then follows that for the type-0 population, Proceeding similarly for the type-1 population, we obtain The expected value of the first time a population of type-1 cells in a branching process reaches the observed size α 1 M 1 is [38] The last approximation is justified because for realistic cell counts, the first term in (23) dominates the other two, which is also evident in simulation studies (S5 Fig). For example, if r 1 ¼ 1 2 b 1 , then the second term log(r 1 /b 1 ) = −0.69, compared to the first term log(α 1 M 1 ) = 19.11. Even if r 1 is as low as 0.1b 1 , the second term is -2.30. In this case, the percent error of the approximation (24) is 7.3%. In general, the accuracy increases with increased tumor size.
With the measurement of γ, the number of subclonal passengers with frequency between f 1 and f 2 , we can estimate the mutation rate u. In the previous section we derive the expected value of γ as Using the estimates of r and r 1 from (19) and (20), and the measured value of γ from the second bulk sequencing, Eq (25) can be solved for the mutation rate u, When estimating mutation rate for the CLL patients from Ref. [27], for which there is bulk sequencing at two or more time points, we average the mutation rate calculated at each of these time points. (26) is applied for each time point with the respective CCFs and observed γ values for each time point.
To derive the maximum likelihood estimates of t 1 , we consider the likelihood function P(m| t 1 ). The number of passenger mutations present in the founder type-1 cell that appeared at time t 1 is a Poisson process with rate u. Thus, Maximizing the logarithm of the likelihood function with respect to t 1 yields a MLE for t 1 in terms of estimated or measured quantities:

Estimating number of unobserved subclonal mutations from sequencing data
When sequencing data is post-processed by filtering out any mutations with L or fewer variant reads, the number of mutations between f 1 and f 2 will likely be underestimated if 2L/(Rp) > f 1 , where R is average sequencing coverage and p is tumor purity. Define γ obs as the observed number of mutations between frequencies f 1 and f 2 , after post-processing has been performed that filtered out any mutations with L or fewer variant reads. The expected number of subclonal mutations between frequencies f 1 and x is given by where c is a constant that will vary depending on the patient and sample. It can be fit on the sequencing data by noting Therefore, c can be estimated from the sequencing data as Then, we can estimate γ as

PLOS COMPUTATIONAL BIOLOGY
Inferring parameters of cancer evolution in chronic lymphocytic leukemia

Number of passengers reaching fixation after t 1
We estimate the number of passengers that occurred after t 1 and reached fixation in the type-1 population in order to adjust the m obs mutation count. From [47], when mutations occur at cell division, the expected number of clonal passengers is d� u=ð1 À dÞ. � u is the probability that a daughter cell gains a new passenger mutation at cell division, so the mutation rate is u ¼ � ub 1 . For the type-1 population, δ = d 1 /b 1 < 1. When mutations accrue over time, and not only at divisions, the expected number of clonal passengers is thus Similarly, for a clone i, the expected number of passengers that occur after time t i and reach fixation is

Simulation of tumor evolution and sequencing data
To assess the accuracy of the analytic results, we perform a continuous time Monte Carlo simulation to model tumor evolution and collection of sequencing data with an implementation of the Gillespie algorithm [68]. Simulations are written in C/C++. The type-j population has division rate b j , death rate d j , and mutation rate u. Mutations can occur at any point of the cell cycle, not just during division. z n is the number of type-j cells with passenger n as their most recent passenger mutation. The type-0 population is initiated with a single cell at time 0, and the type-j population for k � j > 0 is initiated with a single cell at time t j . Let a be the vector recording the ancestor of new mutations. Element a i is the subclonal ancestor of the ith passenger mutation. For each j 2 0, 1, . . ., k, repeat 1-4 while time is less than t k + t + Δ.
1. Set Γ = N j (b j + d j + u). Time increment to next event time is randomly sampled from Exp [Γ].
• If j < k, if time is greater than or equal to t j+1 for first time, randomly select type-j subclone i to have driver mutation, remove one cell from type-j population count, and set N j+1 = 1. Record the true value of m j+1 , the number of passenger mutations present in the founder type-(j + 1) cell.
2. Randomly select cell, with most recent passenger mutation i, to have the event.
3. Determine which type of event and update population and mutation frequencies. Sample Y from Uniform[0, Γ] to determine event type: Suppose it's the pth passenger, z i À ¼ 1, z p = 1. Update ancestor: a p = i.

4.
For j = 0, if time is less than t 1 and population goes extinct, restart simulation. For j � 1, if time is greater than t j and population goes extinct, restart type-j simulation at t j with a single cell.
5. Reindex to remove extinct passenger mutations, and traverse back through ancestor vector a to sum total number of cells with each passenger.
Measurements are taken at bulk sequencing times t k + t and t k + t + Δ. If time is greater than or equal to t k + t, we measure M 1 ¼ P k j¼0 N j and CCF of clone j as N j /M 1 . Then an additional bulk sequencing measurement is taken at the final time t k + t + Δ, where we measure M 2 ¼ P k j¼0 N j and the CCF of clone j as N j /M 2 . At t k + t + Δ, we measure γ, the number of mutations with frequency between f 1 and f 2 .
To measure m j,obs , the observed number of passengers in the founder type-j cell, we count the number of passengers present in all type-j cells. We also save the true value of m j .
For when we calculate a percent error of corrected and observed γ values in Fig 3D and S3  (B) Fig, we simulate sequencing data by sampling from the mutation frequencies obtained in the Monte Carlo simulation, outlined above, using the approach of [35].

Generate number of cells carrying mutation
3. Post-processing. If there are L = 2 or fewer variant reads, discard mutation. 4. Measure γ obs , the observed number of subclonal mutations between frequencies f 1 and f 2 : 5. Calculate the truth, γ true , from the true mutation frequencies:

Parameter values for simulations
For the simulations we consider three parameter sets corresponding to three modes of tumor evolution: a fast-growing tumor, slow-growing tumor, and tumor with no cell death, each with multiple mutation rates. We simulate three clonal structures: single driver subclone, two nested driver subclones, and two sibling driver subclones. All parameter values are listed in S1 Table. Mutation rate parameter values lie within observed genome wide point mutation rates per day [69]. For simulation of parental clone and subclone, the fast-growing tumor dynamics are from [34]. The slower growing tumor parameter regime has a reduced net growth of r = 0.025, compared to the fast-growing tumor's net growth rate of r = 0.07.

Subclonal reconstruction of CLL sequencing data
The sequencing data from all CLLs analyzed is from Ref. [27], Supplementary Tables 2-4. As in that publication, we use PhylogicNDT [43] to perform subclonal reconstruction. We run the Cluster and BuildTree modules of PhylogicNDT on the longitudinal mutation data from Supplementary Table 3 of [27], using mutation alternate/reference counts, copy number, and tumor purity at all pre-treatment time points. Then for each patient, PhylogicNDT outputs a clonal reconstruction, which includes a phylogenetic tree of the subclones and posterior distributions of subclone CCFs. Additionally, it clusters mutations and assigns them to clones. We directly use subclone assignments and posteriors generated from PhylogicNDT. In our analysis we focus on estimating timing and growth rates of macroscopic subclones whose CCFs are greater than 20% for at least one pre-treatment time point.

Accounting for uncertainties in subclone frequencies and growth rates
Our estimates for parameters of cancer evolution require as input the information on the number of subclonal populations in the tumor, their CCFs and their phylogenetic relationships. In order to obtain this information, we use PhylogicNDT [43], which performs subclonal reconstruction of longitudinal cancer sequencing data. The uncertainty in subclone CCFs reported by PhylogicNDT affects our estimates for subclone growth rates, which in turn affect the estimates of mutation rate and time t between driver(s) and diagnosis. We account for this uncertainty by drawing from the CCF posterior distributions that are output by PhylogicNDT. Using these sampled CCF values, we then calculate growth rates, mutation rate u, and time t between driver(s) and diagnosis, thereby generating confidence intervals for these parameters due to CCF uncertainty.
To estimate subclonal growth rates, we fit an exponential growth curve to subclonal sizes measured at two or more time points. This regression yields fitted values for each clone's growth rate and age. To account for uncertainty in the curve fit (in the case of more than two longitudinal samples), we sample the growth rates and age of clone from a bivariate normal distribution with mean equal to the fitted parameters and variance equal to the covariance matrix of the fitted parameters. When the estimated growth rate during the period of observation-including negative growth rates-is smaller than the minimal possible growth rate necessary to reach the observed clone size, we use the minimal growth rate for calculating mutation rate, time of the driver(s), time between driver(s) and diagnosis, and time of detectability.

Estimating time of cancer detectability
The time a cancer is detectable is the time at which the cancer exceeds the minimum observable size. For the CLL data, we estimate the time that the patients first exhibited an abnormally high WBC count, or leukocytosis, characterized by a WBC count of 11,500/μL [70], or approximately 5.75 x 10 10 total WBCs, assuming a total blood volume of 5 L. In the previous section, we describe how we fit the growth dynamics for the CLL data and obtain a distribution of the fitted growth parameters. Here, we sample from the distribution of the fitted parameters 10,000 times (using the minimal growth rate in the case of a growth rate too low to give rise to the observed WBC count), and numerically solve for the time at which the total WBC count was equal to 5.75 x 10 10 . i.e., we numerically find the root with respect to t i of f ðŷ i ; t i Þ À 5:75 x 10 10 ¼ 0 ð36Þ where t i is the ith estimated time out of 10,000 estimates, f(�) is the exponential function describing the mean cancer growth, andŷ i is the ith random sample from the fitted growth parameters (intercept and growth rate).

Accounting for model uncertainty
The largest source of model uncertainty is the Poisson process for how mutations accumulate, which is used to estimate the time t 1 of the driver mutation. In the fast-growing tumor simulation experiments, the time t 1 had the largest error and variation (Fig 2). The estimate for t 1 depends on the m mutations present in all cells in the driver subclone. The observed m is a single random sample from a Poisson distribution. To account for the uncertainty in t 1 arising from m in the CLLs analyzed, we sample t 1 from the posterior distributions P(t 1 |m). This source of model uncertainty due to the Poisson process will be most significant for cancers like CLL with a smaller number of mutations.
The time t between driver mutation and diagnosis is a random variable due to the stochasticity of cancer cell growth, and will naturally have a certain amount of variation. Time between driver event and diagnosis in a branching process follows a Gumbel distribution [38] and will have a constant variance. The mean, however, will increase with the logarithm of the cancer cell counts, which for the CLLs analyzed are � 10 11 . The simulations of cancer evolution grow to smaller tumor sizes (� 10 5 ) and, as a result, the estimate for t has a significant amount of uncertainty (Fig 2). However, for time scales necessary to generate a tumor, the estimate for t will be quite accurate. For commonly observed tumor sizes, the stochastic fluctuations in the time for the cancer to reach that size will be smaller relative to the magnitude of the time. For a cancer with cell count � 10 11 , the standard deviation of the time t will be less than 5% of its expected value.

Tumor with two nested driver subclones
Here we consider the case where there are two nested driver subclones (S4(A) Fig). "Nested" means that all cells carrying the second driver mutation also carry the first. Type-0, or initiated tumor, cells have birth rate b 0 , death rate d 0 , and net growth rate r 0 = b 0 − d 0 . Type-1 cells, which only have the first driver, have birth rate b 1 , death rate d 1 , and net growth rate r 1 = b 1 − d 1 . Type-2 cells, which carry both drivers, have birth rate b 2 , death rate d 2 , and net growth rate r 2 = b 2 − d 2 . The first driver occurred in a type-0 cell at time t 1 . The second driver occurred in a type-1 cell at t 2 ¼ t 1 þ t 0 2 . The mutation rate u is the same for all subclones. At times t 1 þ t 0 2 þ t and t 1 þ t 0 2 þ t þ D, the tumor is bulk sequenced. The bulk sequencing allows the measurement of the fraction of cells with driver 1 at time t 1 þ t 0 2 þ t, α 1 ; the fraction of cells with driver 2 at t 1 þ t 0 2 þ t, α 2 ; fraction of cells with driver 1 at time t 1 þ t 0 2 þ t þ D, β 1 ; the fraction of cells with driver 2 at t 1 þ t 0 2 þ t þ D, β 2 ; and the observed number of subclonal passenger mutations between frequencies f 1 and f 2 , γ obs . Note that the fraction of the population that is a type-1 cell at the two times is α 1 − α 2 and β 1 − β 2 . The fraction of type-0 cells at the two bulk sequencing time points are 1 − α 1 and 1 − β 1 . The total number of cells at bulk sequencing time points are M 1 and M 2 . We then equate the estimated cell counts to the expected value of the type-i population size X i , conditioned on survival.
Solving the above equations for r i , we obtain the growth rate estimates: The expected value of the first time a population of type-2 cells in a branching process reaches the observed size α 2 M 1 [38], where the approximation in (46) is justified as for (24). By (11), Using the estimates for r 0 , r 1 , and r 2 from (41)- (43), and setting (47) equal to the value of γ obtained from (33) and the second bulk sequencing, u can be estimated: When estimating mutation rate for the CLL patients from Ref. [27], for which there is bulk sequencing at two or more time points, we average the mutation rate calculated at each of these time points. (48) is applied for each time point with the respective CCFs and observed γ values for each time point.
Every type-1 cell carries the m 1 passenger mutations that were present in the original type-1 cell when the first driver mutation occurred at t 1 . Similarly, every type-2 cell carries the m 2 passengers that were present in the founder type-2 cell when the second driver mutation occurred at t 2 . Note, none of the m 1 mutations are counted towards m 2 . Now we consider the likelihood function Pðm 1 ; m 2 jt 1 ; t 0 2 Þ: ð49Þ Now, maximizing the logarithm of (51) with respect to t 1 and t 0 2 , The number of passengers present in the founder type-i cell cannot be directly observed, but we can measure m i obs , the number of passengers present in all type-i cells. An expected u/r 1 passengers occurring after t 1 in type-1 cells and reaching fixation in the type-1 subclone will be incorrectly included in m 1 obs , rather than in m 2 obs (see Methods). Similarly, an expected u/r 2 passengers occurring after t 2 in type-2 cells and reaching fixation in the type-2 subclone will be incorrectly included in m 2 obs . Thus, Tumor with two sibling driver subclones Here we consider a tumor with two "sibling" driver mutations (S4(B) Fig). Sibling driver mutations are drivers that occur in separate subclones. In this case, cells are either initiated tumor cell (type-0), carry driver 1 (type-1), or carry driver 2 (type-2). No cells contain both drivers. Driver 1 occurred in a type-0 cell at time t 1 . Driver 2 occurred in a type-0 cell at t 2 .
Type-0 cells have birth rate b 0 , death rate d 0 , and net growth rate r 0 = b 0 − d 0 . Type-1 cells, which carry driver 1, have birth rate b 1 , death rate d 1 , and net growth rate r 1 = b 1 − d 1 . Type-2 cells, which carry driver 2, have birth rate b 2 , death rate d 2 , and net growth rate The mutation rate u is the same for all subclones. Suppose time τ i elapses between driver mutation i and tumor observation. Bulk sequencing of the tumor is performed at t 1 + τ 1 (or equivalently t 2 + τ 2 ), and a known Δ later. Sequencing the tumor allows the measurement of the fraction of cells with driver 1 at the first sequencing, α 1 ; the fraction of cells with driver 2 at the first sequencing, α 2 ; fraction of cells with driver 1 at the second sequencing, β 1 ; the fraction of cells with driver 2 at the second sequencing, β 2 ; and the number of subclonal passenger mutations between frequencies f 1 and f 2 , γ. The fraction of type-0 cells at the two bulk sequencing time points are 1 − α 1 − α 2 and 1 − β 1 − β 2 . The total number of cells at the two sequencing time points are M 1 and M 2 .
We then equate the estimated cell counts to the expected value of the type-i population size X i , conditioned on survival.
Solving the above equations for r i , we obtain The expected value of the first time a population of type-i cells in a branching process reaches the observed size α i M 1 is [38] where the approximation in (64) is justified as for (24). By (11), Using the estimates for r 0 , r 1 , and r 2 from (60) and (61), and setting (65) equal to the value of γ obtained from (33) and the second bulk sequencing, u can be estimated.
When estimating mutation rate for the CLL patients from Ref. [27], for which there is bulk sequencing at two or more time points, we average the mutation rate calculated at each of these time points. (66) is applied for each time point with the respective CCFs and observed γ values for each time point. Every type-1 cell carries the m 1 passenger mutations that were present in the original type-1 cell when the first driver mutation occurred at t 1 . Similarly, every type-2 cell carries the m 2 passengers that were present in the founder type-2 cell when the second driver mutation occurred at t 2 . We assume that m 1 and m 2 don't contain any shared mutations. In the CLL dataset we use, this is true. We consider the likelihood function P(m 1 , m 2 |t 1 , t 2 ) Maximizing the logarithm of (68) with respect to t 1 and t 2 yields the maximum likelihood estimates: Using the same approach as in the case of a single driver, we obtain the corrections for the observed number of mutations present in all cells of each subclone:

Fully generalized estimates for any phylogeny of k drivers
Here we derive estimates for a completely general tumor phylogeny. Suppose a tumor has k driver mutations. In this general case, define a type-i cell as a cell where its most recent driver mutation was driver i. Note that a type-i cell can have between 0 and k − 1 other driver mutations. A phylogenetic reconstruction of the k driver mutations is necessary for the completely general case. From this phylogenetic tree, the ancestor of each subclone can be obtained. Define the function a(i) as the ancestor of the type-i population. That is, if all driver mutations contained in the type-i population are ordered, a(i) gives the driver mutation that occurred prior to i. Define t i as the time between when driver i occurred and when the type-i cells' previous driver mutation occurred. At time of observation, assume the type-i population has κ i total driver mutations, where 1 � κ i � k for all 1 � i � k. Denote the time between the type-i's κ i , or last, driver mutation and when the tumor is observed as τ i . This is the time between the founder type-i cell's birth and tumor observation. Then the tumor is first observed and bulk sequenced at T 1 � ð P k i À 1 j¼0 t a j ðiÞ Þ þ t i (equivalently τ 0 for i = 0), where we denote a j as the jth iterate of the function a: a j ðiÞ � aða jÀ 1 ðiÞÞ 8j � 1: The tumor is also bulk sequenced at T 2 � ð P k i À 1 j¼0 t a j ðiÞ Þ þ t i þ D (equivalently τ 0 + Δ for i = 0). These assumptions allow for any subclone phylogeny, including combinations of the previously discussed sibling and nested subclone types.
The bulk sequencing allows the measurement of the fraction of cells with driver i at T 1 , α i ; the fraction of cells with driver i at time T 2 , β i ; and the number of subclonal passenger mutations between frequencies f 1 and f 2 , γ. Again, the total number of cells at measurement times T 1 and T 2 are M 1 and M 2 . To write the type-i frequencies, a c i and b c i , in terms of the driver frequencies, we subtract the fraction of cells descending from type-i cells but gaining additional driver mutation(s) after i, from the fraction of cells containing driver i: where δ i,a(j) is the Kronecker delta, defined as We equate the estimated cell counts at the first bulk sequencing time point to the expected value of the type-i population size X i , conditioned on survival.
And similarly, at the second bulk sequencing time point, Solving the above equations for r i , we obtain Now, using the growth rate estimates r i and the subclone sizes, we can estimate each τ i . The expected value of the first time a population of type-i cells in a branching process reaches the observed size a c i M 1 is [38] where the approximation in (84) is justified as for (24). Using the (k + 1) r i estimates from (80), and setting (81) equal to the value of γ obtained at the second bulk sequencing from (33), u can be estimated: When estimating mutation rate for the CLL patients from Ref. [27], for which there is bulk sequencing at two or more time points, we average the mutation rate calculated at each of these time points. (85) is applied for each time point with the respective CCFs and observed γ values for each time point. The number of passengers present in the original type i founder cell cannot be directly observed, but we can measure m i , the number of clonal passengers present in the type i population, only including passengers not present in other clones. We will assume that the m i don't contain any shared mutations, which is true for the CLL dataset we consider. The likelihood function P(m 1 , . . ., m k |t 1 , . . ., t k ) is proportional to Then, maximizing the logarithm of (86) with respect to t 1 , t 2 , . . ., t k , The observed clonal passengers in the founder type-i cell will incorrectly include passengers that reached fixation in the type-i population after driver mutation i occurred, instead of correctly being counted toward the descendant of clone i. As a result, we again correct for the expected number of these passengers, u/r i . That is, m i ¼ m i; obs À u=r i þ u=r aðiÞ 8i ¼ 1; . . . ; k: ð88Þ Supporting information S1 Fig. Percent errors (PEs) for case with no death. Accuracy of parameter inferences for Monte Carlo simulation of tumor with no cell death for (a) single driver subclone with mutation rate u = 1, (b) single driver subclone with u = 10, (c) two nested subclones with u = 1, and (d) two sibling subclones with u = 1. Mean percent error (MPEs) are the black numbers above the plots, and mean absolute percent errors (MAPEs) are the grey numbers below the MPEs. Boxes contain 25th-75th quartiles, with median indicated by thick horizontal black line. Whiskers of boxplots indicate 2.5 and 97.5 percentiles. Violins are smoothed density estimates of the percent error datapoints. Complete parameter values and number of runs are included in S1 Table. (PDF)  Table. (PDF)

S3 Fig. Corrections for observed mutation counts. (a)
We compare percent errors of parameter estimates for time from tumor initiating until appearance of a driver subclone, t 1 , with and without the correction for passengers that occur after the driver and reach fixation in the driver population (Eq (6), main text). Errors for estimate with correction are shown in blue, and for estimate without correction (Eq (5), main text) in orange. Errors are plotted as a kernel density estimate for Monte Carlo simulations of fast-growing tumor with mutation rate u = 1. Mean percent errors (MPEs) and mean absolute percent errors (MAPEs) are listed. (b) The percent errors for the observed (orange) and corrected (blue) number of subclonal mutations between frequencies f 1 and f 2 , γ, (Eq (7), main text) are plotted as kernel density estimates. Observed mutations are those that passed post-processing, i.e. those that have more than L = 2 mutant reads. True mutation frequencies were generated from 135 surviving runs of a Monte Carlo simulation of a fast-growing tumor with mutation rate u = 1, from which sequencing reads were simulated with 100x average coverage (see Materials and methods). Percent errors are calculated relative to the true γ measured from the true mutation frequencies. (PDF)

S4 Fig. Model for tumor expansion with two driver mutations.
(a) Two nested driver subclones. Initiated tumor (type-0) cells in blue, cells with driver 1 (type-1) in red, and cells with both drivers (type-2) in orange. A driver mutation occurs in a type-0 cell at t 1 . A second driver mutation occurs in a type-1 cell at t 1 þ t 0 2 . Tumor is bulk sequenced at t 1 þ t 0 2 þ t and t 1 þ t 0 2 þ t þ D. (b) Two sibling driver subclones. Type-0 cells (in blue). A driver mutation occurs in a type-0 cell at t 1 . A second driver mutation occurs in a different type-0 cell at t 2 . Tumor is bulk sequenced at t 1 + τ 1 (or, equivalently t 2 + τ 2 ) and t 1 + τ 1 + Δ (equivalently t 2 + τ 2 + Δ). (PDF)

S5 Fig. Accuracy for t estimate increases with tumor size.
A Monte Carlo simulation of a birth-death process was performed for (a) fast-growing, (b) slow-growing, and (c) no cell death parameter regimes. For each of the 100 surviving simulated tumors, the percent error of the t estimate (Eq (3)) was calculated when the tumor first reached the specified tumor sizes. Means are indicated by red points and lines, ± one standard deviation is shown by the red region, and individual data points for each simulation run are shown as the grey points (with horizontal jitter for visibility). (PDF) S1