Genetic Progression and the Waiting Time to Cancer

Cancer results from genetic alterations that disturb the normal cooperative behavior of cells. Recent high-throughput genomic studies of cancer cells have shown that the mutational landscape of cancer is complex and that individual cancers may evolve through mutations in as many as 20 different cancer-associated genes. We use data published by Sjöblom et al. (2006) to develop a new mathematical model for the somatic evolution of colorectal cancers. We employ the Wright-Fisher process for exploring the basic parameters of this evolutionary process and derive an analytical approximation for the expected waiting time to the cancer phenotype. Our results highlight the relative importance of selection over both the size of the cell population at risk and the mutation rate. The model predicts that the observed genetic diversity of cancer genomes can arise under a normal mutation rate if the average selective advantage per mutation is on the order of 1%. Increased mutation rates due to genetic instability would allow even smaller selective advantages during tumorigenesis. The complexity of cancer progression can be understood as the result of multiple sequential mutations, each of which has a relatively small but positive effect on net cell growth.


Introduction
The current view of cancer is that tumorigenesis is due to the accumulation of mutations in oncogenes, tumor suppressor genes, and genetic instability genes [1]. Sequential mutations in these genes lead to most of the hallmarks of cancer [2]. Cancer research has benefited immensely from studies of uncommon inherited cancer syndromes that served to highlight the importance of individual genes in tumorigenesis [3]. Theoretical considerations have suggested that a handful of mutations, perhaps as few as three, may be sufficient for developing colorectal cancer [4,5]. This relatively small number is consistent with the standard model for colorectal tumorigenesis based on the identification of mutations in well-known cancer genes [6]. However, Sjö blom et al. [7] have recently determined the sequence of 13,000 genes in colorectal cancers and found that individual tumors contained an average of 62 nonsynonymous mutations. Extrapolating to the entire genome, it was estimated that individual colorectal cancers contain about 100 nonsynonymous mutations and that as many as 20 of the mutated genes in individual cancers might play a causal role in the neoplastic process [7].
Tumors arise from a process of replication, mutation, and selection through which a single cell acquires driver mutations which provide a fitness advantage by virtue of enhanced replication or resistance to apoptosis [8]. Each driver mutation thereby allows the mutant cell to go through a wave of clonal expansion. Along with drivers, passenger mutations, which do not confer any fitness advantage, are frequently observed. Passenger mutations arise in advantageous clones and become frequent by hitchhiking. The accumulation of ;100 mutations per cell is therefore the result of sequential waves of clonal expansion; the observed mutations mark the history of the cancer cell, including both drivers and passengers.
Genetic mutations can arise either due to errors during DNA replication or from exposure to genotoxic agents. The normal mutation rate due to replication errors is in the range of 10 À10 to 10 À9 per nucleotide per cell per division [9]. It is likely that the initial steps leading to cancer arise in cells with a normal mutation rate [10]. A normal mutation rate might also be sufficient to generate the large numbers of mutations in cancer given the many generations that the dominant cancer cell clone has gone through both before and after its initiating mutation [11][12][13]. However, it has also been argued that tumor cells have mutator phenotypes that accelerate the acquisition of mutations [14].
Mathematical modeling of carcinogenesis has had a rich history since its introduction more than 50 years ago [15][16][17]. The initial two-hit theory has evolved into more elaborate models incorporating multiple hits, rate-limiting events, and genomic instability [4,[18][19][20][21][22][23]. Most models consider the stem cell at the base of the colonic crypt as the initial target for mutation, with the daughter cells giving rise to the adenoma and progressively increasing the risk of malignant development [4,22].
The tumor data collected by Sjö blom et al. [7] show that the mutational patterns among colorectal cancers from different patients are diverse. This observation indicates that there may be many different mutational pathways that can lead to the same cancer phenotype. In the model described below, we assume that there are 100 potential driver genes and ask for the expected waiting time until one cell has acquired mutations in a given number, up to 20, of these genes. We assume that one or two initial mutations, perhaps together with losses or gains of large chromosomal regions [15,16], give rise to a benign tumor (adenoma) of ;1 milligram or 10 6 cells ( Figure 1). We model the progression of this adenoma to fullblown cancer over a period of five to 20 years [16], in which the adenoma grows to ;1 gram, or 10 9 cells. Whether the whole population of cells is at risk for clonal expansion or whether a fraction of cells akin to stem cells drives growth of the adenoma is currently a subject of debate. This is important as cancer stem cells, as well as other factors such as geometric constraints on the architecture of the adenoma, may significantly reduce the effective population size and thereby impact the waiting time to cancer [24,25]. Note that it is not size that distinguishes a cancer from an adenoma; rather it is the ability of the cancer cells to invade through the underlying basement membrane and escape from its normal anatomical position.
We use the Wright-Fisher process [26] to model the somatic evolution of cancer in a colonic adenoma. We assume a cell turnover of one per day [27] and analyze the time to cancer as a function of the population size N, the per-gene mutation rate u, and the average selective advantage s per mutation. We present extensive simulation results as well as analytical approximations to the expected waiting time. The model offers a basic understanding of how the different evolutionary forces contribute to the progression of cancer.

Results
The mutation data are represented in a binary matrix of size 35 3 78, whose rows correspond to 35 tumor samples and whose columns correspond to the 78 candidate cancer genes identified by Sjö blom et al. [7] (Figure 2). A non-zero entry in cell (i, j) of this matrix indicates the presence of a mutation in gene j of tumor i. Tumors harbor between 1 and 20 mutated genes (mean ¼ 6.5). Most of these genes (66/78 ¼ 85%) are mutated in at most three different tumors, resulting in highly diverse mutational patterns among the tumors. The notable exception are the three well-known cancer genes APC, p53, and K-ras, which were found mutated in 24, 17, and 16 tumors, respectively. We have analyzed partial correlations between genes, taking into account the small number of observations and multiple comparisons. Several pairs of genes were significantly correlated, most of them positively, but all correlations were weak and below 0.07 ( Figure S1). From this data analysis, we conclude that in colon cancer, a very small number of genes are mutated in a large fraction of tumors. However, many other genes are involved in tumor progression, although each single gene is mutated only in a small subset of tumors without a clear pattern emerging.
For the purpose of mathematical modeling of tumorigenesis, we consider the presence of an adenoma. Adenoma formation probably requires the appearance of mutations in one or a few genes (in particular, APC) that are common to most tumors. We assume the occurrence of all subsequent mutations to be independent events. When any k out of d ¼ 100 susceptible genes are mutated in a single cell, the cancer phenotype is considered to be attained. The first cells of this type mark the onset of an invasive tumor. The Wright-Fisher process is used to describe these evolutionary dynamics. Despite the large population size of up to N ¼ 10 9 cells, we can efficiently compute estimates of the time to the first appearance of any k-fold mutant by simulation, because it The adenoma grows from a population of 10 6 to 10 9 cells which accumulate mutations that drive phenotypic changes seen in cancer cells. Blue circles symbolize adenoma cells prior to accumulating the additional mutations that are the subject of modeling, green indicates cells that have acquired additional, but an insufficient number of mutations for malignancy, and red indicates cells with the number of mutations required for the cancer phenotype. doi: 10

Author Summary
Cancer is a disease of multicellular organisms that is characterized by a breakdown of cooperation between individual cells. The progression of cancer proceeds from a single genetically altered cell to billions of invasive cells through a series of clonal expansions. During tumorigenesis the cancer cells undergo replication and mutation, thereby increasing the size and invasiveness of the tumor. Recent sequencing projects of cancer cells suggest that mutations in up to 20 different genes might be responsible for driving an individual tumor's development. This insight contrasts with most mathematical models of cancer progression, which assume that the cancer phenotype is driven by mutations in only a few genes. We present a new mathematical model in which tumorigenesis is driven by mutations in many genes, most of which confer only a small selective advantage. Specifically, the progression of a benign tumor of the colon (adenoma) to a malignant tumor (carcinoma) is described by a Wright-Fisher process with growing population size. We explore the basic parameters of the model that are consistent with observed data. We also derive an analytical formula for the expected waiting time for the progression from benign to maligant tumor in terms of the population size, the mutation rate, the selective advantage, and the number of susceptible genes.
suffices to trace the distribution of the k þ 1 mutant error classes in each generation. We assume a constant average selective advantage, s, for each mutation and a per-gene mutation rate, u. Figure 3 displays the typical behavior of this process in a single simulation. After a short initial phase in which the homogeneous wild-type population produces the first low-order mutants, a traveling wave is observed ( Figure  3). Apparently, this distribution of error classes has constant variance and travels with constant velocity toward higherorder mutants. Thus, we expect the time until the first k-fold mutant appears to be linear in k. This conjecture is substantiated by simulations for a wide range of parameters ( Figure S2) provided that mutations are advantageous (s . 0).
Within our model, the probability of developing cancer is equated with the probability of generating at least one k-fold mutant cell in the adenoma. For k ¼ 20, this probability as a function of time is depicted in Figure 4. The expected time to the development of cancer increases with decreasing cell population size (hence the low risk of cancer associated with very small adenomas), with decreasing selective advantage, and with decreasing mutation rate. Thus, if the population at risk is a small subset composed of actively replicating stem cells, tumor progression will be slow. In contrast, an increased mutation rate due to genetic instability speeds up this process.
The simulations suggest that in a time frame of 5 to 15 years, cancer might develop in an adenoma of size 10 7 to 10 9 cells with a normal mutation rate of 10 À7 per gene per cell division and a 1% selective advantage per mutation ( Figure  4A). Alternatively, a higher mutation rate of 10 À5 per gene per cell division would enable a smaller population of at-risk cells (10 5 to 10 7 ) and a smaller selective advantage (0.1%) to reach the required number of mutations in the same time interval ( Figure 4B). However, for reasonable mutation rates, a completely neutral process (s ¼ 0) predicts waiting times that are not consistent with the observed incidence of colon cancer, as would be expected (Figure 4, Figure S2). Figure 5 generalizes these findings to different values of k by partitioning the parameter space of the model into regions of identical evolutionary outcomes. Each curve defines an instance of the Wright-Fisher process that results in a 10% chance of developing a k-fold mutant after 3,000 generations (or 8.2 years). These level curves define the parameter combinations that produce similar dynamics. For example, a small at-risk population is unlikely to generate a cancer requiring more than ten driver gene mutations unless the selective advantage for these mutations is large (see Discussion).
Based on the simulation results, we have derived an analytical approximation for the expected time to cancer.
The key observation is that the distribution of error types follows a Gaussian ( Figure 3). This approach leads to the expression for the expected waiting time, where k is the number of cancer-defining genes, d is the number of susceptible genes, u is the mutation rate, s the average selective advantage, and N init and N fin are the initial and final population sizes of the polyp, respectively (see Materials and Methods). The approximation is linear in k ( Figure S2) and matches closely the observed behavior of the Wright-Fisher process, as long as s . 0 (Figures 3-5). The fit is analyzed quantitatively in Protocol S1. The expression for t k highlights the strong effect of the selective advantage on tumorigenesis, and gives an explicit tradeoff between the evolutionary forces.

Discussion
Research over the past three decades has shown that cancer is an acquired genetic disorder [1]. The process of replication, mutation, and selection eventually leads to the appearance of tumors in multicellular organisms if they live long enough. Tumor cells accumulate many mutations in their evolutionary path [7,8,28], but not all mutations play a causal role in the evolution of the clone. If a gene is mutated in tumors derived from different patients, it is less likely to be a passenger and more likely to provide the cell with a selective advantage, permitting it to expand and eventually dominate the population. Based on this reasoning, the data in Sjö blom et al. [7] suggest that as many as ;20 driver genes are mutated per tumor. The diverse mutational landscapes observed in tumor cells of the same tissue origin suggest that different mutations can have the same phenotypic effect. One plausible explanation for this observation is that genes are organized into intracellular pathways (signaling, metabolic, checkpoint, etc.), and the disturbance of these pathways drives tumorigenesis. Within each cell, every information transfer cascade requires functional proteins that are the products of distinct genes. Mutations in any one of the genes that code for proteins in a given pathway can complement each other and their genetic alterations can have similar phenotypic effects [1]. This view is supported by the observation that multiple hits in different genes of the same pathway in individual tumors are less frequent than expected [1].
In our model, we assume that each subsequent mutation has the same incremental effect on the fitness of the cell. In general, however, the impact of a specific mutation on the phenotype of the cell will depend on the genetic background. Gene interactions, or epistasis, can be positive or negative, and they can impose constraints on the order in which mutations accumulate [1]. In this case, the model parameter s may be regarded as the average fitness increase per mutation.
We have seen that a fitness increase of 1% per mutation may be enough for the Wright-Fisher model to generate dynamics that are consistent with the observed time scale of evolution from adenoma to carcinoma. However, genetic alterations associated with the initiation of colon cancer (such as those in APC) may have larger fitness advantages than those associated with tumor progression. The value or distribution of the fitness parameter s, which is unknown at present, will ultimately clarify the role of selection in this evolutionary process. We emphasize, however, that our estimates of s are based on the assumptions of the Wright-Fisher model. Thus, additional uncertainty in determining the role of selection is associated with violations of these assumptions by the biological system. Other models may lead to different estimates of the fitness associated with mutations, depending among other factors on the number of mutations necessary to reach the malignant phenotype and on the timescale [29].
In another simplifying abstraction, we have defined the tumor cell by the accumulation of k ¼ 20 mutations in different driver genes. In reality, it is unlikely that any combination of 20 genes will induce the cancer phenotype. Our assumption is based on the observed cancer genotypes which fail to reveal a striking genetic signature of cancer cells. In this respect, our model provides lower bounds on the expected waiting time to cancer, as reaching a specific 20-fold mutant may take significantly longer.
These abstractions are important because all lesions begin with a small number of neoplastic cells. The simulations in Figure 5 show that cancers would never result from such small numbers of cells if 20 driver mutations were required and each mutation conferred only a small fitness advantage. It is likely that some of the early mutations (such as those in K-ras) increase fitness more than the average, allowing a small, initiating lesion to grow into an intermediate size lesion. Once a growth reaches this size, mutations with small fitness advantages can accumulate and eventually convert the tumor into a cancer.
The large population size of 10 9 cells would suggest that a purely deterministic approximation to the Wright-Fisher process is reasonable. It turns out, however, that the stochasticity associated with generating mutants of each new type has a strong impact on the evolutionary dynamics (see Protocol S1). Therefore, a deterministic model of evolutionary dynamics will significantly underestimate the time to cancer. The closer approximation presented here exploits the regular behavior of the system of propagating a Gaussian distribution of error types and takes into account stochastic effects in determining the speed of this traveling wave. Thus, stochastic effects can play an important role even in very large populations. Tumors derived from the same tissue exhibit considerable variability in their spectrum of mutations (Figure 2 and [7]). The number and type of mutations observed is the result of the size of the population at risk, the mutation rate, and the microenvironment of the evolving clone. The individual mutation rate can vary significantly due to genetic [27,30] and environmental effects (e.g., dietary fat intake, colonic bacterial flora, prior genotoxic therapy) [31,32]. These factors, expected to be different for every tumor, also contribute to the diversity of the mutational landscapes observed in tumors. It is also worth noting that the number of potential driver genes is likely to be an underestimate because the power of the Sjö blom et al. study to detect infrequent mutations was limited [7]. The study of larger numbers of tumors is likely to show that a few hundred different genes may function as drivers. This increase in potential drivers, however, will not have a substantial effect on the conclusions of the models derived here (Equation 1).
Most tissues in metazoans undergo turnover and are maintained by a population of tissue-specific stem cells that generally replicate at a slow rate and exhibit properties such as asymmetric division and immortal DNA strand cosegregation [33], perhaps to minimize the acquisition and retention of mutations. Although many tumors have cancer stem cells at their root [34] and colon cancer stem cells have been reported [24,25], it is an open question whether such cells arise solely due to the progressive accumulation of mutations in normal stem cells or because cells can re-acquire stem cell-like properties by mutation. The former scenario would suggest a much smaller effective population size, an important variable for modeling the evolution of cancer [4,22,[35][36][37]. The colon has approximately 10 7 crypts, each one maintained by a small number of stem cells [27]. Initially, these stem cells constitute the overall population at risk, but the vast majority of patients with colon cancer develop tumors as the natural progression of mucosal adenomas [38]. Thus, adenoma formation can be regarded as a mechanism by which the population of cells at risk is increased and hence the probability of cancer in patients with multiple adenomas is dramatically increased. This is observed in familial adenomatous polyposis patients, who have inherited mutations of the APC gene.
Our model permits investigation of the impact of the relevant parameters of tumor evolution on a global scale. These parameters include the size of the population at risk, the mutation rate, and the fitness advantage conferred by specific mutations (Equation 1). The model suggests that the average waiting time for the appearance of the tumor is strongly affected by the fitness, s, conferred by the mutations, with the average waiting time decreasing roughly as 1/s ( Figure S2). The mutation rate and the size of the population at risk contribute only logarithmically to the waiting time and hence have a weaker impact. Thus, the model of cancer progression presented here might add to the debate whether selection [10,11] or mutation [39] is the dominant force in tumor development.
Finally, this model helps answer several questions about colorectal tumorigenesis that have long perplexed researchers and clinicians. Why is there so much heterogeneity in the times required for tumor progression among different patients? Why is there so much heterogeneity in the sizes and development times of tumors even within individual patients, such as those with familial adenomatous polyposis, if they all have the same initiating APC mutation? Why do cancers behave so differently with respect to their response to chemotherapeutic agents or radiation or their propensity to metastasize? Our model is compatible with the view that a few major mutational pathways, such as those involving APC, Kras, and p53, endow relatively large increases in fitness that can allow tumors to grow to sizes compatible with further progression ( Figure 5). However, the final course to malignancy will be determined by multiple mutations, each with a small and distinct fitness advantage, and these mutations occur stochastically. Every cancer will thereby be dependent on a unique complement of mutations that will determine its propensity to invade, its ability to metastasize, and its resistance to therapies. If this model is correct, then biological heterogeneity is a direct consequence of the tumorigenic process itself.
In our view, there is no reason to think that this model, or the data on which it was based, will be applicable only to colorectal cancers. Indeed, Sjö blom et al. [7] have identified similar mutational patterns in breast cancers, even though these tumors have completely different embryologic origins and are associated with distinct biological properties and predisposing factors. We therefore predict that the basic features of our model, i.e., a large number of potential drivers each of which contributes only a small fitness advantage, will apply to the progression of most common solid tumors. These tumors include those of the stomach, pancreas, bladder, lung, prostate, and kidney. It is unlikely that the model will apply to tumors that appear to have shorter waiting times, such as leukemias and lymphomas.
After completion and submission of the manuscript, we have learnt about related work recently published or being published [40][41][42]. These independent papers, which build on previous work published in [43], discuss a closely related mathematical model. In contrast to our work, these excellent contributions do not consider applications to the somatic evolution of cancer. Furthermore, we arrive at similar conclusions regarding the expected waiting time with a much more concise method than used in the other papers. We are grateful to Eric Brunet for bringing these references to our attention.

Methods
Data. The collection of tumor data has been described in [7]. Briefly, ;13,000 genes were sequenced from cancers of 11 patients with advanced colorectal cancers. Any mutant gene detected in this study was analyzed in an additional 24 patients with advanced cancers. Tumors with mismatch repair (MMR) deficiency were not included in this cohort, as MMR is known to increase the mutation rate by orders of magnitude and would complicate the analysis of mutations. Mutations were found in 519 genes, and, of these, 105 genes were found to be mutated in at least two independent tumors.
Statistical analysis. To test for dependencies between mutated genes, we calculated all 3,003 pairwise partial correlations between the 78 genes that were considered candidate drivers. Because the number of observed tumors is much smaller than the number of genes, we used the shrinkage method introduced in [44] for estimation.
Wright-Fisher process. We initially consider a colonic adenoma composed of 10 6 cells (;1 mm 3 ) that is growing exponentially to reach a size of 10 9 cells (;1 cm 3 ). Serial radiological observations show that the growth of unresected colonic adenomas is wellapproximated by an exponential function [45]. The average growth rate determined in [45] implies that it takes ;11 years for an adenoma to grow from 10 6 to 10 9 cells. We consider an evolving cell population of size N(t) in generation t. Population growth is modeled by assuming that growth is proportional to the average fitness hwi of the population, N(t þ 1) -N(t) ¼ a hwi N(t), where a is a constant ensuring the experimentally observed growth dynamics, and N(0) ¼ 10 6 . Although hwi changes slightly over time, the growth kinetics is still approximately exponential.
Each cell is represented by its genotype, which is a binary string of length d ¼ 100 corresponding to the 100 potential driver genes. The population is initially homogeneous and composed of wild-type cells which are represented by the all-zeros string. In each generation, N(t) genotypes are sampled with replacement from the previous generation. For large population sizes of 10 9 cells, it is not feasible to track the fate of each of the possible 2 100 mutants in computer simulations. However, we are interested in the first appearance of any k-fold mutant in the system (k ¼ 20). Thus, it suffices to trace the k þ 1 mutant error classes, i.e., the number of j-fold mutants N j (t) for each j ¼ 0, . . ., k, in each generation. With every additional mutation, we associate a selective advantage s. Thus, the relative fitness of a j-fold x j w j . Ignoring back mutation, the probability of sampling a j-fold mutant is where u is the mutation rate per gene. In each generation, the population is updated by sampling from the multinomial distribution where N(t) follows the above growth kinetics.
We use the discrete Wright-Fisher process rather than the continuous Moran process [26], which might seem more natural for cancer progression, because the Wright-Fisher process allows for efficient computer simulations even for very large population sizes. Both models behave similarly for large population sizes [26].
Analytical approximation. The large cell population size might suggest that one could consider a replicator equation in the limit as N ! '. However, this approach yields a Poisson distribution for the timedependent relative frequencies x j (t) with parameter k ¼ ud(e st À 1) / s, implying that the variance of x increases over time, which contrasts with the simulation results ( Figure 3). The reason for this discrepancy is that, in the replicator equation, higher-order mutants with high fitness are instantaneously generated. Thus, the time for their expansion is underestimated compared to the waiting time in the stochastic system. See Protocol S1 for further discussion of this phenomenon.
To account for the stochastic fluctuations in the accumulation of k mutations, we model this process by decoupling mutation and selection (see Protocol S1 for mathematical details). Briefly, we assume that j-fold mutants are generated at a constant rate with increasing j. The Gaussian describing the distribution of mutant error classes has mean vt, variance r 2 , and travels with velocity v ¼ sr 2 (Figure 3). To determine v, we consider an (initially) exponentially growing subpopulation of j-fold mutants and calculate the expected time until one ( j þ 1)-mutant is produced. This leads to v ¼ 2slogN= log s ud À Á 2 , and for constant population size N, we obtain the approximation t k ' k log s ud À Á 2 =2slogN for the expected time to the first appearance of any k-fold mutant. The same waiting time in a population growing exponentially from initial size N init ¼ N(0) to final size N fin ¼ N(t k ) is equal to that in a constant population with effective population size N ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N init N fin p . Thus the speed of the mutant wave in the growing population can be approximated by the average of the values corresponding to the initial and final population sizes. This leads to t k ' k log s ud À Á 2 =slogðN init N fin Þ for the waiting time in a population growing from N init to N fin . We will often restrict our attention to constant population sizes because of the equivalent waiting time in a constant population with effective size equal to the geometric mean of the initial and final population sizes.
Protocol S1. PDF Document Entitled ''Analytical Approximation for the Expected Waiting Time'' Which Contains the Mathematical Details of the Model Found at doi:10.1371/journal.pcbi.0030225.sd001 (152 KB PDF).