Mathematical modeling the order of driver gene mutations in colorectal cancer

Tumor heterogeneity is a large obstacle for cancer study and treatment. Different cancer patients may involve different combinations of gene mutations or the distinct regulatory pathways for inducing the progression of tumor. Investigating the pathways of gene mutations which can cause the formation of tumor can provide a basis for the personalized treatment of cancer. Studies suggested that KRAS, APC and TP53 are the most significant driver genes for colorectal cancer. However, it is still an open issue regarding the detailed mutation order of these genes in the development of colorectal cancer. For this purpose, we analyze the mathematical model considering all orders of mutations in oncogene, KRAS and tumor suppressor genes, APC and TP53, and fit it on data describing the incidence rates of colorectal cancer at different age from the Surveillance Epidemiology and End Results registry in the United States for the year 1973–2013. The specific orders that can induce the development of colorectal cancer are identified by the model fitting. The fitting results indicate that the mutation orders with KRAS → APC → TP53, APC → TP53 → KRAS and APC → KRAS → TP53 explain the age–specific risk of colorectal cancer with very well. Furthermore, eleven pathways of gene mutations can be accepted for the mutation order of genes with KRAS → APC → TP53, APC → TP53 → KRAS and APC → KRAS → TP53, and the alternation of APC acts as the initiating or promoting event in the colorectal cancer. The estimated mutation rates of cells in the different pathways demonstrate that genetic instability must exist in colorectal cancer with alterations of genes, KRAS, APC and TP53.


Introduction
Colorectal cancer is one of the most common malignant tumors of digestive tract, which is the cancer with the third highest incidence and the second leading cause of mortality in the world [1]. Cancer is widely believed to have been caused by the accumulation of genetic and epigenetic alterations that result in the transformation of normal colonic epithelium to malignant cells. Recent studies showed that the progression from normal cells to the first malignant cell was supported by the alternations in three driver genes involving APC, KRAS and TP53, all of which were the most significant driver mutations in the colorectal cancer [2,3]. However, not all colorectal cancers harbor the alterations of these three genes. Statistical analysis suggested that approximately 15% of colorectal cancers contained all mutations in APC, KRAS and TP53, and approximately 20% of tumor carried the mutations in both APC and KRAS [4,5]. Among them, APC and TP53 are tumor suppressor genes (TSGs), and KRAS is an oncogene. It is well known that activation of oncogene, KRAS, only needs one hit, and inactivation of the TSGs, APC and TP53, requires two hits [6]. Hence, it takes five hits to turn the normal cells into the malignant cells for the primary colorectal cancer involving three driver genes, APC, KRAS and TP53.
Studying the mechanism of cancer formation by using the mathematical model is of essential importance in the prediction of tumor risk. The mathematical models considering the number of gene mutation were presented to explore the mechanism of cancerigenesis by a large number of scholars. The most revealing cancer models were the multistage model proposed by Armitage and Doll as early as 1950s [7,8]. Their models were used to match age-specific mortality rate of various cancers, and the results indicated that the logarithm of mortality rate was linear with the logarithm of age. Subsequently, they considered the clonal expansion of cells and presented the two-stage model with clonal expansion of cells, widely applied to study the risk of various cancer [9]. Notably, Knudson utilized this model to fit incidence rate data of retinoblastoma and discovered the RB gene, the earliest TSG [10]. In addition, Moolgavkar et al. developed the method of solving the hazard function and the probability of tumor for the stochastic two-stage model involving selective growth advantage of cells and explained the detailed biological meaning of the model [11,12]. However, two mutations are not sufficient to pose all cancers. Therefore, the two-stage model with clonal expansion of cells was extended to the model with more than two events where mutations accumulate in a specific order, allowing the intermediate stages to provide selective growth advantage to mutated cells [13][14][15][16]. It is showed that the model with more than two mutations better fits the incidence data of colorectal cancer compared to the two-stage model. Besides, Lang et al. used the twotype branching process model to study the dynamic behaviour of adenoma growth and transition into carcinomas in colorectal [17]. Nevertheless, these models did not consider the detailed mechanism of gene mutations.
The detailed mutation mechanism of driver gene is extremely urgent for colorectal cancer patients at different ages, which can provide insights into the strategies for detection and treatment of tumor with a more effective way. Recently, the mathematical model involving the specific driver genes of colorectal cancer was developed to analyze the procession of malignant transformation in the colorectal [18,19]. The stochastic model of malignant transformation with the losses of APC and TP53 and gain of KRAS presented by Paterson et al. was used to study the specific order of these genes and indicated that inactivation of APC initiated the progression of tumor in more than half of colorectal cancer cases [18]. Subsequently, Zhang et al. studied the sizes of cells with different gene mutation and the waiting time distribution of driver gene mutations in colorectal cancer by using the five-hit branching process model based on the result of Paterson et al [20]. These work proposed the approximate analytic solution with only net growth of cells and analyzed the procession of malignant transformation in the colorectal cancer by using the solution and fixing the values of some parameters from references and some mutational data. However, they did not consider the effect of the death rate of cells on the risk of colorectal cancer. In addition, the rates of mutation and loss of heterozygosity (LOH) are not easy decided, since the rates of mutation and LOH may vary as the type of genetic instability and the mutation order of gene mutation in the colorectal cancer [6,21]. Therefore, our work considers the effect of the growth rate and death rate of cells on the risk of colorectal cancer and does not employ the value of gene mutation rate gave by reference [18].
We use the mathematical model with five hits considering the selective advantage provided by the KRAS, APC and TP53 genes. In the model, any order of mutations in genes KRAS, APC and TP53 is allowed, leading to multiple evolutionary pathways that induce the development of colorectal cancer. Moreover, all possible mutation pathways of each order are considered. We first compare the approximate analytic solution and the exact numerical solution we used. Our result shows that approximate solution will get worse in estimating the risk of cancer at young ages (before 25 ages) when the growth rate or death rate of cells is enough big (e.g. 50 per year). Thus, the exact numerical solution of the model is chosen to explore the mechanism of mutations in genes KRAS, APC and TP53 that are most likely to occur in the development of colorectal cancer. In our fitting, approximate Bayesian computation schemes with the simulated likelihood density is used to estimate the mutation rates of model [22]. Finally, we analyze the models with specific mutation order that can fit well the specific-age incidence data of colorectal cancer and apply them to estimate the expected number of mutated cells with alternations of different genes in the early stages of colorectal carcinogenesis.

Materials and methods
single mutation. Hence, the normal cells require five mutations to become tumor for colorectal cancer involving TP53, APC and KRAS. The detailed model is displayed in Fig 1. In Fig 1, the net growth rate of cells, l p i , is equal to the difference between growth rate of cells (a p i ) and death rate of cells (b p i ). We define the following variables, N-the number of normal cells in all crypts; In the model, m p 4 represents the effective transformation rate from premalignant cell to persistent malignant one that does not die. Let τ be the time of the first persistent malignant cell. The persistent malignant cell is produced at rate m p 4 P 4 ðsÞ at time s, the probability that persistent malignant cell doesn't show up by time t yields [24], Then the probability that a persistent malignant cell occurs by time t is given by Here, we are mainly concerned with the progression from normal cells to a persistent malignant cell, and the progression from a persistent malignant cell to the tumor detected clinically is assumed to be a fixed time, T lag . Thus, the probability that tumor is clinically detected at time t can be written as The hazard function, that is, the incidence rate of primary malignant tumors at time t, h(t), follows that The expected number of premalignant cells can be decided by the following system of equations The detailed derivation of the above equation is depicted in S1 Text. By solving Eqs (4) and (5), we get The above approximate solution is supported when the probability of persistent malignant cell, P(M(t) > 0), is close to zero [11]. However, it is not applied to cancers with high incidence rate [11]. Moreover, this approximate solution depends on the net growth rates and the mutation rates of cells, which doesn't reflect the effects of the growth or death rate of cells on the risk of cancer. Therefore, we analyze another solution of hazard function. Here we consider the growth rate and death rate of cells in the development of cancer. Let a p i and b p i denote the growth rate and death rate of premalignant cells with i mutation(s), respectively.
To solve the hazard function, we define the following probability generating functions It is easy to obtain the following survival function, that is, probability of no tumor at time t, and then SðtÞ ¼ À C 0 ð1; 1; 1; 1; 0; 0; t À T lag Þ Cð1; 1; 1; 1; 0; 0; t À T lag Þ ð9Þ The probability generating functions in formulas (7) satisfy the following Kolmogorov backward equations [25,26] The detailed derivation of above equation can be seen in S1 Text. By the definitions of the probability generating functions, the initial values of equations (10) are as follows Cð1; 1; 1; 1; 0; 0; 0Þ ¼ 1; Then, hazard function can be obtained by solving equations (10) with the initial conditions (11).

Comparison of two solutions
To analyze the effect of the growth rate of cells on the risk of cancer, we fix the following parameter values and let the growth rate of cells, a p i , takes different values, for example, a p i ¼ 0:5, a p i ¼ 1, a p i ¼ 10 and a p i ¼ 50. The logarithm of hazard functions from formula (6) and equations (10)  that the approximate solution is far greater than the exact numerical solution when the division rate or death rate of cells is quite large. In addition, the approximate solution only reflects the effect of net growth rate on hazard function. Nevertheless, the exact numerical solution involves the growth rate and death rate of cells in addition to net growth rate of cells. In fact, the approximate solution is given assuming that the probability of malignant cells is zero [11].
The approximate solution will get worse with an increase in the growth rate or death rate of cells. Therefore, the approximate solution may not be an excellent choice to simulate the risk of cancer, especially for a cancer with a high incidence rate. For stem cell in colon, it divides on average every five days [27]. It is shown that a stem cell produces two identical daughter cells by symmetry division and one mutated cell and one

PLOS COMPUTATIONAL BIOLOGY
The mechanism of gene mutations in colorectal cancer equivalent daughter cell by asymmetric division. In the model, the mutation rate, m p i , signifies an asymmetric division rate of per cell per year. That is, the sum of the mutation rate (m p i ) and the growth rate (a p i ) of the model is equal to the division rate of per cell per year. However, the mutation rate, m p i , is far less than the growth rate, a p i . Therefore, it is easy to estimate the growth rate of stem cells, approximately 73 per cell per year. By the Fig 2, using the approximate solution may produce a large difference in estimating the risk of colorectal cancer compare to the exact numerical solution. As a consequence, we simulate the incidence rate of colorectal cancer at specific age by using the numerical solution obtained by equations (9), (10) and (11).

The order of gene alterations
Gene mutation is the leading cause of drug resistance whose emergence is a great obstacle for tumor treatment [28,29]. For example, the mutation of KRAS leads to the resistant to gefitinib, erlotinib and cetuximab which induce the apoptosis of cancer cells in the therapy of cancer [30,31]. Hence, knowing the order of gene alterations in a tumor is extremely crucial for the early treatment of cancer, which provides the guidance to the selection of medicines. Here, we investigate all possible mutation orders of three driver gene in the colorectal cancer by the mathematical modelling. Tumor, including all alterations of the three drive genes, involves six cases with the order of gene alterations. The detailed descriptions are seen in Fig 3. All possible pathways of alterations in the genes, KRAS, APC and TP53, are displayed in Figs A-F in S1 Text.
TSG can inhibit tumor formation by inducing the death of abnormal cell, whose function is blocked only when both alleles are inactivated (TSG −/− ). In other words, the cells do not grow abnormally when only one allele is inactivated (TSG +/− ). However, the activation of oncogene just takes one hit (oncogene + ), which promotes the proliferation of cells. Thus, by the clonal 1. l p 2 ¼ l p 1 and l p 4 ¼ l p 3 , which involves the following sequences of mutations which involves the following sequences of mutations which involves the following sequences of mutations , which involves the following sequences of mutations There are five cases for the model with five hits by the above analyses, involving 30 pathways of gene mutations. To determine the specific order of three drive genes alterations, we choose the numerical solution of the model to fit the incidence rate data of colorectal cancer from the Surveillance Epidemiology and End Results (SEER) registry during the period 1973-2013. However, not all parameters of the model can be estimated by the data alone. Here, we assumed the growth rate of normal cells to be 73 per year, and the lag time from a persistent malignant cell to tumor detected clinically to be 5 years [27,32]. In addition, all mutation rates of cells (m p i ) are limited in the range (0, 10 −2 ), since the probability of a gene mutation (m p i t) is far less than one.
The activation of oncogene promotes the proliferation of cells, and the differentiation or apoptosis mechanism of cells will disarrange if the inactivation of TSG happens [6]. Studies showed that crypts carrying one of APC and KRAS alterations (that is the inactivation of APC and activation of KRAS) confer a selective growth advantage to cells [33,34]. It has been shown that the net growth rate of the cells including only APC −/− is 0.2 per year, and 0.07 per year for the cells with only KRAS + [35,36]. Furthermore, the inactivation of TP53 does not provide any growth advantage to cells in normal conditions [37]. We assume that mutations between genes have no interaction in cell growth. Therefore, the growth rate and death rate of premalignant cells with different gene alterations in the model are set to be as follows: In our simulations, the fourth-order Runge-Kutta is utilized to solve Equations (10) with the initial values (formulas (11)), and the model parameters, Y ¼ ðv; m p 1 ; m p 2 ; m p 3 ; m p 4 Þ where v = Nμ N , are estimated by using approximate Bayesian computation schemes involving the simulated likelihood density [22].
We do twenty fits of model with the different proliferation rate of cells (λ pi ), since mutation in first allele of TP53 and APC (TP53 +/− and APC +/− ) does not bring about the change in proliferation rate of cell. These fits correspond to the pathways in Figs A-F in S1 Text. Our fitting results suggest that only the orders with KRAS−APC−TP53, APC−TP53−KRAS and APC −KRAS−TP53 can accepted to explain the incidence rate of colorectal cancer at different ages, involving eleven pathways (see Fig 4). There are four pathway of mutations for the orders with  Table 1 shows the rate of mutation on first APC allele is larger than that of mutation on second APC allele, approximately 2 times, which infers that inactivation of APC is caused by mutation in both alleles, or by LOH in first allele and mutation in second allele. In addition, the inactivation rates of APC and TP53 are much greater than the activation rate of KRAS and the point mutation rate [38,39]. The total number of driver positions in APC is 604 from reference [18], and then the mutation rate of APC is 7.55 × 10 −6 if the base pair mutation rate takes 1.25 × 10 −8 from [39]. Our estimated value of alteration in APC (4.70 × 10 −3 ) is much larger than 7.55 × 10 −6 . These imply that the inactivations of APC or TP53 may be accompanied by genetic instability (microsatellite instability or chromosomal instability) that increases the mutation rate of gene. Evidences manifested that chromosomal instability might result from mutations in APC or TP53 [40,41].  1 10 . Evidences suggested that the point mutation rate of genes with microsatellite instability is about ten times as much as that without microsatellite instability for colorectal cancer [42,43]. It can be used as a mark in identifying the detailed mutation pathway for the orders with APC − TP53 − KRAS and APC − KRAS − TP53.

PLOS COMPUTATIONAL BIOLOGY
The mechanism of gene mutations in colorectal cancer

Discussion
The order of driver gene mutations is very vital for the clinical treatment and even prevention in the course of cancer. In this article, we construct the mathematical model considering all possible orders of mutations in APC, TP53 and KRAS that are the most significant driver genes in colorectal cancer. There are 30 pathways involving the alterations of all three driver genes, which need five hits from a normal cell to a malignant cell with all mutations of these driver genes. These pathways are classified as twenty cases based on the different proliferation rates of mutated cells, which corresponds with the five-stage model with different net growth

PLOS COMPUTATIONAL BIOLOGY
The mechanism of gene mutations in colorectal cancer rate of mutated cells (see Figs A-F in S1 Text). Studies showed that approximately 15% colorectal cancer harbored alterations in all of the three genes [4,5]. Therefore, we use the constructed model based on the net proliferation rates of mutated cells to match 15% incidence rate of colorectal cancer for male and female in different ages instead of those of a fixed age [18]. Our fitting results indicate that three mutation orders with KRAS ! APC ! TP53, APC ! TP53 ! KRAS, APC ! KRAS ! TP53, are supported to lead to tumor, involving eleven pathways. Among them, there are seven pathways that inactivation of APC acts as the first event in the colorectal cancer, and the inactivation of TP53 is the last event when the first event is activation of KRAS in the development of colorectal cancer. This is line with the results that the alternations of APC is an early driver event that accounts for 80% of colorectal cancers [44,45]. Inactivation of TP53 is critical in the transformation from early adenoma to advanced tumor, which regulates G1 cycle and apoptosis of cells. As a result, the inactivation of TP53 is usually a late event in the development of colorectal cancer. drugs that do not develop the resistance caused by the mutation of KRAS and APC should be adopted preferentially in the therapy of colorectal cancer.
The multi-type branching process is a useful tool to solve the risk function of the multistage model. The approximate solution of hazard function is used to predict the risk of cancer by some work [18,46,47]. Here, we consider the effect of the growth rate and death rate of cells on risk cancer and make a comparison for the approximate solution without the death rate of cells and the exact numerical solution including the growth rate and death rate of cells. It is shown that the approximate solution will overestimate the risk of cancer compare to the exact

PLOS COMPUTATIONAL BIOLOGY
The mechanism of gene mutations in colorectal cancer  The estimated values are obtained by calculating quartile of Fig 5(a)-5(e), and μ KRAS+ is derived by assuming N = 10 8 [20].
https://doi.org/10.1371/journal.pcbi.1011225.t002 Table 3. Estimate values of parameters in the pathway with The estimated values are obtained by calculating quartile of  The estimated values are obtained by calculating quartile of Fig 9(a)-9(e), and m APC þ=À or m TP53 þ=À is derived by assuming N = 10 8 [20].
https://doi.org/10.1371/journal.pcbi.1011225.t005 The estimated values are obtained by calculating quartile of Fig 10(a)-10(e), and m APC þ=À is derived by assuming N = 10 8 [20]. numerical solution when all parameters of model are fixed, especially for high growth rate or death rate of cells. As a consequence, we choose the exact numerical solution of hazard function to fit the data with incidence rate of colorectal cancer from SEER register during the year 1973-2013. In our work, we do not fix all parameters of model like references [18], and only growth rate and death rate of cells are decided from published references. The mutation rates of gene are not accurately inferred only based on the mutational data due to genetic instability and the type of gene mutation. For this purpose, the mutation rates of model are estimated by fitting incidence rate data of colorectal cancer based on approximate Bayesian computation The estimated values are obtained by calculating quartile of Fig 11(a)-11(e), and m APC þ=À is derived by assuming N = 10 8 [20].
https://doi.org/10.1371/journal.pcbi.1011225.t007 Table 8.   schemes with simulated likelihood density. We consider the simulation of incidence rate data of colorectal cancer from 0 to 84 years old instead of that at 80 years old in reference [18]. The estimated mutation rates are extremely valuable to predict the risk of colorectal cancer. Moreover, approximate Bayesian computation schemes is a good method to estimate the biological parameters. Among them, approximate Bayesian computation with Markov chain Monte Carlo and sequential Monte Carlo was applied to biological systems with success [48,49].

Mutation rate
Here, we use approximate Bayesian computation schemes with simulated likelihood density to estimate our model parameters, which is effective for inferring parameters in high-dimensional stochastic model, and can reduce the probability of getting stuck the low probability regions compared to approximate Bayesian computation with Markov chain Monte Carlo [22]. To further identify the feature of different gene mutation orders, the sizes of premalignant cells in the evolutionary pathways with KRAS ! APC ! TP53, APC ! TP53 ! KRAS, APC ! KRAS ! TP53 are analyzed. We find that the sizes of cells with single-mutant and doublemutant are significant difference in different gene mutation orders. The number of KRASmutated cells is higher than that of cells with both activated KRAS and inactivated APC at the beginning, and the number of cells with both KRAS-activated and APC-inactivated will surpass that of KRAS-mutated cells over time for the order with KRAS ! APC ! TP53. For the orders with APC ! TP53 ! KRAS and APC ! KRAS ! TP53, the cells with APC −/− far outnumber those with APC −/− TP53 −/− and APC −/− KRAS + . These results can be used as the mark of colorectal cancer diagnosis and inferring the sequence of gene mutation.
In this paper, we not only give the mutation order of three driver genes, but also provide the detailed pathway of gene mutations. However, there are still some limits in our work. Our results obtained do not consider the interaction effect between mutations in two genes, since no evidence suggests that epistatic interactions occur in mutations of two genes [4]. As is well known, inactivation of each allele of TSG has two ways involving point mutation and the LOH. We do not discuss the detailed pattern of alteration in TSGs, APC and TP53 in our model. The reason is that the model involving the pattern of inactivation in TSGs includes too many parameters, which results in non-identifiability issue of model parameters. Moreover, the type of genomic instability is unclear because of heterogeneity of tumor. However, we discuss the mutation rate of gene in different pathway obtained by fitting the incidence rate data of colorectal cancer at different ages. These estimated values of mutation rates can provide some information for inferring the pattern of alteration in TSGs and the existence of genetic instability. With the development of DNA sequencing, sequenced cancer genomes and gene expression data are important data source for inferring the evolutionary pathway in cancer progression, and some methods are developed to predict tumor evolution [50][51][52]. Although these data and methods do not give the rate of gene mutation, they can provide new ideas for constructing specific mathematical model of cancer. Thus, more diverse data deserve to be considered in further studying evolutionary pathways of cancer, which will be the direction for future research.