Number and Size Distribution of Colorectal Adenomas under the Multistage Clonal Expansion Model of Cancer

Colorectal cancer (CRC) is believed to arise from mutant stem cells in colonic crypts that undergo a well-characterized progression involving benign adenoma, the precursor to invasive carcinoma. Although a number of (epi)genetic events have been identified as drivers of this process, little is known about the dynamics involved in the stage-wise progression from the first appearance of an adenoma to its ultimate conversion to malignant cancer. By the time adenomas become endoscopically detectable (i.e., are in the range of 1–2 mm in diameter), adenomas are already comprised of hundreds of thousands of cells and may have been in existence for several years if not decades. Thus, a large fraction of adenomas may actually remain undetected during endoscopic screening and, at least in principle, could give rise to cancer before they are detected. It is therefore of importance to establish what fraction of adenomas is detectable, both as a function of when the colon is screened for neoplasia and as a function of the achievable detection limit. To this end, we have derived mathematical expressions for the detectable adenoma number and size distributions based on a recently developed stochastic model of CRC. Our results and illustrations using these expressions suggest (1) that screening efficacy is critically dependent on the detection threshold and implicit knowledge of the relevant stem cell fraction in adenomas, (2) that a large fraction of non-extinct adenomas remains likely undetected assuming plausible detection thresholds and cell division rates, and (3), under a realistic description of adenoma initiation, growth and progression to CRC, the empirical prevalence of adenomas is likely inflated with lesions that are not on the pathway to cancer.


Introduction
Adenomatous polyps (or adenomas) in the large intestine are considered benign precursors of colorectal cancer (CRC) and both clinical and molecular evidence suggest that they may sojourn for many years before turning into cancer [1,2]. For this reason, adenomas are considered a primary intervention target if detected and removed before they become malignant. However, questions remain regarding the significance of their histopathology, molecular signatures, as well as their number and sizes in average risk individuals. Since endoscopic screening for neoplastic lesions is generally limited by macroscopic detection thresholds (of the order of a few mm in caliper size), a large fraction of adenomas may actually be missed, especially if the bulk of adenomas is too small for detection. Potentially, such ''occult'' adenomas could give rise to cancer before they are detected by endoscopy. Here we use a biologically-based model of colorectal carcinogenesis, which has previously been fitted to the age-specific incidence of CRC, to compute the number and size distributions of adenomas. Of particular interest is the fraction of detectable adenomas, as functions of age, detection threshold and the underlying cell kinetics in the adenomas.
The underlying multistage clonal expansion (MSCE) model for CRC upon which our results are based explicitly considers the initiation, promotion and malignant conversion of adenomas [3][4][5][6][7][8]. According to this model, adenomas arise from normal colonic stem cells that suffer at least two rare rate-limiting events. We interpret these events as the biallelic inactivation of a tumor suppressor gene, in particular the APC tumor suppressor gene, which is the gene responsible for familial adenomatous polyposis (FAP), and which is frequently mutated in colorectal neoplasia [9]. The inactivation of APC is understood to occur in colonic crypts (the fundamental proliferative unit in the colon) whose stem cells have previously acquired a mutation at one of the two APC alleles. Because the process of adenoma formation may involve additional genes (such as KRAS), we extend the model framework to accommodate additional rate-limiting mutations for the initiation of an adenoma and generalize the mathematical derivation of their number and size distribution accordingly. However, there is both clinical and experimental evidence that the number of requisite rate-limiting events or mutations for adenoma initiation is small. Once a stem cell is initiated in this model, it is free to proliferate. The basic version of our CRC model assumes that adenoma initiation occurs when the remaining wild-type copy of the APC tumor suppressor gene is deleted or mutated in a stem cell of a (pre-initiated) APC+/2 colonic crypt. In a more realistic model, which is supported by recent experimental findings in the murine system [10], we also model the transient amplification of APC2/2 stem cells prior to their clonal expansion, effectively adding a stage to the initiation process [10][11][12].
The theoretical results derived here are complemented by model predictions for the adenoma size distributions and their (age-specific) prevalence based on parameter estimates obtained previously from fitting cancer incidence data. Since not all biological model parameters can be directly estimated from incidence data alone (non-identifiability issue), we explore the sensitivity of our findings by varying unknown parameters, such as the cell division rate of initiated stem cells, within their plausible ranges. In spite of the model uncertainties and the lack of precise clinical data on adenoma number and sizes, a biologically based approach that is broadly consistent with the pathogenesis of CRC makes it possible to explore more rationally the impact of risk factors and interventions on adenoma development and cancer progression.

The Multistage Clonal Expansion (MSCE) Model
First we briefly review the MSCE model for CRC and then introduce the notation for the relevant stochastic processes involved in the formation of adenomas. We have previously derived expressions for the number and size distribution of nonextinct pre-malignant clones in the context of the two-stage clonal expansion (TSCE) model [13]. This model assumes that the clones develop from a (deterministic) source of progenitor cells via a nonhomogeneous Poisson process. An important extension of this work was put forward by Dewanji et al. [14] for the size distribution of a random sum of Poisson-generated (pre-malignant) clones, which corresponds to a generalized Luria-Delbrück (GLD) distribution for mutant colonies. A hallmark of this distribution is a long tail reflecting large fluctuations of the total (mutant) population size. A further extension derived expressions for the number and size distributions of pre-malignant clones conditioning on observations from individuals who have not previously been diagnosed with CRC [6].
Adenoma development. For colorectal adenomas, the MSCE model assumes that adenomas arise within colonic crypts maintained by immortal stem cells that have accumulated K( §1) requisite (epi)genetic pre-initiation events. In this model, adenomas are allowed to be multi-focal since the initiation process, starting from the time when the Kth pre-initiation event has occurred in a stem cell, is a point process representing the continuous generation of initiated cells from the K-stage cell, the progeny of which is free to undergo clonal expansion. In this picture, pre-initiated stem cells are blocked from clonal expansion; however, they are allowed to divide asymmetrically to generate daughter cells that will (ultimately) undergo terminal differentiation (see Figure 1). This is described in more detail below. To give an example, for a 3-step initiation model (K~2), adenomas arise from crypts whose stem cell population sustains two consecutive hits (e.g., the inactivation of both alleles of the APC gene). In this case initiating events occur when the crypt stem cells suffer a third event.
Pre-initiation events. There are two distinct biological ways for pre-initiation events to occur at the cellular level. A preinitiation event may occur in a stage k(k~1, Á Á Á ,K) stem cell via an asymmetric stem cell division in which a mutation occurs in one of the two daughter stem cells. In other words, a stem cell in the kth pre-initiation stage may divide asymmetrically to yield two daughter stem cells, one in stage k and the other in stage (kz1) which acquires a new mutation ( Figure 1A). This process is mathematically modeled as a Poisson process with intensity rate m k ( : ). In this case, both daughters are retained in the stem cell compartment. The other possibility is represented by another kind of asymmetric stem cell division which causes one daughter stem cell to acquire a mutation leading to a transition of this cell to stage (kz1), while the other daughter is committed to differentiation (Figure 1. B). For historical reasons, we refer to this event as an Armitage-Doll (AD) type transition [15].
For both cases we assume that a stage k(ƒK) stem cell is not yet initiated and therefore lacks the potential for clonal expansion via symmetric cell divisions. However, once a stem cell enters stage (Kz1), it is considered initiated and free to undergo clonal expansion. Furthermore, all stem cells in the pre-initiation stages are assumed immortal.
Multi-focal nature of an adenoma. In the context of the MSCE model for CRC, an adenoma consists of the collection of all initiated (stage Kz1) cells that derive from a single stage K progenitor cell. This definition, while not unique, is consistent with the assumption that the stage K progenitor may be subject to transient amplification, which is represented by frequent Poisson 'emissions' of initiated (stage Kz1) cells that are free to undergo independent clonal expansions resulting in the formation of multiple sub-clones ( Figure 1C).
We call this ensemble of sub-clones an adenoma or adenomatous polyp. Since information on adenoma number and sizes is typically obtained via screening of individuals who have not previously been diagnosed with CRC, we will also derive the results of the model conditioning on no prior clinical detection of CRC. For this purpose, we assume that the last rate-limiting event (with rate m Kz1 ( : )) in the MSCE model, which is usually associated with the malignant conversion of an initiated cell, represents detection of a clinical cancer as in Jeon et al. [6].
Basic notation. Suppose there are K( §1) pre-initiation stages. We refer to the kth pre-initiation event as a P k -mutation and the cells, which have gone through this P k -mutation, as P kcells (k~1, Á Á Á ,K). For completeness, we refer to the normal stem cells as P 0 -cells. The generation of P k -cells from one P k{1 -cell can be modeled through a non-homogeneous PP with rate m k{1 ( : ) per cell. Alternatively, it can be modeled as a direct transition of a P k{1 -cell into a P k -cell, the AD-type transition referred to above, with hazard rate m k{1 ( : ) and density f k{1 ( : ) for the waiting time of

Author Summary
The adenomatous polyp (or adenoma) is considered the common precursor lesion for colorectal cancer (CRC). Although the natural history of adenomas is well-characterized in terms of their histopathology and (epi)genomic changes, little is known about their dynamics in the stage-wise progression from the first appearance of an adenoma to its conversion to malignant cancer. By the time adenomas become endoscopically detectable (i.e., are in the range of 1-2 mm in diameter), adenomas are already comprised of hundreds of thousands of cells. A large fraction of adenomas may therefore remain undetected during screening and, in spite of their small (subthreshold) size, could give rise to cancer prior to being detected. It is therefore of importance to establish what fraction of adenomas is detectable, both as a function of the age at screening for colorectal neoplasia and the size (threshold) above which adenomas can be detected reliably. Here we derive mathematical expressions for the distribution of adenoma number and sizes based on a recently developed stochastic model for CRC, which has previously been calibrated and validated against age-specific CRC incidence data.
a P k{1 -cell before the P k -mutation takes place. For the AD type transition, it is assumed that the functions m k{1 ( : ) and f k{1 ( : ) depend only on the time since the P k{1 -mutation occurred. If m k{1 ( : ) is a constant, then f k{1 ( : ) is the density of an exponential distribution. Once a P K -cell is formed (by means of a P Kmutation), it generates initiated cells (i.e., P Kz1 -cells) according to a non-homogeneous PP with rate m K ( : ) and the initiated cells grow according to a linear birth and death process with rates a( : ) and b( : ), respectively. Note, the initiation of P Kz1 -cells and their clonal expansion recapitulates the two-stage clonal expansion (TSCE) model for which explicit solutions for the number and size distributions have been derived [13]. Figure 1.C illustrates the MSCE model for carcinogenesis. The pre-initiation events can be either PP-type or AD-type transitions. However, the first step in the MSCE model represents the successive (random) generation of P 1 mutant stem cells over time and throughout the normal tissue, which is assumed to be very large in size -about 10 8 normal stem cells in colon and rectum combined. Hence, without loss of generality, the MSCE model assumes that the arrivals of the first mutations are of the PP-type.
Size and detection of an adenoma. Suppose we observe, for an individual at a particular time t, the number of detectable adenomas, N(t), and their sizes (in terms of the number of initiated or P Kz1 -cells in each adenoma), fY i (t),i~1, Á Á Á ,N(t)g. We assume that an adenoma is detectable with probability one if its size is greater than a fixed threshold y 0 . In the following sections, we derive the joint distribution of fN(t), Y i (t),i~1, Á Á Á ,N(t) ð Þ g for different values of K and different assumptions regarding the type of pre-initiation process (i.e., PP or AD). For K~2, we essentially consider the model recommended by Luebeck & Moolgavkar [12] for CRC and used by Jeon et al. [6] for evaluating screening strategies for adenomas in colon and rectum. The latter study provided an efficient approach to simulate the natural history of CRC by recognizing that the size of the adenomas, given the arrival time of a P K -cell, follows a GLD distribution as derived by Dewanji et al. [14].
Let s K be the time of a particular P K -mutation, that is, the arrival time for a progenitor cell (a P K {cell) for an adenoma. Then this progenitor cell will generate initiated cells (P Kz1 {cells) by a Poisson process, and each initiated cell forms a sub-clone of P Kz1 {cells via a clonal expansion. Note, Y (s K ,t) is the (random) sum of all sub-clones of initiated cells (P Kz1 {cells) that were generated from the P K progenitor stem cell which was born at time s K vt.
In particular, when the initiation rate m K ( : ), the cell division rate a( : ), and the cell death rate b( : ) are constants, the GLD distribution reduces to a Negative Binomial distribution (Eq. (3.34) in [6]), given by where Y (s K ,t) denotes the size of the adenoma at time t, given the time of P K -mutation s K , and Under these assumptions, the sub-clones that lead to this (multifocal) distribution arise from initiated P Kz1 -cells that expand clonally by following a linear birth-death process. For this reason, the results derived here are readily applied to the situation when the sub-clones are also identified clinically. Conditioning this distribution on adenomas that occur in cancer-free individuals, i.e., individuals who have not had a prior occurrence of CRC, we have (see Eq.(3.30) in [6]) where and Z(s K ,t) is the indicator variable for clinical detection of cancer at time t with P K -cells born at time s K .

Number and Size Distribution for K~1
For K~1 we have only one pre-initiation event, defined by a P 1 -mutation, and P 2 -cells are the initiated cells. As mentioned in the previous section, the P 1 -mutation follows a PP formulation. Let Y (s 1 ,t) denote the size of the adenoma at time t with the first pre-initiation (P 1 -mutation) time s 1 . The distribution of Y (s 1 ,t) is given by the GLD distribution previously derived by Dewanji et al. [14], for the process originating at time s 1 and involving the initiation rate m K ( : )~m 1 ( : ) and the birth and death rates of the initiated cells given by a( : ) and b( : ), respectively.
As mentioned before, we assume that the generation of P 1 -cells follows a non-homogeneous PP with rate m 0 ( : )X ( : ), where X ( : ) gives the deterministic growth curve for the normal stem cells in the tissue. Let M(t) be the number of first pre-initiations (P 1mutations) by time t and let s 1j , j~1, Á Á Á ,M(t) be the occurrence times of these P 1 -mutations. Also, write N(s 1j ,t)~I(Y (s 1j ,t)wy 0 ), where I( : ) is the indicator function. That is, N(s 1j ,t)~1, if the corresponding adenoma is detectable at time t, and 0 otherwise. Then, the number of detectable adenoma N(t) can be written as a filtered Poisson process The probability generating function (PGF) of N(t) can be written as where y(u; s 1 ,t) is the PGF of the binary variate N(s 1 ,t) with success probability p (1) 1 (s 1 ,t)~P½N(s 1 ,t)~1~P½Y (s 1 ,t)wy 0 jY (s 1 ,s 1 )~0: Note that p (1) 1 (s 1 ,t) is the probability that a P 1 -mutation taking place at time s 1 results in a detectable adenoma at time t. This probability can be obtained from the distribution of Y (s 1 ,t). For constant parameters, this reduces to, using (1) with K~1, p (1) 1 (s 1 ,t)~1{ P y 0 n~0 p n (s 1 ,t). Adenoma prevalence. It follows that the number of P 1 -cells which lead to detectable adenomas at time t follows a nonhomogeneous PP with rate m 0 (s 1 )X (s 1 )p (1) 1 (s 1 ,t), with s 1 ƒt, and N(t) has a Poisson distribution with mean Ð t 0 m 0 (s 1 )X (s 1 )p (1) 1 (s 1 ,t)ds 1 . Since the adenoma prevalence is defined as the probability of at least one detectable adenoma at age t, it is given by Detection probability and size distribution of adenomas. The probability of detecting an adenoma of size Y (t)wy 0 at age t is given by For constant parameters, this reduces to 1 t Similarly, the size distribution of a detectable adenoma at age t is given by Likelihood for the number and size of detectable adenomas. Using the properties described above, it is straightforward to show that the joint probability (or likelihood L 1 ) of having n detectable adenomas with sizes y i ,i~1, Á Á Á ,n at age t, i.e, fN(t)~n,(Y i (t)~y i ,i~1, Á Á Á ,n)g is P½Y (s1,t)~yijY (s1,s1)~0 Extension to observations in individuals with no prior CRC. Here we derive analogous results for cancer-free individuals, i.e., individuals who have not developed CRC by time t. To this end, we require the conditional probability p (1Ã) 1 (s 1 ,t) that a P 1 -mutation taking place at time s 1 , results in a detectable adenoma at time t prior to developing CRC. Hence, we need to compute the conditional probability p (1Ã) 1 (s 1 ,t)~P½Y (s 1 ,t)wy 0 jZ(s 1 ,t)~0,Y (s 1 ,s 1 )~0. For constant parameters, using (2), p (1Ã) 1 (s 1 ,t) can be calculated as 1{ P y0 n~0 p Ã n (s 1 ,t).
Adenoma prevalence among individuals with no prior CRC. As before, the number of detectable adenoma at time t in a cancer-free individual, N Ã (t), follows a Poisson distribution with mean given by Ð t 0 m 0 (s 1 )X (s 1 )S 2 (s 1 ,t)p (1Ã) 1 (s 1 ,t)ds 1 , where the two-stage survival function S 2 (s 1 ,t) represents the probability that a P 1 -cell born at time s 1 does not lead to CRC by time t. Thus, the adenoma prevalence conditioned on cancer-free is given by For constant parameters, this two-stage survival function has been derived previously, i.e.
Detection probability, size distribution, and likelihood function for adenomas in individuals with no prior CRC. Here we provide analogous expressions for the detection probability (8) and size distribution (9), but properly conditioned on no occurrence of prior CRC. The probability of detecting an adenoma at age t with size ywy 0 , conditioned on no prior CRC, can be written as Similarly, for the size distribution of detectable adenomas (i.e., their sizes exceeding the threshold y 0 ) at age t, conditioned on no prior CRC, we find Ð t 0 m 0 (s 1 )X (s 1 )S 2 (s 1 ,t)P½Y (s 1 ,t)~yjZ(s 1 ,t)~0,Y (s 1 ,s 1 )~0ds 1 Ð t 0 m 0 (s 1 )X (s 1 )S 2 (s 1 ,t)p (1Ã) 1 (s 1 ,t)ds 1 : Finally, following the derivation of (10), the joint distribution of the number and sizes of detectable adenomas, in a cancer-free individual, can be written as Number and Size Distribution for K~2 Here the two pre-initiation events (P 1 -and P 2 -mutations) precede initiation and growth of initiated P 3 -cells into sub-clones. Let Y (s 1 ,s 2 ,t) denote the size of the adenoma at time t with the corresponding P 1 -and P 2 -mutations taking place at times s 1 and s 2 , respectively. Note that the distribution of Y (s 1 ,s 2 ,t) is given by the GLD distribution originating at time s 2 and involving the initiation rate m K ( : )~m 2 ( : ) and the birth and death rates of the initiated or P 3 -cells, given by a( : ) and b( : ), respectively. We derive explicit expressions for the number and size distributions for the case when both P 1 -and P 2 -mutations are of PP-type. The case when the P 2 -mutations are of AD-type is described in the online supplement (Text S1).
As before, the number of detectable adenomas at time t can be written as where N (1) (s 1j ,t) is the number of detectable adenomas that emerged from a P 1 -cell born at time s 1j ƒt. Then, as in (6), the PGF of N(t) can be expressed by where y (1) (u; s 1 ,t) is the PGF of N (1) (s 1 ,t). Using the Lemma and eq. (10) of Dewanji et al. [14], we have further and for n §1, where, for l §1,q (1) l (t)~Ð t 0 m 0 (s 1 )X (s 1 )P½N (1) (s 1 ,t)~lds 1 . Again, where this sum is over all the P 2 -mutations by time t that occurred at times s 2j ƒt and which emerged from a P 1 -cell that was born at time s 1 . Here, N(s 1 ,s 2 ,t)~1 if the adenoma originated from the P 1 -cell born at time s 1 and the P 2 -cell born at time s 2 is detectable at time t and 0 otherwise; that is, N(s 1 ,s 2 ,t)~I(Y (s 1 ,s 2 ,t)wy 0 ). Note that Y (s 1 ,s 2 ,t)~Y (s 2 ,t), since Y (s 1 ,s 2 ,t) does not depend on s 1 . Therefore, as in the previous section, N (1) (s 1 ,t) is a filtered PP with the PGF y (1) (u; s 1 ,t), similar to that in (6). Also, N (1) (s 1 ,t) is a non-homogeneous PP with rate m 1 (s 2 )p (1) 2 (s 2 ,t), for s 1 ƒs 2 ƒt, and for fixed t, is a Poisson variate with mean Ð t s 1 m 1 (s 2 )p (1) 2 (s 2 ,t)ds 2 , where p (1) 2 (s 2 ,t)~P½N(s 1 ,s 2 ,t)~1~P½Y (s 2 ,t)wy 0 jY (s 2 ,s 2 )~0: ð20Þ This probability can be obtained from the distribution of Y (s 2 ,t) given in (1). Adenoma prevalence. p (1) 2 (s 2 ,t) has the same form as that of p (1) 1 (s 1 ,t) but with K~2. Thus, using (17), the adenoma prevalence is calculated by The distribution of N(t) can now be obtained by using (18), and the expected number of detectable adenomas can be readily obtained using (16), i.e.

E½N(t)~L
For constant pre-initiation rates m 0 and m 1 , and a constant normal stem cell number X , this reduces to 2 Ð t 0 s 2 p (1) 2 (s 2 ,t)ds 2 =t 2 . Similarly, the size distribution of a detectable adenoma at age t is simply given by Likelihood for the number and size of detectable adenomas. Let M s (t) be the number of 'special' P 1mutations that lead to at least one detectable adenoma at time t.
Because of the filtered-Poisson-process nature of the generation of the adenomas, the occurrence of such special P 1 -mutations follows a PP with rate m 0 (s 1 )X (s 1 )p (2) 1 (s 1 ,t), for s 1 ƒt, where p (2) 1 (s 1 ,t) is the probability that a P 1 -mutation that occurred at time s 1 leads to at least one detectable adenoma by time t. Using the distribution of N (1) (s 1 ,t), we have We now turn to the joint distribution of having n detectable adenomas with sizes y i ,i~1, Á Á Á ,n at age t, i.e, fN(t)~n, Y i (t)~y i ,i~1, Á Á Á ,n ð Þ g , when nw0. First, let N i (t) denote the number of detectable adenomas arising out of the ith 'special' P 1 -mutation with sizes Y ij (t),~1, Á Á Á ,N i (t). Clearly, Then, given M s (t)~mw0, the events fN i (t),Y ij (t),j~1, Á Á Á ,N i (t)g, for i~1, Á Á Á ,m, are independent and identically distributed with Therefore, the joint probability of fM s (t)~mg and fN i (t)~n i ,Y ij (t)~y ij , j~1, Á Á Á ,n i g for i~1, Á Á Á ,m, is given by and P½Y (s 2 ,t)~y ij jY (s 2 ,s 2 )~0 is replaced by P½Y (s 2 ,t)~y ij jZ(s 2 ,t)~0,Y (s 2 ,s 2 )~0: Note, S 3 (s 1 ,t) and S 2 (s 2 ,t) represent the respective probabilities that a P 1 -cell born at time s 1 and a P 2 -cell born at time s 2 do not give rise to CRC by time t. In other words, they are the tumor survival functions of a 3-stage and 2-stage carcinogenesis model, respectively. For constant parameters, these survival functions are as following (see [6,12] for the derivation): where p and q are defined in (4) and (5) with K~2 (see [6,12] for details). Thus, the conditional expression for adenoma prevalence is given by The expected number of detectable adenomas conditioned on no prior CRC can be obtained through analogous replacements in (22).
Detection probability and size distribution for adenomas in individuals with no prior CRC. As before for the case with K~1, we also provide analogous expressions for the detection probability (23) and size distribution (24), but properly conditioned on no previous occurrence of CRC. The probability of detecting an adenoma at age t with size ywy 0 , becomes and for the size distribution of detectable adenomas (i.e., their sizes exceeding the threshold y 0 ) at age t, conditioned on no prior CRC, we have : Number and Size Distribution for General K The notation introduced in the previous section is easily generalized to Kw2. The random variable Y (s 1 , Á Á Á ,s K ,t) denotes the size of the adenoma at time t with the corresponding P 1 -, Á Á Á,P K -mutation times s 1 , Á Á Á ,s K , respectively. The distribution of Y (s 1 , Á Á Á ,s K ,t) is, as before, given by the GLD distribution with time origin at s K and with initiation rate m K ( : ). Initiated P Kz1 -cells divide or die with rates a( : ) and b( : ), respectively. Importantly, Y (s 1 , Á Á Á ,s K ,t) depends on s K alone (i.e., Y (s 1 , Á Á Á ,s K ,t)~Y (s K ,t)), and the distribution is given by (1) for constant parameters.
Various combinations of AD-type and PP-type generations for the K pre-initiations are possible, but the formulation of the likelihood becomes more complicated. The special cases when all pre-initiations are of PP-type or AD-type are covered in the online supplement (Text S1).

Results
The derived expressions allow us to readily predict both observable and unobservable numbers of pre-malignant tumors in a tissue. Such predictions are helpful in validating cancer models using intermediate endpoints on precursor lesions, in particular their number and sizes. Furthermore, being able to predict the unobserved portion of precursor lesions is of clinical relevance for early detection and cancer prevention. Here, we illustrate the utility of the derived expressions using the example of colonic adenomas. Specifically, we present the predicted size distribution of adenomas and the age-specific adenoma prevalence, i.e., the probability of finding at least one observable adenoma in an individual as a function of age. Since population-level screening is typically performed on asymptomatic individuals, we also condition on individuals having not developed cancer in the tissue of interest at the time of observation.
The predictions presented here are for K~2 as described above. The underlying CRC model for cancer incidence is the 4stage model previously derived by Luebeck & Moolgavkar [12] and updated by Meza et al. [7]. The alternative model (PP for P 1and AD for P 2 -mutation for K~2 in the online supplement (Text S1)) yields very similar results (not shown). Importantly, not all biological parameters of the MSCE/CRC model are estimable from incidence data alone. For example, for the 4-stage model used here, only the parameters p,q, the product (slope parameter) m 0 X m 1 , and the ratio m 2 =a are identifiable. However, if the cell division rate of initiated cells, a, is known, all parameters of the model can be determined (assuming that the number of normal tissue stem cells, X , is known and that m 0~m1 ). For our illustrations, we choose plausible values for the cell division rate a, but keep the values of p,q,m 0 X m 1 , and m 2 =a as estimated by Meza et al. [7]. This affords explicit computation of the adenoma number and size distribution without altering the fits of the model to the observed CRC incidence. Figure 2 (left panel) shows the predicted size distribution of nonextinct adenoma without an imposed detection threshold (i.e., y 0~0 ) for the model with K~2. With constant parameters, both the unconditional and conditional (on no prior CRC development) size distributions of detectable adenoma are given by expressions (24) and (31), respectively.
For sizes sufficiently large, the unconditional adenoma size distribution is roughly log-log-linear, while the conditional size distribution shows departures from this behavior for sizes above *100,000 cells, i.e., when the risk of an adenoma-to-carcinoma transition increases more rapidly. This phenomenon is more noticeable when the cell division rate a is lower. Figure 2 (right panel) shows the probability of detecting an adenoma at age 70 as a function of the detection threshold y 0 for both unconditional (solid) and conditional (dashed) adenoma size distributions. Higher cell division rates (a) give rise to larger adenoma sizes and hence lead to higher detection probabilities even though the net cell proliferation rate (*{p) is approximately the same. For constant parameters, the unconditional and conditional detection probabilities are given by (23) and (30), respectively. This figure reveals that even for relatively small (i.e., sensitive) thresholds of a few thousand cells, many adenomas may go undetected. However, the precise proportion of detectable adenomas depends on the cell division rate a with higher values of a making detection more likely. Figure 3 shows the predicted age-specific adenoma prevalence in asymptomatic individuals for both males and females and for the models with K~1 and K~2, as described by Meza et al. [7], including their dependence on the observation threshold y 0 . The prevalence is defined as the probability of at least one detectable adenoma at age t, and is given by (11) for K~1 and (29) for K~2. In comparison with careful observations from autopsy studies [16], the model under-estimates these empirical data (represented by filled circles) unless one accepts a very small number of initiated stem cells to be observable. There are several explanations why the model-generated expected prevalence of adenoma might be lower than the clinical data would indicate (see the end of Discussion).

Discussion
We have previously derived number and size distributions of pre-malignant clones in the context of the two-stage clonal expansion (TSCE) model of carcinogenesis [13,17] and more recently established a formal connection of these results with fluctuation analyses based on the Luria-Delbrück distribution [14]. The mathematical tools derived in these papers were subsequently applied to the problem of screening for colorectal adenoma allowing for interventions resulting from their complete or incomplete removal [6]. These explorations, however, required time consuming computer simulations. In contrast, here we derive mathematical expressions that allow us to readily compute adenoma number and size distributions without simulation. These expressions can form the basis for computing the likelihood of adenoma data from screening studies involving sigmoidoscopies, colonoscopies and computed-tomographic colonographies, and thus are of significantly practical importance. Moreover the analytical form of the likelihood function allows for parameter estimation and likelihood-based hypothesis testing. Analyses of such data will be forthcoming and are the subject of a separate paper.
Our previous analyses of CRC incidence data suggest that K, the number of requisite pre-initiation mutations, is indeed small Figure 2. Size distribution of a detectable adenoma and probability of detection of an adenoma. Left panel: predicted unconditional (solid) and conditional (dashed) size distribution of a detectable adenoma at age 70 using the parameter estimates obtained by [7] for females in SEER with K~2 and y 0~0 . The cell division rate of initiated cells, a, is assumed as either 9 or 90/year. Right panel: the probability of detection of an adenoma at age 70 as a function of the detection threshold y 0 . Otherwise same as left panel. doi:10.1371/journal.pcbi.1002213.g002 Figure 3. Adenoma prevalence. Predicted adenoma prevalence for both males (Left panel) and females (Right panel) as a function of age and various observation thresholds y 0 using the models with K~2 (solid lines) and K~1 (dashed lines). Empirical data from Clark et al. [16] in filled circles. doi:10.1371/journal.pcbi.1002213.g003 [7,12]. K~1 corresponds to a 'two-hit' model for initiation, in essence representing the biallelic inactivation of a tumor suppressor gene (Knudson's recessive oncogenesis) [7]. A model with K~2 may describe both the inactivation of a tumor suppressor gene as well as the activation of an oncogene [7,12]. Here, we also treat the case of general K, which can be viewed as a model for clonal evolution due to the tree-structure where the nodes represent immortal mutant stem cells that will give rise to specific sub-clones which may or may not be identified as such. We distinguish between two types of rate-limiting events, one that generates (potentially multiple) mutations via asymmetric cell division while preserving the progenitor stem cell from which the mutations arose (PP-type), and one that leads to a transition of a progenitor cell into one cell that acquires a new mutation (ADtype). The MSCE model used here assumes that all events that lead to initiated cells are PP-type. This is only a mild restriction since for rare events the PP-type emission is equivalent to a ADtype transition (see Figure 1). For frequent (high rate) events, the AD-type transition looses its rate-limiting nature and can be ignored, while the high rate (PP-type) process leads to the accumulation of multiple clones and thus has the potential to capture non-mutational events, such as the transient amplification of proliferative cells from resident stem cells in the colonic crypts. Once a stem cell is considered initiated, i.e., is of type P Kz1 , we assume that it undergoes a stochastic birth-death process. This leads to the GLD distribution introduced in [14] for the adenoma size Y (s 1 , Á Á Á ,s K ,t), which reduces to a Negative Binomial distribution for constant parameters. Note, however, our formalism is more general and can accommodate other growth models that do not result in a GLD size distribution for the initiated cell population emerging from a P K progenitor cell [17,18].
Finally, we wish to comment on the predictions of the model for the age-specific adenoma prevalence (Figure 3). In comparison with the empirical data, our predictions appear too low. However, the discrepancy depends on what is assumed for the initiated stem cell division rate a and the detection threshold y 0 . While the range of plausible values for a is limited by how fast initiated stem cell can cycle in the adenoma (unlikely more than 2-3 times a week), it is not clear what fraction of cells in an adenoma is truly at risk for malignant transformation [10]. Assuming that a 1 mm adenoma, the caliper size detection limit cited by Clark et al. [16], contains about 500,000 cells [19] and that only 1-10% of cells in an adenoma are tumor stem cells [10,20], y 0 may be as low as 5000 cells and therefore the discrepancy may be less dramatic. Alternatively, one might include pre-initiated cells in the adenoma size count. However, our assumption is that pre-initiated cells do not expand clonally, although they may increase in number as a result of multiple births of the same type of mutation from a single stem cell over time (via Poisson process emissions). Thus, since locus-specific mutations are rare (of the order of 10 {6 to 10 {7 per year), the contribution of pre-initiated cells to the overall number of cells in an adenoma is likely very small.
It is well-recognized that adenomas can be genetically diverse and differ widely in their neoplastic potential. Indeed, adenomas have been suggested to regress implying that there are adenomas that are not on the pathway to cancer [21], although regression may simply reflect the stochastic nature of tumor growth. A more intriguing possibility of resolving the discrepancy is that adenomas go through a growth-bottleneck (i.e., stagnancy) before they can become cancerous. In this scenario, adenomas might sojourn in a reservoir until an activating mutation or change in tumor microenvironment releases them from arrest [22,23]. Although incorporating this scenario into the MSCE model may be challenging, the framework presented here is independent of the particular dynamics of the initiated cells and the number of clonal expansions assumed.

Supporting Information
Text S1 Supplementary methods and results. A tabular glossary which summarizes our notation and succinctly defines the model parameters and terminology in use. (PDF)