Parameter estimation for multistage clonal expansion models from cancer incidence data: A practical identifiability analysis

Many cancers are understood to be the product of multiple somatic mutations or other rate-limiting events. Multistage clonal expansion (MSCE) models are a class of continuous-time Markov chain models that capture the multi-hit initiation–promotion–malignant-conversion hypothesis of carcinogenesis. These models have been used broadly to investigate the epidemiology of many cancers, assess the impact of carcinogen exposures on cancer risk, and evaluate the potential impact of cancer prevention and control strategies on cancer rates. Structural identifiability (the analysis of the maximum parametric information available for a model given perfectly measured data) of certain MSCE models has been previously investigated. However, structural identifiability is a theoretical property and does not address the limitations of real data. In this study, we use pancreatic cancer as a case study to examine the practical identifiability of the two-, three-, and four-stage clonal expansion models given age-specific cancer incidence data using a numerical profile-likelihood approach. We demonstrate that, in the case of the three- and four-stage models, several parameters that are theoretically structurally identifiable, are, in practice, unidentifiable. This result means that key parameters such as the intermediate cell mutation rates are not individually identifiable from the data and that estimation of those parameters, even if structurally identifiable, will not be stable. We also show that products of these practically unidentifiable parameters are practically identifiable, and, based on this, we propose new reparameterizations of the model hazards that resolve the parameter estimation problems. Our results highlight the importance of identifiability to the interpretation of model parameter estimates.

The n-stage clonal expansion model is a continuous-time Markov chain with the following states: X(t), the number of normal cells at age t; Y 1 (t), . . . , Y n−2 , the number of cells in subsequent preintiation states; Y n−1 (t), the number of initiated cells; and Z(t), the number of malignant cells. Let ν be the initial mutation rate (with µ 0 := νX), µ 1 , . . . , µ n−3 the following preinitiation mutation rates, µ n−2 the initiation mutation rate, µ n−1 the malignant transformation rate, α the clonal expansion rate, and β the cell death rate. Mutations are modeled as an asymmetric division process resulting in one cell of the same class as the parent and one of the subsequent class. For notational simplicity, we suppress the possible dependence of the parameters on time. Let Then, the probability generating function for the numbers of preinitiated, initiated, and malignant cells may be written as Ψ(y 1 , . . . , y n−1 , z, t) = E y = (j 1 ,...,j n−1 ,k) P (j 1 ,...,j n−1 ,k) (t)y j 1 1 · · · y j n−1 n−1 z k .

S2
Because this equation depends on y 1 , we must further consider the system of equations in Eqs. (S10).
Along this characteristic, t = τ − τ 0 , z = 0, and y n−1 satisfies a Ricatti equation (Heidenreich et al., 1997), which, in the case of constant parameters, has the solution (with τ * = τ 0 + t * ) where p n , q n : Moreover, each y i (τ ), for 1 ≤ i ≤ n − 2, can be found recursively, noting y i (τ * ) = 1, Although the backward formulation (see, for example, Brouwer et al. (2016)) is more computationally tractable, this forward formulation allows us to find formulas for the hazards of the first several models in the constant parameter case.

Two-stage model
For the two-stage model, we can analytically derive the solution for Ψ in the constant parameter case.

Three-stage model
For the three-stage model, we have and Although this survival function cannot be written in closed form, the hazard can be, = µ 0 1 − q 3 − p 3 q 3 e −p 3 t − p 3 e −q 3 t µ 1 α . (S28)