Parameter Identifiability and Redundancy in a General Class of Stochastic Carcinogenesis Models

Background Heidenreich et al. (Risk Anal 1997 17 391–399) considered parameter identifiability in the context of the two-mutation cancer model and demonstrated that combinations of all but two of the model parameters are identifiable. We consider the problem of identifiability in the recently developed carcinogenesis models of Little and Wright (Math Biosci 2003 183 111–134) and Little et al. (J Theoret Biol 2008 254 229–238). These models, which incorporate genomic instability, generalize a large number of other quasi-biological cancer models, in particular those of Armitage and Doll (Br J Cancer 1954 8 1–12), the two-mutation model (Moolgavkar et al. Math Biosci 1979 47 55–77), the generalized multistage model of Little (Biometrics 1995 51 1278–1291), and a recently developed cancer model of Nowak et al. (PNAS 2002 99 16226–16231). Methodology/Principal Findings We show that in the simpler model proposed by Little and Wright (Math Biosci 2003 183 111–134) the number of identifiable combinations of parameters is at most two less than the number of biological parameters, thereby generalizing previous results of Heidenreich et al. (Risk Anal 1997 17 391–399) for the two-mutation model. For the more general model of Little et al. (J Theoret Biol 2008 254 229–238) the number of identifiable combinations of parameters is at most less than the number of biological parameters, where is the number of destabilization types, thereby also generalizing all these results. Numerical evaluations suggest that these bounds are sharp. We also identify particular combinations of identifiable parameters. Conclusions/Significance We have shown that the previous results on parameter identifiability can be generalized to much larger classes of quasi-biological carcinogenesis model, and also identify particular combinations of identifiable parameters. These results are of theoretical interest, but also of practical significance to anyone attempting to estimate parameters for this large class of cancer models.


Introduction
Models for complex biological systems may involve a large number of parameters. In principle it may well be that some of these parameters may not be observed, or be possible to be derived from observed data via regression techniques. Such parameters are said to be unidentifiable or non-identifiable, the remaining parameters being identifiable.
There is a substantial literature on identifiability in stochastic models in various contexts [1,2,3]. Catchpole and Morgan [3] considered identifiability and parameter redundancy and the relations between them in a general class of (exponential family) models. Catchpole and Morgan [3] defined a set of model parameters in an exponential family model to be redundant if the likelihood can be written using a strictly smaller parameter vector; otherwise they are irredundant. Rothenberg [1], Jacquez and Perry [4] and Catchpole and Morgan [3] also defined a notion of local identifiability, to mean that within a neighbourhood of each set of parameter values the likelihood differs for at least some data points. This notion has been extended by Little et al. [5] to gradient weak local identifiability and weak local identifiability. Little et al. [5] defined a set of parameters to be weakly locally identifiable if the maxima of the likelihood are isolated; they defined parameters to be gradient weakly locally identifiable if the turning points (those for which the likelihood derivative with respect to the parameters is zero) are isolated. The results obtained by Little et al. [5] (Corollary 2 (ii) and the subsequent Remark (ii)), show that, subject to some regulatory conditions, the number of locally identifiable or (gradient) weakly locally identifiable parameter combinations is equal to the rank of the Hessian matrix, or equivalently the rank of the Fisher information matrix. The notions of identifiability in stochastic models [1,2,3,5], within which framework this paper is set, should be contrasted with the consideration of identifiablity in non-stochastic settings considered by some [4,6,7].
Heidenreich [8] and Heidenreich et al. [9] considered parameter identifiability in the context of the two-mutation cancer model [10] and demonstrated that of the five biological parameters in the model, on the basis of the cancer hazard function only three could be identified. [It should be noted that given extra information, for example on numbers and sizes of intermediate cell compartment clones, there is information on an additional parameter.] In this paper we consider the problem of identifiability in recently developed carcinogenesis models of Little and Wright [11] and Little et al. [12]. These models generalize a large number of other quasi-biological cancer models, in particular those of Armitage and Doll [13], the two-mutation model [10], the generalized multistage model of Little [14], and a recently developed cancer model of Nowak et al. [15] that incorporates genomic instability. We shall show that via a specific reparameterization, in the simpler model proposed by Little and Wright [11] in principle combinations of all but two of the model parameters are identifiable, thereby generalizing previous results of Heidenreich [8] and Heidenreich et al. [9] for the two-mutation cancer model. For the more general model of Little et al. [12] combinations of all but rz1 of the model parameters are identifiable, where r is the number of destabilization types, thereby also generalizing all these results. We also identify particular forms of identifiable parameters.

Parameter Identifiability in the Context of a Stochastic Cancer Model with Genomic Instability
We consider the problem of parameter identifiability in a particular class of stochastic cancer models, those of Little and Wright [11] and Little et al. [12]. The ideas used are similar to those employed by Heidenreich et al. [9], in particular the use of Cauchy's method of characteristics. We shall assume throughout this section that this model is embedded in a member of the exponential family so that the log-likelihood is given by where the natural parameters z l~zl ½(h i ) p i~1 ,z l are functions of the model parameters (h i ) p i~1 and some auxiliary data (z l ) n l~1 , but that the scaling parameter w is not. We shall assume that the m l~b ,y l is the cancer hazard function, and that the (z l ) n l~1 are all non-zero. This is generally the case, in particular when cohort data are analysed using Poisson regression models, e.g., as in Little and Wright [11] or Little and Li [16]. By the remarks following Corollary 2 of Little et al. [5], proving weak local identifiability of a subset of cardinality k of the biological parameters (h i ) p i~1 is equivalent to showing that for this subset of parameters The model of Little et al. [12], generalizing that of Little and Wright [11], which in turn generalizes the model of Little [14], assumes that cells can acquire up to k successive cancer-stage mutations, and any of r (mutually exclusive) types of destabilization mutation(s). Cells become malignant when k cancer-stage mutations have occurred, no matter how many destabilizing mutations there have been. Once a cell has acquired a destabilizing mutation of type d (1ƒdƒr), it and its daughter cells can acquire up to m d {1 further destabilizing mutations of the same type. We define r to be the multiplicity of destabilization mutation types. It is to be expected that the more destabilizing mutations cells acquire of each type, the higher the cancer stage mutation rate is, but this is not intrinsic to the model. We write  Table 1 lists the biological parameters that are used in the model, and their multiplicity.
Cells at different stages of the process are labelled by I (a,b,d) , where the first subscript, a, represents the number of cancer stage mutations that the cell has accumulated, the second subscript, b, represents the number of destabilizing mutations acquired, their type being given by the third subscript, d. At all stages other than I (0,0,0) , cells are allowed to divide symmetrically or differentiate (or undergo apoptosis) at rates G(a,b,d) and D(a,b,d), respectively.
Each cell can divide into an equivalent daughter cell and another cell with an extra cancer stage mutation at rate M(a,b,d).
Likewise, cells can also divide into an equivalent daughter cell and another cell with an additional destabilizing mutation of type d at rate A(a,b,d). The model assumes that there are X (t) susceptible stem cells at age t. Further details on derivation of the hazard function are given in the paper of Little et al. [12].

Results
In Text S1 Section B we derive the hazard function and show that it can be written in terms of certain combinations of the biological parameters given in Table 1. From equations (B12)-(B16) in Text S1 Section B it is seen that the characteristics and y are governed by certain parameter combinations. Table 2 summarizes the maximum number of identifiable parameter combinations and their forms associated with each cell compartment. The maximum number of identifiable parameters associated with each destabilization zone, I a,b,d , are 4 when avk{1 and 0vbvm d ; 4 when a~k{1 and 0vbvm d ; 3 when avk{1 and b~m d and 2 when a~k{1 and b~m d . The function y is governed by at most rz1 parameter combinations. Therefore, we have shown that the hazard function h(h) can be written as h(G 1 (h),G 2 (h),:::,G N (h)) for some scalar functions G 1 (:),G 2 (:),:: Table 2). Assuming that the cancer model is embedded in a member of the exponential family (in the sense outlined in Text S1 Section C) the same will be true of the total log-likelihood L(xjh)~L(xjG 1 (h),G 2 (h),:::,G N (h)). By means of the Chain Rule we obtain LL(xjG 1 ,:::,G N ) LG l L 2 G l Lh i Lh j , so that the Fisher information matrix is given by LG l Lh i E L 2 L(xjG 1 ,:::,G N ) which therefore has rank at most N. A similar argument shows that if one were to reparameterise (via some invertible able, indicating a minimum of (rz1) parameter redundancies in the model. Also, from the results obtained by Little et al. [5] (Corollary 2 (ii) and the subsequent Remark (ii)), subject to some regulatory conditions, the number of locally identifiable or (gradient) weakly locally identifiable parameter combinations is equal to the rank of the Fisher    0,0)), and a single X , giving a total of five biological parameters. It is known from the results of Heidenreich et al. [8,9] that for the two-mutation model only three combinations of these are estimable, i.e., that there are two redundancies, precisely in agreement with the result given here for r~1. This result therefore precisely generalizes the results and approach of Heidenreich et al. [8,9]. Unfortunately, analytical methods for proving that precisely this number of parameters are estimable, including some recently outlined [17], cannot be used for the model considered here. Nevertheless, we conjecture that in fact precisely this number of parameters are estimable, so that the upper bound on the number of estimable parameter combinations that we have proved above is in fact sharp. This is supported by numerical evaluation of the Hessian in a couple of example cases, which we now outline.

Numerical Evaluation of Hessian and Determination of Its Rank
That there are likely to be exactly this number of estimable parameters is supported by numerical evaluation of the Hessian matrix of the hazard function. We make use of the solution of the system of ordinary differential equations defining the Hessian, outlined in Text S1 Section D. We will show in two cases that the Hessian has rank two less than the number of biological parameters, w. By the abovementioned results of Catchpole and Morgan [3] and Little et al [5] this suggests that precisely w{2 parameters are (gradient) weakly locally identifiable. In order to show that the Hessians are of rank two less than the number of biological parameters, w, we evaluate the eigenvalues of the Hessian matrix, and establish that the smallest eigenvalue among the w{2 largest eigenvalues in absolute value exceeds the likely magnitude of the error by at least an order of magnitude. We know the likely size of the error in numerical evaluations of each element, h ij , of the Hessian from the Boerlisch-Stoer integrator that is employed, namely max (10 {10 ,10 {10: jh ij j : 1ƒi,jƒw) (bsstep routine, Press et al. [18], p.722). It is known that if two symmetric matrices H andH H have eigenvalues l 1 ƒl 2 ƒ:::l w{1 ƒl w and l l 1 ƒl l 2 ƒ:::l l w{1 ƒl l w then jl i {l l i jƒjjH{H Hjj 2 , 1ƒiƒw, where jjHjj 2~s up½jjHxjj 2 =jjxjj 2 : x=0 [19](p.396). Since the approximate Hessian that we calculate,H H, differs from the true Hessian, H, by an amount jjH{H Hjj 2 ƒ ffiffiffi ffi w p : max½jh ij {h h ij j : 1ƒi, jƒw, we know that: There is also the issue of numerical roundoff error in the QR algorithm where c(w) is a modestly increasing function of the dimension, w, of the approximate HessianH H and e is the machine precision [19](Chapter 8).
Since the machine precision (in double precision) is of the order 10 {15 this expression (3) will be dominated by the error associated with the approximation to the Hessian, given by expression (2).
r destabilization zones (0ƒaƒk{1, 1ƒbƒm d , 1ƒdƒrÞ We evaluated the Hessian matrix for a model with three cancerstage mutations and one destabilizing mutation, and a model with two cancer-stage mutations and one destabilizing mutation; lognormal perturbations of all parameters were performed, assuming a geometric standard deviation (GSD) of 4, centred on models with cancer-stage mutation rates of 4.0610 23  For each of 1000 random sets of parameters we evaluated the Hessian by numerical integration, as outlined in Text S1 Section D. We calculated the eigenvalues of the Hessian using the QR algorithm, specifically the NAG FORTRAN subroutine F02FAF [20]. For each model we selected the set of random parameters for which the ratio of minimum to maximum among the w{2 largest eigenvalues (w being the number of biological parameters) in absolute value was greatest. These are given in Tables 3 and 4, for the three-stage and two-stage models, respectively. The associated eigenvalues are given in Table 5. The absolute value of the w{2th smallest eigenvalue associated with each set exceeds the error bound (2) by at least an order of magnitude in each case. This strongly suggests that the Hessians calculated for these two examples really are of rank w{2 for each model.

Discussion
We have shown that in the class of stochastic cancer models incorporating genomic instability developed by Little and Wright [11] the number of identifiable combinations of parameters is at most two less than the number of biological parameters, thereby generalizing previous results of Heidenreich et al. [8,9] and Hanin et al. [21,22] for the two-mutation model, a special case of this model. For the more general genomic-instability cancer model of Little et al. [12] the number of identifiable combinations of parameters is at most rz1 less than the number of biological parameters, where r is the number of destabilization types, thereby   Table 5. Eigenvalues in ascending order of Hessian matrix associated with a model with three cancer stage mutations and one destabilizing mutation (as in Table 3), and with a model with two cancer stage mutations and one destabilizing mutation (as in Table 4).
Number Eigenvalues (Table 3) Eigenvalues ( also generalizing all these results. Numerical evaluations in two special cases (with r~1) suggest that this bound is tight: a combination of parameters with cardinality two less than the number of biological parameters is of full rank, and so is not redundant. A weakness of the paper is that one cannot be absolutely sure (because of the uncertainty implicit in any numerical evaluation) that the bound demonstrated by the mathematics of section 3 and Text S1 Section B is sharp. Nevertheless, we have clearly established a maximum number of identifiable parameter combinations. We have also specified particular combinations of identifiable parameters, and these should be used in model fitting to avoid obvious numerical problems, of lack of convergence and absence of a unique set of parameters maximizing the likelihood.
These results have obvious implications for the large number of other quasi-biological cancer models that are special cases of these models, in particular those of Armitage and Doll [13], the twomutation model [10], the generalized multistage model of Little [14], and a recently developed cancer model of Nowak et al. [15] that incorporates genomic instability. It should be noted that the results given here are for the fully stochastic solution of the model, and would not be applicable, for example, to the deterministic approximation of the multistage model of Armitage and Doll [13] that is often employed in applications.
Our results imply that for the general class of cancer models considered here, only certain specific parameter combinations should be estimated in principle, and this is the case whatever the size of the dataset being considered. Whether for complex models for even this theoretically available number of parameters there is useful information is of course uncertain, and may well depend on the particular dataset and on the likely size of the parameters to be estimated. However, fits to a large population-based registry of colon cancer, as recently analysed by Little and Li [16], suggests that, for example, the model with two cancer-stage and one destabilizing mutations can be fitted to the dataset and yields stable parameter estimates for certain combinations of 11 parameters, in accordance with the results of this paper.