Skip to main content
Advertisement
  • Loading metrics

Parameter estimation for multistage clonal expansion models from cancer incidence data: A practical identifiability analysis

Correction

17 Jul 2017: Brouwer AF, Meza R, Eisenberg MC (2017) Correction: Parameter estimation for multistage clonal expansion models from cancer incidence data: A practical identifiability analysis. PLOS Computational Biology 13(7): e1005660. https://doi.org/10.1371/journal.pcbi.1005660 View correction

Abstract

Many cancers are understood to be the product of multiple somatic mutations or other rate-limiting events. Multistage clonal expansion (MSCE) models are a class of continuous-time Markov chain models that capture the multi-hit initiation–promotion–malignant-conversion hypothesis of carcinogenesis. These models have been used broadly to investigate the epidemiology of many cancers, assess the impact of carcinogen exposures on cancer risk, and evaluate the potential impact of cancer prevention and control strategies on cancer rates. Structural identifiability (the analysis of the maximum parametric information available for a model given perfectly measured data) of certain MSCE models has been previously investigated. However, structural identifiability is a theoretical property and does not address the limitations of real data. In this study, we use pancreatic cancer as a case study to examine the practical identifiability of the two-, three-, and four-stage clonal expansion models given age-specific cancer incidence data using a numerical profile-likelihood approach. We demonstrate that, in the case of the three- and four-stage models, several parameters that are theoretically structurally identifiable, are, in practice, unidentifiable. This result means that key parameters such as the intermediate cell mutation rates are not individually identifiable from the data and that estimation of those parameters, even if structurally identifiable, will not be stable. We also show that products of these practically unidentifiable parameters are practically identifiable, and, based on this, we propose new reparameterizations of the model hazards that resolve the parameter estimation problems. Our results highlight the importance of identifiability to the interpretation of model parameter estimates.

Author summary

Parameter estimation from data is an important part of mathematical modeling, and structural identifiability is the study of what parametric information exists, for a given model, in ideal data. Unfortunately, for a variety of reasons, there is often less information available in our real data sets. The study of these problems is called practical identifiability. In this study, we consider a family of models of cancer biology that are commonly used to explain cancer incidence in terms of underlying biological parameters. Using profile likelihoods, a widely applicable numerical tool, we demonstrate that even though the more complex models we consider have theoretically more identifiable parameters, the data contains only three pieces of practically identifiable information for each model: the product of the initiating mutation rates, the net cell proliferation rate, and the scaled malignant conversion rate. This result can be interpreted biologically: we can determine only the product of cell mutation rates not the intermediate rates themselves. Our result limits the interpretability of previous work, but we propose a novel parameterization to resolve the identifiability issue. Ultimately, our analysis demonstrates the importance of verifying the practical identifiability of parameters before assigning too much weight to the interpretation of their estimated values.

Introduction

Parameter estimation is an important aspect of computational modeling in the life sciences because parameter estimates can shed light on underlying biological mechanisms and processes and provide a way to link dynamic models to real-world data. However, the dynamics of many living systems have evolved to be robust to changes in underlying parameters, which necessitates an understanding of which parameters or combinations of parameters can even be estimated from data, known as identifiability. Here, we leverage computational identifiability tools to determine what cancer incidence data can tell us about the biology of carcinogenesis.

Cancers arise from the accumulation of genetic (and epigenetic) abnormalities and mutations. Although a single change is thought to be sufficient for certain cancers (certain leukemias, lymphomas, and sarcomas in particular), many cancers are thought to require two or more hits [1]. For example, retinoblastoma is a two-hit cancer—indeed, a two-hit model of retinoblastoma predicted the existence of the tumor suppressing gene pRb before it was discovered [2]—and colorectal cancer can be described by three or more hits to the APC, RAS, and P53 genes [1]. Similarly to the development of precancerous polyps for colorectal cancer, many esophageal cancers begin with a transition to a condition called Barrett’s esophagous [3] before accumulating additional abnormalities. These genetic (or epigenetic) hits are often described as starting different phases of carcinogenesis: initiation, the first destabilizing mutation(s); promotion, the unchecked growth of a tumor; and malignant conversion, the spread into other tissues. This classification is useful because different exposures may act on different stages of carcinogenesis.

Multistage clonal expansion (MSCE) models are a class of continuous-time Markov chain models that capture this initiation–promotion–malignant-conversion hypothesis of carcinogenesis. Originally posed as a two-stage model [4, 5] using birth–death–mutation branching-process theory, this class of models has been expanded to three or more stages, multiple pathways, and other variations. These models have been successfully used to analyze epidemiological population-level cancer incidence data [611], to assess the impact of time-varying exposures on cancer risk using individual-level data [1216], and to project the impact of prevention and control strategies on population cancer rates [10, 1719]. Although models that use multiple clonal expansion steps have been considered, models with multiple initiation stages but only a single, final clonal expansion stage are more common in the literature and appear to capture the incidence patterns of many cancers (e.g. [68, 20]. We are concerned here with parameter estimation for MSCE models because it can lead to better understanding of the rates of biological processes like tumor growth or adverse mutations. Indeed, knowing the approximate speed at which an abnormality arises may help to classify the underlying abnormal event (e.g. single nucleotide mutation, chromosomal translocation, or epigenetic change).

Identifiability is the study of the parametric information available in a data set when viewed through the lens of a model, and identifiability analysis is an important precursor to accurate parameter estimation. A model is said to be identifiable if all model parameters may be uniquely determined from observed data [2123]. There are two kinds of identifiability analyses: structural—which analyzes the model in the context of perfectly measured and noise-free data in order to uncover the inherent limitations of the model structure in the context of parameter estimation—and practical—which considers obstacles to parameter estimation that arise from noise, sampling frequency, bias, and other issues in real-world data sets [24]. Identifiability analysis can identify parameter combinations that embody the parametric information available in the data and lead to useful reparameterizations of the model [23].

That MSCE models are not fully identifiable is well established [6, 2527]. In particular, finding the closed-form solution of a model’s hazard function—the model output corresponding to age-specific incidence data—gives an upper bound on the number of identifiable parameter combinations available for that model from the age-specific incidence data and constrains the forms of those combinations. We previously computed the exact structural identifiability for the class of MSCE models with constant parameters and one clonal-expansion step [28]. However, this is not the last word on the identifiability of MSCE models. In particular, it is known that there is a practical identifiability problem with the clonal expansion models with three or more stages: the information contained in the asymptote of the corresponding hazard function is not available in the usual age-specific cancer incidence data because the asymptote is not reached within human lifespans [8].

In this analysis, we examine this practical identifiability problem with a profile-likelihood approach. We consider pancreatic cancer, which has linear age-specific incidence at older ages [8] and can be fit by an MSCE model with two or more stages. We demonstrate that the two-, three-, and four-stage models have only three practically identifiable parameter combinations and that, for the three- and four-stage models, several parameters that are theoretically structurally identifiable individually, are, practically, identifiable only in their product. This practical unidentifiability means the incidence data contains information about the overall rate of progression from normal to cancer-initiated cells but not the expected information on the rates of the individual steps leading to initiation.

Methods

Multistage clonal expansion models

The mathematics of multistage clonal expansion models have been detailed elsewhere [4, 5, 8, 25, 2936], so we only give a basic description here. The n-stage clonal expansion model (Fig 1) is a continuous-time Markov chain with the following states: X(t), the number of normal cells at age t; Y1(t), …, Yn−2, the number of cells in subsequent preintiation states; Yn−1(t), the number of initiated cells; and Z(t), the number of malignant cells. Let ν be the initial mutation rate, μ1, …, μn−3 the following preinitiation mutation rates, μn−2 the initiation mutation rate, μn−1 the malignant transformation rate, α the clonal expansion rate, and β the cell death rate. If the parameters and X(t) are constant, then we may denote (1) and write hazard functions [6, 8] of the two-, three-, and four-stage models (a derivation is provided in S1 Text): (2) (3) (4)

thumbnail
Fig 1. Schematic of a general multistage clonal expansion model.

Multistage clonal expansion models are continuous-time Markov chain models in which normal cells undergo a series of genetic changes that lead to a state of clonal expansion followed by progression to malignancy.

https://doi.org/10.1371/journal.pcbi.1005431.g001

From the hazard functions, we can see that the two-, three-, and four-stage models have at most three (νX/α, p2, q2), four (νX, μ1/α, p3, q3), and five (νX, μ1, μ2/α, p4, q4) structurally identifiable parameter combinations. In this case, these parameter combinations are structurally identifiable [28].

Multistage clonal expansion model hazards share similar characteristics, including an exponential region, a linear region, and an asymptote (Fig 2). The transition from the linear phase to the asymptote occurs on different time scales for the different models, and, for biologically reasonable ranges of the parameters, only h2 can achieve this asymptote within human lifespans. The other hazards achieve their asymptotes on the order of 1,000 to 100,000 years, depending on the parameters. For example, the asymptote of the three-stage model occurs on the order of (μ1(1 − β/α))−1 [8], and mutation rate estimates are typically on the order of 10−7–10−5 [48] (note that 0 < (1 − β/α) < 1, so that this term can only exacerbate the time span). Thus, because real data cannot access the information contained in the asymptote and other late appearing features, one may expect inherent practical identification issues for MSCE models with more stages.

thumbnail
Fig 2. Salient features of a multistage clonal expansion model hazard.

Multistage clonal expansion models have exponential and linear phases that may be observed on the time scale of a human lifespan. Depending on the number of stages in the model, the asymptote may or may not occur within a human lifespan.

https://doi.org/10.1371/journal.pcbi.1005431.g002

Data

We consider cancers reported to the Surveillance, Epidemiology, and End Results (SEER) cancer registries, using SEER 9 data 1973–2012 (data available in S1 and S2 Data). We use the International Classification of Diseases (ICD-10) codes to identify incidence of pancreatic cancer (C25).

Identifiability framework

More thorough treatments of identifiability of dynamical systems are presented elsewhere [23, 24, 37, 38], and we previously described a framework to apply dynamical systems identifiability techniques to stochastic time-to-event models, including multistage clonal expansion models [28]. Nevertheless, we provide the basic identifiability framework and definitions here for reference.

Consider a vector of states x(t) (unobserved), vector of parameters to be estimated ρ, and observed (known) input u(t) and output v(t) in the dynamical systems model, (5)

Definition 1 Parameter ρi in the model given in Eq (5) is (globally) structurally identifiable if, for almost all values and initial conditions, the observation of an output trajectory (v(t) = v*(t)) uniquely identifies ρi (), i.e. if only one value of ρi could have resulted in the observed output.

Definition 2 The model given in Eq (5) is (globally) structurally identifiable if each ρi is structurally identifiable.

The definition of structural identifiability concerns perfectly measured input and output. However, because real data may not capture all of the parameteric information available in a theoretic trajectory, parameters that are structurally identifiable in a model for a kind of theoretical data may be practically unidentifiable given a corresponding real-world dataset. Practical non-identifiability can arise from poor data quality (uncertainty, infrequent sampling, etc), but it can also be inherent to the type of data measured. For example, the saturation constant of a Michaelis-Menten equation may not be identifiable from low-dose data [39], and the amplitude of a circadian rhythm will not be identifiable if a value is measured once a day at the same time [40]. Thus, even if there are a large number of data points (e.g. as is often the case for cancer registry data), practical identifiability may still be an issue. It is this kind of inherent limitation of the data that we explore for the multistage clonal expansion models.

Practical identifiability is difficult to define in a rigorous way without choosing a threshold (e.g. width of a confidence interval) and thus has a “I know it when I see it” quality. Nevertheless, descriptions of practical identifiability are possible and typically consider the confidence bounds for the estimated parameters, found by Fisher Information Matrix (FIM) [22, 23, 41, 42] or likelihood-based methods [24]. In this analysis, we use likelihood-based confidence intervals, which are defined as follows. Let be the likelihood for the model given the data set as a function of the parameters ρ, and let the maximum-likelihood estimator.

Definition 3 Let denote the maximum likelihood when the ith parameter is fixed to value ρi, and call it the profile likelihood of ρi. Then, the likelihood-based confidence interval for ρi at level of significance α is the set of values of ρi for which the relative negative log-likelihood at ρi is less than a threshold determined by α, that is, (6) where (7) is the chi-squared distribution with a number of degrees of freedom (df) equal to the number of parameters (for simultaneous confidence intervals) or equal to 1 (for pointwise confidence intervals). [24, 43].

We would like to say that parameter ρi in the model given in Eq (5) is practically identifiable if the likelihood-based confidence interval for ρi has finite length. However, this definition is neither well-defined (the confidence interval may be finite for one level of significance but infinite at another) nor practically verifiable. Ultimately, parameters with confidence intervals that are sufficiently large—typically orders of magnitude—as to cause uncertainty and parameter estimation problems at the desired parameter scale and level of significance can be said to be practically unidentifiable.

Computation methods

We use profile likelihood [24] and subset profiling [42] methods to investigate the practical identifiability of the two-, three-, and four-stage models. We assume that cancer incidence is Poisson distributed (details in S1 Text). Profile likelihoods were computed by fixing the value of one parameter at each of a series of values within an interval and numerically optimizing the negative log-likelihood as a function of the remaining parameters. Numerical optimization was done in R (v.3.0.1) using the Bhat package [44].

Results

We plot incidence rates of pancreatic cancer reported to SEER 9 (1973–2012) in men by decade (Fig 3). The data exhibit the classic pattern of linear incidence at older ages. There are no apparent temporal trends, so we fit the two-, three-, and four-stage clonal expansion model hazards to the entire data set by minimizing the negative log-likelihood. The Akaike Information Criterion (AIC) for each model (relative to the best-fitting model) is 177.7, 72.3, and 0, respectively, which preferences the four-stage model.

thumbnail
Fig 3. Pancreatic cancer incidence and best-fit MSCE models.

Incidence of pancreatic cancer in men per 100,000 (SEER 9, 1973–2012), and best-fit two-, three-, and four-stage clonal expansion model hazards. Note that the three model hazards largely overlap.

https://doi.org/10.1371/journal.pcbi.1005431.g003

Two-stage model

We profile the relative negative log-likelihood of the maximum-likelihood two-stage hazard as a function of each of the parameter combinations p3, q3, and νX/α (Fig 4). All three parameters combinations are practically identifiable because of the trough-shape of the negative log-likelihood, giving finite confidence intervals. The parameter estimates are given in Table 1.

thumbnail
Fig 4. Two-stage model profile likelihoods.

Profiles of the relative negative log-likelihood (NLL) of the two-stage clonal expansion model as each of the parameter combinations p2, q2, and νX/α are varied while the remaining parameters are fit. The gray dotted line gives the α = 0.01 threshold for simultaneous confidence intervals based on the relative negative log-likelihood. All three parameters are identifiable.

https://doi.org/10.1371/journal.pcbi.1005431.g004

thumbnail
Table 1. Best-fit parameters and likelihood-based 99% confidence intervals.

Best-fit parameters and likelihood-based 99% confidence intervals for the fits of the two-, three-, and four-stage clonal expansion models (with parameterizations using only practically identifiable parameter combinations and given in Eqs (2), (9) and (11) respectively) to age-specific incidence of pancreatic cancer.

https://doi.org/10.1371/journal.pcbi.1005431.t001

Three-stage model

We profile the relative negative log-likelihood of the maximum-likelihood three-stage hazard as a function of each of the parameter combinations p3, q3, νX, and μ1/α (Fig 5). Parameter combinations p3 and q3 are practically identifiable as above, but parameter combinations νX and μ1/α are not practically identifiable because their likelihoods flatten out, resulting in infinite confidence intervals.

thumbnail
Fig 5. Three-stage model profile likelihoods.

Profiles of the relative negative log-likelihood (NLL) of the three-stage clonal expansion model as each of parameter combinations p3, q3, νX, and μ1/α are varied while the remaining parameters are fit. The gray dotted line gives the α = 0.01 threshold for simultaneous confidence intervals based on the relative negative log-likelihood. Parameter combinations p3 and p4 are identifiable, while νX and μ1/α are practically unidentifiable.

https://doi.org/10.1371/journal.pcbi.1005431.g005

To identify the form of the practically-identifiable parameter combination of νX and μ1/α, we plot the fitted value of μ1/α as we vary the value of νX (Fig 6). Because the relationship is linear on the log–log scale, νX and μ1/α exist in a practically identifiable product. From the biological perspective, this means that we can only know the net rate of transition from normal to initiated cells but not the rates of the individual intermediate steps.

thumbnail
Fig 6. Three-stage model parameter dependency.

Fitted value of μ1/α as νX is varied for the three-stage clonal expansion model. The linear relationship on a log–log scale indicates an identifiable product.

https://doi.org/10.1371/journal.pcbi.1005431.g006

Our analysis thus demonstrates that there are three parameter combinations that are practically identifiable for the three-stage model from age-specific cancer-incidence data. Since there are three pieces of information in the data and four degrees of freedom in the full model (Eq 3), one might assume that one additional constraint on the model is sufficient to reduce the number of parameters estimated to three and simultaneously resolve the non-identifiability problem. However, the most reasonable simplifying assumption, namely that the first two mutation rates are the same (ν = μ1), such as for biallelic gene inactivation [1], does not do this; the three-stage model with ν = μ1 still has four structurally identifiable parameter combinations, namely p3, q3, νX, and ν/α, but only three pieces of practically identifiable information, so another constraint would be needed for a fully identifiable model. In this case, the constraint would need to designate the relative values of νX and μ1/α, which assuming ν = μ1 does not do. The ν = μ1 assumption does, however, suggest a new reparameterization of Eq 3. Denote (8) and fix X and α at reasonable values, i.e. at values where the likelihood profiles are flat (see Fig 5). Then, assuming ν = μ1, we parameterize and , and write (9)

As long as X and α are chosen so that νX and μ1/α are within a the range of values for which the likelihood is flat, their exact values do not affect the model fit and can be fixed. Caution is advisable here, however: although the exact values of these parameters do not affect the fit in this context, it is important to not take these values into other contexts where the exact values may be relevant, e.g. prediction in context of time-varying exposures. Nevertheless, this parameterization has several advantages. In particular, multiplicative effects on r3, such as relative period or cohort effects, can be thought of as affecting both ν and μ1 equally: under the assumption μν = μ1, r3 simplifies to , and, more generally, we can write, for some scalar ξ, .

We see that the profile relative NLL of has a finite confidence interval (Fig 7), as p3 and q3 did in Fig 5. The best-fit parameters for the three-stage model—parameterized as in Eq (9) and fit to the age-specific pancreatic cancer incidence data—are given in Table 1.

thumbnail
Fig 7. Profile likelihood for the reparameterized combination of the three-stage model.

Profile of the relative negative log-likelihood (NLL) as the parameter is varied while the remaining parameters are fit in the three-stage clonal expansion model. The gray dotted line gives the α = 0.01 threshold for simultaneous confidence intervals based on the relative negative log-likelihood. Parameter combination r3 is identifiable.

https://doi.org/10.1371/journal.pcbi.1005431.g007

Four-stage model

We similarly profile the relative negative log-likelihood of the maximum-likelihood four-stage hazard as a function of each of the parameter combinations p4, q4, νX, μ1, and μ2/α (Fig 8). As before, parameters combinations p4 and q4 have finite confidence intervals and are practically identifiable, while combinations νX, μ1 and μ2/α have infinite confidence intervals and are not practically identifiable.

thumbnail
Fig 8. Four-stage model profile likelihoods.

Profiles of the relative negative log-likelihood (NLL) of the four-stage clonal expansion model as each of parameter combinations p4, q4, νX, μ1, and μ2/α are varied while the remaining parameters are fit. The gray dotted line gives the α = 0.01 threshold for simultaneous confidence intervals based on the relative negative log-likelihood. Parameter combinations p4 and q4 are identifiable, while νX, μ1, and μ2/α are practically unidentifiable.

https://doi.org/10.1371/journal.pcbi.1005431.g008

To determine the combination structure, we use subset profiling [42]. However, rather than using FIM to determine the profiled parameter subsets, we note that the analysis of the three stage model leads us to suspect that the three parameter combinations νX, μ1, and μ2/α are in a practical product. We use this structure to propose our nearly-full rank subsets. To verify this proposal, we plot the fitted value of one parameter combination while another is fixed and the third is varied (Fig 9). The three selected plots presented are sufficient to verify that the three parameter combinations indeed exist in a practically identifiable product. As for the three-stage case, that νX, μ1, and μ2/α can only be identified up to their product means that we can only know the net rate of transition from normal to initiated cells but not the rates of the individual intermediate steps.

thumbnail
Fig 9. Four-stage model parameter dependencies.

Fitted values of one of νX, μ1, or μ2/α as another is fixed and the third is varied for the four-stage model. The linear relationships on a log–log scale indicate that νXμ1μ2/α is an identifiable product.

https://doi.org/10.1371/journal.pcbi.1005431.g009

We can define a quantity analogous to r3 in the three stage case. Here, (10) and, for some reasonable fixed values of X and α, (11)

We see that the profile relative NLL of r4 = (νXμ1μ2/α)1/3 has the expected trough shape (Fig 10), as seen in Fig 8 for p4 and q4. The best-fit parameters for the four-stage model—parameterized as in Eq (11) and fit to the age-specific pancreatic cancer incidence data—are given in Table 1.

thumbnail
Fig 10. Profile likelihood for the reparameterized combination of the four-stage model.

Profile of the relative negative log-likelihood (NLL) as the parameter r4 = (νXμ1μ2/α)1/3 is varied while the remaining parameters are fit in the four stage clonal expansion model. The gray dotted line gives the α = 0.01 threshold for simultaneous confidence intervals based on the relative negative log-likelihood. Parameter combination r4 is identifiable.

https://doi.org/10.1371/journal.pcbi.1005431.g010

Discussion

Practical unidentifiability is a significant barrier to parameter estimation. Indeed, because it—unlike structural identifiability—can be so dependent on the quality of the data, it can be a moving target. From this perspective, ironically, it is perhaps fortunate that the practical identifiability issue described herein is inherent to any age-specific cancer incidence data that is linear at older ages. This way, such problems can be anticipated and handled systematically, e.g. by reparameterizing the model appropriately. In theory, we could gain additional information if people were to live long enough to see the incidence plateau, but, as previously discussed, the expected timing of the plateau in the three- and four-stage clonal expansion models is well beyond conceivable human life spans. While the observation of a plateau might suggest that either the underlying mechanism is the two-stage model or the presence of heterogeneities or temporal effects, the absence of a plateau leaves room for various interpretations. Indeed, the two-, three-, and four-stage models were all able to reasonably fit the pancreatic cancer incidence data (Fig 3).

For each of the two-, three-, and four-stage models, only three parameter combinations were practically identifiable. In each case, these combinations are most easily interpreted in the following forms: (12) the net cell proliferation rate of initiated cells, (13) the scaled malignant conversion rate, and (14) the product of all preinitation rates scaled by the number of normal cells and the cell growth rate. Note that the first two combinations are together equivalent to pn and qn. Because the last combination is a product of individually structurally identifiable combinations, we know that information about mutation rates at the intermediate steps is only available in later features of the MSCE hazards, i.e. the asymptote and the transition from the linear phase to the asymptote.

Because there are only three practically identifiable combinations, successful parameter estimation can only be achieved if the models are reparameterized in terms of these combinations. For example, with the three-stage model parameterized as in Eq (3), parameter estimates for νX and μ1/α are not stable. Here, we proposed one possible solution with the reparameterizations in Eqs (9) and (11) and show that it does indeed resolve the practical unidentifiabilities, though an infinite number of reparameterizations will give equivalent fits as long as the parameter combinations are preserved. Each reparameterization represents a different assumption about the relative sizes of its constituent parameter combinations. Our reparameterizations are inspired by the assumption that the preinitation mutation rates are equal (ν = μ1 = …) but do not actually codify this assumption in the models. Nevertheless, it is consistent with a scenario in which multiple copies of a tumor suppressor gene must be “knocked out” [1].

Traditional approaches to parameter estimation that use asymptotic confidence intervals do not always reveal practical identifiability issues. Because asymptotic confidence intervals are based on the local curvature of the likelihood around the parameter estimate, they may give finite confidence bounds when the likelihood is curved on one side of the estimate but flat on the other. Numerical optimization algorithms may provide results that give the appearance of practical identifiability but have in fact simply pushed the estimate to the point where the likelihood begins to curve. Hence, the fact that our group and others have previously reported values of μ2/α with finite confidence intervals in four-stage models [6, 8] is not inconsistent with our results. Some of these previous works have interpreted the larger-than-expected values for μ2 (fixing α) as being too fast to represent a genetic mutation, suggesting that the four-stage model may represent two, slow genetic mutations followed by a fast epigenetic change, a transient event, or other transformation. Our results suggest that a large range of values μ2/α would have resulted in equivalent fits, and we note that the values presented in these previous works are of the same order of magnitude where we see curvature in our likelihood function. In particular, a previous fit of pancreatic cancer incidence in SEER (1973–2004) using the four-stage model [8] estimated νXμ1μ2/α to be 1.88E-5—the same value that we find here with the new parametrization (for pancreatic cancer in SEER 1973–2012; Table 1)—but also separately estimated μ2/α to be 4.0E-1, which falls exactly where the profile likelihood begins to curve up (Fig 8). Hence, such parameter estimates may be an artifact of the algorithm numerically optimizing the likelihood, and one should then be careful when giving a biological interpretation to those results.

This analysis also speaks to the question of model selection and model reduction. Although the four-stage model gives the best statistical fit to the data in Fig 3, its hazard nearly entirely overlaps with the that of the other models. Hence, we must question whether or not the larger model is actually capturing some nuance in the data. Given the practical identifiability issues we have presented, does the two-stage already capture all of the information? Possibly so. Are the results of both models equivalent? Unfortunately not: although each model is estimating the same biological parameters (i.e. the product of initiation rates, the final promotion rate, and the malignant conversion rate), a perusal of Table 1 reveals that the parameter estimates are not particularly consistent across the three models (although are generally within an order of magnitude). Moreover, the different dynamics of each model will become important as we move away from simply analyzing incidence and consider prediction or individual time-varying exposures. Nevertheless, in this situation, one might be inclined to take an ensemble approach and to consider uncertainty quantification not only within a model but across the models, perhaps weighting in some way by statistical fit. Additional empirical science, by better elucidating carcinogenesis mechanisms common to cancer at given site, could aid modelers in model selection.

The guidance we have presented in this study is important as three- and four-stage clonal expansion models are commonly used to model certain cancers at the population level, and successful parameter estimation is dependent on the model being identifiable with respect to the available data. Ultimately, our analysis demonstrates the need for future studies to verify the practical identifiability of model parameters whenever feasible, which should strengthen the validity of the analyses and aid in the interpretation of estimated parameter values and modeling results.

Supporting information

S1 Data. Incidence.

Cases of pancreatic cancer in men reported to SEER 9, 1973–2012.

https://doi.org/10.1371/journal.pcbi.1005431.s001

(CSV)

S2 Data. Population.

Population of men in SEER 9 catchment, 1973–2012.

https://doi.org/10.1371/journal.pcbi.1005431.s002

(CSV)

S1 Text. Mathematical and statistical details.

Derivation of multistage clonal expansion model hazards and statistical formulation of the likelihood.

https://doi.org/10.1371/journal.pcbi.1005431.s003

(PDF)

Author Contributions

  1. Conceptualization: AFB MCE RM.
  2. Funding acquisition: MCE RM.
  3. Investigation: AFB.
  4. Methodology: AFB MCE RM.
  5. Software: AFB.
  6. Validation: AFB MCE RM.
  7. Writing – original draft: AFB.
  8. Writing – review & editing: AFB MCE RM.

References

  1. 1. Knudson AG. Two genetic hits (more or less) to cancer. Nature reviews Cancer. 2001;1(2):157–162. pmid:11905807
  2. 2. Knudson AG. Mutation and cancer: statistical study of retinoblastoma. Proceedings of the National Academy of Sciences. 1971;68(4):820–3.
  3. 3. Wild C, Hardie L. Reflux, Barrett’s oesophagus and adenocarcinoma: burning questions Nature Reviews. Cancer. 2003;3(9):676–84. pmid:12951586
  4. 4. Moolgavkar SH, Venzon DJ. Two-event models for carcinogenesis: incidence curves for childhood and adult tumors. Mathematical Biosciences. 1979;47(1–2):55–77.
  5. 5. Moolgavkar SH, Knudson AG. Mutation and cancer: a model for human carcinogenesis. Journal of the National Cancer Institute. 1981;66(6):1037–52. pmid:6941039
  6. 6. Luebeck EG, Moolgavkar SH. Multistage carcinogenesis and the incidence of colorectal cancer. Proceedings of the National Academy of Sciences. 2002;99(23):15095–15100.
  7. 7. Jeon J, Luebeck EG, Moolgavkar SH. Age effects and temporal trends in adenocarcinoma of the esophagus and gastric cardia (United States). Cancer Causes & Control. 2006;17(7):971–81.
  8. 8. Meza R, Jeon J, Moolgavkar SH, Luebeck EG. Age-specific incidence of cancer: Phases, transitions, and biological implications. Proceedings of the National Academy of Sciences. 2008;105(42):16284–9.
  9. 9. Hazelton WD, Curtius K, Inadomi JM, Vaughan TL, Meza R, Rubenstein JH, et al. The role of gastroesophageal reflux and other factors during progression to esophageal adenocarcinoma. Cancer Epidemiology, Biomarkers & Prevention 2015;24(7):1–6.
  10. 10. Curtius K, Hazelton WD, Jeon J, Luebeck EG. A Multiscale Model Evaluates Screening for Neoplasia in Barrett’s Esophagus. PLOS Computational Biology. 2015;11(5):e1004272. pmid:26001209
  11. 11. Brouwer AF, Meza R, Eisenberg MC. Age Effects and Temporal Trends in HPV-Related and HPV-Unrelated Oral Cancer in the United States: A Multistage Carcinogenesis Modeling Analysis. PLOS One. 2016;11(3): e0151098. pmid:26963717
  12. 12. Luebeck EG, Heidenreich WF, Hazelton WD, Paretzke HG, Moolgavkar SH. Biologically based analysis of the data for the Colorado uranium miners cohort: age, dose and dose-rate effects. Radiation Research. 1999;152(4):339–51. pmid:10477911
  13. 13. Hazelton WD, Luebeck EG, Heidenreich WF, Moolgavkar SH. Analysis of a historical cohort of Chinese tin miners with arsenic, radon, cigarette smoke, and pipe smoke exposures using the biologically based two-stage clonal expansion model. Radiation Research. 2001;156(1):78–94. pmid:11418076
  14. 14. Meza R, Hazelton WD, Colditz GA, Moolgavkar SH. Analysis of lung cancer incidence in the Nurses’ Health and the Health Professionals’ Follow-Up Studies using a multistage carcinogenesis model. Cancer Causes & Control. 2008;19(3):317–28.
  15. 15. Schöllnberger H, Beerenwinkel N, Hoogenveen R, Vineis P. Cell selection as driving force in lung and colon carcinogenesis. Cancer Research. 2010;70(17):6797–803. pmid:20656803
  16. 16. Richardson DB. Multistage modeling of leukemia in benzene workers: a simple approach to fitting the 2-stage clonal expansion model. American Journal of Epidemiology. 2009;169(1):78–85. pmid:18996834
  17. 17. Moolgavkar SH, Holford TR, Levy DT, Kong CY, Foy M, Clarke L, et al. Impact of reduced tobacco smoking on lung cancer mortality in the united states during 1975–2000. Journal of the National Cancer Institute. 2012;104(7):541–548. pmid:22423009
  18. 18. de Koning HJ, Meza R, Plevritis SK, Ten Haaf K, Munshi VN, Jeon J, et al. Benefits and harms of computed tomography lung cancer screening strategies: A comparative modeling study for the U.S. Preventive services task force. Annals of Internal Medicine. 2014;160(5):311–320. pmid:24379002
  19. 19. Kong CY, Kroep S, Curtius K, Hazelton WD, Jeon J, Meza R, et al. Exploring the Recent Trend in Esophageal Adenocarcinoma Incidence and Mortality Using Comparative Simulation Modeling. Cancer Epidemiology, Biomarkers & Prevention. 2014;23(6):997–1006.
  20. 20. Meza R, Jeon J, Renehan AG, Luebeck EG. Colorectal cancer incidence trends in the United States and United Kingdom: evidence of right- to left-sided biological gradients with implications for screening. Cancer Research. 2010; 70(13):5419–29. pmid:20530677
  21. 21. Bellman R, Åström KJ. On structural identifiability. Mathematical Biosciences. 1970;7(3):329–339.
  22. 22. Rothenberg TJ. Identification in Parametric Models. Econometrica. 1971;39(3):577–591.
  23. 23. Cobelli C, DiStefano JJ. Parameter and structural identifiability concepts and ambiguities: a critical review and analysis. American Journal of Physiology. 1980;239(1):R7–R24. pmid:7396041
  24. 24. Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmüller U, et al. Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics. 2009;25(15):1923–1929. pmid:19505944
  25. 25. Heidenreich WF, Luebeck EG, Moolgavkar SH. Some properties of the hazard function of the two-mutation clonal expansion model. Risk Analysis. 1997;17(3):391–9. pmid:9232020
  26. 26. Cox LA, Huber WA. Symmetry, identifiability, and prediction uncertainties in multistage clonal expansion (MSCE) models of carcinogenesis. Risk Analysis. 2007;27(6):1441–1453. pmid:18093045
  27. 27. Little MP, Heidenreich WF, Li G. Parameter identifiability and redundancy in a general class of stochastic carcinogenesis models. PLOS One. 2009;4(12):1–6.
  28. 28. Brouwer AF, Meza R, Eisenberg MC. A Systematic Approach to Determining the Identifiability of Multistage Carcinogenesis Models. Risk Analysis. 2016.
  29. 29. Dewanji A, Venzon DJ, Moolgavkar SH. A stochastic two-stage model for cancer risk assessment. II. The number and size of premalignant clones. Risk analysis: an official publication of the Society for Risk Analysis. 1989;9(2):179–187.
  30. 30. Moolgavkar S, Luebeck G. Two-event model for carcinogenesis: Biological, mathematical, and statistical considerations. Risk Analysis. 1990;10(2):323–341. pmid:2195604
  31. 31. Tan WY. Stochastic Models of Carcinogenesis. New York: Marcel Dekker; 1991.
  32. 32. Little MP. Are two mutations sufficient to cause cancer? Some generalizations of the two-mutation model of carcinogenesis of Moolgavkar, Venzon, and Knudson, and of the Multistage Model of Armitage and Doll. Biometrics. 1995;51(4):1278–1291. pmid:8589222
  33. 33. Heidenreich WF. On the parameters of the clonal expansion model. Radiation and Environmental Biophysics. 1996;35(2):127–129. pmid:8792461
  34. 34. Crump KS, Subramaniam RP, Van Landingham CB. A numerical solution to the nonhomogeneous two-stage MVK model of cancer. Risk Analysis. 2005;25(4):921–6. pmid:16268939
  35. 35. Meza R. Some Extensions and Applications of Multistage Carcinogenesis Models. University of Washington; 2006.
  36. 36. Brouwer AF. Models of HPV as an Infectious Disease and as an Etiological Agent of Cancer. University of Michigan; 2015.
  37. 37. Audoly S, Bellu G, D’Angiò L, Saccomani MP, Cobelli C. Global identifiability of nonlinear models of biological systems. IEEE Transactions on Biomedical Engineering. 2001;48(1):55–65. pmid:11235592
  38. 38. Saccomani MP, Audoly S, Bellu G, D’Angio L. A new differential algebra algorithm to test identifiability of nonlinear systems with given initial conditions. Proceedings of the 40th IEEE Conference on Decision and Control. 2001;4:3108–3113.
  39. 39. Holmberg A. On the practical identifiability of microbial growth models incorporating Michaelis–Menten type nonlinearities. Mathematical Biosciences. 1982;62(1):23–43.
  40. 40. Eisenberg M, Samuels M, DiStefano JJ. Extensions, Validation, and Clinical Applications of a Feedback Control System Simulator of the Hypothalamo-Pituitary-Thyroid Axis. Thyroid. 2008;18(10):1071–1085. pmid:18844475
  41. 41. Cintrón-Arias A, Banks HT, Capaldi A, Lloyd AL. A Sensitivity Matrix Based Methodology for Inverse Problem Formulation. Journal of Inverse and Ill-posed Problems. 2009;17(6):545–565.
  42. 42. Eisenberg MC, Hayashi MAL. Determining identifiable parameter combinations using subset profiling. Mathematical Biosciences. 2014;256:116–126. pmid:25173434
  43. 43. Keener RW. Theoretical Statistics. Springer Texts in Statistics. New York, NY: Springer New York; 2010.
  44. 44. Luebeck G, Meza R. Bhat: General likelihood exploration; 2013. R package version 0.9–10. Available from: http://CRAN.R-project.org/package=Bhat.