Modeling the Natural History and Detection of Lung Cancer Based on Smoking Behavior

In this study, we developed a method for modeling the progression and detection of lung cancer based on the smoking behavior at an individual level. The model allows obtaining the characteristics of lung cancer in a population at the time of diagnosis. Lung cancer data from Surveillance, Epidemiology and End Results (SEER) database collected between 2004 and 2008 were used to fit the lung cancer progression and detection model. The fitted model combined with a smoking based carcinogenesis model was used to predict the distribution of age, gender, tumor size, disease stage and smoking status at diagnosis and the results were validated against independent data from the SEER database collected from 1988 to 1999. The model accurately predicted the gender distribution and median age of LC patients of diagnosis, and reasonably predicted the joint tumor size and disease stage distribution.


Introduction
Lung cancer is one of the most deadly diseases worldwide, largely because most patients present with advanced-stage disease at the time of diagnosis [1,2]. Most patients have clinical stage III or IV disease when they first notice symptoms and seek medical attention, which results in a poor prognosis. Age, gender, smoking status, tumor size, and disease stage at the time of diagnosis are highly related to the prognosis of patients with lung cancer [3,4,5]. In this article, we use mathematical methods to disentangle the tumorigenesis and detection processes. The goal of this model is to trace the timeline of an individual from his/her birth to the time of lung cancer initiation, progression, detection, and death. Thus, we combined models of carcinogenesis (cancer development until the first malignant cell), tumor progression (growth and metastasis), and detection, to construct a framework for modeling lung cancer at an individual level. Using this framework, we could infer characteristics that cannot be observed in clinical practice, including age of the patient when the primary tumor and nodal and distant metastases are formed. We were also able to evaluate characteristics that can only be partially observed, such as the tumor growth rate, and closely reconstruct characteristics that can be observed clinically, notably, tumor size and disease stage at the time of diagnosis. This procedure may be useful for better understanding the formation of the current lung cancer patient population and characterization of the future lung cancer trends with changes in smoking behavior and detection methods.

Lung cancer patients identified in the Surveillance, Epidemiology and End Results (SEER) database
Age, sex, disease stage, and tumor size for lung cancer patients who were diagnosed between 1973 and 2008 are available in the SEER database [6]. For patients diagnosed from 1988 to 1999, we used information on tumor size, with staging determined according to SEER extent of disease codes, which categorize tumors as localized, regional, and distant. For patients diagnosed from 2004 to 2008, the staging system developed by the American Joint Committee on Cancer was used to obtain tumor size and tumor node metastasis (TNM) disease stage information. Tumor size was measured as the maximum diameter, and we calculated the volume according to an assumption that a tumor grows as a sphere. We re-categorized the tumors into groups from 0 to 20 cm (with 1-cm increments) according to their maximum diameter.

Carcinogenesis modeling
For the carcinogenesis model, we used the two-stage clonal expansion (TSCE) model developed by Moolgavkar and Venzon [7] to calculate the age of the patient at tumor initiation. This model leads to an explicit formula for the distribution of the total duration, T, of the first two stages in the carcinogenesis process (the transitions from normal to initiated cell and initiated to malignant cell), which encompasses the time from the birth of an individual to the onset of malignancy [8]. A smoking-based modification of the TSCE model relating smoking intensity measured in packs per day (ppd) to the parameters of the TSCE model through response functions (with the parameters n 0 , a 0 , c 0 , a 1 , and a 2 , as listed in File S1) was chosen. Smoking duration was incorporated to produce a more specific Survival function of the age at tumor initiation for individual never, current and former smokers. [9,10,11].
The smoking history generator (SHG, version 5.2.1) [12]from the Cancer Intervention and Surveillance Modeling Network (CISNET), which produced smoking duration and intensity data for individuals, was incorporated into the smoking-based TSCE model by using the parameters listed in Table S1. Parameters for males and females were estimated using the data from Cancer Prevention Study I (CPS-I) and from Nurses' Health Study (NHS), respectively. The SHG was also used to generate an age of death due to causes other than lung cancer.

Tumor growth and metastasis modeling
For the tumor growth and metastasis models, we assumed that the hazard of tumor progression is based on the activity of the tumor cells, and tumor cells detach from the primary tumor and transfer to another part of the body, leading to metastases [13] 3.1. Assumptions. The following assumptions were made in modeling tumor growth and metastasis: 1) The primary tumor grows from a single cell, with an assumed volume of 1610 29 cm 3 [14]. The growth rate l, which is related to the tumor doubling time by the expression DT~Ln2=l, is determined at the time of tumor initiation and is assumed to remain the same over time.
2) The growth rate follows a gamma distribution, with shape and scale parameters h and K. 3) All metastases are derived from the primary tumor, which means cells detach from the primary tumor at the rate j and are transferred and deposited at the rate m at a new metastasis site [13]. We don't consider secondary metastasis from existing metastasis. 4) The activity of the tumor cell is related to how fast the tumor grows and how easily the cells detach. Specifically, the faster the primary tumor grows, the easier it is for the cells to detach. We define the tumor cell's activity a, to which the growth rate l is proportional, l = e16a (where e1 is a constant). The celldetachment rate, b, is also proportional to a, b = e26a (where e2 is another constant). Thus, b~e 2 e1 l, where e2 e1~j is a parameter representing the relationship between b and l. If the tumor with volume S grows exponentially, S~e lt , the total number of detached cells before time t 0 is e jlt0~Sj 0 ; we assume 0vjv1, the interpretation of which is that cells always detach from the primary tumor but not all tumor cells will detach. 5) The detached cells will be transferred and deposited at new locations. The aggregate rate of transfer and deposition is m. m could be a constant parameter or a functional parameter determined by a biological process, such as the rate of synthesis of proteins that help transfer the tumor cells across the blood vessel wall [13]. 6) Metastases are defined as either nodal or distant. We assume a different rate m (m n and m m ) for each type of metastasis, m n for nodal and m m for distant metastasis. We also assume that the detached cells can move to nodal sites at least as easily as to distant sites (m n $m m ). 7) The hazard function for metastasis (nodal or distant) is related to the number of tumor cells that have detached from the primary tumor and have been successfully transferred and deposited at nodal or distant locations. Assuming exponential growth, the hazard functions for nodal and distant metastases are h n~mn |S j and h m~mm |S j , respectively. The cumulative distribution functions (c.d.f.) are defined below: with the tail functions (or survival functions) F F n , F F m . If j~0, we assume no cells detach from the primary tumor, m n~0 and m m~0 . 1) Assumption m n $m m implies that F n s ð Þ §F m s ð Þ, where F n (s) and F m (s) are corresponding cumulative distribution functions defined above. Primary tumor sizes at the time of initiation for nodal and distant metastases, respectively, are denoted s ni and s mi . 2) We assume that the cell's activity changes after detachment from the tumor, transfer and deposition at the metastatic site. Cell at the nodal and distant metastatic sites grow three (3l) or four (4l) times faster than the primary tumor [13,15,16], correspondingly.
The primary tumor size is calculated using the tumor growth model by giving the growing time t, with a constant growth rate l.
Thus, we rewrite F n s ð Þ,F m s ð Þ to F n t,l ð Þ~Ð where f n and f m are the p.d.f. of time that nodal and distant metastases occurred in patients with tumor growth rate having a Gamma distribution, and c is the Gamma distribution function with parameters k and h. Then, where f n 1 m 0 t ð Þ is the p.d.f. of time for patients who had only nodal metastases (without distant metastases) in the whole time period.

Cancer detection modeling
To model cancer detection, we introduced a competing process of detecting the disease through the primary tumor or nodal or distant metastases, adapting the framework developed by Kimmel and Flehinger. [17] 4.1. Assumptions. The following assumptions were made in modeling cancer detection: 1) The detection of cancer is based on the detection method used. The hazard of detection has a linear relationship with tumor size. It also depends on the reasons (e.g., symptoms or results of a screening test) that prompted the patient to seek medical attention. For details, see equations (*) for h Dp , h Dn , and h Dm further on. The relationship between assumptions (2) and (3) for the detection model is shown in the following functions.
To give the explicit expressions for Z nm , the detailed expressions of D p ,D n and D m are required. According to the assumptions of metastases model, we know that: Where 0,S 0 #S, is the size of the primary tumor that was not observed and D is considered the assumption that F n s ð Þ §F m s ð Þ. P r fs 0 [SjN,Mg represented the probability of a primary tumor with (1) or without (0) nodal (N) or distant (M) metastasis.
The joint density/ probability functions p s,n,m ð Þ of random variable S, N and M are presented below.

Estimation
Methods provided by Kimmel and Flehinger [17] could be used to estimate W n , W m , and Z nm non-parametrically. We can also estimate D p ,D n ,D m ,F n , and F m parametrically once the parametric tumor-growth and detection models are determined. Below we provide an example.
Assuming that a tumor grows exponentially with a growth rate l, and the metastases model described above, we have Assuming that the hazard of tumor detection depends linearly on the size of the tumor, denoting the efficiency of the detection by tumor size as a and stage-dependent offset parameters as w 0 , w 1 and w 2 , we obtain ÞÞ, where g is the simulated joint distribution of tumor size and stage andŶ Y is the observed joint distribution, based on these parameters, and apply the Nelder-Mead method [18,19,20] to achieve the best fitted parameters in the model. We used the second approach because we did not have the multiple tumor-size measurements for individuals to estimate the models separately.
is the least square function where i is the stage status defined as local (i = 0, no nodal or distant metastasis N0M0), nodal (i = 1, nodal metastases but no distant metastases, N1M0), or distant (i = 2, M1), j is the number of tumor size group; g ij () is the simulated percentage of lung cancer with tumor in the size range of j group and i stage among all detected lung cancer;ŷ y ij is the observed percentage of lung cancer with tumor in the size range of j group and i stage among all detected lung cancer. Detailed simulation procedure was in File S1. We estimated the nine parameters (j, m n , m m , l K,h ð Þ, g, W 0 , W 1 , W 2 ) using the TNM staging data in the SEER database from 2004 to 2008 for model fitting. Since SHG is using year 2000 as a cut-off point for the vital status observation, the joint distributions of tumor size and disease stage from 1995 to 1999 were chosen as the output of the simulation. The results were validated against independent data from the SEER database collected from 1988 to 1999. These years were chosen as closest possible to 2004-2008 periods.
To simulate the LC population, we firstly used the smoking history generator (SHG) to generate the underlying population. We assumed that the number of persons before year 1890 was zero and at the year of 1890 there were 2877000 new born babies (the number of live births in each year was shown in the Figure S1). We provided the year of birth (say 1890) and gender (half and half) to SHG as inputs and repeated the SHG for 287700 times. Then we got these persons' basic information, including the year of death (converted from the age of death, A d , generated by SHG) and their smoking history information. We then applied our simulation strategy (described in the File S1 at section 2.1 simulation process) to get LC candidates and the information of their tumor progression. For the next year (say 1891), we added new born babies to the underlying population and removed the persons that were dead in the previous year (say 1890) from the population, whenever she or he was LCs or ''normal'' persons. Thus, we had underlying population, which would be approaching the real U.S. population (figure S2), and the LC candidate population, which were considered as an unperturbed (existing before detection) LC population. Assuming that no LC-related death occurred before detection, the yearly LC population would be achieved by applying the detection model to the unperturbed LC population. Figure 1 shows the probabilities of nodal metastases and distant metastases by the time from the tumor onset. The estimated parameters j, m n , m m , k and h in Table 1, which gives the estimates of the model parameters, were used to draw f n t ð Þ, f m t ð Þ and f n1m0 t ð Þ. These probability density functions showed that the probability of nodal and distant metastasis began to fast increase at 2.5 years (about 900 days) and 3 years (about 1100 days) from the time of tumor onset, respectively. It reached the highest at 6.4 years (about 2350 days) and 6.8 years (about 2500 days) from the time of tumor onset. Figure 2 compares the characteristics of the population for the years 1995-1999 generated by the fitted model to the SEER data (2004)(2005)(2006)(2007)(2008). For tumors smaller than 10 cm in diameter, the proportion of N0M0-stage disease (no nodal or distant metastases) more closely reproduces the SEER data (2004)(2005)(2006)(2007)(2008). The proportions of NxM0-and M1-stage disease are not reproduced as accurately as the proportions of N0M0-stage disease, especially when the tumors are larger than 5 cm in diameter. For tumors smaller than 1 cm, the model predicted that about 50% and 35% would be staged as N0M0 and M1, respectively, whereas the actual percentages were 42% and 42% respectively.

Predicting Clinically Observable Characteristics
The fitted model was also validated by predicting the characteristics of lung cancer patient population in United States between 1988 and 1999. The model predicts both the gender distribution among LC patients and median age that are quite close to the 1988-1999 SEER data (Table 2).
Comparing the model prediction and the data, a smaller proportion of patients was diagnosed with localized disease than it was predicted (Figure 2c). One of the reasons may be the different staging definitions used. The predicted tumor size distributions were closer to the 2004-2008 SEER than to the 1988-1999 SEER data (Figure 3 (a-c)).  Table 3 summarizes the predicted but not directly clinically observable characteristics of the detected tumors. The mean time from the tumor onset (when the first malignant cell appears) to nodal and distant metastases (when they are just formed and not yet observed) and diagnosis is about 4.77, 5.05, and 6.27 years, respectively. The average size of the primary tumors when nodal and distant metastases form was 0.09 cm 3 and 0.24 cm 3 , respectively. The median age at the time of tumor onset is 63 and the average growth rate corresponds to the tumor-volume doubling time of about 60 days. Table S2 shows the distribution of doubling time by tumor size and stage. In clinical practice, a primary tumor with distant metastases is more likely to be found with a smaller size than a primary with no or only nodal metastases. This leads to an observation that a primary tumor with distant metastases grows slower and remains smaller. This Table also demonstrates that faster growing tumors tend to be detected at larger sizes.

Discussion
The parameters estimated from the joint distribution of tumor size and stage in the SEER database from 2004 to 2008 (Table 1) were applied to generate a lung cancer patient population from 1988 to 1999 and validated by comparison of the results to the data from the SEER database from 1988 to 1999. The model accurately predicts the gender distribution and the median age of lung cancer patients, and approximates the joint tumor size and disease stage distribution. The accurate prediction of gender distribution and age at diagnosis for 1988-1999 is largely owing to the accuracy of the smoking-based TSCE model and SHG. Because smoking behavior has changed significantly over the recent decades, the output is sensitive to the year at detection, which is why we are not able to reconstruct gender and age at diagnosis in the SEER data from 2004 to 2008 as accurately. This model overestimates the proportion of patients with tumors larger than 10 cm in diameter and underestimates the proportion of patients with tumors between 4 and 9 cm. These discrepancies are more obvious for the distributions of the primary tumor size at stages NxM0 and M1 than at N0M0. The reason might be that the detection interval is fixed to 1 year in our model, whereas patients may visit a doctor more frequently when symptoms appear. For tumors smaller than 1 cm, the model underestimated the proportion of patients with distant metastases.
We also used the fitted model to predict disease characteristics that are difficult or impossible to observe in clinical practice. According to the estimates of k and h in Table 1, the average tumor growth rate, l, is about 4.4, which corresponds to a tumorvolume doubling time of approximately 55 to 65 days given the   exponential tumor growth model. This growth rate is higher than what has been reported from screening studies [21,22,23], and thus the difference is not entirely unexpected [24]. Besides, introducing a time-dependent or size-dependent growth rate to the tumor growth model may improve the fit of the model to the data. The hazards for detection once nodal or distant metastases are present are much larger than the hazards for detection when only the primary tumor is present (W 1 , W 2 ..W 0 ), which reflects the reality of disease detection in clinical practice. The mean duration from tumor onset to detection was about 6 years in our model, which is consistent with other disease progression models [25]. Among the detected tumors, the average primary tumor size at the time of metastasis (nodal or distant) was less than 1 cm in diameter, which is considerably smaller than that implied by other predictions originating from screening data [17,26,27,28]. Assumptions regarding detection used in the models led to this difference. In our model, the chances of detecting lung cancer increase after nodal and distant metastases occur, and the competitive detection model allows for the detection of the metastasized tumor. This detection model does not require either of the two extreme assumptions used in the previous studies [8,9,17]: (1) that the probability to detect cancer is unchanged when metastases are present, or (2) cancers are detected immediately when metastatic spread occurs. To reduce the complexity of the disease stage progression model, we did not consider the possibility of a secondary spread of the disease from nodal metastases. This may be the reason that the model did not as accurately reproduce the proportion of nodal and distant metastases as it did the proportion of localized tumors for tumor sizes larger than 5 cm.
Our framework combines a carcinogenesis model with a model of the natural history of tumor growth and progression, and a detection model, to predict features of a lung cancer patient population. This modular structure allows testing of different detection strategies. One limitation of this model is that we were not able to construct the overall likelihood function for the model in the analytical form, and the Nelder-Mead estimation procedure used to optimize the least square fit is time consuming. Another limitation is that this framework largely depends on the smoking information generated by the SHG, which has to be updated before it can be used to accurately predict properties of future lung cancer patient populations. We did not perform simulation by histology, which is another limitation. Moreover, the model did not consider the difference between the lung cancer risks in CPSI/NHS and SEER, while the previously estimation of parameters in carcinogenesis model was directly used. We cannot recalibrate carcinogenesis model since no smoking information was recorded in SEER.

Conclusion
We proposed a model for predicting the natural disease progression and detection of lung cancer that relies on the following biologically and clinically reasonable assumptions: the hazard function of tumor progression is based on the activity of the tumor cells, which detach from the primary tumor and transfer to another part of the body, leading to metastases [13]. Thus, the metastasis process is related to the size of the primary tumor and the tumor growth rate (which is also related to the activity of the tumor cells). The detection of lung cancer in patients occurs as a result of competing detection of the primary tumor or nodal or distant metastasis. We used a TSCE model combined with the smoking history generator to reproduce the population with incipient tumors according to the yearly live birth number in the United States ( Figure S1). We then applied our models of the tumor natural progression and its detection to re-create the lung cancer patient population at the time of diagnosis. Lung cancer data from SEER database collected between 2004 and 2008 were used to fit the lung cancer progression and detection model. The fitted model combined with a carcinogenesis model was used to reconstruct the distribution of age, gender, tumor size, and disease stage at diagnosis, and the results showed that the model accurately predicted gender and median age, and reasonably predicted the tumor size and disease stage distribution against independent data from the SEER database collected from 1988 to 1999. This model framework provides a platform for estimating the outcome of a strategy for the secondary prevention of lung cancer before it is applied in clinic.