Development and Validation of a Prediction Model to Estimate Individual Risk of Pancreatic Cancer

Introduction There is no reliable screening tool to identify people with high risk of developing pancreatic cancer even though pancreatic cancer represents the fifth-leading cause of cancer-related death in Korea. The goal of this study was to develop an individualized risk prediction model that can be used to screen for asymptomatic pancreatic cancer in Korean men and women. Materials and Methods Gender-specific risk prediction models for pancreatic cancer were developed using the Cox proportional hazards model based on an 8-year follow-up of a cohort study of 1,289,933 men and 557,701 women in Korea who had biennial examinations in 1996–1997. The performance of the models was evaluated with respect to their discrimination and calibration ability based on the C-statistic and Hosmer-Lemeshow type χ2 statistic. Results A total of 1,634 (0.13%) men and 561 (0.10%) women were newly diagnosed with pancreatic cancer. Age, height, BMI, fasting glucose, urine glucose, smoking, and age at smoking initiation were included in the risk prediction model for men. Height, BMI, fasting glucose, urine glucose, smoking, and drinking habit were included in the risk prediction model for women. Smoking was the most significant risk factor for developing pancreatic cancer in both men and women. The risk prediction model exhibited good discrimination and calibration ability, and in external validation it had excellent prediction ability. Conclusion Gender-specific risk prediction models for pancreatic cancer were developed and validated for the first time. The prediction models will be a useful tool for detecting high-risk individuals who may benefit from increased surveillance for pancreatic cancer.


Introduction
There is no reliable screening tool to identify people with high risk of developing pancreatic cancer even though pancreatic cancer represents the fifth-leading cause of cancer-related death in Korea. The goal of this study was to develop an individualized risk prediction model that can be used to screen for asymptomatic pancreatic cancer in Korean men and women.

Materials and Methods
Gender-specific risk prediction models for pancreatic cancer were developed using the Cox proportional hazards model based on an 8-year follow-up of a cohort study of 1,289,933 men and 557,701 women in Korea who had biennial examinations in 1996-1997. The performance of the models was evaluated with respect to their discrimination and calibration ability based on the C-statistic and Hosmer-Lemeshow type χ 2 statistic.

Results
A total of 1,634 (0.13%) men and 561 (0.10%) women were newly diagnosed with pancreatic cancer. Age, height, BMI, fasting glucose, urine glucose, smoking, and age at smoking initiation were included in the risk prediction model for men. Height, BMI, fasting glucose, urine glucose, smoking, and drinking habit were included in the risk prediction model for women. Smoking was the most significant risk factor for developing pancreatic cancer in both men and women. The risk prediction model exhibited good discrimination and calibration ability, and in external validation it had excellent prediction ability.

Introduction
Pancreatic cancer represents the fifth-leading cause of cancer-related death in Korea and the seventh worldwide. It has a dismal 5 year survival rate of 7.6% in Korea [1], mainly due to unresectable disease in 80-90% of patients at the time of diagnosis [2]. Pancreatic cancer patients seldom exhibit disease-specific symptoms until late in the course of disease progression, and the impact of standard therapy is limited. Despite advances in the screening and early detection of other cancers, such as gastric cancer and breast cancer, no reliable screening tool exists for pancreatic cancer. Because of the relatively low incidence of the disease, current efforts are only focused on early detection and screening in patients at high risk of developing the disease.
A screening strategy has not been established for sporadic pancreatic cancer. Because pancreatic cancer-specific symptoms occur late in the course of the disease, early detection will require screening asymptomatic individuals. Invasive pancreatic cancer develops from precancerous non-invasive precursor lesions called pancreatic intraepithelial neoplasia (PanIN), which progresses from PanIN1 to PanIN3 (carcinoma in situ) [3]. However, the timeline of the progression of pancreatic cancer is not well established. In a case series, Brat and colleagues reported the presence of PanIN 17 months to 10 years before the clinical diagnosis of cancer [4]. Two major obstacles restrict our ability to screen for pancreatic cancer: an absence of a high-risk group of patients and the absence of sensitive and specific marker(s) for detecting early stages of pancreatic cancer. However, even if a biomarker with very high sensitivity and specificity is identified, screening the general population for asymptomatic pancreatic cancer would not be cost effective or practical. Thus, screening for asymptomatic pancreatic cancer will likely require filtering the population into at least two sequential groups in order to enrich the population and allow cost-effective screening [5]. The first filter could be the selection of a high-risk group (i.e., a population of individuals at a higher than average risk of pancreatic cancer [6]), and the second filter could be to identify individuals with a unique clinical phenotype using one or more biomarker(s) of early stage pancreatic cancer or non-invasive imaging [5,7]. Currently, individuals with genetic syndromes that are associated with a high incidence of pancreatic cancer and those who have at least two first-degree relatives with pancreatic cancer are screened by endoscopic ultrasonography [8]. However, these patients account for less than 5% of all pancreatic cancer cases. An entirely different approach should be developed for screening sporadic pancreatic cancer.
The present study is the largest population-based cohort study of pancreatic cancer to date in Korea and provides a unique opportunity to develop a model for predicting the individual risk of developing pancreatic cancer.

Study population
The study population consisted of men and women who participated in a biennial health examination conducted by Korean Health Insurance Corporation, including government  employees, schoolteachers, company employees, and their dependents, between 1996 and  1997. A total of 1,289,933 men and 557,701 women aged 30 to 80 years who had no history of  any cancer at baseline and during the first two years of follow-up, without any missing values  for the primary risk factors (age, height, body mass index (BMI), fasting glucose, urine glucose, cholesterol, smoking, age at smoking initiation, meal preference, frequency of meat consumption, eating habits), were included in the model development.
To assess the performance of the models, we used an independent population medically evaluated by the National Health Insurance Corporation between 1998 and 1999 that was free of any cancer at baseline as a validation cohort. A total of 500,046 men and 627,629 women were included in the validation data set.
The participants were followed from the date of the health examination until December 31, 2007, and the event was defined as the first diagnosis of pancreatic cancer (median follow-up time: men, 11.49 years; women, 10.72 years in the development set and men, 8.50 years; women, 8.46 years in the validation set). Individuals who did not develop cancer until the end of the follow-up were censored.
This study was approved by the Institutional Review Board of the National Cancer Center, Korea (IRB no. NCCNCS09-305). The need for participant consent was waived by the ethics committee because this study involved routinely collected medical data that were anonymously managed at all stages, including data cleaning and statistical analysis.

Data collection
The incidence of pancreatic cancer among participants up to December 31, 2007, was identified through the Korean Central Cancer Registry (KCCR) database. The incidence of pancreatic cancer was classified according to ICD-10 codes (C25) [9]. Deaths, including cancer deaths, were identified from the death records of the National Statistical Office and National Health Insurance Corporation. During the health examination, participants responded to a questionnaire about previous disease history, eating habits, meal preferences, frequency of meat intake, drinking habits, amount of alcohol consumed at a time, duration of smoking, amount of smoking per day, year of smoking cessation, and number of times per week they participated in physical activity. Height, weight, systolic and diastolic blood pressure, total cholesterol, and fasting blood and urine glucose levels were measured directly. BMI was calculated as the weight in kilograms divided by the square of the height in meters. The age at smoking initiation was calculated from the duration of smoking and age at baseline.

Prediction model development
The Cox proportional hazard model was used to develop gender-specific risk prediction models. Significant risk factors for pancreatic cancer were identified by crude and age-adjusted Cox regression. The time to event was defined as the time between the date of the health examination and the date of the first diagnosis of pancreatic cancer. Potential risk factors considered in the analyses included previous disease history (hepatitis, diabetes, and any other cancer), eating habits (bland, moderate, spicy, or salty), meal preference (meat vs. vegetables), frequency of meat intake ( 1 time/week, 2-3 times/week, or !4 times/week), drinking habit ( 2-3 times/ month or !1-2 times/week), amount of alcohol consumed at a time, duration of smoking, amount of smoking per day (never, ever, current and <0.5 pack/day, current and !0.5-1 pack/ day, or current and !1 pack/day), year of smoking cessation, physical activity (none, light, moderate, or heavy), height (grouped by quartiles), BMI (<18.5, 18.5-22.9, 23.0-24.9, or !25), systolic and diastolic blood pressure, total cholesterol, and fasting blood and urine glucose levels. More in-depth descriptions of the rationale of the categorization of these variables were published previously [10,11]. Three different model selection processes (forward, backward, stepwise) were employed in the multivariable analysis using α = 0.10. Graphical checks for the proportional hazards assumption were done. Age and its quadratic term were also included in the model as risk factors to improve the model fit.
The probability of developing pancreatic cancer within 8 years (t = 8) for an individual with K risk factors is estimated as follows: .,x K are the values of K risk factors at baseline, and M 1 ,. . .,M K are the average values of corresponding risk factors. Baseline survival probability S 0 (t) indicates the survival probability at time t for an individual at time t = 8 whose covariate values are equal to the mean value of each risk factor. The detailed procedures can be found in the S1 Appendix.

Model validation
An independent population was used for external validation of the developed models, which allowed us to evaluate the performance of the models with respect to discrimination and calibration. Discrimination is a model's ability to distinguish between non-events and events. This can be quantified by calculating the C-statistic for the survival model developed by Nam [12]. The C-statistic is a concordance measure analogous to the receiver operating characteristic (ROC) curve area for the logistic model [13]. The value indicates the probability that a model produces higher risks for those who develop pancreatic cancer within 8 years of follow-up compared to those who do not develop pancreatic cancer [13]. An SAS macro was used to calculate the Cstatistic with 95% confidence intervals (CIs).
Calibration measures how closely the predicted probabilities agree numerically with the actual outcomes. We used a Hosmer-Lemeshow (H-L) type χ 2 statistic developed by Nam [12]. This χ 2 statistic was calculated by dividing the data into 10 groups (deciles) based on the predicted probabilities produced by the model in ascending order. Then, for each decile, the average predicted probabilities were compared to the actual risk probabilities estimated by the Kaplan-Meier approach [14].
All statistical analyses were performed using SAS, version 9.1 (SAS institute, Cary, NC) and Stata version 10 (StataCorp LP, College Station, TX).

Cancer incidence and baseline characteristics
Among 1,289,933 men and 557,701 women, 1,634 (0.13%) men and 561 (0.10%) women were newly diagnosed with pancreatic cancer during 8 years of follow-up, and the incidence rates in men and women were 11.56 and 9.39 cases per 100,000 person-years, respectively (Table 1). Across age groups, the incidence proportions for men were highest in the 50s followed by the 60s, 40s, and 70s. In women, the incidence proportions were higher in their 60s, followed by 50s and 70s. Tables 2 and 3 provide the baseline characteristics and age-adjusted results of univariate analyses for each risk factor in men and women. The mean (standard deviation) ages of the men and women were 44.6 (10.33) years and 48.6 (11.27) years, respectively.

Risk factors and relative risk
The age-adjusted univariate analyses in men showed that height, BMI, blood glucose, urine glucose, smoking, meal preference (vegetables/meat), and eating habits (spicy or salty) are significant risk factors for pancreatic cancer. In the risk prediction model for men, age, height, BMI, blood glucose, urine glucose, smoking, and age at smoking initiation were finally included based on stepwise selection. To improve the model's fit, we included a quadratic term of age (age 2 ) and combined smoking status and the average value for the amount smoked per day into one variable termed "smoking" that comprised five categories: never, past, current <0.5 pack/day, current 0.5-0.99 pack/day, and current !1 pack/day. The pancreatic cancer risk increased as height and BMI increased. Higher blood glucose (!140 mg/dL) was associated with an almost 30% increased risk compared to lower blood glucose (<140 mg/dL). Current smokers consuming more than 1 pack of cigarettes per day had double the risk compared to never smokers (Table 2).
For women, BMI, urine glucose, and current smoking were significant risk factors in the age-adjusted univariate analyses. In multivariable Cox regression model for women, age and its quadratic term (age 2 ), height, BMI, blood glucose, urine glucose, and drinking habit were finally selected as factors. BMI greater than 25.0 was associated with a 1.4-fold increased risk compared to the normal range (18.5-22.9). Current smokers also had a 1.8-fold higher risk than never smokers ( Table 3).

Validation of the risk prediction models
The risk prediction models were validated in an independent cohort by evaluating their discrimination and calibration abilities with respect to the C-statistic and the H-L type χ 2 statistic. As shown in Table 4, baseline characteristics for the validation cohort were similar to those in the derivation cohort. The incidence rates for men aged more than 60 years in the validation set slightly increased compared to the derivation set because of the higher proportion of older participants (S1 Table).

Illustration of individual absolute risk estimation for pancreatic cancer
The absolute risk estimates for pancreatic cancer within 8 years are illustrated in S2 and S3 Tables for men and women, respectively. In S2 Table, no. 3 is a 50-year-old man with a height of 165 cm, BMI of 18.5-22.9, negative urine glucose, <140 mg/dL blood glucose, and classified as a non-smoker. His risk of pancreatic cancer within 8-years was only 0.0557%. On the other hand, no. 1 is a 50-year-old man who was between 168 and 172 cm in height, with a BMI of 23.0-24.9, positive urine glucose, and is a current smoker consuming !1 pack/day, started smoking at 25 years of age, and has !140 mg/dL blood glucose. This patient had a 0.2579% risk, which was 4.6 times greater than that of no.3. In the same manner, we can interpret that no.16, who had the same risk profile as no.1 but was 25 years older, has an approximately 6-times greater risk than no. 1 (1.5117% vs. 0.0557%). For women, 8-year risk estimates are described in S3 Table for each risk profile. If a woman is 75 years old, >158 cm in height, with a BMI of <18.5, no urine glucose, is a current smoker, drinks !1-2 times per week, and has a !140 mg/dL blood glucose, then she has a 1.1952% absolute risk of pancreatic cancer over 8 years, which is roughly 6.6-times greater risk than a woman of the same age who is between 155 and 158 cm in height with healthier physical conditions and smoking and drinking habits, such as no.18 (1.1952% vs. 0.1819%)

Cumulative incidence probabilities of five risk groups
We divided derivation sets of men and women into five risk groups based on quintiles of the estimated probability of developing pancreatic cancer. The cumulative incidence probability and hazard ratio for each risk group is provided in Fig 3. At 10 years, the cumulative incidence probabilities of the highest risk group were 0.359% (95% CI: 0.335-0.384) and 0.292% (95% CI: 0.261-0.327) in men and women, respectively, and those of the lowest risk group were 0.009% (95% CI: 0.006-0.013) and 0.006% (95% CI: 0.003-0.013) in men and women, respectively. Medium-low, medium, medium-high, and highest risk groups had significantly higher hazard ratios than the corresponding lowest risk group in both men and women.

Discussion
Despite improvements in the clinical management of pancreatic cancer, limited advances have been made in the early detection of this highly lethal malignancy [15]. Within the population of resectable pancreatic cancer patients, 5-year survival exceeds 75% in the subset with well-differentiated stage I cancers < 1 cm [5]. Thus, early detection of pancreatic cancer, or even precursor lesions, is the most intuitive approach for improving the overall prognosis of this lethal and frustrating cancer. Because of the overall low prevalence of pancreatic cancer in the general population, current screening efforts are mainly directed at populations at high risk of developing pancreatic cancer. The present risk prediction models were developed to guide healthcare professionals and individuals in their decision making regarding further screening efforts and lifestyle changes in order to inform individuals about their risks of having pancreatic cancer. This approach is meant to supplement the reasoning and decision making of healthcare professionals by providing more objectively estimated probabilities [16]. However, very few studies in the literature have involved a prediction model for pancreatic cancer. One previous study proposed a method that combines PubMed knowledge and electronic health records to develop a weighted Bayesian Network Inference model for pancreatic cancer prediction [17]. Though pancreatic cancer was used as a sample disease in the initial study, clinical implementation remains a challenge. The etiology of pancreatic cancer remains to be established, but several known genetic and environmental factors have been associated with its development. Thus far, risk factors accounting for up to 30% of the disease have been determined [18]. Among the few risk factors identified to date, cigarette smoking is the most consistent [19]. However, inconsistencies in the patterns of cigarette smoking and incidence between different countries, as well as the low relative risk, suggest that the disease is only partially attributable (~20%) to smoking [20]. Diabetes mellitus [21] and chronic pancreatitis [22] are additional predisposing factors for pancreatic cancer, but diabetes as a result of pancreatic cancer is not infrequent and chronic pancreatitis explains less than 3% of pancreatic cancer cases. In contrast, obesity has been reported to be associated with an approximate 20% increase in pancreatic cancer risk compared to normal weight [23]. In the present study, pancreatic cancer risk increased with increasing height or BMI. In the model for men, higher blood glucose levels (>140 mg/dL) were associated with an almost 30% increased risk of pancreatic cancer compared to lower blood glucose levels (<140 mg/dL). Unfortunately, the association between chronic pancreatitis and pancreatic cancer could not be investigated in these cohorts. Alcohol consumption is a well-known risk factor for type II diabetes mellitus and chronic pancreatitis, both of which are associated with an increased risk of pancreatic cancer. However, the relationship between alcohol intake and pancreatic cancer risk has been too inconsistent to reach a conclusion on the association between alcohol intake and the risk of pancreatic cancer. A pooled analysis of the primary data from 14 prospective cohort studies [24] revealed a positive association of pancreatic cancer risk with alcohol intake, but it was only significant among women. In the present study, drinking habit was selected for the model for women only.  Thus far, no such early detection method has had sufficient sensitivity and specificity to serve as a tool for pancreatic cancer screening. In addition, the feasibility of pancreatic cancer screening among the general population is questionable owing to the overall low prevalence. Thus, existing research of screening has been restricted to high-risk individuals, such as those with Peutz-Jeghers syndrome, familial breast-ovarian cancer patients, and relatives of patients with familial pancreatic cancer with at least one affected first-degree relative [25]. The largest prospective study showed that screening asymptomatic high-risk individuals can detect pancreatic lesions, including curable, non-invasive high-grade neoplasms. However, such individuals with genetic syndromes account for less than 5% of all pancreatic cancer. The present study provides a unique opportunity to filter the population into high-risk individuals using a predictive model of the individual risk of sporadic pancreatic cancer.
Despite the large sample size, a central limitation of this cohort study is its retrospective nature, as it utilized existing subject data that are usually documented for other reasons and can address longer follow-up times, but usually at the expense of poorer, less systematically obtained data [26]. We used an independent population for model validation, but it was also from the same National Health Insurance data. Hence we should be more careful in generalizing these results. However, these risk prediction models for pancreatic cancer exhibited very good performance so that it can be a valuable tool for screening high risk individuals for pancreatic cancer.
Another limitation in this study is death due to other reasons than pancreatic cancer was considered as 'censored' rather than a competing risk of pancreatic cancer occurrence. However, we could find some publications for development of the risk prediction model for pancreatic cancer using similar censoring scheme for death [27][28][29] and there are already several publications where we used the same approach as in this study [11,30,31].
Since this model is based on Korean population, for Asian population, further studies will be needed for other races.
Supporting Information S1 Appendix. Risk prediction of developing pancreatic cancer within 8 years.
(DOCX) S1 Table. Pancreatic cancer incidence rates in the validation set. (DOCX)  Table. Eight-year absolute risk estimates of pancreatic cancer in men with different factor profiles. (DOCX) S3 Table. Eight-year absolute risk estimates of pancreatic cancer in women with different factor profiles.