Prediction Model for Gastric Cancer Incidence in Korean Population

Background Predicting high risk groups for gastric cancer and motivating these groups to receive regular checkups is required for the early detection of gastric cancer. The aim of this study is was to develop a prediction model for gastric cancer incidence based on a large population-based cohort in Korea. Method Based on the National Health Insurance Corporation data, we analyzed 10 major risk factors for gastric cancer. The Cox proportional hazards model was used to develop gender specific prediction models for gastric cancer development, and the performance of the developed model in terms of discrimination and calibration was also validated using an independent cohort. Discrimination ability was evaluated using Harrell’s C-statistics, and the calibration was evaluated using a calibration plot and slope. Results During a median of 11.4 years of follow-up, 19,465 (1.4%) and 5,579 (0.7%) newly developed gastric cancer cases were observed among 1,372,424 men and 804,077 women, respectively. The prediction models included age, BMI, family history, meal regularity, salt preference, alcohol consumption, smoking and physical activity for men, and age, BMI, family history, salt preference, alcohol consumption, and smoking for women. This prediction model showed good accuracy and predictability in both the developing and validation cohorts (C-statistics: 0.764 for men, 0.706 for women). Conclusions In this study, a prediction model for gastric cancer incidence was developed that displayed a good performance.


Introduction
Gastric cancer is the fourth most common cancer in the world, and approximately 1 million new cases are diagnosed annually worldwide [1]. Although the incidence has decreased substantially in most parts of the world, gastric cancer remains the most common cancer and the third most common cause of death from cancer in Korea [2,3].
The prognosis of patients with gastric cancer is highly different according to pathological stage. The 5-year survival rate of patients with stage IA gastric cancer is 95.1-98.9% in Korea; however, this Fig declines to 26.1-32.2% in patients with stage IIIC [4][5][6]. For the patients who undergo palliative chemotherapy for stage IV, an overall survival of approximately 1 year is expected worldwide [7,8]. This great difference of survival according to the stage suggests that early detection before tumor progression is important for a good prognosis. For early detection, regular screening is essential, and regular screening was reported to be associated with a lower mortality from gastric cancer in previous population-based cohort studies [9][10][11].
In Korea, the national gastric cancer screening program has been running since 1999 as a part of the National Cancer Screening Program [12]. The target population of the National Cancer Screening Program was less than 50% of National Health Insurance beneficiaries in 2005 and was extended to the entirely of the National Health Insurance beneficiaries in 2010. Therefore, all beneficiaries older than 40 were advised to undergo a gastroscopy or upper gastrointestinal series examinations every 2 years.
The screening rate was 34.4% in 2004 and increased to 64.6% in 2011. Nevertheless, a significant proportion of eligible patients still do not undergo gastric cancer screening. Public indifference to mass screening and the unawareness of the risk factors for developing gastric cancer might be related to the low screening rate. Therefore, the identification of high risk populations and the notification of such populations may have a significant effect on improving the survival rate.
A risk prediction model is a simple and effective method used to evaluated individualized risk by quantifying cancer risk. However, few studies have established risk prediction models for gastric cancer incidence using epidemiological risk factors [13]. In this study, we have conducted a systematic investigation of the potential risk factors of gastric cancer using a large population-based cohort in Korea, with the aim of developing a risk prediction model for gastric cancer incidence.

Study population
The study cohort consisted of Korean government employees, teachers, company employees and their dependents who underwent a biennial medical examination provided by the National Health Insurance Corporation (NHIC) between the years 1996 and 1997. After excluding the recipients who were under 30 or over 80 and who had a previous cancer history or who were diagnosed as with gastric cancer between the years 1996 and 1997, we identified 2,291,132 individuals (1,436,958 men and 854,174 women) with data on baseline characteristics. Ten risk factors were considered for modeling including age, body mass index (BMI), family history of any type of cancer, meal regularity, salt preference, frequency of meat consumption, dietary preference, alcohol consumption, smoking and physical activity. However, the NHIC data have a huge portion of missing data because most life style information was obtained by self-report questionnaires. After the recipients who had missing data on any one of the risk factors were excluded, only 823,741 (57.3%) men and 369,554 (43.3%) women remained.
The proportion of excluded recipients because of missing data was considerably high; thus, we complemented the data using the imputation method. We were able to do this because the NHIC examination was provided every two years, and we were able to retrieve some information from the NHIC examination data performed in the years other than 1996 and 1997. When a participant received multiple examinations, the nearest time point was used to impute the missing values. Finally, 1,372,424 (95.5%) men and 804,077 (94.1%) women were available for model development after imputation. The difference in prediction models developed based on the complete data and imputed data with the nearest observations was minor (S1 File); therefore, the model development and validation were based on the imputed, larger data set.
To evaluate the performance of the developed model, an independent population who underwent the National Health Insurance Corporation medical evaluation between the years 1998 and 1999 was used as a validation cohort. Among all eligible recipients, we excluded recipients who were included in the development model in addition to recipients who met same exclusion criteria. Similar missing data imputation was applied, and finally a total of 484,335 men (4.3% missing) and 466,013 women (3.5% missing) were included in the validation cohort.
This study was approved by the Institutional Review Board of the National Cancer Center, Korea (IRB no. NCCNCS 09-305).

Data collection and risk factor assessment
During the health examination, the weight, height and blood pressure of each participant were measured as part of the routine physical examination. Additionally, the participants completed a questionnaire about family history of any type of cancer, previous disease history, dietary habits, alcohol consumption, smoking and physical activity. Each question had simple choices because it was self-recorded and the categories of diet habits and physical activity were subjective ones such as 'Regular', 'Intermediate', and' Irregular' for meal regularity, and 'Not salty', 'Intermediate', and 'Salty' for salt preference. Based on these simple questionnaires, the risk factors for gastric cancer development were analyzed.

Cancer ascertainment and identification of death
Data for gastric cancer incidence were obtained from the Korean Central Cancer Registry database through December 31, 2007. Based on the International Classification of Disease, 10th edition, C16 was used for the incidence of gastric cancer. Deaths and causes of death were identified from the death records of the National Statistical Office, which is a nationwide registration of deaths, and the National Health Insurance Corporation.

Statistical analysis
A Cox proportional hazards regression model was used to estimate the relative risks (and corresponding 95% confidence intervals (CIs)) of gastric cancer incidence for each of the potential risk factors. The proportionality in hazards was examined via log-log survival plots. We noticed that the demographic characteristics and environmental exposures were different between men and women, and both crude and age (at baseline) adjusted analyses were performed separately for men and women.
The potential risk factors considered in the analysis were BMI (<18.5, 18.5-22.9, 23.0-24.9, and !25), family history of any type of cancer (yes or no), meal regularity (regular, intermediate, irregular), salt preference (not salty, intermediate, salty), frequency of meat consumption ( 1, 2-3, and !4 times per week), dietary preference (vegetables preferred, mixed, and meat preferred), alcohol consumption (none, i.e., 0 g; light, i.e., 1-14.9 g; moderate, i.e., 1.5.0-24.9 g; and heavy, i.e., !25 g of ethanol per day), smoking (never, former, current < 10, current 10-19, and current ! 20 cigarettes per day), and physical activity (none, light, moderate, and heavy). For women, because of the small number of incidences, several categories of the alcohol consumption and smoking variables were combined. For alcohol consumption, those with more than 15 g of ethanol were combined, and only two categories were used for smoking (never, smoker). Further descriptions of the rationale of the categorizations of these variables can be found elsewhere [14,15].
A backward variable selection method with a type I error criterion of 0.1 based on likelihood ratio tests was considered in the multivariable model. The probability of developing gastric cancer within t years (t = 8) for an individual with covariate values x = (x 1 ,. . ., x K ) for K risk factors can be estimated using the following equation: Here, S 0 (t) is the mean survival probability at time t for an individual whose covariate values are all 0, and the β i s are the estimated coefficients from the Cox proportional hazard model. Once β i and S 0 (t) are obtained, the probability of developing gastric cancer for any set of covariate values can be estimated.
The developed models were validated in an independent cohort population by evaluating the performance of the models with respect to their discrimination ability using C-statistics, and the calibration ability was evaluated using a calibration plot and calibration slope [16][17][18][19][20][21].
Harrell's C-statistics for survival data was considered in this study [18][19][20]. This value represents the probability that the predicted probability of developing gastric cancer is higher for those who actually develop gastric cancer in 8 years than for those who do not develop gastric cancer. Calibration is related to the accuracy of the prediction. To generate a calibration plot, the data were first divided into 10 disjointed subgroups according to the predicted probabilities of developing gastric cancer based on the developed model. The expected (the average predicted probabilities) and observed (the actual event rate measured by the Kaplan-Meier estimate) values were then plotted. Additionally, to obtain the calibration slope, the prognostic index (PI) from the Cox regression, which is the weighted linear combination of the variables selected for the prediction model, was obtained for individuals in the validation data set, and the regression coefficient on the PI was obtained. A PI close to 1 indicates good calibration, and a likelihood-ratio test that tests whether this slope is 1 is then performed [22].
All the analyses were performed using SAS (version 9.1.3; SAS Institute, Cary, NC) and STATA (version 13) software.

Ethics statement
This study was performed with the approval of the institutional review boards of the National Cancer, Center, Korea (No. NCCNCS 09-305). The participants' informed consent was waived by the institutional review boards because this study involved routinely collected medical data that were anonymously managed in all stages, including the stages of data cleaning and statistical analyses.

Cancer incidence and baseline characteristics
The total number of person-years of follow-up was 14,815,612 for men and 8,471,357 for women for a median of 11.3 years of follow-up. The mean (SD) ages of the men and women were 45.1 (10.5) and 48.7 years (11.0), respectively. During follow-up, 19,465 (1.4%) and 5,579 (0.7%) cases of gastric cancer were observed among 1,372,424 men and 804,077 women, respectively, resulting in incidence rates of 131.38/100,000 and 65.89/100,000 person years for each sex. In the validation cohort, a total of 6,628 and 2,920 gastric cancer cases were observed out of 484,335 men and 466,013 women, respectively. The incidence rates in the validation cohort were 164.54/100,000 for men and 75.84/100,000 for women; these rates were higher than those observed in the model developing cohort.

Risk factors
To evaluate the significant risk factors of gastric cancer incidence, a multivariable analysis was performed based on the variable selection criteria. Tables 1 and 2 show the incidences of gastric cancer and the estimated hazard ratio for each of the potential risk factors for men and women, respectively. For men, the significant risk factors of gastric cancer incidence were age, low weight, having a family member who had previously had any type of cancer, irregular meals, salt Table 1. Risk factor distributions between gastric cancer patients and gastric cancer-free patients (men), and age-adjusted univariable and multivariable model in the developing cohort.

Frequency
Age-adjusted univariable model preference, alcohol consumption, and smoking. Among these risk factors, a clear trend of increased risk was observed for alcohol consumption and smoking (linear trend test P <0.0001 for both variables). Heavy drinkers (ethanol ! 25 g/day) had a more than 20.4% increased risk, and heavy smokers (1 pack currently) had a more than 43.1% increased risk of gastric cancer incidence. Additionally, those who had a family member with any type of cancer had a 30.2% increased risk, and irregular meal consumption and a preference of salty food also conferred an increased risk. Conversely, a BMI !23 kg/m 2 and moderate to high physical activity were protective factors. For women, the significant risk factors were age, BMI, having a family member who had any type of cancer, and former smoking. Salt preference and alcohol consumption had were marginally significant (< 0.1; these variables were thus included in the model), and meal regularity and physical activity had no effect on gastric cancer incidence in women.

Prediction model
Based on the multivariable analysis results, we developed gender specific prediction models as follows. (A for men, B for women).  A. Risk prediction model for men.
Step 1: Form a prognostic index (PI) using the β-coefficient estimates Step 2: Calculate the probability P = 1 -S(t|t = 8) Exp(PI) In which S(t|t = 8) is the survival probability estimate for the mean values of the risk factors in the model. Here, S(t|t = 8) = 0.9939406.
B. Risk prediction model for women.
Step 1: Form a prognostic index (PI) using the βcoefficient estimates Step 2: Calculate the probability P The receiver operating characteristic curve analysis was performed to evaluate the discrimination ability of the developed model, and the C-statistics were 0.764 (95% CI, 0.760-0.768) for men and 0.706 (95% CI, 0.698-0.715) for women. The calibration ability was also evaluated by the calibration plot, and the predicted and actual probability of gastric cancer development appeared to be almost identical in each risk group (Fig 1(A) for men 1(B) for women).

Model validation
In the validation cohort, the mean ages (SD) of men and women were 46.8 (12.8) and 51.1 years (12.1), respectively. The age-adjusted hazard ratios of the risk factors in the validation cohort are presented in Tables 3 and 4. Similar results were found for men, and only meal regularity had marginal significance. For women, a family history of cancer, salt preference, and vegetable preference were found to be significant risk factors, and a BMI !25 was a marginally protective factor. Unlike the model developing cohort, smoking was not a significant risk factor in the validation cohort. These results were possibly derived from the small event sizes and shorter follow-up period of the validation cohort compared to that of the model developing cohort.
For the model validation, the 8 year survival rates of the patients in the validation cohort were estimated using the coefficients of the risk factors estimated from the original model developing cohort. Based on the estimated survival rates, the discrimination and calibration abilities of the model in the validation cohort were then obtained (Fig 2(A) for men 2(B) for women). The C statistics were 0.782 (95% CI, 0.777-0.787) and 0.705 (95% CI, 0.696-0.714) for men and women, respectively, and these discrimination abilities of the prediction model were as good as that in the model developing cohort . Fig 2(A) and 2(B) were calibration plots for each gender, and good calibration abilities were presented for gastric cancer development. The calibration slope, which is the regression coefficient of the PI using the validation data set, was 0.980 for men (P = 0.15) and 0.953 (P = 0.07) for women, which indicates good calibration.

Illustration of predicted risk probability based on various risk profiles
In Figs 3 and 4, the estimated probabilities of developing gastric cancer within 8 years are presented for men (Fig 3) and women (Fig 4) for ages 40 (top row), 50 (middle row) and 60 (bottom row). The left and right panels present these estimates for subjects who have family members with and without any cancer, respectively. For men, the leftmost Fig represents the risk probability for a person with the worst risk combination, that is, a thin person (BMI<18.5 kg/m 2 ) who is a heavy smoker and drinker, has irregular meals, prefers salty food and does not exercise. The risk probabilities of a man with same risk combinations except alcohol consumption or except smoking are presented in the next two plots. Then, the plot for a man with same risk combinations but without alcohol consumption or smoking is presented. Finally, the plot for a man without any risk factors is presented. This last For example, we can consider a man who is 50 years old with a family member with any type of cancer. The probability of developing gastric cancer within 8 years can be as high as 2.87% under the worst risk combinations. If a man who has the same risk combinations does not drink alcohol, the risk is 2.39%; if both smoking and alcohol consumption are removed from the risk combinations, the risk decreases to 1.68%. Finally, a lowest possible risk value of 1.08% is present in a man with no risk factors, which is less than half the value for the worst combinations.

Discussion
Many epidemiological studies have evaluated the risk factors of gastric cancer incidence. However, there have been only a few studies that have developed prediction models for gastric Table 3. Risk factor distributions between gastric cancer patients and gastric cancer-free patients (men), and age-adjusted univariable model in the validation cohort.

Frequency
Age-adjusted univariable model cancer incidence. In the present study, gender specific predictive models for gastric cancer incidence were developed and validated based on a large population-based cohort. Low weight, a family history of cancer, irregular meals, preference for salty food, alcohol consumption, smoking, and a lack of physical activity were related with developing gastric cancer for men; low weight, a family history of cancer, preference for salty food, alcohol consumption, and smoking were associated with developing gastric cancer for women. Risk factors for gastric cancer incidence have been revealed in previous studies. The most typical risk factor is H. pylori infection; this has been classified as carcinogenic to humans since 1994 [22]. Smoking has also been acknowledged as one of the causes of gastric cancer by the International Agency for Research on Cancer since 2004 [23]. Probable risk factors include preferences for salt, salty and smoked foods, and heavy alcohol consumption [24,25]. Conversely, green-yellow vegetables, allium vegetables and fruits, and citrus fruits are probable protective factors [24,26]. Moreover, red and processed meats, haem iron, and obesity (for cardia) are possible risk factors, whereas estrogen is a possible protective factor [27][28][29][30]. Family history is also associated with gastric cancer incidence, with an odds ratio ranging from 2 to 10 [31,32]. Among these known risk factors, we included 10 risk factors that could easily be collected by a simple physical examination or questionnaire. Data for H. pylori infection and specific foods such as allium vegetables could not be collected because an invasive procedure for H. pylori and a trained interviewer for specific foods were not included in this study.
In this study, we observed that BMI was a protective factor in the male population. Previously, some studies showed an increased risk of gastroesophageal cancer incidence in overweight subjects, and other studies reported no significant relationship between being overweight and overall gastric cancer incidence [33][34][35]. However, a recent meta-analysis revealed that being overweight is a protective factor for non-cardia cancer, and a similar pattern of hazard ratios was observed in a large-scale cohort study [36,37]. Because the majority of gastric cancer cases were located in the distal part of the stomach in Korea and we had a huge sample size, BMI likely had a statistical significance as a protective factor in this study.
Previously, a Korean prediction model for gastric cancer was reported in 2009 [13]. In this study, only three hospitals participated, and less than 200 cases were included as the case and control groups, respectively. However, our prediction models were derived from a nationwide database with more than two million participants and government employees, teachers, company employees and their dependents, which can represent the entire Korean population because these occupational characteristics comprise a large proportion of the entire Korean population. The other advantage of this study is that an external validation using a large sized population was performed, whereas the previous model was not validated in independent data. Moreover, 16 factors were included in the previous model which is somewhat complicated to Prediction of Gastric Cancer apply nationwide; however, we included only 8 factors for men and 6 factors for women to predict gastric cancer. Using this model, we can simply predict the risk of gastric cancer development, and high risk groups can easily be identified.
This model can be used when a primary physician counsels healthy individuals after a routine check-up. A primary physician can give a warning for risk factors of gastric cancer after a simple history taking, and each examinee could receive the warning more seriously with an exact probability of gastric cancer. Moreover, if this prediction model is known to the general Korean population, people with high risk factors could be motivated to perform routine checkups. Additionally, more frequent and intensive screening programs can be implemented to the high risk populations. These active screening could allow gastric cancer to be detected at an earlier stage and might finally result in lowering gastric cancer related mortality.
This study had a few limitations. H. pylori information was not available because all the data were collected through routine physical examinations. In many countries, H. pylori examination is not included in gastric cancer screening because it requires invasive procedures such as blood sampling or endoscopic examinations, and costs a great deal. Without the invasive procedure of H. pylori examination, we can predict the risk of gastric cancer development using this model, and this prediction model can be applied to a larger population.
Second, socioeconomic status, educational attainment, and specific foods such as fish, soybeans, allium vegetables, and tea were not considered in this study [38][39][40][41][42]. For these data, a trained interviewer and the interviewee's effort such as a diet diary are required. These complicated data can be helpful to develop a more delicate model; however, it can also be difficult to generalize.
Third, this study is not free from recall bias because of using a questionnaire for dietary patterns. Additionally, the categories of meal regularity and salt or meal preference were very simple and subjective. However, we suggest these simplified subjective categories of dietary patterns can provide a widely available model for the general population.
Fourth, this study did not assess the risk probability of developing gastric cancer according to tumor location. In some previous studies, smoking and a high BMI tended to increase the risk of cardia cancer, and salty food was positively associated with noncardia (distal) gastric cancer [43][44][45]. Possibly because of the small number of incidences of cardia cancer, no meaningful distinction between risk factors for cardia and distal cancers was observed in the current study (data not shown). Further study would be worth pursuing.
Fifth, the data provided by the NHIC contained a large amount of missing data, and we imputed these missing values based on the data of the nearest time point. This method may not be optimal; however, we suspected that most of the variables did not change within a short period of time. When we developed a prediction model with complete data as a comparison, only meal regularity for men and BMI and salt preference for women were eliminated in the model with complete data because of the reduced statistical significance resulting from the reduced sample size. Therefore, we concluded that the effect of the missing data on the model development will be minor.  40 (top), 50 (middle) and 60 (bottom). The left and right panels present these estimates of subjects with and without family history of any cancer, respectively. The risk combinations for each category of the X-axis are as follows. Worst corresponds to a BMI < 18.5 kg/m 2 ; Meal regularity, Irregular; Salt preference, Salty; Alcohol consumption, ! 25 g/day; Smoking, 1 pack currently; and Physical activity, None.-Drinking is the same except that Alcohol consumption is 0, and-Smoking is the same except that the Smoking amount is None.-Both is the same except the Smoking amount is None and the Alcohol consumption is 0. Best corresponds to a BMI !25; Meal regularity, Regular; Salt preference, Not salty; Alcohol consumption, 0; Smoking amount, Never; and Physical activity, Moderate to high. doi:10.1371/journal.pone.0132613.g003

Prediction of Gastric Cancer
Sixth, this prediction model was validated by a similar Korean population and the prediction of this model may be limited to the Korean population.
In conclusion, we can assess the risk of gastric cancer incidence using age, BMI, a family history of any cancer, meal regularity, salt preference, alcohol consumption, smoking, and physical activity for men, and using age, BMI, a family history of any cancer, salt preference, alcohol consumption, and smoking for women. This simple tool for the general public may be helpful to educate and motivate individuals to participate in screening programs.
Supporting Information S1 File. Risk factor distributions between gastric cancer patients and gastric cancer-free patients (men), and age-adjusted univariable and multivariable model in the complete developing cohort (Table A). Risk factor distributions between gastric cancer patients and gastric cancer-free patients (women), and age-adjusted univariable and multivariable model in the complete developing cohort (Table B). Risk factor distributions between gastric cancer patients and gastric cancer-free patients (men), and age-adjusted univariable model in the complete validation cohort (Table C).Risk factor distributions between gastric cancer patients and gastric cancer-free patients (women), and age-adjusted univariable model in the complete validation cohort (Table D). (DOCX)