Risk Prediction Model for Colorectal Cancer: National Health Insurance Corporation Study, Korea

Purpose Incidence and mortality rates of colorectal cancer have been rapidly increasing in Korea during last few decades. Development of risk prediction models for colorectal cancer in Korean men and women is urgently needed to enhance its prevention and early detection. Methods Gender specific five-year risk prediction models were developed for overall colorectal cancer, proximal colon cancer, distal colon cancer, colon cancer and rectal cancer. The model was developed using data from a population of 846,559 men and 479,449 women who participated in health examinations by the National Health Insurance Corporation. Examinees were 30–80 years old and free of cancer in the baseline years of 1996 and 1997. An independent population of 547,874 men and 415,875 women who participated in 1998 and 1999 examinations was used to validate the model. Model validation was done by evaluating its performance in terms of discrimination and calibration ability using the C-statistic and Hosmer-Lemeshow-type chi-square statistics. Results Age, body mass index, serum cholesterol, family history of cancer, and alcohol consumption were included in all models for men, whereas age, height, and meat intake frequency were included in all models for women. Models showed moderately good discrimination ability with C-statistics between 0.69 and 0.78. The C-statistics were generally higher in the models for men, whereas the calibration abilities were generally better in the models for women. Conclusions Colorectal cancer risk prediction models were developed from large-scale, population-based data. Those models can be used for identifying high risk groups and developing preventive intervention strategies for colorectal cancer.


Introduction
Colorectal cancer is one of the most rapidly increasing cancer in the Korean population, with annual percent changes of 6.2% in men and 6.8% in women between 1999 and 2009 [1]. Although the mortality rate from colorectal cancer started to decline in younger generations and women [2], colorectal cancer is still ranked the fourth most common cause of cancer death [3].
Several risk prediction models for colorectal cancer have been developed and validated in different populations [4][5][6][7][8][9][10][11]. The major roles of risk prediction models are: 1) to identify individuals at high risk of developing the disease who can then be offered individually tailored clinical management, targeted screening and interventions to reduce the burden of disease and 2) to identify new risk factors for the disease through research [11].
Recent literature suggests that the distribution of molecular subtypes of colorectal cancer differ by subsites [12,13]. In our previous study, we reported that risk factor profiles differed by sex, and by the anatomical locations of the colorectal cancer [14]. Therefore, the focus of the present study was to develop colorectal cancer risk prediction models for overall colorectal cancer, proximal colon cancer, distal colon cancer, and rectal cancer for the Korean population; utilizing a large set of health examination data.

Study population
This study was approved by the Institutional Review Board of the National Cancer Center, Korea (IRB no. NCCNCS 09-305). The need for participants' consent was waived by the ethics committee because this study involved routinely collected medical data that were anonymously managed in all stages, including data cleaning and statistical analyses.
Two independent sets of population were incorporated into this study. The first data set was used for model development, which consisted of men and women who participated in a medical examinations provided by the National Health Insurance Corporation (NHIC) between 1996 and 1997. Details of the study design have been described elsewhere [14]. Participants were asked to fill out self-administered questionnaires on alcohol consumption, cigarette smoking habits, regular exercise, family history of cancer, dietary preferences, and frequency of meat consumption. Additional information about female reproductive factors (i.e., age at menarche, age at first childbirth, menopausal status, and age at menopause) were also collected. Height and weight were measured directly, and body mass index (BMI) was calculated as the weight in kilograms divided by the height in meters squared.
The second data set was used for model validation, which consisted of those participated in a medical examinations in 1998 and 1999. Those who were included in the final analysis were between 30 and 80 years old, without previous history of cancer, and with no missing information for any of the major risk factor variables (i.e., height, weight, fasting serum glucose, total serum cholesterol, family history of cancer, cigarette smoking status (current/ex-/non-smokers), and alcohol consumption frequency). The number of study subjects included were 1,326,058 (846,559 men and 479,499 women) for development set, and 963,749 (547,874 men and 415,875 women) for validation set.

Cancer Ascertainment
The incidence of cancer was ascertained from the Korean Central Cancer Registry (KCCR) database, and death information from the Korean National Statistical Office up to December 2007. The subsites of colorectal cancer were categorized by the International Classification of Disease 10th edition (ICD-10) code as follows: proximal colon (C180-C185), distal colon (C186-C187), and rectum (C19-C20). Cancers with an overlapping lesion of the colon (C188), and those were not otherwise specified (C189) were excluded from the sub-site analysis.

Statistical analysis
Five models were developed for overall colorectal cancer, colon cancer, right colon cancer, left colon cancer, and rectal cancer, separately for men and women. The Cox proportional-hazard regression models were used for developing prediction equations   in development set. Colorectal cancer occurrences were counted as an event on the date of hospital admission recorded in the Cancer Registration data. Subjects were censored at the date of death ascertained from the death certificate database, or on the end date after eight years of follow-up. Crude and age-adjusted analyses were performed for each risk factor. Age and the quadratic terms of age were centralized by subtracting the mean age of the study participants. The risk factors considered for the models were age, age-squared, height, BMI, family history for cancer, fasting glucose, serum cholesterol, cigarette smoking habit, alcohol intake, and meat consumption frequency. All of the risk factors except age were included as categorical variables in the model. BMI was categorized according to the WHO criteria for the Asian population (,25.0 vs. $25.0). Height was divided into quartiles and the first quartile was used as the reference. Variable selection (forward, backward and stepwise) methods with selection and exclusion criteria of type I error 0.15 were considered in the multivariate model to build the risk prediction model.
The baseline survival estimate for the mean values of the risk factors for time t (t = 5 years) was estimated by the following equation: Here, b 1 , Á Á Á ,b k are the regression coefficient estimates, Discrimination was quantified by calculating the C-statistic for the survival model [15]. The C-statistic is a concordance measure analogous to the Receiver Operating Characteristic (ROC) Curve area for the logistic model [16]. The value indicates the probability that a model produces higher risk for those who develop breast cancer within five years of follow-up, compared with those who do not develop colorectal cancer [16].
A Hosmer-Lemeshow (H-L) type x 2 statistic was used for calibration [15]. The x 2 statistic was calculated by first dividing the data into 10 groups (deciles) by ascending order of predicted probabilities produced by the model. Then, in each decile, the average predicted probabilities were compared to the actual event rate estimated by the Kaplan-Meier approach. Values exceeding 20 can be considered a significant lack of calibration [17]. In addition, the expected (E) and the observed (O) numbers of cancer cases were compared for overall colorectal cancer, and each subsites. All statistical analyses were performed using SAS version 9.1 (SAS institute, Cary, NC).

Results
During the follow-up period, 6,492 men and 2,655 women were developed colorectal cancer in the development set. Among the men, there were 1,143 proximal colon cancers, 1,725 distal colon cancers, and 3,146 rectal cancers. Among the women, there were 604 proximal colon cancers, 606 distal colon cancers, and 1,252 rectal cancers. Cases with overlapping lesions in the colon or whose cancers were not otherwise specified lesions were excluded (478 men and 193 women).
In validation, 3,555 men and 1,969 women were diagnosed with colorectal cancer. Among the men there were 605 proximal colon cancers, 909 distal colon cancers, and 1,764 rectal cancers. Among the women, there were 433 proximal colon cancers, 448 distal colon cancers, and 958 rectal cancers.

The risk factors included in the models
The risk factors included in the risk prediction models were listed in Table 1 (men) and Table 2 (women). Age, height, family history for cancer, and amount of alcohol consumed were included in all models for men. Body mass index was included in all models except for the one for right colon cancer.
In women, age and height were included in all models. Fasting glucose and family history of cancer were included in all models except that for rectal cancer, and meat consumption frequency was included in all models except that for the right colon. BMI was included in the model for right colon only, and frequency of alcohol consumption was included in the model for rectal cancer only.

Model performance
Discrimination. The discriminatory ability of the model was measured using the C-statistic in both development and validation sets ( Table 3). The C-statistics for models for men ranged 0.762,0.786 and those statistics for models for women were 0.678,0.763. Models for colorectum (0.762 for development set and 0.779 for validation set), left colon (0.786 for development set and 0.779 for validation set), as well as rectum (0.753 for development set and 0.779 for validation set) showed the highest C-statistics in men, whereas models for right colon showed the highest values in women (0.745 for development set and 0.763 for validation set).
Calibration. Figures 1-A, 2 Table 3 presented the Hosmer-Lemeshow-type chi-square values. In general, the event rates predicted by the models were very close to the actual event rates in male models. Only models for left colon cancer in men did not show significant prediction power. In women, however, none of the models showed significant prediction ability.

Discussion
Recent epidemiological and clinical information suggest that colon cancer and rectal cancer are distinct diseases [12,13,18]. In addition, proximal and distal colons are different in embryologic origins, morphologic appearance of mucosa, physiological function, and bile acid composition [19,20]. Among several colorectal cancer risk prediction models developed and validated [11], only one study provides separate models for proximal and distal colon, and rectum [6]. One study provided separate models for colon cancer and rectal cancer [7]. Previously, we published an article on the risk factor profiles for different colorectal cancer subsites [14]. The prediction models were developed using the same dataset for the model development with a longer follow-up period. In addition, an independent population was used for model validation.
The models showed moderately good discrimination ability. The model for overall colorectal cancer showed the best calibration ability. Among the models for women, that for right colon cancer showed the highest discrimination ability and that for left colon cancer showed the lowest C-statistics. Unfortunately none of the models showed any meaningful calibration ability. Still, our models showed C-statistics that were comparable with, or even higher than, other colorectal cancer risk prediction models [11]. The C-statistics for three previous models 0.67-0.71 for Harvard Cancer Risk Index, 0.61 for the US study, and 0.62-0.66 for Japanese study, respectively, whereas those for our models were 0.68-0.78 [11]. Indeed, model for left colon cancer in women did not reach C-statistics of 0.7. Two studies provided calibration statistics as ratio of observed vs. expected colorectal cancer events (O/E) [7,8]. The O/E ratios varied depend on risk factor profile [7,8]. In a Japanese model for men, the Hosmer-Lemeshow chi-square p-value was 0.08 [7].
The incidence rate for colorectal cancer in women is two thirds that in men for the Korean population [21]. Relatively low cancer incidence rates for women, compared to men, may restrict the statistical power of models for women. Lack of detailed information about female-specific risk factors such as reproductive and hormonal factors may be another reason for the limited power of calibration [22].   The current risk prediction models aim to assess the probability of sporadic colorectal cancer risk. Hereditary colorectal cancer syndromes such as hereditary nonpolyposis colorectal cancer and familial adenomatous polyposis are known to account for up to 2% of overall colorectal cancers [23,24]. Mixing hereditary cancer cases into our study cohort may have diluted the relative risks due to environmental factors.
The strengths of the current study include a large sample size and completeness of cancer follow-up by data linkage to cancer registration and death certificates. Limitations include limited information on dietary risk or protective factors such as calcium and fiber intake [25], or non-dietary factors such as nonsteroidal anti-inflammatory drugs [26]. Previous colonoscopy which may reduce the incidence of cancer was not considered in the model.
In conclusion, risk prediction models for colorectal cancer developed by utilizing large insurance-based data sets from the Korean population, show reasonable discrimination ability. These models help define groups at high risk for colorectal cancer and help guide them to change risk behaviors as well as to undergo cancer screening.