Incidence and mortality rates of colorectal cancer have been rapidly increasing in Korea during last few decades. Development of risk prediction models for colorectal cancer in Korean men and women is urgently needed to enhance its prevention and early detection.
Gender specific five-year risk prediction models were developed for overall colorectal cancer, proximal colon cancer, distal colon cancer, colon cancer and rectal cancer. The model was developed using data from a population of 846,559 men and 479,449 women who participated in health examinations by the National Health Insurance Corporation. Examinees were 30–80 years old and free of cancer in the baseline years of 1996 and 1997. An independent population of 547,874 men and 415,875 women who participated in 1998 and 1999 examinations was used to validate the model. Model validation was done by evaluating its performance in terms of discrimination and calibration ability using the C-statistic and Hosmer-Lemeshow-type chi-square statistics.
Age, body mass index, serum cholesterol, family history of cancer, and alcohol consumption were included in all models for men, whereas age, height, and meat intake frequency were included in all models for women. Models showed moderately good discrimination ability with C-statistics between 0.69 and 0.78. The C-statistics were generally higher in the models for men, whereas the calibration abilities were generally better in the models for women.
Citation: Shin A, Joo J, Yang H-R, Bak J, Park Y, Kim J, et al. (2014) Risk Prediction Model for Colorectal Cancer: National Health Insurance Corporation Study, Korea. PLoS ONE 9(2): e88079. https://doi.org/10.1371/journal.pone.0088079
Editor: Zhengdong Zhang, Nanjing Medical University, China
Received: July 14, 2013; Accepted: January 5, 2014; Published: February 12, 2014
Copyright: © 2014 Shin et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The authors have no support or funding to report.
Competing interests: The authors have declared that no competing interests exist.
Colorectal cancer is one of the most rapidly increasing cancer in the Korean population, with annual percent changes of 6.2% in men and 6.8% in women between 1999 and 2009 . Although the mortality rate from colorectal cancer started to decline in younger generations and women , colorectal cancer is still ranked the fourth most common cause of cancer death .
Several risk prediction models for colorectal cancer have been developed and validated in different populations –. The major roles of risk prediction models are: 1) to identify individuals at high risk of developing the disease who can then be offered individually tailored clinical management, targeted screening and interventions to reduce the burden of disease and 2) to identify new risk factors for the disease through research .
Recent literature suggests that the distribution of molecular subtypes of colorectal cancer differ by subsites , . In our previous study, we reported that risk factor profiles differed by sex, and by the anatomical locations of the colorectal cancer . Therefore, the focus of the present study was to develop colorectal cancer risk prediction models for overall colorectal cancer, proximal colon cancer, distal colon cancer, and rectal cancer for the Korean population; utilizing a large set of health examination data.
This study was approved by the Institutional Review Board of the National Cancer Center, Korea (IRB no. NCCNCS 09-305). The need for participants' consent was waived by the ethics committee because this study involved routinely collected medical data that were anonymously managed in all stages, including data cleaning and statistical analyses.
Two independent sets of population were incorporated into this study. The first data set was used for model development, which consisted of men and women who participated in a medical examinations provided by the National Health Insurance Corporation (NHIC) between 1996 and 1997. Details of the study design have been described elsewhere . Participants were asked to fill out self-administered questionnaires on alcohol consumption, cigarette smoking habits, regular exercise, family history of cancer, dietary preferences, and frequency of meat consumption. Additional information about female reproductive factors (i.e., age at menarche, age at first childbirth, menopausal status, and age at menopause) were also collected. Height and weight were measured directly, and body mass index (BMI) was calculated as the weight in kilograms divided by the height in meters squared.
The second data set was used for model validation, which consisted of those participated in a medical examinations in 1998 and 1999. Those who were included in the final analysis were between 30 and 80 years old, without previous history of cancer, and with no missing information for any of the major risk factor variables (i.e., height, weight, fasting serum glucose, total serum cholesterol, family history of cancer, cigarette smoking status (current/ex-/non-smokers), and alcohol consumption frequency). The number of study subjects included were 1,326,058 (846,559 men and 479,499 women) for development set, and 963,749 (547,874 men and 415,875 women) for validation set.
The incidence of cancer was ascertained from the Korean Central Cancer Registry (KCCR) database, and death information from the Korean National Statistical Office up to December 2007. The subsites of colorectal cancer were categorized by the International Classification of Disease 10th edition (ICD-10) code as follows: proximal colon (C180–C185), distal colon (C186–C187), and rectum (C19–C20). Cancers with an overlapping lesion of the colon (C188), and those were not otherwise specified (C189) were excluded from the sub-site analysis.
Five models were developed for overall colorectal cancer, colon cancer, right colon cancer, left colon cancer, and rectal cancer, separately for men and women. The Cox proportional-hazard regression models were used for developing prediction equations in development set. Colorectal cancer occurrences were counted as an event on the date of hospital admission recorded in the Cancer Registration data. Subjects were censored at the date of death ascertained from the death certificate database, or on the end date after eight years of follow-up.
Crude and age-adjusted analyses were performed for each risk factor. Age and the quadratic terms of age were centralized by subtracting the mean age of the study participants. The risk factors considered for the models were age, age-squared, height, BMI, family history for cancer, fasting glucose, serum cholesterol, cigarette smoking habit, alcohol intake, and meat consumption frequency. All of the risk factors except age were included as categorical variables in the model. BMI was categorized according to the WHO criteria for the Asian population (<25.0 vs. ≥25.0). Height was divided into quartiles and the first quartile was used as the reference. Variable selection (forward, backward and stepwise) methods with selection and exclusion criteria of type I error 0.15 were considered in the multivariate model to build the risk prediction model.
Here, are the regression coefficient estimates, are the risk factors for each individual and are the mean values for each risk factor in the study population. S(t) is the baseline survival estimate at time t (t = 5 years) when all the risk factors are at their mean values.
Discrimination was quantified by calculating the C-statistic for the survival model . The C-statistic is a concordance measure analogous to the Receiver Operating Characteristic (ROC) Curve area for the logistic model . The value indicates the probability that a model produces higher risk for those who develop breast cancer within five years of follow-up, compared with those who do not develop colorectal cancer .
A Hosmer-Lemeshow (H-L) type statistic was used for calibration . The statistic was calculated by first dividing the data into 10 groups (deciles) by ascending order of predicted probabilities produced by the model. Then, in each decile, the average predicted probabilities were compared to the actual event rate estimated by the Kaplan-Meier approach. Values exceeding 20 can be considered a significant lack of calibration .
In addition, the expected (E) and the observed (O) numbers of cancer cases were compared for overall colorectal cancer, and each subsites. All statistical analyses were performed using SAS version 9.1 (SAS institute, Cary, NC).
During the follow-up period, 6,492 men and 2,655 women were developed colorectal cancer in the development set. Among the men, there were 1,143 proximal colon cancers, 1,725 distal colon cancers, and 3,146 rectal cancers. Among the women, there were 604 proximal colon cancers, 606 distal colon cancers, and 1,252 rectal cancers. Cases with overlapping lesions in the colon or whose cancers were not otherwise specified lesions were excluded (478 men and 193 women).
In validation, 3,555 men and 1,969 women were diagnosed with colorectal cancer. Among the men there were 605 proximal colon cancers, 909 distal colon cancers, and 1,764 rectal cancers. Among the women, there were 433 proximal colon cancers, 448 distal colon cancers, and 958 rectal cancers.
The risk factors included in the models
The risk factors included in the risk prediction models were listed in Table 1 (men) and Table 2 (women). Age, height, family history for cancer, and amount of alcohol consumed were included in all models for men. Body mass index was included in all models except for the one for right colon cancer.
In women, age and height were included in all models. Fasting glucose and family history of cancer were included in all models except that for rectal cancer, and meat consumption frequency was included in all models except that for the right colon. BMI was included in the model for right colon only, and frequency of alcohol consumption was included in the model for rectal cancer only.
The discriminatory ability of the model was measured using the C-statistic in both development and validation sets (Table 3). The C-statistics for models for men ranged 0.762∼0.786 and those statistics for models for women were 0.678∼0.763. Models for colorectum (0.762 for development set and 0.779 for validation set), left colon (0.786 for development set and 0.779 for validation set), as well as rectum (0.753 for development set and 0.779 for validation set) showed the highest C-statistics in men, whereas models for right colon showed the highest values in women (0.745 for development set and 0.763 for validation set).
Figures 1-A, 2-A, 3-A, 4-A, and 5-A show the calibration plots for the overall colorectal cancer model as well as E/O ratios of validation sets for male colorectal, right colon, left colon, rectal, and colon cancers, respectively, and figures 1-B, 2-B, 3-B, 4-B, and 5-B show those for female, respectively. Table 3 presented the Hosmer-Lemeshow-type chi-square values. In general, the event rates predicted by the models were very close to the actual event rates in male models. Only models for left colon cancer in men did not show significant prediction power. In women, however, none of the models showed significant prediction ability.
Recent epidemiological and clinical information suggest that colon cancer and rectal cancer are distinct diseases , , . In addition, proximal and distal colons are different in embryologic origins, morphologic appearance of mucosa, physiological function, and bile acid composition , . Among several colorectal cancer risk prediction models developed and validated , only one study provides separate models for proximal and distal colon, and rectum . One study provided separate models for colon cancer and rectal cancer . Previously, we published an article on the risk factor profiles for different colorectal cancer subsites . The prediction models were developed using the same dataset for the model development with a longer follow-up period. In addition, an independent population was used for model validation.
The models showed moderately good discrimination ability. The model for overall colorectal cancer showed the best calibration ability. Among the models for women, that for right colon cancer showed the highest discrimination ability and that for left colon cancer showed the lowest C-statistics. Unfortunately none of the models showed any meaningful calibration ability. Still, our models showed C-statistics that were comparable with, or even higher than, other colorectal cancer risk prediction models . The C-statistics for three previous models 0.67–0.71 for Harvard Cancer Risk Index, 0.61 for the US study, and 0.62–0.66 for Japanese study, respectively, whereas those for our models were 0.68–0.78 . Indeed, model for left colon cancer in women did not reach C-statistics of 0.7. Two studies provided calibration statistics as ratio of observed vs. expected colorectal cancer events (O/E) , . The O/E ratios varied depend on risk factor profile , . In a Japanese model for men, the Hosmer-Lemeshow chi-square p-value was 0.08 .
The incidence rate for colorectal cancer in women is two thirds that in men for the Korean population . Relatively low cancer incidence rates for women, compared to men, may restrict the statistical power of models for women. Lack of detailed information about female-specific risk factors such as reproductive and hormonal factors may be another reason for the limited power of calibration .
The current risk prediction models aim to assess the probability of sporadic colorectal cancer risk. Hereditary colorectal cancer syndromes such as hereditary nonpolyposis colorectal cancer and familial adenomatous polyposis are known to account for up to 2% of overall colorectal cancers , . Mixing hereditary cancer cases into our study cohort may have diluted the relative risks due to environmental factors.
The strengths of the current study include a large sample size and completeness of cancer follow-up by data linkage to cancer registration and death certificates. Limitations include limited information on dietary risk or protective factors such as calcium and fiber intake , or non-dietary factors such as nonsteroidal anti-inflammatory drugs . Previous colonoscopy which may reduce the incidence of cancer was not considered in the model.
In conclusion, risk prediction models for colorectal cancer developed by utilizing large insurance-based data sets from the Korean population, show reasonable discrimination ability. These models help define groups at high risk for colorectal cancer and help guide them to change risk behaviors as well as to undergo cancer screening.
Conceived and designed the experiments: AS BN JJ JK JHO. Analyzed the data: HY JB YP AS BN. Contributed reagents/materials/analysis tools: BN. Wrote the paper: AS BN.
- 1. Shin A, Kim KZ, Jung KW, Park S, Won YJ, et al. (2012) Increasing trend of colorectal cancer incidence in Korea, 1999–2009. Cancer Res Treat 44: 219–226.
- 2. Shin A, Jung KW, Won YJ (2013) Colorectal cancer mortality in Hong Kong of China, Japan, South Korea, and Singapore. World J Gastroenterol 19: 979–983.
- 3. Jung KW, Park S, Won YJ, Kong HJ, Lee JY, et al. (2012) Prediction of cancer incidence and mortality in Korea, 2012. Cancer Res Treat 44: 25–31.
- 4. Colditz GA, Atwood KA, Emmons K, Monson RR, Willett WC, et al. (2000) Harvard report on cancer prevention volume 4: Harvard Cancer Risk Index. Risk Index Working Group, Harvard Center for Cancer Prevention. Cancer Causes Control 11: 477–488.
- 5. de la Torre I, Diaz FJ, Anton M, Barragan E, Rodrigues J, et al. (2012) A Telematic Tool to Predict the Risk of Colorectal Cancer in White Men and Women: ColoRectal Cancer Alert (CRCA). J Med Syst 36: 2557–2564.
- 6. Freedman AN, Slattery ML, Ballard-Barbash R, Willis G, Cann BJ, et al. (2009) Colorectal cancer risk prediction tool for white men and women without known susceptibility. J Clin Oncol 27: 686–693.
- 7. Ma E, Sasazuki S, Iwasaki M, Sawada N, Inoue M (2010) 10-Year risk of colorectal cancer: development and validation of a prediction model in middle-aged Japanese men. Cancer Epidemiol 34: 534–541.
- 8. Park Y, Freedman AN, Gail MH, Pee D, Hollenbeck A, et al. (2009) Validation of a colorectal cancer risk prediction model among white patients age 50 years and older. J Clin Oncol 27: 694–698.
- 9. Selvachandran SN, Hodder RJ, Ballal MS, Jones P, Cade D (2002) Prediction of colorectal cancer by a patient consultation questionnaire and scoring system: a prospective study. Lancet 360: 278–283.
- 10. Wei EK, Colditz GA, Giovannucci EL, Fuchs CS, Rosner BA (2009) Cumulative risk of colon cancer up to age 70 years by risk factor status using data from the Nurses' Health Study. Am J Epidemiol 170: 863–872.
- 11. Win AK, Macinnis RJ, Hopper JL, Jenkins MA (2012) Risk prediction models for colorectal cancer: a review. Cancer Epidemiol Biomarkers Prev 21: 398–410.
- 12. Barault L, Charon-Barra C, Jooste V, de la Vega MF, Martin L, et al. (2008) Hypermethylator phenotype in sporadic colon cancer: study on a population-based series of 582 cases. Cancer Res 68: 8541–8546.
- 13. Jass JR (2007) Classification of colorectal cancer based on correlation of clinical, morphological and molecular features. Histopathology 50: 113–130.
- 14. Shin A, Joo J, Bak J, Yang HR, Kim J, et al. (2011) Site-specific risk factors for colorectal cancer in a Korean population. PLoS One 6: e23196.
- 15. D'Agostino RB, Nam BH (2003) Evaluation of the performance of survival analysis models: Discrimination and calibration measures. Handbook of Statistics, vol 23. pp. 1–25.
- 16. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143: 29–36.
- 17. D'Agostino RB Sr, Grundy S, Sullivan LM, Wilson P (2001) Group CHDRP (2001) Validation of the Framingham coronary heart disease prediction scores: results of a multiple ethnic groups investigation. JAMA 286: 180–187.
- 18. Hong TS, Clark JW, Haigis KM (2012) Cancers of the colon and rectum: identical or fraternal twins? Cancer Discov 2: 117–121.
- 19. Bufill JA (1990) Colorectal cancer: evidence for distinct genetic categories based on proximal or distal tumor location. Ann Intern Med 113: 779–788.
- 20. McMichael AJ, Potter JD (1985) Host factors in carcinogenesis: certain bile-acid metabolic profiles that selectively increase the risk of proximal colon cancer. J Natl Cancer Inst 75: 185–191.
- 21. Jung KW, Park S, Kong HJ, Won YJ, Lee JY, et al. (2012) Cancer statistics in Korea: incidence, mortality, survival, and prevalence in 2009. Cancer Res Treat 44: 11–24.
- 22. Shin A, Song YM, Yoo KY, Sung J (2011) Menstrual factors and cancer risk among Korean women. Int J Epidemiol 40: 1261–1268.
- 23. Aaltonen LA, Salovaara R, Kristo P, Canzian F, Hemminki A, et al. (1998) Incidence of hereditary nonpolyposis colorectal cancer and the feasibility of molecular screening for the disease. N Engl J Med 338: 1481–1487.
- 24. Evans DG, Walsh S, Jeacock J, Robinson C, Hadfield L, et al. (1997) Incidence of hereditary non-polyposis colorectal cancer in a population-based study of 1137 consecutive cases of colorectal cancer. Br J Surg 84: 1281–1285.
- 25. Vargas AJ, Thompson PA (2012) Diet and nutrient factors in colorectal cancer risk. Nutr Clin Pract 27: 613–623.
- 26. Thun MJ, Jacobs EJ, Patrono C (2012) The role of aspirin in cancer prevention. Nat Rev Clin Oncol 9: 259–267.