Development and validation of a multi-dimensional diagnosis-based comorbidity index that improves prediction of death in men with prostate cancer: Nationwide, population-based register study

Assessment of comorbidity is crucial for confounding adjustment and prediction of mortality in register-based studies, but the commonly used Charlson comorbidity index is not sufficiently predictive. We aimed to develop a multidimensional diagnosis-based comorbidity index (MDCI) that captures comorbidity better than the Charlson Comorbidity index. The index was developed based on 286,688 men free of prostate cancer randomly selected from the Swedish general population, and validated in 54,539 men without and 68,357 men with prostate cancer. All ICD-10 codes from inpatient and outpatient discharges during 10 years prior to the index date were used to define variables indicating frequency of code occurrence, recency, and total duration of related hospital admissions. Penalized Cox regression was used to predict 10-year all-cause mortality. The MDCI predicted risk of death better than the Charlson comorbidity index, with a c-index of 0.756 (95% confidence interval [CI] = 0.751, 0.762) vs 0.688 (95% CI = 0.683, 0.693) in the validation cohort of men without prostate cancer. Men in the lowest vs highest MDCI quartile had distinctively different survival in the validation cohort of men with prostate cancer, with an overall hazard ratio [HR] of 5.08 (95% CI = 4.90, 5.26). This was also consistent within strata of age and Charlson comorbidity index, e.g. HR = 5.90 (95% CI = 4.65, 7.50) in men younger than 60 years with CCI 0. These results indicate that comorbidity assessment in register-based studies can be improved by use of all ICD-10 codes and taking related frequency, recency, and duration of hospital admissions into account.


Introduction
Accurate measures of comorbidity are needed to predict mortality, control confounding and/ or define relevant population strata in register-based observational studies.This is highly relevant for studies of men with prostate cancer, who often are above 65 years of age and have a high risk of death from other causes.The widely used Charlson comorbidity index and modifications thereof are based on the occurrence (Yes/No) of ICD-codes for a limited number of pre-specified diagnoses [1][2][3] and a large portion of men in prostate cancer studies have Charlson comorbidity index 0 [4][5][6][7], which hampers the prediction of mortality risk [5,8,9].The restriction to a limited number of ICD-codes means that potentially relevant information from other codes that may be predictive is lost.A recently created drug comorbidity index avoided such selection by including all filled prescriptions up to a year before index date and clearly outperformed the Charlson comorbidity index [10].Furthermore, there is more information in administrative health care registers than the mere occurrence of diagnostic codes.Other aspects such as frequency of occurrence of a code, time since latest appearance of a code (recency), and total duration of hospital admission associated with a code may also be predictive [5].

Aim of the study
The aim of this study was to create a comorbidity index to be used in observational registerbased studies.We hypothesized that predictive ability could be improved by extracting multiple quantitative aspects of all available diagnostic codes in a national patient register using a data-driven approach to identify the most important codes and quantitative aspects of code occurence.We compared the ability of this novel index to predict long-term mortality with the Charlson comorbidity index in men with and without prostate cancer.

Materials
We used data from the Prostate Cancer data Base Sweden version 5 (PCBaSe 5) [11].In PCBaSe 5, the National Prostate Cancer Register of Sweden has been linked by use of the unique Swedish Personal Identity Number to other healthcare registers held by the Swedish National Board of Health and Welfare, such as the Swedish Cancer Register, the Cause of Death Register [12], the National Prescribed Drug Register and the National Patient Register [13].The capture rate of the National Prostate Cancer Register is above 96% compared to the Swedish Cancer register to which reporting is mandated by law [14].The primary aim of NPCR is to provide data for quality assurance for cancer carer and adherence to National guidelines [15,16].There are only modest differences in demographics, cancer treatment, comorbidity, and mortality between the men in NPCR and men only registered in the Cancer Register, indicating that information in NPCR can be generalized to all men with prostate cancer in Sweden.
PCBaSe 5 includes men diagnosed with prostate cancer between 1998 and 2020.For each prostate cancer case PCBaSe also includes five randomly selected comparison men, free from prostate cancer at the date of diagnosis of the corresponding case (index date), matched on birth year and county of residence.
Data was collected between January 1, 1998, and December 31, 2021, and accessed for research purposes June 1, 2022.The study population included all men in PCBaSe 5 with an index date between January 1, 2008, and December 31, 2014.The development cohort included all comparison men in PCBaSe 5 with an index date between January 1, 2008 and December 31, 2013.Two validation cohorts were used.The first cohort included all comparison men in PCBaSe 5 with an index date in 2014, and the second cohort all prostate cancer cases diagnosed between January 1, 2008, and December 31, 2014.Temporal rather than random splitting of the data is expected to reduce the risk for overly optimistic estimates of model performance.
Cancer Database Sweden (PCBaSe), which is based on the National Prostate Cancer Register (NPCR) of Sweden and linkage to several national health-data registers.The data cannot be shared publicly because the individual-level data contain potentially identifying and sensitive patient information and cannot be published due to legislation and ethical approval (https:// etikprovningsmyndigheten.se).Use of the data from national health-data registers is further restricted by the Swedish Board of Health and Welfare (https://www.socialstyrelsen.se/en/)and Statistics Sweden (https://www.scb.se/en/) which are Government Agencies providing access to the linked healthcare registers.The data will be shared on reasonable request in an application made to any of the steering groups of NPCR and PCBaSe (contact npcr@npcr.se).For detailed information, please see www.npcr.se/in-english,where registration forms, manuals, and annual reports from NPCR are available alongside a full list of publications from PCBaSe.The code used for the analyses can be accessed via the following DOI (0.5281/zenodo.8210683;https://zenodo.org/doi/10.5281/zenodo.8210683).
All codes according to the 10 th revision of International Statistical Classification of Diseases and Related Health Problems (ICD-10) registered as related to hospitalizations or specialist outpatient visits up to 10 years prior to the index date were extracted from the inpatient and specialist outpatient sub-registers of the National Patient Register.The National Patient Register comprises information on all in-hospital care and out-patient specialist care in Sweden.It has nation-wide coverage of in-patient care since 1987 and specialized outpatient care since 2001.During the study period, diagnoses were recorded according to the Swedish clinical modification of ICD-10 which has few modifications compared to the original ICD-10 version.
The Swedish Prescribed Drug Register, used in this study to calculate a drug comorbidity index, contains details of all prescriptions dispensed in Sweden since July 1, 2005.Drugs are identified by a unique identifier for each specific combination of brand name, substance, formulation and package.The register only includes filled prescriptions, and not medicines sold over the counter or medicines administered directly to the patient during in-patient care, outpatient care or primary care.
Dates of death until 31 December 2020 were extracted from the Cause of Death Register for the follow-up of mortality.The Swedish cause of death register contains information on all deaths of Swedish residents since 1952.

Methods
We developed a multidimensional diagnosis-based comorbidity index (MDCI) primarily to predict the risk of death of all causes within 10 years after each individuals index date, based on ICD-10 codes from inpatient stays and specialist outpatient visits.In complementary analyses the predictive performance with 1-and 5-year follow-up of mortality after the index date was also evaluated, using the same approach.The study adhered to the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines (S1 Appendix).
The Swedish Ethical Review Authority approved of the study [220-03437 and the need for consent was waiwed.Data was pseudonymized by the National Board of Health and Welfare prior to being delivered to the researchers.

MDCI development strategy
The overall strategy was to derive predictors describing occurrence, frequency, recency, and duration of hospital admission of all ICD-10 codes identified in the National Patient Register, include all potential predictors in a Cox regression model, and use a variable selection strategy to minimise overfitting.To compute a patient's MDCI, one sums the coefficients from this model related to the selected predictors based on the ICD-10 codes observed for each patient during the 10 years preceding the index date.In an exploratory analysis using the development dataset we also evaluated the impact of length of lookback period for both the Charlson comorbidity index and the MDCI, alone and in combination with the drug comorbidity index.The process is described below and in greater detail in S2 Appendix.

ICD-10 code granularity
The character positions in the ICD-10 code represent subcategories of a disease or condition.In practice, this allows registering of diagnoses with codes with fewer characters than the most detailed level specified by the ICD-10 coding system, or using a trailing "9" to indicate "unspecified".The use of such truncated codes may also vary between settings and over time.To address this the registered ICD-10 codes were first processed in a data management step involving code truncation and elongation, pruning of unnecessary codes, and filtering codes present in at least 0.01% of the development cohort (S1 Fig and S2 Appendix).

Definition of predictors
Each ICD-10 code could contribute with information on four levels depending on the number of characters used.For example, the code I731 (thromboangiitis obliterans) could contribute information based on two (I7), three (I73), four (I731), and five characters (I7319).For each of the four variants, code-specific predictors of mortality were generated and used in the regression models, describing occurrence, frequency, recency, and total duration of hospital admission in the 10 years preceding the index date.One predictor definition reflected simple occurrence of a code as primary diagnosis, and another indicated occurrence as either primary or secondary diagnosis.The predictors for frequency, recency, and duration were only based on codes registered as primary diagnosis.Frequency was categorized as occurrence on �2, �3, or �4 unique dates.Recency was categorized as occurence within 90, 180 and 365 days prior to the index date.Duration of hospital admission was categorized by dummy variables indicating if the total number of days in hospital exceeded 7 or 14 days.These cutoffs were prespecified based on clinical reasoning with the intention to separate patients with short observational and mainly diagnostic stays from patients with severe conditions and/or reduced functional status.

Variable selection and model fitting
We used a Cox proportional hazards regression model with all the predictors as covariates [8,20] and applied regularization by use of the elastic net, [21,22], to simultaneously perform both variable selection and handle collinearity.Ten-fold cross-validation over a grid of 100 values of the hyper-parameter that controls the amount of penalization was used to identify the model with the highest concordance index (c-index) [23,24].The analysis was performed using R version 4.1.3and glmnet version 4.1-2 [25].

Validation
The performance of the MDCI was compared to the Charlson comorbidity index and the drug comorbidity index in the two validation cohorts separately, using the c-index and calibration curves [26].We computed the c-index for mortality at 1, 5 and 10 years of follow-up after the index date, and computed 95% confidence intervals (CIs) for the c-indices using a bootstrap procedure with 1000 replications.The Charlson comorbidity index was calculated based on ICD-10 codes registered as a primary or secondary diagnosis in the National Patient Register during the 10-year period preceding the index date [27].When the Charlson comorbidity index was calculated in the validation cohort with prostate cancer cases we excluded ICD-10 codes for prostate cancer (C61), and metastases (C77-80) if they were registered in conjunction with C61.This is in line with previous adaptations of the Charlson comorbidity index to cancer populations [28].The drug comorbidity index was calculated based on all drug prescriptions registered in the National Prescribed Drug Register during the 365-day period preceding the index date [8].

Survival analysis
Survival was assessed with Kaplan-Meier curves and 1-year hazard ratios (HR) estimated using a Cox proportional hazards model [29].Analyses were stratified by age, Charlson comorbidity index, the drug comorbidity index, and the MDCI.The MDCI was categorized by using the 25%, 50% and 75% quantiles of the distribution of the MDCI in the development cohort.The same procedure was applied to categorize the drug comorbidity index.

Study participants
The study population consisted of 409,584 men with an age range of 32-102 years (Table 1).The development and validation cohorts were similar in terms of age, Charlson comorbidity index, drug comorbidity index, and ICD-10 code chapters registered during the 10-year period preceding the index date.Median length of follow-up for mortality was 8 years in the development cohort and 30% (85,468 men) died within 10 years (Table 2).There were 10,325 unique ICD-10 codes registered.Of these, 45% (5612) were observed in at least 0.01% of the development cohort and these codes generated 56,120 potential predictors of mortality (S1 Fig).

Selected codes and code dimensions
The MDCI derived after the variable selection process included information from 978 unique ICD-10 codes, out of which 58% (564) had 4 or 5 characters.These codes generated 1543 corresponding predictors of mortality, out of which 56% (870) reflected occurrence, 17% (261) frequency, 17% (264) recency, and 10% (148) duration of hospital admission (S2 Fig) .As examples, the predictors for mortality derived from ICD-10 codes for cardiovascular diseases are demonstrated (Fig 1).The example illustrates how predictors reflecting occurrence, frequency, recency, and duration of hospital admission of ICD-10 codes independently contributed with predictive information to a varying degree depending on the specific diagnosis.For cardiovascular diseases additional predictive ability was sometimes also added from codes with 5 characters.For diseases such as arrythmia, cardiomyopathy, cardiac arrest, and heart failure, additional information was often obtained using 5 characters and from all four dimensions of predictors.Similar distributions of predictors related to all other diseases are shown in (S3 Fig).Some categories that were not expected to contribute, such as carcinoma in situ (D0) for malignancies, in fact added predictive information.
In complementary analyses we derived the MDCI also using 1 and 5 years of follow-up for mortality after the index date.The number of informative diagnosis codes decreased with shorter follow-up.For the 1-year MDCI, 369 unique ICD-10 codes were selected, compared to 885 unique ICD-10 codes for the 5-year MDCI (S2 Fig) .Based on these ICD-10 codes there were 547 predictors selected for the 1-year MDCI and 1378 for the 5-year MDCI.The distributions between factors reflecting occurrence, frequency, recency, and duration were comparable to that seen for the MDCI developed from 10-year follow-up for mortality (S2 and S3 Figs).
The c-indices from the cross-validation procedure can be found in S1 Table .All included predictors and their coefficients are listed in S2 Table.
The MDCI based on 1-year of follow-up for mortality correlated well with the MDCI based on a 10-year follow-up (correlation coefficient 0.90-0.93)(S4 Table ).The discrimination was overall slightly lower, but still consistently higher compared to the Charlson comorbidity index and the drug comorbidity index (Table 2).The pattern was also consistent with shorter follow-up resulting in higher discrimination (Table 2).Calibration curves comparing predicted and observed 10-year probabilities of death indicated good calibration of the MDCI, and also illustrates the increased separation of mortality risk provided by the MDCI compared to the Charlson comorbidity index (Fig 2).Calibration was similar for predicted 5-year mortality risk but less optimal for prediction of 1-year mortality, where the MDCI tended to underestimate mortality risk in strata with low mortality and overestimate the risk in strata with higher mortality risk (S4 Fig).
In complementary analyses, decreasing the length of the lookback period to 5 years or 1 year for the Charlson comorbidity index or for the MDCI clearly decreased the predictive ability, e.g. using a lookback of one year the C-index for the Charlson comorbidity index and MDCI were 0.599 (95% CI: 0.597-0.601)and 0.637 (0.635-0.639), respectively (S3 Table ).
When the comorbidity indices were combined with the drug comorbidity index the overall predictive ability increased and the impact of lookback was less pronounced, although a lookback of 10 years produced the largest C-indices both for Charlson comorbidity index and the MDCI, 0.724 (95% CI: 0.722-0.725)and 0.751 (95% CI: 0.749-0.752),respectively (S3 Table ).

Added value for prediction of mortality in comparison with other comorbidity indices
The MDCI identified groups of comparison men with distinctively different survival probability in the validation cohort, also within strata according to the Charlson comorbidity index or the drug comorbidity index ( The correlation between the MDCI and the drug comorbidity index was low (correlation coefficient �0.64) (S3 Table ).
The MDCI was particularly good at discriminating risk of death at the extremes of the scale, i.e., for men in the top and bottom quartile of the MDCI.In the validation cohort consisting of men without prostate cancer, there was an increased risk of death in the highest vs In each cohort, the calibration plot was obtained by computing the predicted 10-year survival probability for each individual using a Cox proportional hazards model including the MDCI or Charlson comorbidity index, respectively as predictor, and a corresponding estimate of the baseline hazard function.The observed survival probability was computed as the average 10-year survival estimated by the Kaplan-Meier curve within the intervals of the predicted survival probabilities using the cutoffs 0%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% and 100%.The predicted survival probability was similarly computed as the average of all individual survival probabilities among individuals within each interval.Probability intervals containing zero individuals are not shown.

Summary of main results
Our new multidimensional diagnosis-based comorbidity index (MDCI) based on occurrence, recency, frequency, and duration of hospital admissions for all ICD-10 codes predicted risk of death better than the commonly used Charlson comorbidity index.This was demonstrated consistently across strata of both age and other comorbidity indices.Notably, the MDCI was also able to stratify men with Charlson comorbidity index = 0 into strata with clearly different risk of death.This supports the hypothesis that relevant quantitative information related to disease severity can be extracted from diagnosis codes in longitudinal register data.The prediction model developed using 10 years of follow-up performed better than models based on shorter follow-up.

Comparison with other comorbidity indices
The commonly used versions of the Charlson comorbidity index [27,30] are based on the occurrence of a limited number of pre-specified diagnostic codes in an administrative health care register.In contrast, we assessed the predictive value of all ICD-10 codes available in the Swedish National Patient Registry without considering prior knowledge of the potential predictive value of the codes.In our analysis, we also included information on the number of occurrences during a specified time interval, the duration of hospitalization with a code as a main diagnosis, and the time interval from the latest hospital discharge to index date, as well as if the code was registered as a main or secondary diagnosis.
The concept of such a multi-dimensional comorbidity index has recently been demonstrated to be valid for intensive care based on 36 predefined comorbidity categories and predictors indicating the number of hospital admissions during the past 5 years, duration of hospitalization and time since latest admission [5].This comorbidity index identified groups with clear differences in survival also within strata of the Charlson comorbidity index and age.Our study provides further support for the concept of a multi-dimensional diagnosis-based comorbidity index.Other attempts in this direction have been made.A recent study using Medicare claims linked with Health and Retirement Study data developed seven dichotomous markers of disease severity, including markers based on the number of disease-associated outpatient visits, emergency department visits, and hospitalizations made by an individual over a defined interval [31].These markers did not, however, meaningfully predict outcomes such as decline in activities of daily living or mortality.
The strata of individuals with Charlson comorbidity index 0 is often large [4][5][6] but heterogeneous in terms of risk of death [5,8], which we confirmed in our study.The improved ability of the MDCI to discriminate risk of death in this group, and in subgroups defined by age, suggests that the MDCI should be preferred over the Charlson index when adjusting for comorbidity or predicting life expectancy in register-based studies of prostate cancer.Since diagnosis codes reflect information from hospitalizations or specialist out-patient care, even better prediction can possibly be achieved when combining the MDCI with the newly created drug comorbidity index [8] that summarizes information also from primary care.Our results suggest that the MDCI and the drug comorbidity index reflect different aspects of a patient's medical history and therefore should be used in combination, preferably with a lookback of 5-10 years for the MDCI.

Strengths and limitations
Strengths of our study include that the development cohort comprised data from more than 280,000 men with a complete 10-year history of ICD-10 codes registered in a National Patient Register that has high validity [13].Furthermore, we performed validation in two large and independent cohorts, one comprising men without prostate cancer alive at a later calendar period and the other comprising men with a prostate cancer diagnosis.Developing the MDCI in men without prostate cancer allowed for estimation of risk of death from all causes in absence of prostate cancer diagnosis without introducing potential bias due to misclassification of the cause of death [32].
There are some limitations to our study.Linkages of rich sources at an individual level can be done in Sweden but not everywhere else so this is a limitation for generalization of this method to other countries and settings.The prediction models were developed in men selected as comparison men to men with prostate cancer, meaning that e.g.extrapolation of model parameters to women or specific other patient groups should be cautious.Associations between ICD-codes and mortality may also be context-sensitive because disease patterns, coding practices, and treatment strategies likely vary between countries and over time [10].External validation is therefore needed in other health care settings.The principle for extraction of comprehensive information from the ICD-10 codes may therefore be more generalizable than the generated actual parameter estimates.The selection of ICD-10 codes and associated parameter estimates should be used cautiously in other settings and may also need to be revisited over time.If the purpose is to control selection bias and/or confounding in a study, it is likely as effective, or even more effective, to include specific ICD-10 codes and associated predictors and derive their weights internally for the specific study.This, however, requires a large sample size.The MDCI was compared to a recently published version of the Charlson index, specifically adapted to the Swedish Patient Register [27].Comparisons to other comorbidity measures, including different adaptations of the Charlson comorbidity index and the Elixhauser comorbidity index, may yield different results.
The MDCI was developed and validated for use in register-based non-interventional studies and not for application on individual patients in clinical practice.While the study population was large, the results may be influenced by the restriction to men and the distribution of comorbidities in this particular population.The study population was therefore not well suited to elucidate the relative contribution of predictors based on occurrence, frequency, duration, and recency to the overall performance of the MDCI.The main purpose of this study was to provide proof of principle for a multidimensional diagnosis-based comorbidity index.Future research should assess the discrimination and calibration of the MDCI in a general population, including both men and women.In such a general population sample, it is of interest to try to identify what specific features of this new index that drives the improved predictive ability, so that the principle also can be adapted to other settings with less comprehensive data sources.

Conclusions
A multidimensional diagnosis-based comorbidity index based on all ICD-10 diagnostic codes in a National Patient Register, including information of occurrence, recency, frequency, and duration of hospital admission of each code, clearly outperformed the commonly used Charlson comorbidity index in prediction of death in men with and without prostate cancer.S9 Fig. 1-year hazard ratios for men in the validation cohorts of comparison men and men with prostate cancer.Comparing men with a multi-dimensional diagnosis-based comorbidity index (MDCI) below Q1 vs men with MDCI above Q3.The analysis was also stratified by age, Charlson comorbidity index (CCI) and drug comorbidity index (DCI).NA indicates that the hazard ratio could not be estimated due to small sample size and/or few events in each stratum.(PDF)

Fig 3 ).
This observation was consistent throughout all strata of age and Charlson comorbidity index (S5 Fig) and in men with prostate cancer (S6 Fig).A similar pattern was seen within strata of age and the drug comorbidity index (S7 and S8 Figs).

Fig 1 .Fig 2 .
Fig 1. Example of predictors selected in the prediction model.ICD-10 codes for cardiovascular diseases are shown in the figure, specifically focusing on the ICD-10 code category I5 (heart failure and some heart disorders and diseases).The innermost circle represents ICD-10 codes with two characters and the outmost circle codes with five characters and grouped predictors.Each predictor variable (occurrence, frequency, recency, duration) derived for each ICD-10 code corresponds to a color-coded circle segment.A segment is colored if any coefficient within that group of predictors had a non-zero coefficient for that code, and grey if non-informative.I50 = heart failure, I51 = complications and ill-defined descriptions of heart disease, I500 = right ventricular failure, I501 = left ventricular failure, I509 = heart failure, unspecified, I510 = cardiac septal defect, acquired, I513 = intracardiac thrombosis, not elsewhere classified, I514 = myocarditis, unspecified, I516 = cardiovascular disease, unspecified, I517 = cardiomegaly, I519 = heart disease, unspecified, I5019 = left ventricular failure, unspecified, I5099 = heart failure, unspecified.https://doi.org/10.1371/journal.pone.0296804.g001

Fig 3 .
Fig 3. Survival of men in the validation cohort consisting of men without prostate cancer.The validation cohort has been split in subgroups based on the Charlson comorbidity index (0, 1, 2, 3+) and the drug comorbidity index (quartiles of the DCI).The survival within each subgroup is then shown stratified in quartiles of the MDCI developed using 10 years of follow-up for mortality.https://doi.org/10.1371/journal.pone.0296804.g003

Fig 4 . 1 -
Fig 4. 1-year hazard ratios for men in the validation cohort consisting of men without prostate cancer.Stratified by age, Charlson comorbidity index (CCI) and drug comorbidity index (DCI) for the multi-dimensional diagnosisbased comorbidity index (MDCI) developed using 10 years of follow-up for mortality.Not available (NA) indicates that the hazard ratio could not be estimated due to small sample size and/or few events in each stratum.https://doi.org/10.1371/journal.pone.0296804.g004

Table 2 . Duration of follow-up for mortality after the index date, number of deaths, and model discrimination in development and validation cohorts.
*Follow-up restricted to a maximum of 7 years.https://doi.org/10.1371/journal.pone.0296804.t002