Reproducibility, reliability and validity of population-based administrative health data for the assessment of cancer non-related comorbidities

Background Patients with comorbidities do not receive optimal treatment for their cancer, leading to lower cancer survival. Information on individual comorbidities is not straightforward to derive from population-based administrative health datasets. We described the development of a reproducible algorithm to extract the individual Charlson index comorbidities from such data. We illustrated the algorithm with 1,789 laryngeal cancer patients diagnosed in England in 2013. We aimed to clearly set out and advocate the time-related assumptions specified in the algorithm by providing empirical evidence for them. Methods Comorbidities were assessed from hospital records in the ten years preceding cancer diagnosis and internal reliability of the hospital records was checked. Data were right-truncated 6 or 12 months prior to cancer diagnosis to avoid inclusion of potentially cancer-related comorbidities. We tested for collider bias using Cox regression. Results Our administrative data showed weak to moderate internal reliability to identify comorbidities (ICC ranging between 0.1 and 0.6) but a notably high external validity (86.3%). We showed a reverse protective effect of non-cancer related Chronic Obstructive Pulmonary Disease (COPD) when the effect is split into cancer and non-cancer related COPD (Age-adjusted HR: 0.95, 95% CI:0.7–1.28 for non-cancer related comorbidities). Furthermore, we showed that a window of 6 years before diagnosis is an optimal period for the assessment of comorbidities. Conclusion To formulate a robust approach for assessing common comorbidities, it is important that assumptions made are explicitly stated and empirically proven. We provide a transparent and consistent approach useful to researchers looking to assess comorbidities for cancer patients using administrative health data.


Results
Our administrative data showed weak to moderate internal reliability to identify comorbidities (ICC ranging between 0.1 and 0.6) but a notably high external validity (86.3%). We showed a reverse protective effect of non-cancer related Chronic Obstructive Pulmonary Disease (COPD) when the effect is split into cancer and non-cancer related COPD (Age-adjusted HR: 0.95, 95% CI:0.7-1.28 for non-cancer related comorbidities). Furthermore, we showed that a window of 6 years before diagnosis is an optimal period for the assessment of comorbidities.

Conclusion
To formulate a robust approach for assessing common comorbidities, it is important that assumptions made are explicitly stated and empirically proven. We provide a transparent PLOS  and consistent approach useful to researchers looking to assess comorbidities for cancer patients using administrative health data.

Background
When modelling cancer survival in population-based research, it is relevant to account for potential confounders and effect modifiers, such as comorbid conditions, frequently linked to clinically relevant outcomes. [1,2] Most studies found that cancer patients with comorbidity had poorer survival than those without comorbidity. [1] The presence of comorbidities may delay or favour a timely cancer diagnosis. [3][4][5] In addition, it has been hypothesised that patients with comorbidities do not receive standard cancer treatments such as surgery, chemotherapy, and radiation therapy as often as patients without comorbidities. [1] Thus, the use of individual comorbidities or comorbidity scores such as the Charlson index [6] will enrich our understanding of differences in cancer survival outcomes in observational population-based studies.
Comorbidities are defined as the coexistence of disorders, in addition to a primary disease of interest, which are causally unrelated to the primary disease (e.g. cancer). [7,8] A myriad of comorbidity indices have been developed, some more specifically for cancer patients (simple condition, counts of simple conditions, weighted indices and organ-based system). [9] The Charlson comorbidity index (CCI) is the most extensively studied and most widely used comorbidity index in the medical literature. [10] The widespread use of this index could be explained by the fact that it is not designed for patients with a particular disease and is recommended when overall mortality is the outcome of interest. [10] It does not require extensive information, which makes it appealing to researchers who access administrative data rather than individual clinical notes. [9] However, no gold standard approach to measure comorbidity in the context of cancer exists, and the source of data to ascertain comorbidities varies. [9] Two main sources of data are commonly used to ascertain comorbidities: clinical records and administrative data. In population-based research, administrative data has been suggested as the best available option to ascertain comorbidities and predict in-hospital mortality or 6-month mortality for the CCI. [9,10] A report regarding the administrative sources of data used for deriving comorbidities showed a lack of consistency, validity, and replicability of a broad majority of the studies deriving comorbidities. [9] Furthermore, the studies describing the comorbidity index did not offer a clear description of the underlying assumptions made to obtain the algorithm nor provide the code used, thereby limiting the opportunity to assess and replicate the work. Consequently, researchers can make differing assumptions in their evaluation of comorbidities, leading to conflicting findings.
We aimed to construct a robust algorithm that is both transparent and replicable to assess comorbidities using population-based hospital administrative data. First, we described and evaluated the assumptions underlying the development of an algorithm, using the hospital episode statistics (HES) in England for the period 2003-2013. We then evaluated the internal and external validity and quality of these data to extract and use comorbidity information.

Study design, data and linkage strategy
We developed a retrospective longitudinal assessment of comorbidities for cancer patients diagnosed in England during 2013. Information on cancer patients with a malignant invasive primary tumour was obtained from cancer registrations in England. This contains patient and tumour variables including relevant dates (birth, diagnosis, last vital status), sex, age at diagnosis, deprivation, cancer site and morphology. We used population-based administrative hospital discharge data for the assessment of comorbid conditions. Namely, we analysed Hospital Episode Statistics (HES) data, [11] including accident and emergency (A&E), inpatient and outpatient data streams in England for the period 2003 to 2013. HES contains clinical, administrative, and demographic information about individual patients. The diagnostic information uses the International Classification of Diseases (10th revision) (ICD-10) [12] and operations are coded using the Office of Population Censuses and Surveys Surgical Operations and Procedures (4th edition) (OPCS-4). [13] HES data had been linked to the cancer registrations from Public Health England using a deterministic linkage strategy based on an individual ID (NHS number), date of birth, sex and postcode.

Data management
Overall assumptions. Overall, we assumed that HES is a valid source of data for the assessment of comorbidities at a population-based level. However, the evaluation of comorbidities depends heavily on both age and probability of attending the hospital (outpatient/inpatient) in the years preceding the cancer diagnosis. Given the chronic aspect of comorbidities, we also assumed that once a comorbidity is recorded in HES, the patient suffers from that comorbidity up until the time of cancer diagnosis. To explain and evaluate our algorithm, we focussed on patients diagnosed with laryngeal cancer.
Algorithm. From HES, we selected all 14 diagnostic variables containing ICD-10 diagnosis codes [12] (version 4) for each episode registered. The time scale refers to time pre-cancer diagnosis, which we split into six-monthly intervals ( Fig 1A). We compared the hospital episode start date to the cancer diagnosis date to confirm its inclusion in the different time Population-based assessment of comorbidities in cancer patients intervals. Each interval was examined independently. Each diagnosis field was scanned for the 17 co-morbid conditions that compose the CCI (listed in S1 Table) and morbid obesity. [14] If a comorbidity (i) was recorded in a given six-month interval (j), we updated the corresponding binary indicator variable (x ij = 1). The assessment of comorbidities for periods longer than six months were simply the aggregation of the information contained in all binary variables derived for each six-monthly interval, assuming that once the comorbidity was identified it was just counted once. We also retained the episode date at which a comorbidity was first recorded. We consider the patient as the unit of analysis.
Time-related assumptions. Minimising the potential for selection bias when assessing comorbidities requires the development of an algorithm that will evaluate the same amount of person-time at risk for any given patient included in the assessment. It allows each patient to have the same probability of being diagnosed with comorbidities in relation to the time under assessment. Table 1 shows the minimum number of years, for each cohort of patients, for which we can assess comorbidities. We used the 2013 cohort as a reference to which we compared the comorbidity information derived from shorter time windows. Given this data constraint, we had to consider carefully the optimal time window for the assessment of comorbidities based on the trade-off between long HES history and the number of cancer patient cohorts to evaluate.
Furthermore, considering the definition of comorbidity as the occurrence of disorders which are causally unrelated to cancer, we defined comorbidities identified shortly before diagnosis as cancer-related ( Fig 1B). Thus there is a risk of a collider bias given that cancer-related comorbidities may be a common effect of exposure and outcome, and contradictory associations may arise between non-cancer related comorbidities and cancer survival. [15] To mitigate the possibility of selection bias we created restriction windows of 6, 12 or 24 months before the cancer diagnosis, during which comorbidities first registered were excluded. However, cancer-related comorbidities may be of interest in studies aiming to evaluate factors associated with the cancer treatment decision. In this particular case, the restriction mentioned above will not apply.

Validation and statistical analyses
First, to evaluate the optimal time window for the retrospective assessment of comorbidities, we compared the cumulative incidences of comorbidities for the 2013 cohort of laryngeal cancer patients using consecutive time restrictions, and showed the corresponding percentages of comorbidities lost.
Second, we used two semi-parametric Cox proportional hazard models to estimate the ageadjusted effect of non-cancer related comorbidities on cancer survival. The first model did not differentiate cancer-and non-cancer-related comorbidities. Then, the effect of comorbidities on cancer survival was compared with a second model where cancer and non-cancer related comorbidities were modelled as independent variables. For both models, we fitted three different versions relating to various lengths of the restriction window (6, 12 and 24 months).
Finally, to measure the reliability and consistency of HES to assess comorbidities we computed the intraclass correlation coefficient (ICC) [16] and calculated the percentage of agreement between two independent sources for cancer diagnosis information, namely HES and the cancer registrations. [17] We defined internal reliability as the extent to which two or more successive HES episodes for any given patient report identical or additional comorbidities ( Fig  1C). [18,19] We used non-linear generalised random effects models to derive the ICC for each of the 17 CCI conditions and their respective 95% CI. The external validity of HES was defined following the Centre for Disease Control (CDC) surveillance strategy for the assessment of  -6m 12m -6m 12m -6m 12m -6m 12m -6m 12m -6m 12m -6m 12m -6m 12m -6m 12m -6m 12m Hospital Episodes Statistics reliability between two different sources of data. [20,21] Cancer registrations were considered as the gold standard for cancer diagnosis. The HES diagnostic fields were screened for a laryngeal cancer diagnostic code from 1 month before the cancer registry diagnosis date (Fig 1D). Then we estimated the percentage of agreement for cancer diagnosis between the two sources and derived 95%CI based on the exact test. [17] All data management and statistical analyses were performed using STATA version 14.  Fig 2B and 2C illustrate the impact of a six-month and 12-month restriction window, respectively. The cumulative proportions of any comorbidities reach 80% and above soon after six years before the diagnosis: six years with no restriction window, six and a half years with a six-month restriction and seven years with a 12-month restriction: an overall window of six years already captures the vast majority of comorbidities there are to report in the ten years preceding the diagnosis. The proportion of patients with COPD (Chronic Obstructive Pulmonary Disease) is high immediately before diagnosis and does not increase greatly with additional years, approximately 15%, reflecting that more than 85% of COPD is identified in the six months before the diagnosis. In Fig 2B and 2C, that proportion is restricted to around 30%. Furthermore, using a restriction window make all cumulative incidence curves follow the same pattern: between 0 and 40% of the final proportions of comorbidities are detected 6 or 12 months before diagnosis. The cumulative proportions of comorbidities increase at an approximate similar rate reaching 100% at ten years.

Results
We provide absolute and relative measures of the impact of applying restriction windows on the assessment of comorbidities. Among the 1,789 laryngeal cancer patients diagnosed in 2013, 51% present with at least one comorbid condition. That proportion drops to 34% and 32% if a six-or 12-month restriction window is applied. It highlights that 17% of these comorbidities are first reported in the six months preceding the diagnosis (Table 2).
Overall, the reliability of the recording of comorbidities in HES is moderate for 7 comorbidities with an ICC ranging between 0.3 for rheumatic disease and 0.62 for dementia. All other comorbidities showed ICCs less than 0.3 for the 2013 laryngeal cancer cohort indicating weak internal reliability ( Table 3). The ICC for the 2013 laryngeal cancer cohort and all the available data (cohorts from 2005 to 2013, N = 16,112) were similar, which indicated the absence of secular trends. Dementia, COPD, diabetes without chronic complication and renal disease showed a moderate ICC (!0.5) while peptic ulcer and myocardial infarction consistently showed a lower ICC. In evaluating external reliability, we found that the proportion of agreement between ONS and HES was notably high (86.3%). Table 4 shows the effect of cancer-related and non-cancer-related COPD on cancer mortality. COPD was defined as related to the cancer if first diagnosed within either 6, 12 or 24 months before the cancer diagnosis. For all three intervals, age-adjusted non-cancer related comorbidities were associated with higher odds of cancer mortality: hazard ratios were 1.26 (CI: 0.98-1.63), 1.26 (CI: 0.97-1.63) and 1.16 (CI: 0.86-1.56) for comorbidities assessed 6, 12 or 24 months away from the cancer diagnosis, respectively. Likewise, cancer-related comorbidities were consistently associated with a higher cancer mortality risk with all hazard ratios over 1.5. However, in multivariate adjusted Cox models where we included age, cancer-and noncancer related comorbidities as independent predictors, the point estimate for the effect of non-cancer related comorbidities was reversed (HR: 0.98, CI: 0.72-1.32; 0.95, CI: 0.70-1.28; 0.86, CI: 0.62-1.19 for all three intervals assessed). Despite the lack of statistical significance

Discussion
This study highlights the importance of explicitly stating and empirically proving the assumptions made in the assessment of cancer comorbidities using administrative health data. We recommend considering time as an important confounder in the assessment of comorbidities by defining an optimal window and a restriction window. Furthermore, consistency of the optimal window ensures there is no selection bias associated with time, as all patients included in the incident cancer data have the same follow-up period to be assessed for comorbidities. We demonstrate that a 6-year window is an optimal period for identifying comorbidities in our setting. The purpose of the restriction window is to prevent paradoxical effects when assessing the impact of comorbidities on cancer survival. With empirical evidence we highlight the need for a restriction window of at least six months prior to laryngeal cancer diagnosis, when we consider the effect of COPD. Such exercise would need to be repeated for different  combinations of cancer sites and comorbidities. Additionally, the code for the computing algorithm is available as proof of reproducible research (S1 Code). The number of studies in cancer epidemiology using derived information of comorbidities from administrative or clinical data has grown in the last five years. [10,[22][23][24][25][26] There is a wealth of literature on comorbidity scores [23,27,28] and on adapting them to different data sets [29][30][31], varying numbers of comorbidities included for consideration, and varying subsets of the population [32][33][34] or diseases of interest. [35,36] The literature mostly focusses on how administrative data compares to medical records [22] in terms of identifying relevant comorbidities, and if a particular score or modified score is a good predictor of mortality. [37,38] However, there is no clear consensus on how to assess and estimate comorbidities using administrative or other type of health data. Furthermore, the majority of recent research analysing comorbidity data do not state major assumptions made to derive information on comorbidity; criteria such as validity and reliability are not routinely assessed. [20,[22][23][24][25][26][27][28][29][30][31][32][33][34] In epidemiological studies any assumption made during the data generation process and analysis must be stated. [39] Therefore, the first step to develop a uniform approach to assess and use comorbidities in future studies is to state the assumptions made to generate the data. We explicitly document and empirically prove the set of assumptions needed to derive comorbidity information from secondary care health administrative data.
Time is one of the most important confounders in epidemiology. Given patients with larger follow-up period might show higher probability of identifying comorbidities, we set an optimal window so that the assessment of comorbidities is independent of time (i.e., securing the same follow-up time for comorbidities for all cancer patients). Furthermore, the optimal window helps to maximise the equal number of years that all cancer patients included in the analysis were followed up. [40] Following the weak ICC for hospital administrative data presented here for many of the comorbidities assessed, there is a rational for using the longest possible assessment window in order to maximise the detection of existing comorbidities. An audit of HES codes showed 90.5% accuracy for identifying 8 major comorbidities, indicating that HES diagnostic fields can confidently predict the actual presence of the comorbidities. Improving the protocol for documenting comorbidities with clinicians and providing further training to administrative clerks could enhance the assessment of comorbidities using HES. [41] We also compared the prevalence of comorbid conditions in our laryngeal cancer population with that of the general population: comorbidities sharing the same risk factors as laryngeal cancer were much more prevalent in our data, while all other comorbidities were comparable to published prevalence for the general population (data not shown). These results are in agreement with the HES clinical audit. [41] Some studies have used a restriction window to assess the effect of comorbidities on cancer survival, although the assumptions made to set this window have not been explicitly justified or documented. [40,42] To our knowledge, we are the first to empirically show the impact of neglecting this principle. A paradoxical association between cancer-related comorbidities and mortality, such as obtaining a protective effect from a risk factor (COPD) known to predict the outcome (mortality due to laryngeal cancer), occurs if a restriction window is not set. This paradoxical association occurs when the probability of the exposure is associated with the outcome being studied. [15] Likewise, a collider stratification bias may occur when we condition on a common effect of exposure and outcome, i.e. non-cancer related comorbidities conditioned on cancer-related comorbidities and cancer. [43] Therefore, when the interest of researchers is to explain the effect of comorbidities on cancer survival, we advise epidemiologists to think carefully about the particular effect of individual comorbidities on specific cancer sites to avoid reporting spurious or paradoxical protective effects of comorbidities. [44][45][46] We show that there are fewer differences between a 6-and 12-month restriction window than between no window and a 6-month window, mostly related to cancer-related comorbidities recorded for the first time in the 6 or 12 months before the cancer diagnosis. This finding highlights the potential for earlier cancer diagnosis. In particular, the high proportion of COPD could reflect a mis-diagnosis of laryngeal cancer.
Despite documented differences between administrative data and medical records [47], both types of data produce comorbidity scores that have similar predictive power. [48,49] We found over 86% agreement between the HES data and Cancer Registrations for the registration of laryngeal cancer. Other limitations include the necessary computing resources for handling big data, the availability of data for the assessment of comorbidities (2003 to 2015), the relatively small number of comorbidities we focused on, and the external validity of our findings limited to hospital records in England. However, our approach is general, and it could be valid in other settings. We recognise we are missing some lifestyle risk factors such as tobacco smoking, alcoholism, drug abuse and other conditions such as asthma, eating disorders and epilepsy, which would undoubtedly impact outcomes.
We encourage researchers to consider our recommendations for the assessment and use of comorbidities. We have clarified the set of assumptions used to identify cancer patients' comorbidities using hospital data. Moreover, we have demonstrated our assumptions through empirical analyses based on current epidemiologic knowledge. Our algorithm for the assessment of comorbidities could be considered as a state-of-the-art method for the evaluation of comorbidities using administrative health data in population-based cancer research epidemiology. Furthermore, we have shown that administrative hospital data is a valuable and consistent source of information allowing population-based cancer researchers to update comorbidities information for patients.