Influence of Using Different Databases and ‘Look Back’ Intervals to Define Comorbidity Profiles for Patients with Newly Diagnosed Hypertension: Implications for Health Services Researchers

Objective To determine the data sources and ‘look back’ intervals to define comorbidities. Data Sources Hospital discharge abstracts database (DAD), physician claims, population registry and death registry from April 1, 1994 to March 31, 2010 in Alberta, Canada. Study Design Newly-diagnosed hypertension cases from 1997 to 2008 fiscal years were identified and followed up to 12 years. We defined comorbidities using data sources and duration of retrospective observation (6 months, 1 year, 2 years, and 3 years). The C-statistics for logistic regression and concordance index (CI) for Cox model of mortality and cardiovascular disease hospitalization were used to evaluate discrimination performance for each approach of defining comorbidities. Principal Findings The comorbidities prevalence became higher with a longer duration. Using DAD alone underestimated the prevalence by about 75%, compared to using both DAD and physician claims. The C-statistic and CI were highest when both DAD and physician claims were used, and model performance improved when observation duration increased from 6 months to one year or longer. Conclusion The comorbidities prevalence is greatly impacted by the data source and duration of retrospective observation. A combination of DAD and physicians claims with at least one year observation duration improves predictions for cardiovascular disease and one-year mortality outcome model performance.


Introduction
Rigorous outcome research is required to adjust for comorbidities, as failing to adjust for comorbidities may raise questions for results and lead to erroneous conclusions [1]. To measure comorbidities, previous studies have used various data sources, including hospitalization discharge abstract database (DAD) [2][3][4][5][6][7], physician claims [8][9][10][11][12][13], and drug dispensations database [14,15] and different durations of retrospective observation. DAD is often used to measure Charlson comorbidities and evaluate their association with mortality, length of stay, and health care costs [16][17][18][19]. Data from DAD, however, only records comorbidities for hospitalized patients, which is problematic since many patients with chronic conditions are managed at outpatient settings. As such, comorbidities that are defined using only one database are likely to be underestimated.
Researchers have tried various approaches to accurately defining comorbidities accurately. Wang et al. [12] developed strategies for defining comorbidities by using Medicare and Medicaid claims data. Researchers in the United States [20] and Australia [21] explored the length of "look back" required for defining comorbidities and their associations with clinical outcomes, as previous studies have indicated that the prevalence of comorbidities varies depending on the data source and length of observation period used, which impact adjusted clinical outcomes.
In our study of the occurrence, management, and outcomes related to hypertension, we have found that the majority of patients with hypertension are identified through physician claims data, while patients with severe hypertension are mainly identified from DAD data [22]. Considering the long incubation period from hypertension diagnosis to the manifestation of poor clinical outcomes, along with the number of patients who are managed in outpatient settings, we aimed to maximize the length of follow-up for outcomes and minimize the duration of observation for defining comorbidities by fully using available health information. Unfortunately, to the best of our knowledge, no existing studies have compared different data sources and durations for estimating the burden of comorbidities, and the impact these approaches have on the model performance of risk adjusted outcomes among patients with hypertension. Therefore, we conducted this study to define Charlson comorbidities using DAD and physician claims data for four durations of retrospective observation (i.e., 6 months, 1 year, 2 years, and 3 years) to explore the impact of these different approaches on mortality and cardiovascular disease outcomes, among patients with newly diagnosed hypertension.
unique personal health numbers. Data from the DAD includes all inpatients in Alberta and contains up to 16 diagnoses coded according to the International Classification of Diseases, 9 th revision, Clinical Modification (ICD-9-CM) prior to April 1, 2002, and up to 25 diagnoses coded according to the ICD-10 Canadian Modification (ICD-10-CA) since April 1, 2002. In Alberta, physicians submit billing claims for services to the provincial Government insurance program, regardless of their service location. When submitting these claims, at least one and up to three diagnoses, coded in ICD-9, must be provided. Physician claims data captures clinical information from patients at emergency departments, hospitals, and outpatient clinics who are covered by the Alberta provincial insurance program. Due to this universal insurance program, the program registry (also called the population registry) covers nearly all Alberta residents and contains information such as personal health number, age, sex and postal code. The death registry is updated regularly and includes an individual's date and location of death.

Study Population and Outcomes
We extracted patients with hypertension from our linked administrative data sources using the following ICD algorithm, which has previously been validated: "two claims within 2 years or 1 hospitalization" (sensitivity 75%, specificity 94%, positive predictive value 81%, and negative predictive value 92%) [23]. Patients with pregnancy-induced hypertension were excluded [23].
To determine newly-diagnosed (incidence) cases of hypertension, we employed a 3-year washout period so not to misclassify prevalent cases as incidence. We assigned the index date for hypertension diagnosis using the first date of physician visit or hospitalization with a hypertension diagnosis code. To ensure at least a one year follow up period for the outcomes among patients with hypertension, we included incidence cases for the fiscal years 1997 to 2008, resulting in up to a 12 year follow-up period for the study population. We excluded patients with hypertension who were not residents of Alberta or who were less than 20 years of age at the time of diagnosis.
Outcomes included all-cause mortality, determined from death registry data, and cardiovascular disease (CVD), and defined as either myocardial infarction, heart failure, or stroke. We linked the study population with data from DAD and used validated coding algorithms to define myocardial infarction (ICD-9: 410.

Comorbidity Definitions
Charlson comorbidities were defined using validated ICD-9 and ICD-10 coding algorithms [26]. We applied these coding algorithms to our three data sources (i.e., DAD, physician claims, and both) across four retrospective periods of observation (i.e., 6 months, 1 year, 2 years, and 3 years from the date of hypertension diagnosis). Thus, we evaluated 12 approaches to defining Charlson comorbidities. We did not use Elixhauser comorbidities that contain more conditions and are better predictors of long-term mortality than Charlson comorbidities. [27] The reason is that majority of hypertension patients are captured from physician claims databases. Diagnosis code in this database is coded using ICD-9, up to 4 digits. Defining Elixhauser comorbidities requires ICD-9-CM diagnosis codes, up to 5 digits (more precise coding system than ICD-9).0

Statistical Methods
The prevalence of Charlson comorbidities was calculated for each of our 12 approaches. We also employed logistic regression models for one year all-cause mortality and CVD hospitalization for each of these 12 approaches. Age and sex-adjusted odds ratio (OR) was estimated for each comorbidity. We then used the Cox proportional hazard regression model for all-cause mortality and CVD hospitalization and estimated the hazard ratio (HR) for each comorbidity after adjusting for age and sex.
We assessed our model performance by using C-statistics for logistic regression and concordance index CI for Cox proportional hazard regression [28]. We used 10-fold cross validation method to evaluate the model performances. The C-statistics, CI and their 95% confidence intervals were estimated using bootstrap method with 500 resamples. All analyses were conducted using SAS version 9.4 (SAS Institute Inc., USA).
This study was approved by The Conjoint Health Research Ethics Board (CHREB), University of Calgary. The waiver of consent was also approved by CHREB because this study analyzed the health administrative data, and patient records/information was anonymized and deidentified in these databases prior to analysis, approved number: REB13-0051.

Results
Of the 759,040 patients identified with hypertension between the 1994 and 2009 fiscal years, we included 456,263 patients with newly diagnosed hypertension. As shown in Table 1, 9.9% of these were identified using data only from DAD, 86.8% using data only from physician claims data, and 3.4% using both DAD and claims between 1997 and 2008. The follow-up period ranged from 0 to 12 years (mean: 5.7 years, median: 5.5 years) with a mortality rate of 2.8 per 1000 person-years.
The prevalence of each comorbidity was higher for when both DAD and physician claims data was used, compared to when either of these sources was used alone ( Table 2). For the 1 year 'look back' period, the prevalence of having at least one Charlson comorbidity was almost twice as high in claims data than in DAD data, and was even higher when both DAD and claims data sources were used together (DAD: 15.5%, claims: 30.0%, and both: 32.6%). The prevalence also increased alongside an increased length of retrospective observation, although the increase from 2 to 3 years was less than the increase from 1 year to 2 years. Risk-adjusted ORs and HRs for the Charlson comorbidities varied slightly across data sources and retrospective periods in models that used mortality (Table 3) and CVD hospitalization as the outcome. The approach that used both DAD and physician claims data had the highest C-statistics, followed by DAD data only, and physician claims data only (Table 4). For each data source, the C-statistics and CI improved for CVD hospitalization and one year mortality when the retrospective period was increased from 6 months to one year or more. The 3 year DAD and physician claims approach had the highest C-statistics and CI among these 12 approaches. The C-statistics and CI were lower for modeling CVD hospitalization as an outcome than for mortality (Table 4).

Discussion
We found that the use of DAD data alone underestimated the prevalence of comorbidities, while use of both physician claims and DAD data with a 3-year retrospective observation period yielded the highest prevalence. The model performance for one year mortality and cardiovascular disease hospitalization was statistical significantly improved for the approach that used DAD and physician claims data when compared to the approach of using only one of these sources for one year or longer. Preen et al. [21] found that less than 50% of comorbidities that were recorded in the five years preceding were captured in the hospital record index. Another study in the United States reported that the prevalence of comorbidities increased from 10% when using only inpatient data to 25% when using both inpatient and physician claims data [8]. Our study supports the findings from this literature. We found that 75% of comorbidities are missing when only DAD data is used compared to when both physician claims and DAD data is used, with a retrospective observation period of three years. The prevalence of having at least one Charlson comorbidity was 9.2% when the hospital records index in DAD data was used. This increased to 18.7% when we employed a 3-year retrospective observation period to DAD data. Using physician claims data further improved the identification of comorbidities, as the prevalence reached 43.2%. These studies clearly suggest that DAD and physician claims data with a long duration of observation should be used to capture comorbidity profile.
We found that the use of different data sources had a higher impact on risk adjustment model performance than the duration of retrospective observation period for mortality and CVD outcomes that were based on C-statistics and CI. One study in the United States reported their C-statistics remained the same between a 1 and 2 year observation period for combined inpatient and outpatient and data [20]. Using inpatient data, Preen et al. [21] in Australia reported that their C-statistics had little to no improvement from a 1-year to a 5-year observation period. In Canada, however, Lee et al. [29] found that increasing the duration of retrospective observation period increased the detection of comorbidities, but only marginally improved their predictive model performance for 30-day mortality. There are several possible explanations for this. First, inpatients with hypertension are more likely to be sicker than outpatients with hypertension. As such, patients with multiple conditions and poorer outcomes are captured in DAD data. Second, regardless of the service location, physician claims record conditions not only from outpatients but also from inpatients and emergency department visitors. There is therefore a huge overlap between conditions that are recorded in DAD and claims data. Third, hypertension as an outcome is determined by many factors, such as social-demographic and clinical characteristics. However, administrative data does not capture many important factors, such as the severity of a disease. As an index of case-mix, Charlson comorbidities that are defined using data may have reached the maximum capacity for predicting clinical outcomes, such as CVD, even where prevalence increases with duration. Regardless of how data may be enhanced through duration, however, we have no much margin to improve risk adjustment model performance. Fourth, patients with severe conditions visit physicians frequently and their comorbidities could be captures within a short duration. We found that the ORs and HRs for comorbidities slightly decreased with duration of observation for both mortality and CVD hospitalizations. This decrease may in part be due to false positive comorbidities, which dilute the effect of comorbidities on poor outcomes. Patients with mild comorbidities, however, are less likely to visit their physicians and more likely to have a longer survival than patients with severe comorbidities.  The C-statistics were estimated using a logistic regression model while adjusting for age group, sex and 17 Charlson conditions. The confidence intervals were estimated using bootstrap method.
CI was estimated using Cox's proportional regression model while adjusting for age group, sex, and 17 Charlson conditions. The confidence intervals were estimated using bootstrap method. doi:10.1371/journal.pone.0162074.t004 Limitations to this study are as follows. First and foremost, we did not validate comorbidities that were identified in physician claims data. Previous studies have indicated that validation for chronic conditions varies for different conditions and data sources. As the observation period increased, more false positive chronic conditions were included due to ICD coding errors. These false positive conditions might influence the discriminatory ability for poor outcomes. Secondly, comorbidities were defined prior to hypertension diagnosis. Some comorbid conditions may have occurred after the diagnosis of hypertension and contributed to poor outcomes. We did not account for time-dependent variables. Thirdly, we followed patients with incident hypertension for up to 12 years. The HRs might have changed with a longer follow-up period. Lastly, hypertension and comorbidities in this study were identified using Canadian administrative health data from a universal health insurance program. Thus, the findings from our study may not be generalizable to countries with different healthcare systems.
In conclusion, using a combination of DAD and physician claims data substantially improved the capture of chronic comorbidities. Prevalence was significantly increased with an increase in the duration of a retrospective observation period. A combination of DAD and physician claims data with one year or longer observation duration observation duration improves predictive model performance for cardiovascular disease hospitalization and one year mortality outcomes, because many chronic conditions are managed in outpatient clinical settings.