The Impact of the Choice of Data Source in Record Linkage Studies Estimating Mortality in Venous Thromboembolism

Linked electronic healthcare databases are increasingly being used in observational research. The objective of this study was to investigate the impact of the choice of data source in estimating mortality following VTE, with a secondary aim to investigate the influence of the denominator definition. We used the UK Clinical Practice Research Datalink (CPRD) to identify patients aged 18+ with venous thromboembolism (VTE). Multiple cohorts were identified in order to assess how mortality rates differed with a range of data sources. For each of the cohorts, incidence rates per 1,000 person years (/1000py) and relative rates (RRs) of all-cause mortality were calculated. The lowest mortality rate was found when only primary care data were used for both the exposure (VTE) and the outcome (death) (108.4/1000py). The highest mortality rate was found for patients diagnosed in secondary care (237.2/1000py). When linked primary and secondary care data were included for eligible patients and for the overlapping period of data collection, a mortality rate of 173.2/1000py was found. Sensitivity analyses varying the denominator definition provided a range of results (140.6–164.3/1000py). The relative rates of mortality by gender and age were comparable across all cohorts. Depending on the choice of data source, the population studied may be different. This may have substantial impact on the main findings, in particular on incidence rates of mortality following VTE.


Introduction
Electronic healthcare data bases are increasingly being used in observational research. For conditions diagnosed, treated and managed solely in primary care, electronic healthcare records (EHR) from general practice may be a good source of data for pharmacoepidemiology studies. However, some conditions have significant periods of management in secondary care or specialist centres and electronic data from one source may not capture all the events happening in another setting. In order to answer a specific research question, researchers must choose which data source(s) are most appropriate in order to ensure the findings are generalisable and that case misclassification can be minimised.
The impact of using different populations in EHR studies has not been evaluated. A previous publication investigated mortality following venous thromboembolism (VTE) using the combination of primary care, secondary care hospital episode statistics (HES) and mortality data from the Office for National Statistics (ONS) from the Clinical Practice Research Datalink (CPRD) [1]. Many of the symptoms of VTE present in primary care and a high level of validity of VTE recording in primary care has been reported previously [2,3]. More serious cases may first appear in secondary care or as a cause of death in the death certificate, without any record in the primary or secondary care records. General practitioners may not obtain the full information on the cause of death; feedback from secondary care can be poor, therefore death data can be non-coded or missing. Considering the range of recording options, this condition is an ideal candidate to use as an example of when there is a need to use multiple data sources to examine the patient journey in full.
Previous research has highlighted that case validity is improved by using more than one source of data [4], minimising misclassification of either the exposure or outcome, or both. The objective of this study was to investigate the impact of the choice of data source in estimating mortality following VTE, with a secondary aim to investigate the influence of the denominator definition. In order to do this we recreated an analysis of mortality following VTE using CPRD primary care data and linked HES and ONS mortality data, using the same version of the data sources and same study period [5]. We experimented with using the primary care data only and including available linked data.

Materials and Methods
The primary care data from CPRD is a database of computerised medical records from across UK General Practice. Data are available from 1987 on over 13.6 million patients [6] in a dynamic cohort where patients can join/leave a GP practice over the course of follow up. A systematic review of validation studies found that medical data in CPRD were generally of high validity [7]. The national Hospital Episodes Statistics (HES) data contain details of all admissions to National Health Service (NHS) hospitals in England from April 1997 [8]. The Office for National Statistics (ONS) mortality data contains the date and coded cause of death for the population of England and Wales from January 1998 [9]. HES and ONS mortality data are deterministically linked to the primary care data using a combination of identifiers including the patient's unique NHS number, gender, date of birth and postcode. For this study, each individual data source had a different period of coverage, the CPRD linkage was limited to consenting GP practices from England, and not all of the patients registered at a participating practice were eligible for each linkage (Fig 1).
We recreated an analysis of mortality following VTE using the same study period and data sources as in a previous publication [5]. An older version of the linked data was used covering the period up to the 30th October 2009, at which point 40% of CPRD practices had consented to take part in the linkage. VTE was defined as a composite of distal and proximal deep venous thrombosis (DVT) and pulmonary embolism (PE) from either primary or secondary care. In order to imitate making different choices, multiple cohorts were identified with the data sources available (Fig 2). In all cohorts, patients were aged 18 years or older with a diagnosis of VTE (index date) on or after 01/01/1995.
Four cohorts were identified in order to assess the impact of the choice of data source: Patients were categorised by their prior VTE risk based on modifiable (fractures, surgery, trauma) or unmodifiable (cancer, congestive heart failure, varicose veins) factors recorded before the index date. Patients were followed from the index date until censoring, based upon the earliest of the end of data collection, patient's death, or transfer out of the practice, whichever date came first. The date of death was identified from the primary care data or linked ONS mortality data, where available. The mortality incidence rate for each of the cohorts was calculated per 1,000 person years (/1000py). Survival analyses using Cox proportional hazards regression was used to estimate the relative hazard rates (RRs) for all-cause mortality over  time. RRs were calculated by gender and age using those aged 18-39 as the reference. The statistically adjusted model included age, gender, lifestyle information (BMI, smoking status, alcohol use) and VTE risk category.
Sensitivity analyses evaluated the influence of the denominator definition by considering individual eligibility and data coverage periods in two further cohorts; patients with VTE from either primary or secondary care over the whole study period (cohort A: all data cohort), and patients from consenting practices whether or not they were individually eligible (cohort B: consenting practices).
The CPRD Group has obtained ethical approval from a National Research Ethics Service Committee (NRES) for all purely observational research using anonymised CPRD data; namely, studies which do not include patient involvement (which is the vast majority of CPRD studies). Individual patient consent is not required for observational studies using anonymised CPRD data. Individual studies must be granted approval by the Independent Scientific Advisory Committee for MHRA database research (ISAC). This study was approved under protocol number 14/024.

Results
A total of 46,332 patients were identified with a record of VTE from either primary or secondary care between 1995 and 2009. The largest cohort was of those diagnosed in primary care (cohort 1: N = 36,216, 71%), of which 3,320 (9%) were diagnosed before April 1997 when linked HES data became available and 6,539 (18%) were registered at practices in Wales, Scotland or Northern Ireland. Over 70% of patients had no obvious risk factor of VTE. In the HES data, 13,404 patients were identified with a diagnosis of VTE between 1997 and 2009 (cohort 4), of which 3,288 (25%) had a matching record in primary care on the same date. After pooling the primary care and HES patients and restricting to those eligible for HES linkage and to the overlapping coverage period (1997-2009), the linked exposure cohort included 23,720 cases (cohort 2). The average follow up period for cohort 2 was shorter than cohort 1 by 0.5 years (3.54 vs. 4.08 years) and patients were more likely to have a modifiable risk factor for VTE (28.7% vs. 21.3%). The linked exposure and outcome cohort was created by restricting to those eligible for ONS linkage and to the overlapping coverage period (1998-2009), including 22,639 cases (cohort 3). The average follow up period for cohort 3 was shorter than for both cohorts 1 and 2 at 3.39 years. The broadest cohort included all patients regardless of eligibility for linkage or whether the practice had consented (cohort A: all data cohort). This cohort included 46,296 patients of which 5,365 patients (11.6%) were diagnosed before mortality statistics were available from the ONS. The cohort of patients from consenting practices with a diagnosis in either primary care or HES included 26,320 patients (cohort B: consenting practices). A larger proportion of these were diagnosed in primary care (49% vs 43% in the other linked cohorts). Table 1 compares patient characteristics between all cohorts and shows that the age, gender, smoking and BMI profile was similar for all cohorts. The mean age was 55-56 years and 55% of patients were female. Over half of the patients were overweight or obese at the time of diagnosis and one fifth were smokers. Table 2 shows that the mortality rates over the 2 years following a VTE diagnosis were lowest in the primary care cohort (108.4/1000py) and highest for patients diagnosed in secondary care (237.2/1000py). Including eligible patients diagnosed in either setting (linked exposure cohort) produced a mortality rate between the two (169.9/1000py); adding mortality outcome data from ONS (linked exposure and outcome cohort) resulted in an increase in the mortality rate (173.2/1000py). Restricting to practices participating in the linkage, but ignoring individual eligibility resulted in a rate lower rate than the linked exposure and outcome cohort (164.3 vs 173.2/1000py). The largest number of deaths (9,701) was attributed to the all data cohort; however the mortality rate was the lowest of all the cohorts that included linked data (140.6/ 1000py). Fig 3 compares all cohorts over the two years. Table 3 shows that 85% of all fatal cases were in patients aged 60 or over. Older patients, especially the elderly, had an increased risk of mortality after adjusting for gender, lifestyle information (BMI, smoking status, alcohol use) and VTE risk category. A larger number of deaths were identified in women, however after adjustment, the mortality risk was found to be higher in men. The relative rate of mortality by gender and age were comparable across the cohorts although there was a lower mortality rate in the reference group for age (18-39) in the primary care cohort than all of the others.

Discussion
We looked at the impact of the choice of data source in estimating mortality following VTE and found considerable differences in the observed mortality rate across cohorts built using different sources, despite comparable relative rates by gender and age. All of the cohorts we examined were similar in terms of age, gender and lifestyle information; however the mortality rate was two times higher for patients diagnosed in secondary care data compared to primary care. Using linked data sources provided estimates between the two; however the choice of denominator in linked populations also affected the estimates reported, with lower rates seen when individual eligibility was not considered and the whole study period was used. Fig 3 shows the variation in rates across all of the cohorts over two years, highlighting that the difference can be seen immediately following diagnosis. This suggests that the cohorts were representing different populations.
We repeated a previously published study, modifying the study design to identify four cohorts that could feasibly be used to analyse mortality following VTE. We estimated the incidence rate for each cohort and the relative risk of mortality comparing those aged 18-39 to other age categories and comparing men to women. The primary care cohort represented the patient set and analyses that would be performed if only general practice records were available. Both the exposure (VTE) and outcome (death) were identified from primary care records and data were available for the full study period 1995-2009. The mortality rate (108.4/1000py) was comparable to a previously published study (97.8/1000py) [10], however there are potential limitations on the suitability of studying VTE in a primary care cohort and the availability and accuracy of the mortality information.
A pulmonary embolism is a potentially life-threatening condition with common symptoms including chest pains and a shortness of breath. Patients with such symptoms are more likely to attend the hospital and one third of patients with symptomatic VTE manifest pulmonary embolism [11]. In analysing only patients diagnosed in secondary care, we found a mortality rate of more than double that in the primary care cohort (237.2 vs. 108.4/1000py). The overlap between cohorts was relatively small (7%), suggesting they were very different patient populations. This study is in line with previous research highlighting how data sources do not always  concur [12,13,14]. Movig et al (2002) compared clinical coded diagnoses from hospital with laboratory data and found a large number of patients (approximately 87%) identified via laboratory data did not have an associated coded diagnosis [14]. A study based on coded data may return a differing result to one based on laboratory results much like we have found when comparing patients diagnosed in primary versus secondary care. Previous research has highlighted that case validity is improved by using a combination of linked data sources. Herrett et al (2013) used four linked data sources to analyse acute myocardial infarction and found that each individual data source missed a substantial proportion (25-50%) of MI events, suggesting that failure to used linked data is likely to lead to biased estimates [4]. A more representative cohort of VTE cases may therefore include exposures from both primary and secondary care. The linked exposure cohort included diagnoses from both data sources for the overlapping coverage period (1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)). The mortality rate (169.9/ 1000py) was halfway between what was seen in the primary care (108.4/1000py) and secondary care (237.2/1000py) cohorts. This may have addressed potential misclassification of the exposure, however misclassification of the outcome may persist through the use of primary care data for the outcome.
Although the gatekeepers to healthcare access in the UK, holding the longitudinal record for patients, the GP may not always be informed of a patient's death in good time, and in some cases not at all. Within the GP record, some patients may have multiple records relating to death on different dates without clarity on which is correct. The ONS mortality dataset includes information on the date and cause of death from civil registration records. All deaths in England and Wales have to be registered and usually within five days. With just one record per person and up-to date recording, this provides the gold-standard data source for mortality information which could reduce the level of misclassification of the outcome. The addition of ONS mortality data with the relevant coverage period (1998-2009), resulted in a smaller cohort (N = 22,639) with reduced person time (32,149) and an increased rate of 173.2/1000py. This cohort may have minimised misclassification, however it is not clear whether combining primary and secondary care data sources evaluates two different populations, where one comprising VTE cases presenting at hospital may be more severe.
Studies analysing linked data sources are frequently used in regulatory decision making, however previous publications have applied differing approaches in the definition of the denominator. With this in mind, we looked at the influence of not considering individual eligibility and data coverage periods. Linked data are not available for all patients in CPRD. Consent is given at the practice level and identifiers are sent to a trusted third party for linkage. Some registered patients may not be eligible for linkage if they have incomplete identifier information, withdraw consent or live outside of England. The consenting practice cohort included all patients from these practices regardless of eligibility. The cohort was necessarily larger than the linked cohorts, with more patients diagnosed in primary care (49%) and longer follow-up, presenting a lower mortality rate. The all data cohort explored eligibility further by including all patients and incorporating linked data where it was available; resembling the patient set and analyses reported in some previous publications [15,16,17]. This cohort primarily included patients from primary care (71%), and included longer follow-up, presenting a mortality rate 20% lower than the linked exposure and outcome cohort. This lower rate is not unexpected since 11.6% of the patients in this cohort were diagnosed before the start of coverage of the ONS data creating an immortal time bias. These sensitivity analyses highlight the importance of considering the eligibility in the methodology using linked data alongside the influence of the two different populations.
There were certain limitations to this study. We did not confirm the VTE record or validate the death information; further analyses of the cause of death could highlight that some of deaths were unrelated to VTE. In addition, cases fatal on index date and those where the VTE was diagnosed as part of a post-mortem may not have been included in the study. We did not have access to data from outpatients or emergency visits. The secondary care cohort is limited to those linked to the primary care record rather than all patients presenting to hospital with VTE. It is important to note that the data sources are collected for use in routine clinical practice, for audit purposes or for the purpose of reporting national statistics rather than for research. The linked population was limited to England only, however a previous comparison of those included in the linkage scheme to the whole CPRD population found no significant differences in patient characteristics [1] and similarly we found no substantial differences between the cohorts in this study (Table 1).
The size of the available cohort from CPRD makes the primary care EHR a viable choice for studying VTE however given the likelihood that more severe patients will present in secondary care, a linked population including patients presenting in secondary care may be a more appropriate source for analysis. Consideration of the influence of the denominator definition is important in studies of linked data as restricting the study population to those eligible for linkage does reduce the follow-up time, which is likely to have a large influence on small cohorts or those with rare outcomes. Future studies should consider the impact of the data source on case validity alongside the benefits of using linked data, together with the denominator definition. The choice of data source in this study demonstrated a substantial impact on the absolute rates of mortality following VTE. We used mortality following VTE as a learning case and expect these findings also may have an impact on other disease/outcome pairs.