Assessing Hepatitis C Burden and Treatment Effectiveness through the British Columbia Hepatitis Testers Cohort (BC-HTC): Design and Characteristics of Linked and Unlinked Participants

Background The British Columbia (BC) Hepatitis Testers Cohort (BC-HTC) was established to assess and monitor hepatitis C (HCV) epidemiology, cost of illness and treatment effectiveness in BC, Canada. In this paper, we describe the cohort construction, data linkage process, linkage yields, and comparison of the characteristics of linked and unlinked individuals. Methods The BC-HTC includes all individuals tested for HCV and/or HIV or reported as a case of HCV, hepatitis B (HBV), HIV or active tuberculosis (TB) in BC linked with the provincial health insurance client roster, medical visits, hospitalizations, drug prescriptions, the cancer registry and mortality data using unique personal health numbers. The cohort includes data since inception (1990/1992) of each database until 2012/2013 with plans for annual updates. We computed linkage rates by year and compared the characteristics of linked and unlinked individuals. Results Of 2,656,323 unique individuals available in the laboratory and surveillance data, 1,427,917(54%) were included in the final linked cohort, including about 1.15 million tested for HCV and about 1.02 million tested for HIV. The linkage rate was 86% for HCV tests, 89% for HCV cases, 95% for active TB cases, 48% for HIV tests and 36% for HIV cases. Linkage rates increased from 40% for HCV negatives and 70% for HCV positives in 1992 to ~90% after 2005. Linkage rates were lower for males, younger age at testing, and those with unknown residence location. Linkage rates for HCV testers co-infected with HIV, HBV or TB were very high (90–100%). Conclusion Linkage rates increased over time related to improvements in completeness of identifiers in laboratory, surveillance, and registry databases. Linkage rates were higher for HCV than HIV testers, those testing positive, older individuals, and females. Data from the cohort provide essential information to support the development of prevention, care and treatment initiatives for those infected with HCV.


Introduction
Hepatitis C virus (HCV) is a major global public health problem with~184 million people infected worldwide. In Canada, between 230,000-450,000 (0.66% -1.3%) people are infected with HCV [1]. Most were infected decades ago and are now increasingly presenting with HCV-related sequelae (e.g., cirrhosis, end-stage liver disease and liver cancer).
Although potentially curative HCV treatments have been available since 2000, uptake and effectiveness has been low. Newer, short-course, well-tolerated direct acting antiviral therapies are highly effective (95%) in curing HCV but are very expensive [2][3][4][5][6]. Accurate and up-todate knowledge about the current and future burden of disease, co-infections with HIV and HBV, health disparities, and cost of HCV-related illness relative to the cost and effectiveness of treatment are needed to inform public funding decisions for the newer antiviral drug therapies; the need for population level screening; and to prioritize resources for engaging infected individuals into care and treatment. Furthermore, there is a need for systems to monitor effectiveness and the overall impact of newly approved treatments on long-term outcomes.
The British Columbia (BC) Ministry of Health (MoH) approved a comprehensive linkage of public health surveillance and laboratory data with administrative healthcare data to create the BC Hepatitis Testers Cohort (BC-HTC). The purpose of the cohort is to assess and monitor HCV disease burden; HCV/HIV/HBV/TB co-infections; syndemics; disparities in testing and care; health care utilization; treatment uptake and completion; effectiveness of interferonbased and newer non-interferon-based treatments; cost of HCV related illness and impact of treatment on illness related costs and outcomes. This paper describes the cohort construction, data linkage process and yield, and comparison of the characteristics of linked and unlinked individuals. The linkage of multiple public health surveillance and laboratory databases included in the cohort is based on both deterministic and probabilistic linkage algorithms. Individuals in the cohort were deterministically linked with administrative health care databases using a unique personal health number (PHN). At each stage, some records with missing identifiers did not get linked and, hence, did not become part of the final dataset. Persons at the highest risk of HCV and/or HIV may be more likely to have missing identifying information needed for linkage, which may lead to selection bias and may affect the generalizability of results from the analysis of the cohort. Thus, it is important to understand the differences between those who met the cohort inclusion criteria and those who finally remained in the linked dataset, in order to understand biases affecting future analyses. This paper describes the linkage process and characteristics of linked and un-linked individuals, providing essential background for interpreting findings from this cohort.

BC Hepatitis Testers Cohort
The BC Hepatitis Testers Cohort (BC-HTC) includes all individuals who have been tested for HCV or HIV at the BC Public Health Laboratory (BCPHL) or who have been reported to public health as a case of HCV, hepatitis B (HBV), HIV or active tuberculosis (TB) in BC, Canada. The laboratory/public health surveillance data are linked with the BC MoH Client Roster, medical visits, hospitalizations, prescription drug data, cancer registry and mortality (Fig 1). The cohort includes a longitudinal linkage of records from inception (1990) of each dataset to 2012/13, with plans for annual updates thereafter. The resulting linked cohort includes data on more than 1.5 million individuals, about a third of BC's 4.6 million population.

Data access and ethics approval
Data linkages were approved by the data stewards of the BC Centre for Disease Control (BCCDC), the BCPHL, the BC MoH, the BC Vital Statistics Agency and the BC Cancer Registry. Approval for data linkage with the MoH datasets was granted under the BCCDC public health mandate to conduct surveillance and program evaluation. The study was reviewed and approved by the University of British Columbia Behavioral Research Ethics Board (No: H14-01649).

Case definitions and identification
As an HCV positive case was identified by multiple laboratory tests and/or public health case reports, algorithms were created to define HCV cases and assign a single diagnosis date. An individual testing anti-HCV positive, HCV RNA positive, genotype positive or who was reported as an HCV case in the Integrated Public Health Information System (iPHIS) was considered a HCV case in this cohort [7]. An individual included in the HIV/AIDS Information System (HAISYS), or who had positive HIV lab test results was considered a case based on provincial HIV laboratory test interpretation guidelines [8]. To capture additional HIV cases who may have been tested without nominal information and hence may not have been captured from HAISYS or laboratory data, a previously validated algorithm requiring at least three Medical Services Plan (MSP) visits with an HIV diagnosis code or a hospital admission with an HIV diagnosis code was also considered an HIV case ( Table 1 and S2 Fig) [9]. HBV and active TB diagnosis was based on provincial guidelines [10,11].

Data sources
The following data sources are included in the cohort (Fig 1): Laboratory and public health surveillance databases. BCPHL Laboratory Information System: Greater than 95% of all anti-HCV tests and all HCV RNA testing and genotyping conducted in BC are performed at the BCPHL. This dataset includes all laboratory testing records of persons tested for HCV since 1992 and HIV tests performed since 1988. HCV data include antibody screening and confirmatory tests, genotype results and HCV RNA data (both qualitative and quantitative viral load testing).
iPHIS stores information on all cases of HCV, HBV and TB reported to public health in BC, including a small number of cases not tested at BCPHL.
The BCCDC Enhanced Hepatitis Strain Surveillance System (EHSSS) includes detailed risk factor data on a subset of acute HBV and acute HCV cases identified between 2000-2013.
HAISYS contains data on new diagnoses of HIV/AIDS reported to public health together with enhanced risk factor data. Of note, in BC, there is the option for non-nominal reporting of HIV in which a positive HIV test result is reported to public health with the client's initials only and without an address. About 1% of unique clients in HAISYS have chosen non-nominal reporting at least once.
Administrative healthcare databases. The Client Roster contains information on all BC residents enrolled in the publicly funded health care plan. Each person is assigned a unique PHN which serves as a unique identifier in all healthcare databases. In addition to demographic information, extracted data include the number of days registered with the plan and the residential six digit postal code for every year of registration.
MSP: MSP is BC's universal health insurance plan that reimburses medically required services provided to individuals by fee-for-service practitioners. MSP data includes all encounters and associated International Classification of Disease (ICD) 9 diagnostic codes. MSP records on linked cohort members are available since January 1, 1990 [12].
Discharge Abstract Database (DAD): DAD compiles data on all hospitalizations in BC and records on linked cohort members are available from April 1, 1985 [13].
PharmaCare/PharmaNet: PharmaNet is the province-wide network managed by BC MoH and the College of Pharmacists of BC that links pharmacies to a central data system in which prescription drugs dispensed in BC are recorded. PharmaCare is the BC public prescription drug insurance plan that assists BC residents in paying for eligible prescription drugs. Pharma-Care data ranges from January 1, 1985 to December 31, 1995. PharmaNet data starts from January 1, 1996 and includes the types of data previously recorded in PharmaCare [14].

Linkage of public health surveillance, laboratory and administrative datasets
The BC-HTC was created within the Public Health Reporting Data Warehouse (PHRDW) at BCCDC. PHRDW is a public health data warehouse for linking BCPHL and BCCDC datasets to provide role-based access to data and summary reports. The PHRDW data linkage algorithm was used for patient-matching to identify unique individuals within and across datasets, including HCV and HIV laboratory data, HIV cases from HAISYS, and HCV, HBV and TB cases from iPHIS. There are three levels of matching within PHRDW using different levels of linkage certainty. Level 1 corresponds to perfect (matches on: PHN, First Name + Last Name + Date of Birth+ Gender), level 2 to strong (PHN, First Name + Last Name + Date of Birth checked), and level 3 to a moderate match (PHN + Date of Birth checked). For level 3 match, date of birth check means, birth year for all records in a matched set is same, but difference in month or day is allowed. When PHN is not available for a record, matching is based on First Name + Last Name + Date of Birth+ Gender. For BC-HTC generation, level 3 matching was used. At this stage, a unique BCCDC ID was generated and assigned to each linkable individual in the cohort. The validity of this approach was assessed when the warehouse was developed [17]. A cohort linkage file was generated from the linked dataset. Records with missing or invalid PHNs and missing demographic information were deemed unlinkable for further linkage and were not sent for matching with the Client Roster at MoH.
The cohort linkage file including the BCCDC ID, PHN and demographic information was sent to the MoH. At the MoH, records were matched with the Client Roster using PHN, date of birth and gender. A MoH study ID was then appended to all matched records and a crosswalk file linking the BCCDC ID and the MOH study ID was created. Each MoH data source and BCCR then extracted content data on the matched individuals, leaving only the MoH study ID as a unique ID. The crosswalk file was sent to the BCCDC to be used to append the MoH study ID to BCCDC content data (HCV, HIV, HBV, and TB disease status). At this stage, BCCDC ID and other identifiers were removed from the BCCDC content data. The MoH and BCCR data with only MoH study ID was sent to BCCDC. The only unique identifier shared across all datasets is the MoH study ID; linked data in the BC-HTC does not include any identifiers. The MoH study ID will be used for further linkages and annual updates (S1 Fig).

Follow-up
The current BC-HTC dataset provides testing, co-infection, comorbidity and outcome data for about 20 years (1992-2012/13) on individuals tested in 1992 and a shorter follow-up period for those tested thereafter. Annual updates of the linked dataset through the same linkage process will allow inclusion of new testers and follow-up of individuals already included in the cohort.

Data management and storage
The linkage produced a large dataset with various subsets exceeding 100 gigabytes. We selected an SQL server-based relational database where the various datasets are stored in separate tables and joined together through a unique key (Study ID). This system allows creation of data views which present a subset of data for analysis, thus enhancing analysis speed and efficiency. It also facilitates construction of a robust security system where analysts' access to the dataset is tailored to the needs. The dataset is accessed for analysis by connecting analysis software such as SAS or R through an Open Database Connectivity (ODBC) connection.

Linkage yield
There were 2,656,323 individuals available from all data sources contributing to the BC-HTC. Of these, 1,427,917 (54%) were sent to the MoH for linkage with the Client Roster. The remaining 1,228,204 individuals either did not have valid PHN or tests were submitted non-nominally for HIV testing. Of those sent to the MoH for linkage a very small percentage (n = 19,166; 1.3%) did not match with the Client Roster (Fig 2) due to mismatch between PHN recorded in the laboratory or surveillance data and the Client Roster. In the linked cohort about 1.15 million individuals were tested for HCV and 1.02 million for HIV.
The percentages of individuals within each dataset with valid data available for linkage and sent to the MoH for linkage are presented in Table 2. The table shows that the percentage sent for linkage was >85% for HCV testing and case data while it was~50% for HIV testing and 35% for HIV case data. The linkage yield was 86% for HCV test data, 89% for hepatitis case data, 95% for active TB cases, 48% for HIV test data and 36% for HIV case data ( Table 2). Major reasons for lower linkage of HIV data relate to lack of PHN and other identifiers needed for linkage during the earlier years of the HIV epidemic. PHNs were not recorded regularly in the laboratory information system prior to 2006. Availability of PHN and other demographic data improved after HIV became reportable in BC in 2003.
In defining and identifying HIV positive cases, 6955 cases were identified from HAISYS, 7427 from lab data, and 9983 were identified using MSP and DAD for a total of 11025 HIV cases (S2 Fig). Most of the cases (5795, 53%) were present in all 3 sources, 932 (8%) were present in lab data and HAISYS; 666(6%) in lab data and MSP/DAD; 197 (2%) in HAISYS and MSP/DAD while 3325 (30%) were only recorded in MSP/DAD.

Loss/completion rate by year
In the earlier years, linkage rates were low for both HCV and HIV (Fig 3 and Fig 4). For HCV negatives, the linkage rate increased from 40% in 1992 to 84% in 1996 and stayed in this range until 2005 after which the rate was~90%. For HCV positives, the linkage rate gradually increased from 72% in 1992 to 90% in 1999 and remained at~90% with the exception of 2000-2005 where it dropped to 70% (Fig 3). The HIV linkage rate followed a similar trend except that the linkage rate for HIV negatives was 10% in 1992, gradually increased to 50% in 2005 and thereafter stayed at~80%. For HIV positives, the linkage rate was more variable starting with 23% in 1992, increasing to 64% in 2000 and then ranged between 69-83% until 2013, with the exception of 2004 when it dropped to 49% (Fig 4). When HCV cases were defined based on anti-HCV testing

Characteristics of cohort and comparison with those not linked
Hepatitis C Testers. There was a difference in the distribution of linked and unlinked HCV cases by year of first test ( Table 3). The highest proportion of linked HCV cases (34%) was tested during 1996-1999 while the highest proportion of unlinked cases (50.5%) was tested from 2000-2005. Among HCV negative testers, the highest proportions of linked and not linked testers were from recent years (2006-2013: 50.5% vs. 34.2%) and the lowest were from the earliest years (1992-1995: 2.8% vs. 14.7%). Most of the RNA tests (70%) in the linked group were performed in 2006-2013 while in the unlinked groups the majority (65%) were performed in 2000-2005 ( Table 3).
In HCV positives, year of birth was unknown for 8.7% of unlinked vs. <1% of those linked, while in HCV negatives the proportions were 4% and <1%. The linkage rate among those with unknown birth year was 1% in positives and 0.3% in negatives. For HCV positives, unlinked individuals were slightly older at the time of their first HCV test (median age: 42.6 vs. 41.8 years) while HCV negative unlinked individuals were younger than linked individuals (median age 33.9 vs. 37.2 years). Age at first HCV positive test was similar between linked and unlinked. Among HCV negatives, the linkage rate increased with age, while among HCV positives, the linkage rate was highest among the youngest (<25 years) and the oldest groups (55 years) and lowest in 45-54 years (78.7%). The proportions of those linked vs. unlinked with unknown gender were higher for both positives (4.5% vs 0%) and negatives (5% vs. 0%). In HCV negatives, the proportion of males in unlinked was higher than in linked (53% vs. 45%) while in HCV positives it was slightly lower in unlinked (63.9% vs 65.2%). Those with unknown health region of residence at the first positive test constituted a larger proportion of unlinked compared to linked (30% vs. 0%), with a similar trend in HCV negatives (13.6% vs. 0%). The largest proportion of unlinked HCV cases were from the Vancouver Coastal health region (VCH) (35%) while in the linked group the highest proportion of cases was from Fraser health region (30%). Similar trends were observed in HCV negatives.
Linkage rates for those HCV positive and co-infected with HIV (92%), HBV (93%) and TB (99.6%) were higher than for those not co-infected. Similarly, for HCV negatives, linkage rates were higher for those with any other infection than those without another infection (  The linkage rate was highest among those born after 1975 for both HIV positives (67%) and negatives (66%). Most of the unlinked HIV positives (81%) were born in 1945-1974 and a similar proportion (85.2%) of linked HIV positives were also born during this time. In unlinked HIV negative testers,~73% were born 1945-1974 compared to 52.3% among linked HIV negatives. In HIV positives, at the time of the first HIV test, unlinked individuals were slightly younger than linked individuals (median age 34.6 vs. 35.6 years) while for HIV negatives, the age distribution was similar between linked and unlinked individuals (31.1 vs. 30.7 years). Age at the first HIV positive and age at the last HIV negative test among HIV negatives was higher for linked compared to unlinked individuals (HIV positives: 37.6 vs. 35.1;HIV negatives: 34.6 vs. 31.1 years). The majority of HIV positives in the linked (79%) and unlinked groups (84.4%) were male with an additional 4.3% with unknown gender in the unlinked group. In HIV negatives, there were more females (60.4% vs. 52.7%) and fewer individuals with unknown gender (<0.1% vs. 6.6%) in the linked compared to unlinked groups. In HIV positives, health region was not known for 8.6% of the linked and 80.8% of the unlinked groups. However, in the linked group, residence location for further analyses is available from the Client Roster. Among both HIV positives and negatives who also had another infection, the linkage rate was very high ( Table 4). Since HIV linkage rates were higher after 2005 related to reportability and better recording of PHNs, we presented characteristics of linked and unlinked individuals who were tested after 2005 in S1 Table. As expected, younger birth cohorts (born 1965-74 and 1975) represented most of the positives and negatives. However, their linkage rates were lower compared to older birth cohorts tested during this time frame. In HIV positives, the age group 25-34 years had lower linkage rate at first test and HIV diagnosis compared to other age groups (69% vs >80%). Gender distribution was similar to the entire cohort. Linkage rate by region was >90% with known health region in positives and improved considerably in negatives to >80% for all regions except the VCH where the linkage rate was 72%.

Discussion
The BC-HTC is likely one of the most comprehensive datasets of HCV and HIV cases and testers linked with medical visits, prescription drugs, hospitalizations, cancers and deaths. The data will facilitate an array of population level analyses related to burden of disease, natural history of infection, co-infections and disease syndemics, health care utilization, costs of illness, and impact of HCV treatment on illness costs, morbidity and mortality. It will allow distinguishing the effects of risk activities and other confounders from the effects of HCV infection to inform policy and programs related to HCV and other diseases included in the dataset.  This paper summarises linkage rates over time and by characteristics of linked and unlinked individuals which will assist in interpretation of future analyses. We found that linkage rates increased over time and were much higher for HCV than HIV testers and for those testing positive, older individuals, females and those not from Vancouver.
The linkage rate for HIV test and case data was low and the linkage yield was much lower in earlier (before 2005) than in recent years (after 2005). During earlier years, many tests were submitted for testing without a PHN or non-nominally and it was not possible to link these data even using probabilistic linkage due to missing identifiers. Linkage rates improved when HIV was made reportable in BC in 2003 and when PHN was recorded for patients who tested nominally on a routine basis since 2006 [18]. The linkage rate for HCV was much higher than  Table 3). This is in agreement with a smaller drop in the linkage rate when using anti-HCV tests or HCV case definition alone, rather than the combination of anti-HCV and RNA (S2 Fig).
Other characteristics that were related to lower linkage rates for both HCV and HIV were younger age, male sex and residence in Vancouver. For HCV, this may reflect that the population in Vancouver's Downtown Eastside (DTES), which includes a large proportion of people who inject drugs (PWID) and are homeless, may not have been captured in the linked dataset due to missing identifiers. For HIV, this may reflect men who have sex with men (MSM) living in Vancouver, who may opt for non-nominal testing due to HIV related stigma [19,20]. Various studies have reported that MSM and youth are more likely to opt for non-nominal HIV testing where available [21][22][23][24]. People with high risk behaviors are typically tested repeatedly over time and an individual may have both nominal and non-nominal tests in the dataset. As the identifying information used for linkage became more complete in recent years, testing records for some of these individuals who were not linked may have been included in the linked group, but we are unable to verify that. Of note, we identified additional HIV cases who were not captured through laboratory/surveillance databases (30% of all cases) by using medical visit and hospitalization data (S2 Fig). Other studies have also reported lower linkage rates among socially disadvantaged groups due to incomplete records [25,26]. A study in England and Wales identified poor reporting of HCV data, especially from people at higher HIV risk presenting to genitourinary medicine clinics in London [26]. In this study, 68% of HCV cases were matched with HIV cases reported to the surveillance system between 1996 and 2003. Similar to our datasets, HCV cases which were matched increased from 50% in 1996 to 78% in 2003 due to an increasing completion rate of identifying information. Other studies linking various datasets, especially those for HCV and HIV have reported similar or lower linkage rates. In Scotland, 10% of HCV cases between 1995-2005 could not be linked to mortality records due to missing identifiers [27]. In a recent linkage of HCV diagnosis with a clinical database to assess engagement with care, the linkage rate based on identifiers was 89%, similar to our findings [28]. The linkage rate in a study between a blood recipient file and the Nova Scotia Health Card registration file using a probabilistic linkage was 65% [29]. In another study in New Zealand, 11% records of deceased prisoners could not be linked with the national death registry while in Australia the linkage rate for HIV deaths with the national death index with optimal sensitivity and specificity was 82% [30,31]. The linkage rate of various administrative healthcare datasets included in the BC Linked Heath Datasets (now called Population Data BC) not including laboratory and surveillance data was >95% [32]. In the BC-HTC, the linkage rate with the MoH Client Roster was >99% ( Table 2). Linkage studies from other subject areas have also reported linkage rates in the range of 50-60%, while others have reported very high rates of 80-90% [33][34][35][36]. Put in perspective, our linkage rate using a highly specific criterion of the PHN aided by date of birth and names yielded a linkage rate of more than 80% for HCV which was higher in recent years (90%), although linkage rates for HIV were low.
Non-linkage of records has implications for results of future analyses, with the highest likely impact depending on whether individuals with certain characteristics are more or less likely to be linked and if these characteristics are associated with a higher risk of HCV or HIV acquisition or outcomes. Our results show that in the unlinked HCV population, younger age (of whom most tested negative), males and those from VCH were over represented. This suggests that males with high risk behaviours, likely to be PWID from the DTES of Vancouver, were not linked at the same rate as others and likely will be under represented in the linked cohort, if they were represented, some of their laboratory tests were not captured. As the population in the DTES is marginalized and the rate of HCV infection is high, their non-inclusion may lead to an underestimation of disease burden. This could also impact assessment of disparities in testing, treatment uptake, treatment completion and response rates. As noted above, people with high risk behaviors are tested repeatedly and as a result some records of unlinked individuals may have been included in the linked dataset. This will counterbalance the exclusion and reduce any underestimation of burden; however, we are unable to quantify the magnitude of inclusion of the same individuals in both the unlinked and linked groups. Geographic analysis of the linked dataset with overlaying socioeconomic and marginalization indices will provide further insight into differences in testing and infection rates.
Similar to HCV, the unlinked HIV (both positive and negative) population included residents from Vancouver and male or unknown gender. Since large numbers of HIV cases from earlier years could not be linked due to missing identifiers, the overall burden of HIV and coinfection rates will be underestimated in earlier years. However, recovery of additional cases identified from medical visits and hospitalization databases may overcome the lower linkage rate. Non-linkage still affects HIV negative testers for which we did not have additional data sources for recovery, thus under-representing testing in marginalized populations.

Limitations and gaps
Currently, we are not able to identify immigrants and Aboriginal peoples in the cohort. During the past few decades there has been increasing immigration from HBV and HCV endemic countries to BC and Canada, and the prevalence of HBV and HCV is likely higher in these groups compared to the rest of the Canadian population. In future, linkages with immigration records and other techniques for characterizing ethnicity, such as name recognition, would improve data on immigrants. Similarly linkage with databases of Aboriginal peoples, such as the First Nations client registry would improve characterization of HBV and HCV disease burden in this population.
In summary, the BC HTC links together multiple datasets on a large number of people in BC to enable assessment HCV, HIV and other co-infections and outcome trends; HCV treatment uptake, completion, and effectiveness; cost of illness and effects of treatment on the cost of illness; and disparities in disease risk, outcomes and treatments to inform policy and programming.