The Danish National Lymphoma Registry: Coverage and Data Quality

Background The Danish National Lymphoma Register (LYFO) prospectively includes information on all lymphoma patients newly diagnosed at hematology departments in Denmark. The validity of the clinical information in the LYFO has never been systematically assessed. Aim To test the coverage and data quality of the LYFO. Methods The coverage was tested by merging data of the LYFO with the Danish Cancer Register and the Danish National Patient Register, respectively. The validity of the LYFO was assessed by crosschecking with information from medical records in subgroups of patients. A random sample of 3% (N = 364) was made from all patients in the LYFO. In addition, four subtypes of lymphomas were validated: CNS lymphomas, diffuse large B-cell lymphomas, peripheral T-cell lymphomas, and Hodgkin lymphomas. A total of 1,706 patients from the period 2000–2012 were included. The positive predictive values (PPVs) and completeness of selected variables were calculated for each subgroup and for the entire cohort of patients. Results The comparison of data from the LYFO with the Danish Cancer Register and the Danish National Patient Register revealed a high coverage. In addition, the data quality was good with high PPVs (87% to 100%), and high completeness (92% to 100%). Conclusion The LYFO is a unique, nationwide clinical database characterized by high validity, good coverage and prospective data entry. It represents a valuable resource for future lymphoma research.


Introduction
Each year approximately 1,300 patients are diagnosed with malignant lymphoma in Denmark. An increase in the incidence has been observed for decades; from 15/100.000 in 2000 to 23/ 100.000 in 2013, in average an annual rate of 4.1% [1]. Lymphomas are heterogeneous diseases with more than 40 histologies identified in the current WHO classification. The diversity in the disease behaviour, from highly aggressive subtypes to the very indolent forms, has led to the development of several prognostic indices to enable an optimal prognostication of the different subgroups [2][3][4][5]. While randomized trials remain the gold standard for assessing the effect of an intervention such as new treatments, these studies are often characterized by strict inclusion criteria and do not necessarily reflect the outcomes of all patients.
Since 1942, the Danish Cancer Register (DCR) has collected information on diagnosis, disease stage and mortality from all new cancer patients in Denmark. Although Danish medical registries are generally known to be complete and accurate [6], they lack information on clinical and para-clinical characteristics as well as information on the specific treatment and outcome. The Danish National Lymphoma Registry (LYFO) was established in 1982 to monitor the quality of lymphoma treatment in Denmark. It contains information on clinical and paraclinical data at diagnosis as well as treatment details and outcome. In 2000 the database was implemented nationwide and now contains data from more than 23,000 lymphoma patients as a result of its high coverage [1,7]. The LYFO has been utilized for clinical research in several studies [8][9][10][11]; however, a systematic evaluation of the validity of the data has never been undertaken.
Population-based databases like the LYFO are powerful research tools for several reasons: i) data is readily available, ii) large number of patients is included, iii) they enable studies of rare lymphomas, and iiii) low cost. In contrast to randomized trials, the data are easily obtainable and can be utilized at minimum cost. The risk of bias is limited [12], however, the data collection and quality is not necessarily controlled by the investigator, and registry data should be validated before it is used for research purposes [13].
The main objective of this study was to evaluate the coverage of the LYFO and the quality of the entered data by crosschecking with the information from medical records.

Materials and Methods
The Danish hematology services are publicly funded and provide equal access to healthcare for all citizens regardless of social status or income. Patients diagnosed with lymphoma are referred to the nearest university hospital or local community hospital with a hematology department.
report forms. Table 1 shows the variables registered at different time points. Safeguarding against missing information is ensured by requests to the local departments in case patients with newly diagnosed lymphoma according to the Danish National Patient Register (DNPR) have not been reported to the LYFO or if information on critical data, such as treatment and/ or relapse, is missing. Some of the critical data are validated against other Danish registries

Central Registries
The Danish National Patient Register (DNPR). The DNPR was established in 1977 and is currently run by the Danish National Board of Health. It has captured data on all hospital admissions since it was founded, and since 1995 data on outpatient visits have also been included. The variables in the DNPR cover administrative and clinical information, such as the unique civil registration number assigned to all Danish residents (CPR number), identification of hospital ward and the diagnosis at discharge [14,15]. A linkage to The Civil Registration System tracks changes in vital status and residential area daily. The CPR number allows unique linkage of records between all medical registries in Denmark. [16] The Danish Cancer Register (DCR). The DCR is a population-based register containing data on the incidence of cancer in Denmark since 1943. Reporting to the register was made mandatory in 1987. The register contains information on the date of diagnosis, topography, histological characteristics, and disease stage [17]. Since 2004, the data are captured electronically from the DNPR and from the Danish Pathology Register.
The Danish Pathology Register (DPR). The DPR, established in 1997, has nationwide coverage and includes detailed information on all pathological specimens. To ensure correctness, a pathologist approves all diagnostic descriptions. Topography, morphology and specimen type is coded according to a modified version of the SNOMED classification [18]. The Danish Pathology Data Bank is accessible at most hospitals and offers instant updated nationwide information on pathological investigations and diagnoses [19].
Researchers can apply for data extraction when the request is approved by the Danish Data Protection Agency and The Danish Patient Safety Authority or the National Committee on Health Research Ethics. The Danish Health Authorities receives application for regarding the DNPR and DCR. [20] Data extraction from the DPR is performed by the DPR. [21] Application for data extraction from the LYFO is send to the Danish Clinical Registries. [22] Methods In order to evaluate the coverage of the LYFO, all patients recorded with lymphoma in the period 2000-2011 in the LYFO, the DCR and the DNPR were extracted. The coverage of the LYFO was tested against the DCR and the DNPR, respectively using the capture-recapture method. [23] Patients diagnosed in 2012 were excluded from the coverage calculation since a delay in the data delivery for the last year of diagnosis was observed when comparing the LYFO to the DCR and the DNPR. All pathology reports for patients not registered in the LYFO were reviewed in the Danish Pathology Data Bank to ensure the accuracy of the diagnosis.
To validate the quality of the data in the LYFO, patients diagnosed with lymphoma in the period 2000-2012 and recorded in the LYFO were used for a random sample selection of 3%. Patients with small lymphocytic lymphoma were excluded due to the overlap with the diagnosis of chronic lymphocytic leukaemia, and patients with the sole diagnosis of cutaneous lymphomas were excluded since these entities are usually treated by dermatologists in Denmark. Eleven variables of prognostic importance were selected for the validation of the LYFO. In addition to the random sample, a further 1,371 patients from ongoing studies of CNS lymphomas (N = 371), diffuse large B-cell lymphomas (DLBCL) (N = 164), peripheral T-cell lymphomas (PTCL) (N = 141) [24], and Hodgkin lymphomas (N = 695) were included for validation purposes. For these four subgroups, the date of diagnosis, histological subtype, ECOG performance status and Ann-Arbor stage were validated. Additional subgroup variables were included, e.g. treatment, relapse and other test results ( Table 2).
Validations of these variables were done by local review of all individual medical files of the 1,760 patients (e.g. paper records, electronic records, DPR and clinical laboratory information system). Information obtained from these documents was considered as the "gold standard" to which data entered in the LYFO were compared. Variables selected for validation were labelled 'consistent' (if the local medical files and the LYFO had similar values), 'inconsistent' (if the local medical files and the LYFO did not have similar values), or 'missing value' (if data were missing in the LYFO or in the local medical file).
If a patient had missing data on one of the variables to be validated (information missing from the medical record or information missing in the LYFO), the patient was excluded from the validation analysis of that specific variable. Patients could only be validated once, such that if a patient was chosen for random sample validation, the same patient was not validated in the subtype validation. Positive predictive values (PPVs) were estimated for variables in each subgroup and in a joined cohort. PPVs were calculated as the number of patients with correct registration divided by the number of patients registered [12,25]. Completeness of each variable was estimated for variables in each subgroup and in a joined cohort. Completeness was calculated as the number of patients registered divided by the number of patients with information regarding the variable in the medical record. SAS 9.3 statistical software was used for the statistical analyses. Patient information was deidentified prior to analysis according to standards of the Danish Data Protection Agency.

Ethics
Registration in the LYFO is compliant with Danish regulations and approved by the National Board of Health and the Danish Data Protection Agency. Establishment of an additional database for the present validation study was approved by the Danish Data Protection Agency.

Coverage analysis
The LYFO contained information on 11,362 patients, whereas the DCR included information on 11,473 patients. Altogether, the two registries contained information on 12,234 unique patients. Seven percent of the patients (n = 872) were only registered in the DCR, and six percent of the patients (n = 761) were only registered in the LYFO. The remaining 10,601 patients were included in both registries. Among the 872 patients registered in the DCR, but not in the LYFO, 258 persons had no histology proven lymphoma and 614 had a biopsy-proven diagnosis of lymphoma. (Fig 1) Thus, the total lymphoma population in the surveyed time period reached 11,976, and the coverage of LYFO is therefore 11,362/11,976 (94.9%). A total of 436 patients were never referred to hematology departments, including 91 patients who were diagnosed post-mortem (autopsy). Therefore, the total number of lymphoma patients seen at Danish hematology departments, but not entered in the LYFO, was 178 (1.6%). Analysis of the 761 patients only registered in the LYFO revealed that the majority of patients had indolent lymphoma ( Table 3). The patients were distributed equally across the period, however 16% (n = 120) were diagnosed in the last year of the study period.

Data validation analyses
There was a concordance within 6 months in the date of diagnosis in 10,445 of the 10,601 patients between the LYFO and the DCR (98.5%), while the remaining patients were equally distributed with 0.7% of patients having an earlier date of diagnosis according to the LYFO and the same number for the DCR.
Taken together, the DNPR and LYFO included a total of 12,940 patients with a lymphoma diagnosis in the study period, while the DNPR alone noted 12,680 patients and the LYFO 11,362 patients. Twelve percent of the patients (n = 1,578) were only registered in the DNPR, two percent of the patients (n = 260) were only registered in the LYFO, while 11,102 patients occurred in both registries. Of the 1,578 patients in the DNPR who were not registered in the LYFO, 737 were registered with a lymphoma in the DCR and therefore included in the previous description. Furthermore, 323 patients had other malignant haematological diagnoses leaving 518 patients with an erroneous lymphoma diagnosis in the DNPR.
Of the entire cohort registered in the LYFO in the period 2000-2012, a subgroup of 1,760 patients was selected for validation. A total of 54 medical records could not be found. These patients were excluded from the study.   Ann-Arbor staging was not correct in 6.6% of the patients, which was even higher for PTCL (12.1%). The discrepancies were often seen in patients with limited disease, and in particular in patients with limited stage disease involving extra nodal sites where disagreements existed between the E designation and stage IV [26,27]. The definition of the date of diagnosis was correct in 94.5% of the cases. Patients with discrepancies often had multiple histological samples available, and in some cases, the date of the histology report was used as the date of diagnosis. The lymphoma subtype was correct in 98.5% of the patients, with 94.8% correct in the DLBCL subtype. The most frequent subtype error was in patients with follicular lymphoma grade III that was registered as DLBCL. Chemotherapy and immunotherapy data were not registered correctly in few patients. Patients receiving radiotherapy was not registered in 2.9% of the cases, however it was 7% among DLBCL patients. Relapse was registered correctly in 96.6%, and the patients for whom data were missing were often not treated for the relapse (data not shown).

Discussion
The current study presents an overview of registration practices, comparability, completeness and validity of data of the LYFO. The results of this evaluation show that the LYFO data have a high degree of both completeness and validity when compared to the DCR and medical records. The completeness of the LYFO when compared to the DCR was 94.9%. Although 100% completeness is the ultimate goal of any register, 436 patients were never referred to a hematology department; more over 91 patients were diagnosed by autopsy. In general, all lymphoma patients are referred to a hematology department, but there is a variety of reasons for not referring some patients; 175 patients died within 30 days after the diagnosis (Fig 1). Among the remaining 170 patients, 107 patients had aggressive lymphomas including HIV-related lymphomas and patients with post-transplantation lymphomas, who were treated locally. Therefore only 178 (1.5%) patients were classified as truly missing in the LYFO, since they were referred to a hematology department. This corresponds to 1-2 patients every year per hematology department.
As a result of the capture-recapture procedure, we found 761(6.35%) patients in the LYFO that were not registered in the DCR. This underreporting of incident cases was primarily observed in patients with indolent lymphomas. A substantial part of these patients were followed without treatment, which may hide the patients from the capture algorithm of the DCR. The capture-recapture procedure also identified 258 patients in the DCR (2.2%) without a lymphoma diagnosis. In the DNPR, the figure was even higher at 518 (4.1%).
In a former comparison of the DNPR and the DCR, some underestimation and misclassifications were identified regarding hematological malignancies. In that analysis, the DCR was used as a reference standard, and the DNPR had a completeness of 91.5%. However, the pathological review revealed misclassifications in both registries [28]. This emphasizes the importance of a clinical registry. In the present study, 518 patients without lymphoma were registered in the DNPR as diagnose with lymphoma. A major advantage of a database like the LYFO is the clinical quality control of the data. To ensure that the patients fulfil the inclusion criteria for the registry, a clinician evaluates and validates all patients included and identifies those patients where lymphoma is not present. Most of these patients present with enlarged glands and biopsies have subsequently disproved the lymphoma diagnosis. The disadvantage with an automatized capture is the lack of data checking by clinicians and as a result, a proportion of patients without malignant lymphomas are registered in the DCR and the DNPR. The effort needed to find and since remove a patient with an erroneous malignant diagnosis from the DNCP or the DCR, is often substantial, leaving the diagnosis code unchanged.
Data from the LYFO are highly valid and have a high grade of variable completeness. Although at 87.1%, the PPV of albumin in Hodgkin lymphoma is lower, the other PPVs, which range between 93.4% and 100%, as well as the completeness, ranging between 98.1% and 100%, are high. A reason for the lower PPV for albumin could be that more effort is allocated to the clinically most important variables. Recently, the Danish National Acute Leukemia Register has been validated, and similar results were seen. Registration completeness compared to the DNPR was very high with a value of 99.6%. With variable completeness above 90% for 23 of 30 selected variables, and PPVs above 90% for 29 of 30, this register is also highly valid [29]. Furthermore, validation is ongoing for other Danish hematological registries.
The registration of a correct Ann-Arbor stage was found to vary between several of the subgroups. The discrepancy was particularly obvious when extra nodal disease was present. Even for experienced physicians, the application of a correct Ann-Arbor stage is a matter of debate, i.e. a stage IIBE patient can be interpreted as stage IV by others. Therefore a PPV of 93.4% is not surprising.
The strengths of this study are the extensive review of medical records covering a sample of four subtypes and a 3% random sample. In total our validation represents about 15% (1,706/ 11,362) of the patients in the database in the study period. We validated 11 different variables, selected due to their prognostic importance. We have no indication that other required variables in the database should be reported differently.
This study demonstrates the necessity of multiple sources to yield 100% coverage of all lymphoma patients. While frequently linkage can identify missing patients, there will be a need for optimising the PPV. One of the most frequent errors in data registration are transcription mistakes in the data entry process [30,31]. In the future it is planned that data is captured from electronic patient records. In the present study, clinicians identified patients without lymphoma, and the high validity of the data content was due to correct identification of the data. Direct data transfer will eliminate errors due to incorrect data transcription, but at the possible expense of incorrect data selection and therefore captured data from electronic records still need to be reviewed by clinicians.
This study also found patients in both the DNPR and the DCR that were registered with an erroneous diagnosis of lymphoma. With automatized procedures, this cannot be avoided, but emphasises the need for procedures that remove these patients, once they are entered in the DCR /DNPR. Before using data from clinical databases for epidemiological studies, it is important to assure the quality of the data [13,32]. The LYFO has high coverage and contains high quality data with a large amount of detailed information on clinical and para-clinical data, treatment and outcome. Therefore, the LYFO is a valuable data source for epidemiological and clinical lymphoma research in the future.