Comparing record linkage software programs and algorithms using real-world data

Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04–93.36%) and positive predictive value (PPV) (range 86.67–97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms.

Introduction Funding: Author TDK is employed by Bristol-Myers-Squibb (BMS), which as funder, provided support in the form of salaries and travel expenses for the primary research participants and authors [DBH, TG, AFK, MT, SLW]. BMS researcher TDK participated in collaborative research discussions, but did not have any formal role in the study design, selection of datasets to be linked, selection of data linkage algorithms, data analysis (There was no data collection.), decision to publish, or preparation of the manuscript. At no time did BMS attempt to alter consensus decisions. The manuscript was reviewed by both RTI International (employer of AFK and, at the time, SLW) and BMS (employer of TDK) for adherence to corporate quality standards. Neither organization requested any changes. The specific roles of the authors are articulated in the 'author contributions' section.
Competing interests: As noted in the funding statement and submission, TDK is employed by BMS, AFK and (at the time; she has subsequently retired) SLW by RTI International, and DBH, TG and MT by Rutgers University. None of these organizations is pursuing patents or marketable products as a result of this research. This employment does not alter our adherence to PLOS ONE policies on sharing data and materials. As noted in material submitted regarding data availability, the analysis file containing the results of the linkage experiments (weights and derived variables) will be made available in adherence to PLOS ONE policies. The inpatient and outpatient files that were linked cannot be made available. They contain only identifiers (names and addresses) and demographic information and are of no analytical value in the context of the results of the experiments. No author holds competing financial or non-financial interests.
common surnames are more likely to match by chance-resulting in higher U-probabilitiesthan unusual surnames. Determination of M-and U-probabilities may be specified exogenously, reflecting past experience or expert opinion (e.g., the Fellegi-Sunter approach [9]) or calculated endogenously (e.g., using the expectation-maximization [EM] algorithm [10]).
Numerous record linkage programs exist, which differ with respect to cost and methodologic transparency (open-source as compared with proprietary), operating system/hardware requirements, and scalability. Conceptually, all linkage programs perform string comparison, weight determination, and match determination. Data preprocessing is a key step in record linkage, including purging of duplicate records, harmonization of linkage variables (which is necessary, for instance, if the common values of gender are "F" and "M" in one, but "1" and "2" in the other), and common representation of missing values. Blocking is a common strategy to reduce computational burden, where only pairs of records that agree on one or more blocking variables are compared. If the blocking variable has n values, both time and memory requirements are reduced by a factor of n.
Most studies on linkage performance use only one software package to link synthetic or real-world databases [11][12][13][14][15]. Although these studies provide valuable information on linkage challenges, accuracy, and biases, they do not account for all the complexities of the linkage process or the variability across different linkage packages and approaches. Little has been published on the comparative behavior and output of software programs that can be used to link healthcare databases. One study used actual identifiers to evaluate probabilistic approaches from two software packages (Link Plus and Link King) without studying how specific variables affected weights and matches [16]. Another study created synthetic datasets to compare the quality and performance of 10 different linkage packages but did not examine the impact of different matching thresholds [17]. Neither study examined the performance of algorithms within and across software packages. Consequently, the present study aimed to compare 17 linkage methods within 4 nonproprietary available linkage programs to determine how the quality of dataset linkage is affected by linkage algorithm, blocking variable selection, methods for string matching and weight determination, and decision rules for pair matching.

Materials and methods
We compared the performance of 4 linkage software packages applied to real patient data from university-affiliated institutions. We focused on variables typically available in real healthcare data (e.g., name, gender, date of birth (DOB)) that contain actual errors but with very low levels of missingness (see also S1 File). The Rutgers University Institutional Review Board deemed this project not to be human subjects research as defined by 45 CFR 46. Nonetheless, we implemented strict security measures to preserve patient privacy and confidentiality in accordance with institutional regulatory and legal requirements. Results were deidentified before being shared with investigators outside of Rutgers University (see Analysis files). No health-related information was used for this study, and no patient identifiers were viewed by investigators or others not employed by Rutgers University.

Datasets
We used data for the three years 2013-2015 contained in noninteroperable EHRs from two neighboring, clinically affiliated but administratively separate institutions. The inpatient dataset (IPD) came from the Robert Wood Johnson University Hospital, a 965-bed urban teaching hospital with approximately 30,000 admissions per year. As received, the IPD included demographic data on all patients admitted overnight to the hospital during the study period. Each hospital admission resulted in a distinct entry; consequently, individuals with repeated hospitalizations had multiple entries. The outpatient dataset (OPD) came from the Rutgers Robert Wood Johnson Medical School, which has a multispecialty outpatient medical practice of over 500 affiliated physicians. The OPD contained information about all patients seen at least once during the study period, with only one record per person, based on a unique outpatient medical record number (MRN), representing the most recent set of demographic data. Both the IPD and OPD included first name, last name, DOB, gender, race, street address, city, state, and ZIP Code. Because only the OPD contained information on ethnicity, we excluded this variable from linkage experiments. Because the datasets were from clinically affiliated institutions, administrators used a proprietary linkage package to assign common MRNs. An inpatient MRN accessible within the OPD was used as a gold standard to evaluate linkage accuracy.
We preprocessed both datasets to harmonize the variable names and values. Preprocessing entailed dropping variables not used in the linkage runs or other analyses (e.g., street address, date of visit), reclassifying race (e.g., Asian, black, white, other, or missing), and extracting year of birth (YOB) from DOB. We converted implausible values-such as ZIP Codes containing letters-to missing values, but we did not standardize names.
After data preprocessing, we proceeded with deduplication. The OPD contained 176,154 records of purportedly unique individuals, making deduplication unnecessary. We deduplicated the original 104,289 IPD records by removing entries that matched exactly on 6 variables: MRN, last name, first name, gender, YOB, race, and ZIP Code. Records containing a missing ZIP Code were retained only if no other record matching on all other identifiers had a valid ZIP Code. The final IPD and OPD datasets contained the following variables: last name, first name, gender, DOB, YOB, age, race, ZIP Code, and MRN. We also assigned a unique study identifier to each record.

Software packages
We selected software packages based on multiple criteria: (1) available for a Windows-based computer, (2) nonproprietary, (3) described in prior publications, (4) containing reasonable documentation with some transparency regarding default settings, (5) capable of operating in scripting/batch mode, and (6) capable of saving weights for compared pairs. Based on these criteria, we chose 4 software packages: R (Version 3.4.0, RecordLinkage package), Merge ToolBox (MTB, Version 0.75), Curtin University Probabilistic Linkage Engine (CUPLE, shortened in figures and tables to CU), and Link Plus (LP, Version 2.0) (Table A in S1 File).

Experiment design
Because both DOB and YOB are highly reliable variables in healthcare, we conducted 2 linkage experiments, one using DOB as the blocking variable (presented as primary analyses) and the other using YOB as the blocking variable (presented as secondary analyses). First name, last name, and gender comprised matching variables for all linkage runs. Aside from the software package, we varied linkage runs by the string matching method (exact or inexact); for inexact string matching, we applied the most common method, Jaro-Winkler. We also varied the weight determination method, using 3 probabilistic approaches (Fellegi-Sunter, expectationmaximization, EpiLink [18]), as well as deterministic linkage. Most of the software packages implement more than one weight determination method. For probabilistic linkage approaches other than expectation-maximization, we used default values of M-and U-probabilities, which for each linkage variable were typically 0.95 and the reciprocal of the number of unique values, respectively. Some packages required manual entry of these values.

Analysis files
We prepared 2 de-identified analysis files, one for each linkage experiment, with each file containing one row of information for each compared record pair. Analysis files included columns for IPD and OPD record identifiers, gender, age (upper limit 90), and race; a variable indicating whether the record pair matched on inpatient MRN; and 17 sets of weights corresponding to each linkage run. We assembled the analysis files using R (Version 3.4.0) and SAS (Version 9.4).
To compare results across runs, we scaled the 17 sets of weights to range from 0 to 1, corresponding to the lowest and highest weights respectively. The scaling was linear and was done using the following equation: OriginalWeightÀ minðOriginalWeightÞ maxðOriginalWeightÞÀ minðOriginalWeightÞ Additionally, we ranked the weights within runs from highest (ranked as 1) to lowest. Declaring matches based on weight rank, such as rank 1 or rank 2, also allowed for comparability across algorithms.
Using this analysis file, we investigated the 17 sets of weights and scaled weights from multiple perspectives. We conducted descriptive analyses of the weights, including display of their empirical cumulative distribution functions. We also evaluated relationships among the weights, including their correlation, principal components analysis, and accuracy with respect to the gold standard, inpatient MRN. We also compared the performance of using matching weights as decision rules, including the area under the receiver operating characteristic (ROC) curve (AUC).

Results
The deduplicated datasets contained 69,523 inpatient records and 176,154 outpatient records, respectively. The total number of possible record pair comparisons, without blocking, was 12,199,192,962 pairs. Blocking on DOB reduced the number of record pair comparisons to 400,490. Datasets were similar based on gender distribution but distinctly different based on age and race (Table 1). Table 2 summarizes the statistics for the weights arising from the 17 linkage runs, displaying the number of unique weights produced, the maximum and minimum weights, the number of pairs that received the highest and second highest weights, and the number of pairs that received the lowest and second lowest weights. As expected, exact string matching approaches generally produced fewer distinct weights (range 4-9) than inexact string matching (range 8-64,273).

Characteristics of the weights
The empirical cumulative distribution functions of scaled weights varied considerably across the 17 linkage runs, confirming that these methods behaved differently (Fig A in S1 File).

Relationships among weights
Although there was substantial agreement among the 9 algorithms that use exact string matching, they did not produce identical rankings (Table 3; Table B in S1 File). All runs with exact string matching assigned the highest weight to the same 30,536 pairs that matched on first name, last name, gender, and DOB. Probabilistic string-matching algorithms besides expectation-maximization (i.e., Fellegi-Sunter and EpiLink) assigned higher weight to pairs that matched on last name and gender than pairs matching on first name and gender (Table 3). Software packages differed subtly in how they handled missing data (here, gender).
All runs using exact methods were highly correlated (Fig 1). Among runs using inexact methods, only those run in CUPLE and R were highly correlated with the exact methods. Runs using inexact string matching in the other two software packages (LP and MTB) were correlated with other runs using the same software but much less so with runs in other packages. The principal components analysis indicated only four predominant dimensions to the 17 sets of weights, whereby the first four principal components explained 98.97% of the variation among the weights (Table C in S1 File).

Comparison with medical record number check
Among 30,536 record pairs matching on DOB, first name, last name, and gender across most runs, 809 pairs (representing 1.16% of IPD, 0.45% of OPD) did not match on MRN (Table 4). Manual review of a random sample of 86 these 809 pairs suggested that these likely were the same individual with either different MRNs (77%), a missing MRN in OPD (22%), or a misspecified MRN in OPD (1%). Consequently, the gold standard itself had a small error rate of approximately 0.5% to 1%.  We also noted a small number of record pairs with very low weights despite matching MRNs, representing either different people with the same MRN or errors in the matching variables (Table 5). Record pairs that agreed only on DOB and MRN but not on first name, last name, or gender occurred only for 1 record pair in 1 software package using inexact string matching. Approximately 400 to 500 record pairs (about 1.2% of pairs with the second lowest weights across multiple runs) matched only on DOB and gender but not on first or last name. Manual review of a random sample (n = 100) of these records suggested that 99% were likely the same individuals who did not match appropriately. In most pairs (80%), this occurred with newborns that had a first name of "Male" or "Female" only in IPD and differed on last names, presumably the mothers' in IPD and fathers' or compound last names in OPD. Other discrepancies occurred because of various issues with names, most often misspellings in first names and inconsistent representation of compound last names.

Comparative performance of the methods
We compared the performance of the packages and algorithms using scaled weights and the gold standard, inpatient MRN, to identify declared matches, false positive matches, and false negative matches as the weight threshold varied. Across of range of scaled weights, the number of declared matches of record pairs varied among different packages and algorithms (Fig B and C in S1 File). When declared matches were the record pairs with the highest weights, most linkage algorithms performed similarly well, with sensitivity > 90%, specificity > 99%, positive predictive value (PPV) > 97%, and negative predictive value (NPV) > 99% (Table D in S1 File). Expansion of declared matches to include those with the second highest weights did not substantively change the test characteristics for most runs (Table E in S1 File). The ROC curves reflected these high levels of accuracy, with AUC greater than 0.99 for most linkage runs and minor differences among them (Fig 2; Fig D and Table F in S1 File).

Ensemble methods
We also explored whether the 17 linkage runs could be combined into an "ensemble" method that outperformed all runs individually. The motivation came from ensemble methods in machine learning, such as Super Learner [19], in which multiple models of the data are constructed and decisions are made by combining the results from these models. We explored two plausible approaches to ensemble methods based on: (1) averaging scaled weights and (2) "vote counting" using Rank 1 or 2 weights (i.e., pairs assigned the highest or second highest weights).

Average scaled weight
As the name connotes, the average scaled weight is the average of the 17 individual weights: As measured by AUC, the average scaled weight ensemble method outperformed all individual linkage runs (0.9962 vs. � 0.9948) (Fig E in S1 File).

Rank 1 or 2 voting
An alternative ensemble method was based on the number of linkage runs that assigned the highest or second highest weight to a record pair (see S1 File: Ensemble method: Rank 1 or 2 voting). The AUC for this ensemble method was 0.9807, lower than with average scaled weight and lower than some individual linkage runs (Fig E in S1 File).

Year of birth experiment
In recognition that DOB is not available in all databases or accurate in all circumstances, we performed additional experiments blocking on YOB (S1 File: Year of birth experiment). All 9 exact string matching linkage runs assigned the highest weight to the same 30,805 pairs, as compared with 30,536 pairs when blocking on DOB (Table G in S1 File). These additional 269 pairs matched on first name, last name, gender, and YOB, but not day of birth, month of birth, or both. Using YOB instead of DOB as the blocking variable, we gained 43 matches that shared Comparing record linkage software programs and algorithms using real-world data a common MRN but also added 226 matches that did not match on MRN (Table H in S1 File). Correspondingly, blocking on YOB led to marginal improvements in sensitivity and NPV at Comparing record linkage software programs and algorithms using real-world data the expense of specificity and PPV (Table I in S1 File). Additionally, blocking on YOB imposed major computational challenges because of the 329-fold increase in the number of compared pairs (131,906,591). Compared to the efficient DOB linkage runs that took less than one minute (Table J in S1 File), some YOB linkage runs took more than one hour (Table K in S1 File). Additionally, we were unable to complete any runs using R with inexact string matching due to computation limitations. Thus, compared to the experiment blocking on DOB, blocking on YOB seemed ineffective, yielding 5 times more additional errors than correct matches while substantially increasing computational time and precluding certain analyses.

Discussion
Using real data from noninteroperable EHRs, we performed a comprehensive assessment of the behavior and usability of nonproprietary available linkage software, evaluating the decision-making capability of specific linkage methods such as type of string comparison and weight determination and output from the linkage runs. Across multiple runs, we found relatively few perceptible differences in matching results, specifically with respect to ranks of the highest weights. Performance among software packages using exact string matching varied much less than that for methods using inexact string matching. From other perspectives, such as declaring matches to be pairs assigned the highest weight, linkage runs with inexact string matching were notably less efficient. As seen in Table 2, the linkage runs with exact string matching identified the same number of highest weight records, whereas some linkage runs with inexact string matching produced either too many matches or, in the case of LP, dramatically too few-possibly because only positive weights could be included in output. An ensemble matching approach had incrementally superior accuracy to individual algorithms. The performance of most software packages and algorithms was similar, although not identical, with respect to matching accuracy as compared with our gold standard. In our linkage runs, exact matching using EM algorithms for weight determination appeared to be slightly less reliable than other exact matching algorithms: EM algorithms prioritized matching on first name over matching on last name, a more diverse and specific matching variable. Why some linkage runs prioritized first name over last name or vice versa is unclear. Compared with exact matching, linkage runs of inexact matching algorithms led to more variability in both the number of discrete weights assigned (8 to 64,273) and the number of record pairs receiving the highest weight (15 to 31,619). Our linkage runs also revealed more variability in assigning low weights than high weights when using both exact string matching and inexact string matching. This diversity among low-weight record pairs is unlikely to affect declared matches at common matching thresholds; however, it does underscore the differences in weight determination among approaches.
We focused on the weights associated with the record pairs evaluated for each linkage program, evaluating linkage programs and algorithms as decision tools rather than the actual matching decisions. The findings suggest that the selection of weight thresholds for declaring matches can have a substantial impact on both operational and inferential uses of the linked data [20,21]. Choosing a threshold also depends on the study objectives. If false-positive linkages are costly, whether monetarily, scientifically, or in terms of human health, then higher matching thresholds may be preferred at the cost of lower sensitivity. For other questions, such as estimating the prevalence of rare diseases, even a few false negatives may cripple an analysis and the resulting loss of statistical power may be prohibitive [5, 11-15, 22, 23].
A prior study similarly compared the performance and accuracy of different linkage approaches at different matching thresholds [16]. However, that study was limited to only three approaches (2 probabilistic and 1 deterministic) without mention of specific probabilistic matching algorithms or the impact of individual matching variables on weight determination. Another more extensive comparison of the performance of both open-source and proprietary linkage packages used synthetic datasets to develop a method for evaluating data linkage software [17]. After comparing both computational speed and linkage quality of various software packages, that study identified a couple of packages, which were not named, that outperformed others. However, unlike our current findings, the prior work did not directly evaluate the performance of specific algorithms or the impact of specific variables or weight thresholds on decision-making.
Even small numeric differences in weighting could be more important in some settings, such as work with low-quality, incorrect data, or high levels of missingness. Prior work has shown that probabilistic approaches, which are often more time consuming, generally perform better in settings with low-quality data or high levels of missingness [8]. Although the matching variables in our datasets contained few missing values, treatment of missing values varied among the 4 software packages used in the present study. Furthermore, some packages did not allow the user to change default settings regarding missing data; other packages even lacked documentation about the subject. These differences in the handling of missing data may have important implications for the performance and reliability of linkage approaches, a hypothesis that bears further investigation.
Unlike prior work [17], we did not formally assess the computational performance of the packages. For the main experiment blocking on DOB, all ran within 1 minute (Table J in S1 File). For the experiment that blocked on YOB, only MTB and CUPLE ran smoothly for choices of string matching and weight determination (Table K in S1 File). LP ran relatively efficiently, in part because only positive weights were calculated for inexact string matching. R did not run at all for inexact string matching or for exact string matching with EpiLink weight determination, presumably because R is single-threaded and holds all objects in (real or virtual) memory. This experience highlights the importance of suitable blocking variables.
The extent to which our findings can be generalized to other datasets is uncertain. Like all real datasets, the two datasets we used had some data quality problems. As one such indicator, nearly 3% (Table 4) of the highest ranked matches did not share the same MRN, our gold standard. Other demographic variables in the original datasets that were not used as linking variables had high levels of missingness, such as ethnicity and address. Nonetheless, in terms of the linking variables tested, both datasets seemed to be rather good quality, in part perhaps because of human health and financial incentives [24]. Furthermore, evaluating linkage error using a gold standard-even an alloyed gold standard-provides important information on linkage quality [7].
Another factor that may limit the generalizability of the findings is that we chose not to do extensive data cleaning beyond deduplication [21,25,26]. Specifically, we did not perform name standardization-such as removing name suffixes like "Jr," conducting nickname lookups, and dealing with compound surnames and name transpositions-because we felt it was beyond the scope of the current project. Based on our manual review of low-weight record pairs with the same MRN, name standardization would likely have improved results for many of the linkage packages. Nonetheless, such processes may come at a cost, as heavy cleaning may decrease overall linkage quality [25]. However, we found one other group with low matching weights despite agreements in the gold standard: infants whose first name and last name changed between the inpatient and outpatient settings. This specific group underscores the importance of understanding real-world healthcare practices-such as the naming of newborn infants in inpatient settings as "Male" or "Female" with mothers' surnames-when interpreting EHR data [27]. Other analogous circumstances such as name changes with marriage or divorce may also compromise matching accuracy based on names. It is important to note that these findings may not be generalizable to other populations, such as the elderly.
We also did not explore the effect of using additional linking variables such as address because of the challenges in standardizing address text and concerns about the reliability of address variables, which can and do change over time. Identifying and accounting for identifier errors when linking data, especially on ZIP Code, helps to reduce bias caused by linkage errors [28]. Further, we did not examine the effects of varying numerical parameters-such as values of M-probabilities and U-probabilities-or thoroughly investigate how different choices of linking variables would affect linkage quality. Some of the results suggest that, in the same way as weight thresholds matter less than weight rank, some algorithms may be relatively insensitive to the choice, for instance, of M-probabilities and U-probabilities. Conducting weight-focused experiments similar to the current study could help resolve these kinds of questions.

Conclusion
We assessed the behavior and performance of various linkage algorithms using nonproprietary available linkage software and real data from two EHR systems. In settings in which levels of missing data are low and data quality is high, exact string matching approaches vary little across software packages, although approaches using exogenous weight determination, such as Fellegi-Sunter, may outperform those with endogenous methods, such as EM algorithms. With few exceptions, most linkage runs with either exact string matching or inexact string matching yielded similar groups of higher-weighted record pairs with high accuracy. Where possible, blocking on DOB seems preferable to blocking on YOB, given its greater computational efficiency and greater accuracy. Certain ensemble methods appear to improve overall performance of the algorithms.