Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A natural language processing approach to support biomedical data harmonization: Leveraging large language models

  • Zexu Li,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliation Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America

  • Suraj P. Prabhu,

    Roles Resources, Software, Validation, Writing – review & editing

    Affiliation Department of Bioinformatics, Boston University Faculty of Computing & Data Sciences, Boston, Massachusetts, United States of America

  • Zachary T. Popp,

    Roles Resources, Validation, Writing – review & editing

    Affiliation Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America

  • Shubhi S. Jain,

    Roles Resources, Validation, Writing – review & editing

    Affiliation Slone Epidemiology Center, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America

  • Vijetha Balakundi,

    Roles Resources, Software, Validation, Writing – review & editing

    Affiliation Department of Medicine/Section of Preventive Medicine and Epidemiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America

  • Ting Fang Alvin Ang,

    Roles Funding acquisition, Supervision, Validation, Writing – review & editing

    Affiliations Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America, Slone Epidemiology Center, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America, Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America

  • Rhoda Au,

    Roles Funding acquisition, Resources, Writing – review & editing

    Affiliations Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America, Slone Epidemiology Center, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America, Framingham Heart Study, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America, Department of Neurology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America, Department of Epidemiology, Boston University School of Public Health, Boston, Massachusetts, United States of America, Department of Medicine/Section of Genetics, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America

  • Jinying Chen

    Roles Conceptualization, Methodology, Resources, Software, Supervision, Validation, Writing – original draft, Writing – review & editing

    jinychen@bu.edu

    Affiliations Department of Medicine/Section of Preventive Medicine and Epidemiology, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America, Data Science Core, Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts, United States of America

Abstract

Background

Biomedical research requires large, diverse samples to produce unbiased results. Retrospective data harmonization is often used to integrate existing datasets to create these samples, but the process is labor-intensive. Automated methods for matching variables across datasets can accelerate this process, particularly when harmonizing datasets with numerous variables and varied naming conventions. Research in this area has been limited, primarily focusing on lexical matching and ontology-based semantic matching. We aimed to develop new methods, leveraging large language models (LLMs) and ensemble learning, to automate variable matching.

Methods

This study utilized data from two GERAS cohort studies (European [EU] and Japan [JP]) obtained through the Alzheimer’s Disease (AD) Data Initiative’s AD workbench. We first manually created a dataset by matching 347 EU variables with 1322 candidate JP variables and treated matched variable pairs as positive instances and unmatched pairs as negative instances. We then developed four natural language processing (NLP) methods using state-of-the-art LLMs (E5, MPNet, MiniLM, and BioLORD-2023) to estimate variable similarity based on variable labels and derivation rules. A lexical matching method using fuzzy matching was included as a baseline model. In addition, we developed an ensemble-learning method, using the Random Forest (RF) model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model’s hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF’s probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HR-n) and mean reciprocal rank (MRR).

Results

E5 performed best among individual methods, achieving 0.898 HR-30 and 0.700 MRR. RF performed better than E5 on all metrics over 50 trials (P < 0.001) and achieved an average HR-30 of 0.986 and MRR of 0.744. LLM-derived features contributed most to RF’s performance. One major cause of errors in automatic variable matching was ambiguous variable definitions.

Conclusion

NLP techniques (especially LLMs), combined with ensemble learning, hold great potential in automating variable matching and accelerating biomedical data harmonization.

Introduction

Epidemiology and machine learning studies in the biomedical domain often require large, diverse samples to produce unbiased results and improve the generalizability of findings [1,2]. However, such comprehensive data are rarely found in a single study. Instead, many datasets are generated by individual studies and shared via public platforms or data repositories [3]. Data sharing has become widely adopted in research communities and is now often mandated by funding agencies [4,5]. To effectively utilize the shared datasets, data harmonization is typically employed or required [68].

Data harmonization refers to the process of combining data from multiple resources to achieve maximum compatibility [6]. Strategies for data harmonization can be broadly categorized into two types: the stringent approach and the flexible approach [9]. The stringent approach uses the same measurements and data collection protocols across studies. Datasets from these studies share the same variables and can be harmonized through data merging directly [10]. In practice, implementation of this strategy is often challenging and typically confined to specific projects, due to the absence of widely accepted common data elements (i.e., standardized data definitions) and low adoption rates across individual studies [11]. In addition, the availability of numerous historical datasets necessitates effective harmonization methods that are both flexible and robust to maximize their usability [8]. The flexible approach does not require studies to use identical variables; instead, it uses analytical methods to transform matched variables into a common data model [12]. For example, the flexible prospective harmonization method requires researchers to agree on compatible data collection tools and protocols before the studies begin [13], while the flexible retrospective harmonization method can be applied to data from existing similar studies without prior collaboration or agreement among these studies [1417].

All flexible retrospective harmonization methods share a common early step: identifying variables that can be merged or mapped across studies (which we call variable matching) [8,18,19]. Current data harmonization practices in the biomedical domain tackle this problem in manual and labor-intensive ways, where cohort experts identify relevant variables, assess their compatibility (based on variable definitions and data format), and estimate the level of difficulty of harmonization [8,18,19]. Even the initial step of identifying relevant variables across studies can be time-consuming, especially when datasets include numerous variables and use diverse naming conventions [16,17,19]. Approaches that automatically match variables across studies can reduce manual efforts and accelerate the data harmonization process. Research in this area has been limited, with methods primarily focusing on lexical matching and ontology-based semantic expansion and matching (e.g., using concepts and relations defined in ontologies to expand variables to be matched) [2023] and keyword-based matching [24].

This study aimed to develop and evaluate a new variable matching approach that leverages large language models (LLMs) for variable matching to reduce human efforts and accelerate the data harmonization process. Using variables from two large cohort studies on Alzheimer’s Disease (AD) [25,26], we showed that the LLM-based methods outperformed the fuzzy matching method substantially in variable matching. In addition, we demonstrated that a tree-based ensemble classifier, which combined features from individual variable matching methods—including fuzzy matching and LLMs—significantly outperformed the best individual LLM method across all evaluation metrics over 50 trials. To our knowledge, this is the first study that demonstrates the great potential of applying LLMs and ensemble learning to support biomedical data harmonization.

Materials and methods

Approach overview

We treated the variable matching task as a ranking problem (Fig 1). For each source variable (i.e., a variable from GERAS-EU), we ranked candidate target variables (e.g., variables from GERAS-JP) based on their similarities to the source variable. The similarity between two variables was estimated by individual natural language processing (NLP) methods and ensemble learning, using information extracted from variable labels and other related sources (e.g., definitions or derivation rules of the variables, descriptions of data sheets containing the variables). Fig 1 provides an overview of our approach. This study was conducted as a secondary data analysis of deidentified data made available through the AD Data Initiative’s AD workbench and was exempt from the Institutional Review Board (IRB) review.

thumbnail
Fig 1. Overview of the automated variable matching approach.

aEach large language model (LLM) was applied to rank EU-JP variable pairs separately. b The features used by the Random Forest classifier included cosine similarity scores generated by LLMs, edit distance scores generated by fuzzy matching, and other features derived from the data dictionary (detailed in section Class label and machine learning features).

https://doi.org/10.1371/journal.pone.0328262.g001

Study setting

We mapped variables between the GERAS-EU [25] and GERAS-JP [26] studies, which were accessed through the AD Data Initiative’s AD Workbench—a secure, cloud-based data sharing and analytics environment to facilitate open data access and collaboration for AD-related research globally. The GERAS-EU study is an observational study examining the societal costs associated with AD in three European countries: France, Germany, and the United Kingdom [25]. This study recruited 1,532 participants with probable AD between October 1, 2010 and September 31, 2011 and collected data during the baseline visit and the follow-up visits every 6 months over 18 to 36 months. A variety of data elements were collected, such as demographics, medical history, cognitive function (e.g., Mini-mental state examination [MMSE] [27], the Alzheimer’s Disease Assessment Scale–Cognitive Subscale [28]), daily activities (Activities of Daily Living Inventory [29]), resource utilization (e.g., Resource Utilization in Dementia survey [30]), and quality of life (e.g., EuroQol-5 Dimension surveys [31]). The second study, GERAS-JP, utilized a similar study protocol to investigate the societal costs associated with AD in Japan [26,32]. It enrolls 553 participants with probable AD between November 2016 and December 2017 in Japan. Although GERAS-JP and GERAS-EU collected similar data, they recorded and formatted their data in different ways. For example, as shown in Table 1 (see Table A in S1 Text for more examples), the variable recording the participant’s body mass index was named “BMIB” (labeled as “Body Mass Index (BMI) at Baseline”) in GERAS-JP and “VSBLVTR_BMI” (labeled as “Vital Sign Result Numeric BMI baseline”) in GERAS-EU,

thumbnail
Table 1. Examples for variable names, labels, and data sheet descriptionsa.

https://doi.org/10.1371/journal.pone.0328262.t001

Data preprocessing and information used for variable matching

We automatically transformed the data from long format to wide format before matching variables, if a variable was represented in different formats in the two studies. For example, MMSE scores in GERAS-EU were stored in long format, where the variable “MMSEQSNUM” recorded the MMSE question number and the variable “MMSERN” recorded the corresponding scores. In contrast, MMSE scores in GERAS-JP were stored in wide format, with variables such as “MMSE_Q1” and “MMSE_Q2” recording scores for each MMSE question separately. We transformed “MMSEQSNUM” and “MMSERN” in the GERAS-EU data jointly into wide format and assigned variable labels to the transformed variables accordingly. The same process was applied to all questionnaire-related variables originally represented in long format. Both studies stored longitudinally measured variables in the same long format by using a variable “VISITNUM” to index each visit. Because the time-intervals between visits were identical (every six months) for both studies, time-dependent matching can be solved by aligning the visit number. Therefore, we did not perform the long-to-wide transformation for these variables.

Variable labels and data sheet description.

Our approach matched variables based on information from data dictionaries, such as variable labels and data sheet descriptions. Table 1 provides examples of matched variables from the GERAS-EU and GERAS-JP studies (see Table A in S1 Text for more examples). Variable labels provide concise descriptions of the corresponding variable names. Both studies organized their variables in separate data sheets, with each data sheet representing a specific type of variables (e.g., demographic variables) or variables derived from a specific survey. Each data sheet was accompanied by a short description in the data dictionaries. For example, the GERAS-JP study kept variables associated with the MMSE test in a data sheet with the description “Mini-Mental State Examination (MMSE) per visit”. In this data sheet, the variable “MMSE_Q8” recorded the response to the eighth question in the MMSE [27].

Key word extraction from derivation rules.

The data dictionaries of both GERAS-EU and GERAS-JP studies also contain a field called variable definition or derivation rule (see Table B in S1 Text for examples). This field provides information about how the variables were defined or derived, such as the values of categorical variables and the derivation rules of variables. Derivation rules provide additional information beyond variable labels and data sheet descriptions, which can be valuable for variable matching. However, not all variables have derivation rules. In addition, there is significant variation in the length and content of derivation rules. Therefore, we utilized information from derivation rules exclusively for ensemble learning.

We incorporated information from derivation rules into variable matching as follows. First, if a derivation rule contained more than 20 words, KeyBERT [33] was employed to extract key words of up to 15 words to represent the entire text. KeyBERT utilizes BERT (Bidirectional Encoder Representation from Transformer) [34] embeddings to identify sub-phrases in a text that are most semantically similar to the original text. If the derivation rule contained 20 or fewer words, the entire text was retained. Second, for each variable, we concatenated the variable label with the derivation rule (or key words extracted from the derivation rule) to create input for the individual NLP methods. This treatment ensured that variables without derivation rules had non-empty input for NLP.

Natural language processing methods for variable matching

We evaluated two types of NLP methods: LLM-based and fuzzy matching.

Large language model-based methods.

We evaluated four LLMs in variable matching: E5, MPNet, MiniLM, and BioLORD-2023.

The E5 (EmbEddings from bidirEctional Encoder REpresentations) is a text embedding model that enhanced its training process through weakly supervised contrastive pre-training [35]. The key idea of E5’s contrastive pre-training is optimizing text embeddings so that they will bring relevant unlabeled text pairs closer together and push irrelevant text pairs further apart within the vector space of embeddings [36]. In E5, relevant text pairs were sourced from diverse platforms, including Reddit (posts and comments), Stack Exchange (questions and upvoted answers), English Wikipedia (entity names and passages), scientific papers (titles and abstracts), and Common Crawl web (titles and passages) and selected via a consistency-based data filtering technique [35]. These relevant text pairs served as positive examples; text from different relevant pairs formed irrelevant text pairs (i.e., negative examples). During pre-training, E5 enhanced existing LLMs, e.g., the BERT models by leveraging the large amount of newly collected text pairs and contrastive learning. The model was then fine-tuned using three labeled datasets: NLI (Natural Language Inference), MS-MARCO passage ranking dataset [37], and NQ (Natural Questions) [38] datasets. E5 outperformed existing embedding models on both BEIR [39] and MTEB [40] benchmark datasets that were used to evaluate a variety of text embedding tasks. This study used the E5_large model, which is built on the large BERT model Bert-large-uncased-whole-word-masking model. The E5_large_v2 model was chosen due to its superior performance on benchmark datasets, particularly in the semantic textual similarity (STS) tasks [35].

In addition, we evaluated two LLMs developed using the Sentence Transformers (also called SBERT) framework [41]. Built on Siamese and triplet networks and contrastive learning techniques, SBERT aims to generate semantically meaningful embeddings from sentences or short paragraphs while achieving higher computational efficiency than BERT [41]. SBERT incorporates pre-trained LLMs into their low-level building blocks. We evaluated two LLMs, MPNet and MiniLM, incorporated into SBERT [42].

The MPNet (masked and permuted language modeling) model unifies mask language modeling from BERT [34] and permuted language modeling from XLNet [43]. In addition, MPNet utilizes auxiliary position information (i.e., the tokens’ positions in the original, non-permutated input sentence) to improve the consistency of the model’s input representations between pre-training and fine-tuning [43]. The All_MPNet_base model was fine-tuned from the MPNet-base model using contrastive learning on over 1 billion text pairs sourced from diverse datasets (e.g., Stack-Exchange, MS-MARCO, NQ) [42]. Among all SBERT models, the All_MPNet_base model has demonstrated superior performance on the Sentence Embedding task (14 datasets) and the Semantic Search task (6 datasets) [42].

The MiniLM model employs a specific knowledge distillation technique, deep self-attention distillation, to compress large transformer-based models [44,45]. Knowledge distillation compresses a large, complex model (teacher model) into a smaller, simpler model (student model) with fewer parameters by minimizing the difference between intermediate features of the two models (e.g., self-attention distributions in Transformer models). It has been shown that the student model obtained through using this technique can maintain similar test accuracy as the teacher model across tasks such as image and speech recognition [46]. MiniLM employs distillation on the self-attention module of the final transformer layer of the BERT base model, resulting in a 12-layer student model [45,47]. This reduction in parameters enhances fine-tuning efficiency. In this study, we used All_MiniLM_L12 model [41,42]. Initialized from the pre-trained MiniLM (microsoft/MiniLM-L12-H384-uncased) model, All_MiniLM_L12 was fine-tuned with a contrastive objective using a dataset of 1 billion text pairs, including NQ and SQuAD2.0 [48]. This model is selected due to its comparable performance in sentence embedding and semantic search tasks when compared with the All_mpnet_base_v2 model [42].

We also evaluated BioLORD-2023, an LLM that leveraged the Unified Medical Language System (UMLS) knowledge graph to improve the generation of embeddings for text containing medical concepts [49]. The BioLORD-2023 model is based on the MPNet model (as described previously) [41] and leverages the LORD (Learning Ontological Representations from Definitions) strategy [50,51] to further refine model training to enhance semantic representations of biomedical text. The training process of BioLORD-2023 involved three phases: contrastive learning, self-distillation, and weight-averaging. The contrastive learning phase implemented the LORD strategy [51]. Specifically, the model was trained on paired medical concept names and definitions to learn embeddings, ensuring that each concept’s name was close to its definition(s) while remaining distant from definitions of other concepts in the embedding space. The self-distillation phase aimed to mitigate performance degradation in measuring general-purpose semantic similarities, which arose due to intensive contrastive learning on medical concept names and definitions—a phenomenon observed in the previous BioLORD-2022 model [51]. In this phase, the knowledge acquired by the contrastive model (i.e., the model derived from the contrastive learning phase) was distilled into the base model in a supervised manner. Specifically, the concept embeddings and definition embeddings learned by the contrastive model were first averaged and then reduced to 64-dimensional vectors through principal component analysis. These reduced embeddings served as the learning objective for the base model, which was trained to predict them through a randomly initialized linear projection layer. Due to the random initialization of this projection layer, the training process produced multiple slightly different fine-tuned self-distillation models. To enhance robustness, the weight-averaging phase applied a strategy called “model soup” [52] to average the parameters of the fine-tuned self-distillation models to derive a single, ensembled model. BioLORD-2023 leveraged two biomedical knowledge graphs (ontologies)—the UMLS [53] and the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) [54]—to enhance model training. Biomedical concepts and their definitions from the broad-coverage UMLS were used for contrastive learning. The training data were further augmented by: (1) extending concept definitions using template-based descriptions sampled from UMLS relations (e.g., is-a, used to treat, is the synonym of) and (2) incorporating 40,000 definitions from the Automatic Glossary of Clinical Terminology (AGCT) [55], which were automatically generated using the GPT-3.5 model and the SNOMED-CT ontology. BioLORD-2023 outperformed two BERT-based models (which incorporated knowledge from UMLS or biomedical literature via domain-specific pre-training or fine-tuning) and BioLORD-2022 (which incorporated knowledge from UMLS via contrastive learning) on both general-domain and biomedical-domain STS tasks [49].

Fuzzy matching.

Fuzzy matching methods utilize dynamic programming and edit distance [56] (e.g., Levenshtein distance) to assess lexical similarities between text strings. Edit distance measures the number of operations needed to transform one text string into another [56]. We computed the edit distance between the label of a GERAS-EU variable and the label of each GERAS-JP variable. Variables with a smaller edit distance were considered similar. We implemented the fuzzy matching method by using the Python package RapidFuzz [57], which incorporates a variety of edit distance scoring functions originally developed in the Fuzzywuzzy Package [58]. Prior to applying fuzzy matching, we preprocessed the variable labels by removing punctuations and stop words, converting all letters to lowercase, and stemming the words. We compared the performance of various fuzzy matching scoring functions implemented in the RapidFuzz package in a preliminary experiment and selected the top-performing function, the token-set-ratio function, for this study. The token-set-ratio function is an extension of the token-sort-ratio function. The token-sort-ratio function tokenizes the preprocessed strings, sorts the tokens alphabetically, and computes the Levenshtein distance between the sorted strings. The token-set-ratio function eliminates duplicate tokens within each sorted string before comparison.

Ensemble learning for variable matching

To further improve variable matching performance, we employed the Random Forest classification algorithm to integrate the outputs from both LLM-based methods and the fuzzy matching method.

Random forest classifier.

The Random Forest classifier is a supervised machine learning model that utilizes an ensemble of decision tree classifiers to generate predictions. The classifier takes features associated with a pair of EU-JP variables as its input and outputs a class label (1: matched variables, 0: unmatched variables) and the associated probabilities. The training of the Random Forest classifier involves the creation of an ensemble of decision trees that classify the input from the training data. Each tree in the random forest builds on a subset of training instances randomly sampled from the complete training set without replacement, as well as a subset of features randomly selected from the entire feature set [59]. Introducing randomness enhances model robustness and helps mitigate overfitting. The construction of a decision tree involves splitting nodes iteratively from top to bottom. Each node is split based on a specific machine learning feature, with the feature and the corresponding splitting rule determined by criteria such as Gini impurity and information gain. For continuous features, the split rule may check whether the feature value is within a certain range. For categorical features, it may check whether the feature value is equal to a specific category. The goal is to achieve the greatest reduction in impurity or the largest increase in information gain with each split. The node-splitting process continues until certain termination criteria are met, such as reaching the maximum tree depth, the minimum number of samples required for a split, or the minimum decrease in impurity. When applying a trained Random Forest classifier to a new data instance, each decision tree that makes up the classifier is applied to the data instance respectively. At each node of a decision tree, the corresponding split rule will direct the data instance to a certain branch (i.e., a specific child node) under that node. After traversing several nodes in the tree, the data instance will reach a leaf node and receive its classification label (which is the label of the majority training instances that reach the same leaf node during model training). A final classification label is determined based on the aggregated voting results from all decision trees. In addition, the classifier will output a probability indicating how likely the data instance (in this case, a pair of variables) is positive, i.e., represents a match. These probabilities were then used to rank the candidate JP variables for each EU variable. Fig 2 provides an overview of the training and test procedures of the Random Forest model.

thumbnail
Fig 2. Training and evaluation of Random Forest (RF) classifier in a single triala.

a The Random Forest classifier was trained and evaluated in 50 trials, with each trial having a different random split of the training and test datasets. Each test set had slightly varying ratios of positive (i.e., matched EU-JP variable pairs) to negative (i.e., unmatched EU-JP variable pairs) cases because some EU variables were manually aligned to multiple JP variables (see details in sub-sections Training and test datasets and Model comparison).

https://doi.org/10.1371/journal.pone.0328262.g002

Training and test datasets.

The datasets utilized in this study were created in the following way.

First, we manually matched a subset of variables from GERAS-EU and GERAS-JP studies, which served as the ground truth variable pairs. Multiple steps were taken to ensure accurate matching. At the first step, EU variables were assigned to five co-authors, who had training and work experiences in epidemiology, statistics, or bioinformatics, for preliminary matching with corresponding JP variables. The variables were matched based on both variable definitions (i.e., derivation rules) and variable values. Challenging cases were discussed within the research team, including a senior co-author with expertise in epidemiology and health informatics. We categorized the matching results into three groups: no match, single match, and multiple matches. For example, the variable “LVLOCLNM” (which records the living area, urban or rural, of the participant) in the GERAS-EU study was matched with multiple variables such as “C_LIVLOCCD” and “LIVLOC” in the GERAS-JP study, because it was included in multiple data sheets under different variable names in GERAS-JP. At the second step, the five co-authors validated and corrected matching results from the first step, with each assigned a different subset of variables. The validation results and corrections (along with a justification for any corrections) were documented. In the last step, one co-author reviewed all corrections to ensure their validity. Through this 3-step procedure, 438 pairs of EU-JP variables were manually matched, which included 347 unique EU variables (68 of which have multiple true alignments).

Second, we created the training and test datasets in two steps (Fig 2). First, we treated the 347 unique EU variables as source variables that need to be aligned with the target variable(s) in the GERAS-JP study and randomly divided these variables into training and test (4:1) sets. Each EU source variable had 1322 JP candidate variables to match, which constituted 1322 variable pairs. These variable pairs contained only one or few positive instances (matched EU-JP pairs) but many negative instances (unmatched EU-JP pairs). Second, we down-sampled the negative instances to mitigate the adverse effects of data imbalance on model training. Specifically, for the EU source variable in each positive instance, we randomly selected 200 JP variables that did not match this EU variable to form 200 EU-JP variable pairs as negative training instances. In the test set, however, we included all the negative instances. This resulted in about 351 positive instances (associated with 277 unique EU variables) and 70,200 negative instances in the training set and 87 positive instances (associated with 70 unique EU variables) and 92,453 negative instances in the test set in each trial of the machine learning experiment (detailed in the section Experimental settings).

Class label and machine learning features.

The class label is binary-valued, with 1 indicating manually matched EU-JP variable pairs and 0 indicating unmatched EU-JP variable pairs. Features utilized by the Random Forest model came from two sources: similarity scores generated by five individual NLP methods (E5, MPNet, MiniLM, BioLORD-2023, and fuzzy matching) and other information extracted from data dictionaries. Each NLP method generated three similarity scores for an EU-JP variable pair by using (1) data labels; (2) data sheet descriptions; and (3) data labels and key words extracted from derivation rules as the respective input. For example, the EU variable “DIAGDT” (representing “Disease diagnosis date”) and the JP variable “ADDIADT” received three similarity scores from the E5 model calculated based on variable labels (0.98), data sheet descriptions (0.76), and data labels plus key words from derivation rules (0.90) respectively. Other features include the length of variable labels, the length and the absence of derivation rules for both the EU and JP variables. For example, for the EU variable “DIAGDT”, we had three features: 3 for the length of the variable label measured in words, 30 for the length of the derivation rule in words, and “False” for the absence of derivation rules (see Table B in S1 Text). In total, 15 similarity scores (three generated by each NLP method) and 6 additional features (3 for each variable in the variable pair) were used as machine learning features.

Model development.

In this study, we utilized the Random Forest classifier from the Scikit-learn package [60]. We developed the model using the training set and tuned the following hyperparameters through 5-fold cross validation and grid search: (1) the number of trees in the forest; (2) the maximum depth of each tree; (3) the criteria used to assess the quality of a tree node split; (4) the minimum number of samples required for each split; and (5) the number of features considered at each splitting. Other hyperparameters were set to their default values. We used the hit ratio (HR) at top 30 and mean reciprocal rank (MRR; see section Evaluation metrics) to select the best hyperparameters.

Evaluation metrics

For each EU variable, each matching method will generate a ranked list of all candidate JP variables. A high-performance method is expected to place the manually matched JP variable(s) at the top of this list.

We evaluated the performance of the matching methods by two metrics: HR and MRR. Both HR and MRR are commonly used to evaluate recommendation or ranking algorithms [61,62]. As shown in equation 1, HR is defined as the total number of hits appearing in the top-n ranked items () divided by the total number of search queries ().

(1)

A “hit” denotes that the user-selected item is among the top-n ranked items. In our case, a “hit” means, for an EU variable, the manually matched JP variable is ranked within the top-n list of JP variables identified by the matching method. denotes the total number of hits in the top-n lists for all EU variables in a dataset. denotes the total number of EU variables in the dataset.

For each user query v, the reciprocal rank RR(v) is the inverse of the rank of the first relevant item (equation 2). The MRR is the averaged reciprocal rank across all queries (equation 3).

(2)(3)

In our case, in equation 2 denotes the highest (smallest) rank of the JP variable(s) that were manually matched to an EU variable v. MRR is the averaged reciprocal rank across all EU variables in a dataset.

Ties in ranking were resolved by using the median rank when calculating HR and MRR. For example, if three JP variables have identical similarity scores when matched to an EU variable and occupy positions 4–6 in the ranked list, they will all be assigned a rank of 5.

Experimental settings

Model comparison.

We first compared the performance of the five individual NLP methods on the whole dataset. We then compared the performance of the Random Forest classifier and the best NLP method on the test sets from 50 trials. The 50 trials were conducted by randomly splitting training and test datasets 50 times (see section Training and test datasets). We used paired t-tests to compare the performance of the two models across the 50 trials, with the null hypothesis stating that there is no significant difference in performance metrics between these two methods. Five metrics were used to evaluate model performance, including the top 30, 20, 10, and 5 HR (HR-30, HR-20, HR-10, HR-5) and MRR.

Feature importance analysis.

To understand which features contributed most to the performance of the Random Forest model, we estimated feature importance using data from the 50 trials and the permutation importance method [63]. For each feature in a trained model, permutation importance estimates its contribution by randomly shuffling the feature’s values in a held-out dataset and measuring the subsequent decline in model performance on this dataset. In this study, we estimated feature importance by averaging the model’s performance decline over the test sets from the 50 trials, measured by HR-5, HR-10, and MRR. We selected these three metrics because they were more sensitive to permutation on individual features in the Random Forest classifier compared to HR-20 and HR-30.

We compared feature importance using two methods. In the first method, we ranked features using their mean importance scores over 50 trials. In the second method, we first ranked the features within each trial based on their importance scores and then compared their average ranks across the 50 trials.

In addition, we conducted feature ablation analysis to assess whether removing a specific type (i.e., subset) of features from the Random Forest model would affect the model performance. We categorized the features into three types: LLM-derived features, fuzzy matching-derived features, and other features. We measured the average decline in model performance after removing each type of feature over 50 trials.

Error analysis

To understand the limitations of our variable matching approach, we manually checked the manually matched EU-JP variable pairs that were ranked low (below top-30) by the best NLP method and the Random Forest model and summarized the error patterns.

Results

Descriptive statistics

The dataset used in this study included 347 GERAS-EU variables and 1322 GERAS-JP variables. As shown in Table 2, compared with the JP variables, the EU variables have longer labels (11.3 vs. 8.5 words), shorter data sheet descriptions (3.2 vs. 5.5 words), and shorter derivation rules (14.9 vs. 26.4 words).

thumbnail
Table 2. Characteristics of variable labels, data sheet descriptions, and derivation rules of EU and JP variables.

https://doi.org/10.1371/journal.pone.0328262.t002

Table A in S2 Text provided descriptive statistics for features used in trial 1 of the machine learning experiment. Other trials showed similar patterns in feature value distributions. The NLP-derived features (i.e., similarity scores estimated by the individual NLP methods) showed similar distributions in both the training and test sets. A higher score or feature value indicates a greater similarity between the EU and JP variables.

Other features used in the Random Forest model were summarized in Table B in S2 Text. The GERAS-EU study exhibits a shorter median and mean keyword length (medium length: 9 words, mean length: 8.8 words) in non-empty derivation rules than the GERAS-JP study (medium length: 15 words, mean length: 11.4 words). The EU variables had longer labels (medium length: 9 words, mean length: 11.3 words) than the JP variables (medium length: 7 words, mean length: 8.5 words). About 72.6% of EU variables missed the derivation rules, while only 11.5% of JP variables missed the derivation rules.

Comparison of individual NLP methods

As shown in Table 3, the E5 model achieved the highest performance across all evaluation metrics, including an HR-30 of 0.898 and an MRR of 0.700. The BioLORD-2023 model achieved the second-best performance across all metrics, followed by the MiniLM and MPNet models. The fuzzy matching method exhibited the lowest performance in most metrics but outperformed the MPNet model in HR-5 and MRR.

thumbnail
Table 3. Performance of individual NLP methods in variable matchinga, b.

https://doi.org/10.1371/journal.pone.0328262.t003

Random Forest model versus E5

As shown in Table 4, the Random Forest model, optimized based on the HR-30 metric, outperformed the E5 model on all the evaluation metrics and achieved an HR-30 of 0.986 (std: 0.012) and an MRR of 0.744 (std: 0.036). Paired t-tests on the 50 trials indicated significant performance gains (P < 0.001 for all metrics). A separate Random Forest model, optimized based on MRR, exhibited similar performance across all metrics (see Table C in S2 Text).

thumbnail
Table 4. Random Forest and E5 model performance comparison.

https://doi.org/10.1371/journal.pone.0328262.t004

Feature analysis

As shown in Table 5, the features derived by applying the E5 model and the BioLORD-2023 model on variable labels (i.e., E5_on_label and BioLORD_on_label) contributed most to the performance of the Random Forest model, as indicated by all three evaluation metrics. Other features differed in their contributions measured by HR and MRR. Their contributions to HR-5 were all substantially lower compared to the two most important features. Features derived from applying MiniLM to variable labels (i.e., MiniLM_on_label), E5 to both variable labels and keywords extracted from the derivation rules (i.e., E5_on_label_key), and E5 to data sheet descriptions (i.e., E5_on_sheet) were among the top-ranked features for HR-10, with an average importance score of 0.010 or higher. In contrast, features derived from applying BioLORD-2023 to both variable labels and keywords extracted from the derivation rules (i.e., BioLORD_on_label_key), as well as applying MPNet and MiniLM to data sheet descriptions (i.e., MPNet_on_sheet and MiniLM_on_sheet) were important to MRR.

thumbnail
Table 5. Feature importance for the Random Forest modela.

https://doi.org/10.1371/journal.pone.0328262.t005

Feature ablation analysis (Table 6) showed that removing LLM-derived features or other types of features decreased the model’s performance significantly (P < 0.001 for most cases) while removing fuzzy matching features did not affect model performance except for HR-10 (P = 0.001).

Discussion

Variable matching is an early key step in flexible data harmonization. We evaluated the performance of NLP methods, including LLMs and fuzzy matching, on this task. In addition, we developed and evaluated a Random Forest-based ensemble learning method that leveraged the strengths of individual NLP methods. We found that the E5 LLM outperformed other individual NLP methods on variable matching. The Random Forest model showed significantly better performance than E5 on all metrics and achieved an HR-30 of 0.986 and an MRR of 0.744. These results suggest that NLP techniques (including LLMs), combined with ensemble learning, have great potential in automating variable matching, thus accelerating the data harmonization process. Below, we discuss our main findings in greater detail.

Data harmonization has been discussed within the context of utilizing data from multiple sources, which aims to combine datasets for effective use by resolving data heterogeneity at three levels: syntax (i.e., data format), structure (i.e., conceptual schema), and semantics (i.e., how the variables were measured, derived, and encoded) [6]. The advantages of data harmonization include increasing the statistical power of a study and enabling big data analytics [6,12,64], verifying findings across studies [1], and evaluating and reducing the bias of analyses using individual data sources [2,7]. Two strategies, merging and mapping, apply to the data harmonization process [6]. Merging involves the creation of a single global taxonomy or ontology for multiple datasets and then linking or mapping variables to the taxonomy [14,65]. Mapping, on the other hand, involves the creation of a set of rules to match variables across studies [6]. For example, Kamala et al. harmonized data from two pregnancy cohort studies by using the mapping approach, where they created a set of rules by considering construct measured, question asked, response options, measurement scale, time and frequency of measurement, and coding schema of variables, to classify variables into completely matching, partially matching (e.g., two variables measured the same construct but were measured or encoded in different ways), or no matching [12]. The mapping was conducted manually by reviewing documentation, consulting the research team of the original data source, and exploring variables in the dataset. Kamala’s team harmonized 20 variables from both cohorts and pointed out that this is a repetitive and iterative process [12]. In both mapping and merging approaches, matching variables between studies and the common data ontology, or across studies, can be time-consuming but remains a crucial step in data harmonization. Automated methods that facilitate variable matching, including the methods developed in this study, can potentially improve efficiency and, therefore, enable large-scale harmonization (e.g., harmonizing large datasets or many datasets).

In this study, we developed and tested new automated methods, leveraging LLMs and ensemble learning, to match variables across studies based on information provided in data dictionaries. Our approach relies on text comparison (e.g., comparing variable labels, data sheet descriptions, and derivation rules) and focuses on construct-level variable matching. Specifically, we considered both lexical similarity and semantic similarity of the text describing the variables [66]. The assessment of lexical similarity was motivated by the observation that some matched variables shared common technical terms in their definitions, such as variable labels or derivation rules. Fuzzy matching (approximate string matching [67]) focuses on measuring lexical similarity. The assessment of semantic similarity was motivated by the observation that matched variables could use different but semantically related words in their definitions. Text embedding is a widely used NLP technique for distributed text representation which can be used to measure semantic similarities between text [68]. Recent studies have shown the success of using sentence embeddings generated from BERT-based models to measure text semantic similarity [36,41]. LLMs, like BERT, KeyBERT, and RoBERTa, also showed high performance in measuring short-text semantic similarities [69]. Furthermore, incorporating information from biomedical ontologies (e.g., UMLS) or knowledge sources (e.g., the website of Mayo Clinic) into LLMs has been shown to enhance performance in downstream tasks such as question answering [70], generating synthetic electronic health records [71], and measuring semantic similarity between biomedical texts [49]. In this study, we treated variable matching as a short-text (a blend of medical concepts and ordinary text) similarity assessment task and utilized multiple pre-trained LLMs (which have been fine-tuned for either general or biomedical STS tasks) to measure the semantic similarity between variable definitions. The evaluation of individual NLP methods showed that LLMs outperformed fuzzy matching, suggesting that measuring semantic similarities was beneficial for variable matching. We also found that BioLORD-2023 outperformed its based model, MPNet, achieving absolute gains of 0.09 to 0.16 points across all evaluation metrics. This result demonstrates the benefits from incorporating knowledge graphs such as UMLS and SNOMED-CT in variable matching. It is worth noting that incorporating domain-specific knowledge into general-domain LLMs is a nontrivial task. Previous research has shown that fine-tuning an LLM originally designed for general-domain STS tasks through contrastive learning from biomedical text pairs may negatively impact the model’s ability to retain general knowledge [49]. BioLORD-2023 employed a self-distillation strategy to mitigate this negative effect. However, it still performed worse than the general-domain E5 model in variable matching. Future research in developing and comparing methods for integrating domain-specific knowledge into E5 may further advance the state of the art in variable matching by individual LLMs. In addition, methods inspired by retrieval-augmented generation (RAG) [72]-like techniques could potentially enhance LLMs for variable matching. For example, these methods could be used to retrieve text pairs (e.g., query-response pairs) containing medical terms from web posts to enrich the training data with more diverse and contextually relevant biomedical text pairs. Additionally, augmenting LLM inputs (e.g., variable labels and derivation rules in our case) with synonyms or concept relations extracted from biomedical knowledge graphs may further improve biomedical variable matching.

In addition, ensemble learning, which combines the strengths of multiple individual methods, is an effective approach for enhancing task performance. Our evaluation results showed that the Random Forest classifier consistently outperformed the best individual method (i.e., the E5 model). This aligns with prior studies demonstrating that ensemble-based systems are more effective than single-expert systems [73]. In our case, the single-expert system, such as the fuzzy matching method or the LLMs, provides similarity scores for two variables. Due to variations in algorithm design and model training methods, these single-expert systems may offer diverse perspectives on the nuances of similarities and differences between variable definitions. Furthermore, ensemble learning provides a flexible framework for incorporating different information sources (e.g., variable labels, data sheet descriptions, and variable derivation rules in our case), which is also called data fusion [73]. In this study, we used the Random Forest classifier to incorporate similarity scores measured by using different single-expert systems and information sources, as well as several additional features (e.g., lengths of variable labels), to classify each candidate GERAS-JP variable as matching or not matching a GERAS-EU variable. Results from feature importance analysis and feature ablation analysis showed that features derived by using LLMs contributed the most to the Random Forest model’s good performance. In addition, the feature importance analysis showed that the important features for a high hit ratio (which is sensitive to the quality of top-ranked variable pairs) and a high MRR (which measures the global or overall ranking quality) differed, although the E5-derived feature E5_on_label and BioLORD-derived feature BioLORD_on_label stood out as the most important for both metrics. The contribution of the fuzzy matching feature seemed negligible in feature ablation analysis, except for the top-10 hit ratio. A possible reason is that LLM features already capture sufficient information about both lexical and semantic similarities for matching variables.

Our error analysis revealed two major error patterns. First, the variable descriptions are sometimes ambiguous or lack specific details. For example, the E5 model and the Random Forest model failed to match the variable “MMSEBLVALTR” (labeled as “Baseline Value - TR Phase”) from the GERAS-EU study with the variable “MMSEB” (labeled as “Mini-Mental State Examination (MMSE) at Baseline”) from the GERAS-JP study. Both variables represent the MMSE performance at the baseline visit, but the label for the “MMSEBLVALTR” variable (which does not specify that this variable represents the baseline value of the MMSE test) is too ambiguous or vague to be useful for variable matching. Second, the variable labels from the two studies sometimes emphasized different perspectives. For example, the variable “MMSES34” was labeled as “Item result numeric: Correct response to writing” in the GERAS-EU study, and its counterpart in the GERAS-JP study, i.e., MMSE_Q34, was labeled as “Please write a sentence”. Both variables represent the evaluation result of item 34 in the MMSE questionnaire, but the variable label from the GERAS-EU study focuses on the evaluation result, whereas the label from the GERAS-JP study focuses on the evaluation item. In general, most error cases involved variable labels that lacked semantic and lexical similarities, making them challenging for the NLP methods to analyze effectively.

This study has limitations. We validated our approach on two studies that followed similar protocols to collect data and used data dictionaries with comparable formats. For example, both data dictionaries include data fields for variable name, variable label, data sheet description, and variable derivation rule. Additional work in data preprocessing is required when applying our methods to cases where data dictionaries differ substantially in format and structure across studies. However, the methodology underlying our approach is generic. By utilizing pre-trained LLMs and the LLM-derived text similarity scores as machine learning features, our approach can be applied to other settings (e.g., ontology-based retrospective data harmonization, searching variables of interest in existing datasets) where text descriptions of variables are available. Another limitation, common to all computational approaches, is the additional effort required to prepare input data when “computer-readable” documentation is unavailable for some studies included in a data harmonization project [19].

Conclusion

Our NLP methods, which leveraged LLMs and ensemble learning, achieved promising results in the task of variable matching. Variable matching is an early key step in data harmonization, which often requires substantial human efforts. Our methods have great potential to reduce manual effort when text descriptions of variables are available for studies. In the future, we aim to refine and extend our methods to address scenarios with greater variation in data collection protocols across studies.

Supporting information

S1 Text.

Data fields used by natural language processing methods to match variables. Table A. Examples for variable names, labels, and data sheet descriptions. Table B. Examples of KeyBERT extraction from derivation rules.

https://doi.org/10.1371/journal.pone.0328262.s001

(PDF)

S2 Text.

Additional experimental results. Table A. NLP-derived features used by the Random Forest model. Table B. Other features used by the Random Forest model. Table C. Random Forest and E5 model performance comparison.

https://doi.org/10.1371/journal.pone.0328262.s002

(PDF)

Acknowledgments

Data discovery services and computational resources contributing to this work were provided in kind by the AD Data Initiative [https://www.alzheimersdata.org/]. The authors would like to acknowledge the sponsor of the GERAS-EU study for making this data available upon request through AD Data Initiative’s AD Workbench for analysis.

References

  1. 1. Tiwari P, Colborn KL, Smith DE, Xing F, Ghosh D, Rosenberg MA. Assessment of a Machine Learning Model Applied to Harmonized Electronic Health Record Data for the Prediction of Incident Atrial Fibrillation. JAMA Netw Open. 2020;3(1):e1919396. pmid:31951272
  2. 2. Wachinger C, Rieckmann A, Pölsterl S, Alzheimer’s Disease Neuroimaging Initiative and the Australian Imaging Biomarkers and Lifestyle Flagship Study of Ageing. Detect and correct bias in multi-site neuroimaging datasets. Med Image Anal. 2021;67:101879. pmid:33152602
  3. 3. Navale V, von Kaeppler D, McAuliffe M. An overview of biomedical platforms for managing research data. Journal of Data, Information and Management. 2021;3(1):21–7.
  4. 4. National Institute on Drug Abuse. NIDA Guidance on NIH Data Management & Sharing Policy. 2023. Available from: https://nida.nih.gov/research/nih-policies-guidance/guidance-nih-data-management-sharing-policy
  5. 5. Fouad K, Vavrek R, Surles-Zeigler MC, Huie JR, Radabaugh HL, Gurkoff GG, et al. A practical guide to data management and sharing for biomedical laboratory researchers. Exp Neurol. 2024;378:114815. pmid:38762093
  6. 6. Cheng C, Messerschmidt L, Bravo I, Waldbauer M, Bhavikatti R, Schenk C, et al. A General Primer for Data Harmonization. Sci Data. 2024;11(1):152. pmid:38297013
  7. 7. Hu F, Chen AA, Horng H, Bashyam V, Davatzikos C, Alexander-Bloch A, et al. Image harmonization: A review of statistical and deep learning methods for removing batch effects and evaluation metrics for effective harmonization. Neuroimage. 2023;274:120125. pmid:37084926
  8. 8. Fortier I, Raina P, Van den Heuvel ER, Griffith LE, Craig C, Saliba M, et al. Maelstrom Research guidelines for rigorous retrospective data harmonization. Int J Epidemiol. 2016;46(1):103–5. pmid:27272186
  9. 9. Fortier I, Doiron D, Burton P, Raina P. Invited commentary: Consolidating data harmonization--how to obtain quality and applicability?. American Journal of Epidemiology. 2011;174(3):261–4.
  10. 10. Cheng C, Messerschmidt L, Bravo I, Waldbauer M, Bhavikatti R, Schenk C, et al. Harmonizing government responses to the COVID-19 pandemic. Sci Data. 2024;11(1):204. pmid:38355867
  11. 11. Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, et al. FAIR data sharing: The roles of common data elements and harmonization. J Biomed Inform. 2020;107:103421. pmid:32407878
  12. 12. Adhikari K, Patten SB, Patel AB, Premji S, Tough S, Letourneau N, et al. Data harmonization and data pooling from cohort studies: a practical approach for data management. Int J Popul Data Sci. 2021;6(1):1680. pmid:34888420
  13. 13. Borugian MJ, Robson P, Fortier I, Parker L, McLaughlin J, Knoppers BM, et al. The Canadian Partnership for Tomorrow Project: building a pan-Canadian research platform for disease prevention. CMAJ. 2010;182(11):1197–201. pmid:20421354
  14. 14. Benet M, Albang R, Pinart M, Hohmann C, Tischer CG, Annesi-Maesano I, et al. Integrating Clinical and Epidemiologic Data on Allergic Diseases Across Birth Cohorts: A Harmonization Study in the Mechanisms of the Development of Allergy Project. Am J Epidemiol. 2018;188(2):408–17. pmid:30351340
  15. 15. Wey TW, Doiron D, Wissa R, Fabre G, Motoc I, Noordzij JM, et al. Overview of retrospective data harmonisation in the MINDMAP project: process and results. J Epidemiol Community Health. 2021;75(5):433–41. pmid:33184054
  16. 16. Hao X, Li X, Zhang G-Q, Tao C, Schulz PE, Alzheimer’s Disease Neuroimaging Initiative, et al. An ontology-based approach for harmonization and cross-cohort query of Alzheimer’s disease data resources. BMC Med Inform Decis Mak. 2023;23(Suppl 1):151. pmid:37542312
  17. 17. Salimi Y, Domingo-Fernández D, Bobis-Álvarez C, Hofmann-Apitius M, Birkenbihl C. ADataViewer: exploring semantically harmonized Alzheimer’s disease cohort datasets. Alzheimers Res Ther. 2022;14(1):69.
  18. 18. Fortier I, Wey TW, Bergeron J, Pinot de Moira A, Nybo-Andersen A-M, Bishop T, et al. Life course of retrospective harmonization initiatives: key elements to consider. J Dev Orig Health Dis. 2023;14(2):190–8. pmid:35957574
  19. 19. Balise RR, Hu M-C, Calderon AR, Odom GJ, Brandt L, Luo SX, et al. Data cleaning and harmonization of clinical trial data: Medication-assisted treatment for opioid use disorder. PLoS One. 2024;19(11):e0312695. pmid:39570967
  20. 20. Pang C, Hendriksen D, Dijkstra M, van der Velde KJ, Kuiper J, Hillege HL, et al. BiobankConnect: software to rapidly connect data elements for pooled analysis across biobanks using ontological and lexical indexing. J Am Med Inform Assoc. 2015;22(1):65–75.
  21. 21. Pang C, Kelpin F, van Enckevort D, Eklund N, Silander K, Hendriksen D, et al. BiobankUniverse: automatic matchmaking between datasets for biobank data discovery and integration. Bioinformatics. 2017;33(22):3627–34. pmid:29036577
  22. 22. Mate S, Kampf M, Rödle W, Kraus S, Proynova R, Silander K, et al. Pan-European Data Harmonization for Biobanks in ADOPT BBMRI-ERIC. Appl Clin Inform. 2019;10(4):679–92. pmid:31509880
  23. 23. Pezoulas VC, Sakellarios A, Kleber M, Bosch JA, Laan SWVd, Lamers F, et al. A hybrid data harmonization workflow using word embeddings for the interlinking of heterogeneous cross-domain clinical data structures. 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI); 2021 July 27-30; p. 1-4.
  24. 24. Bosch-Capblanch X. Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach. BMC Med Inform Decis Mak. 2011;11:33. pmid:21595905
  25. 25. Wimo A, Reed CC, Dodel R, Belger M, Jones RW, Happich M, et al. The GERAS Study: a prospective observational study of costs and resource use in community dwellers with Alzheimer’s disease in three European countries--study design and baseline findings. J Alzheimers Dis. 2013;36(2):385–99. pmid:23629588
  26. 26. Nakanishi M, Igarashi A, Ueda K, Brnabic AJM, Matsumura T, Meguro K, et al. Costs and resource use of community-dwelling patients with Alzheimer’s disease in Japan: 18-month results from the GERAS-J study. Curr Med Res Opin. 2021;37(8):1331–9. pmid:33904362
  27. 27. Folstein MF, Folstein SE, McHugh PR. “Mini-mental state”. A practical method for grading the cognitive state of patients for the clinician. J Psychiatr Res. 1975;12(3):189–98. pmid:1202204
  28. 28. Rosen WG, Mohs RC, Davis KL. A new rating scale for Alzheimer’s disease. Am J Psychiatry. 1984;141(11):1356–64. pmid:6496779
  29. 29. Galasko D, Bennett D, Sano M, Ernesto C, Thomas R, Grundman M, et al. An inventory to assess activities of daily living for clinical trials in Alzheimer’s disease. Alzheimer Dis Assoc Disord. 1997;11(Suppl 2):S33–9.
  30. 30. Wimo A, Jonsson L, Zbrozek A. The Resource Utilization in Dementia (RUD) instrument is valid for assessing informal care time in community-living patients with dementia. J Nutr Health Aging. 2010;14(8):685–90. pmid:20922346
  31. 31. Johnson JA, Coons SJ. Comparison of the EQ-5D and SF-12 in an adult US sample. Qual Life Res. 1998;7(2):155–66. pmid:9523497
  32. 32. Nakanishi M, Igarashi A, Ueda K, Brnabic AJM, Treuer T, Sato M, et al. Costs and resource use associated with community-dwelling patients with Alzheimer’s disease in Japan: baseline results from the prospective observational GERAS-J study. J Alzheimers Dis. 2020;74(1):127–38.
  33. 33. Grootendorst M. KeyBERT. 2022. Available from: https://maartengr.github.io/KeyBERT/index.html
  34. 34. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019;1(long and short papers):4171–4186. doi: https://doi.org/10.18653/v1/N19-1423
  35. 35. Wang L, Yang N, Huang X, Jiao B, Yang L, Jiang D, et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training. 2022:arXiv:2212.03533. Available from: https://ui.adsabs.harvard.edu/abs/2022arXiv221203533W.
  36. 36. Gao T, Yao X, Chen D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov. Association for Computational Linguistics; 2021: p. 6894-6910.
  37. 37. Bajaj P, Campos D, Craswell N, Deng L, Gao J, Liu X, et al. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. 2016:arXiv:1611.09268. Available from: https://ui.adsabs.harvard.edu/abs/2016arXiv161109268B
  38. 38. Kwiatkowski T, Palomaki J, Redfield O, Collins M, Parikh A, Alberti C, et al. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics. 2019;7:453–66.
  39. 39. Thakur N, Reimers N, Rücklé A, Srivastava A, Gurevych I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Proceedings of the Thirty‑Fifth Conference on Neural Information Processing Systems (NeurIPS 2021), Datasets & Benchmarks Track; 2021 Oct 11; Online; 2021.
  40. 40. Muennighoff N, Tazi N, Magne L, Reimers N, editors. MTEB: Massive Text Embedding Benchmark. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL); 2023 May. Association for Computational Linguistics; 2023: p. 2014-37.
  41. 41. Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019: 3982–92.
  42. 42. Reimers N, Gurevych I. Pretrained Models. 2019. Sentence Transformers. Available from: https://sbert.net/docs/sentence_transformer/pretrained_models.html
  43. 43. Song K, Tan X, Qin T, Lu J, Liu TY. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems. 2020;33:16857–67.
  44. 44. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
  45. 45. Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems. 2020;33:5776–88.
  46. 46. Hinton G, Vinyals O, Dean J. Distilling the Knowledge in a Neural Network. 2015:arXiv:1503.02531. Available from: https://ui.adsabs.harvard.edu/abs/2015arXiv150302531H
  47. 47. Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. 2020:arXiv:2002.10957. Available from: https://ui.adsabs.harvard.edu/abs/2020arXiv200210957W
  48. 48. Li M, Sun B, Wang Q. Question Answering on SQuAD2.0. Stanford CS224N Natural Language Processing with Deep Learning. 2019.
  49. 49. Remy F, Demuynck K, Demeester T. BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights. J Am Med Inform Assoc. 2024;31(9):1844–55. pmid:38412333
  50. 50. Remy F, Scaboro S, Portelli B. Boosting Adverse Drug Event Normalization on Social Media: General-Purpose Model Initialization and Biomedical Semantic Text Similarity Benefit Zero-Shot Linking in Informal Contexts. Proceedings of the 11th International Workshop on Natural Language Processing for Social Media; 2023 Nov. Association for Computational Linguistics; 2023: p. 47-52.
  51. 51. Remy F, Demuynck K, Demeester T. BioLORD: Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022 December. Association for Computational Linguistics; 2022: 1454-65.
  52. 52. Wortsman M, Ilharco G, Gadre SY, Roelofs R, Gontijo-Lopes R, Morcos AS, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: Kamalika C, Stefanie J, Le S, Csaba S, Gang N, Sivan S, editors. Proceedings of the 39th International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2022. p. 23965–98.
  53. 53. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(Database issue):D267–70. pmid:14681409
  54. 54. Côté RA, Robboy S. Progress in medical information management. Systematized nomenclature of medicine (SNOMED). JAMA. 1980;243(8):756–62.
  55. 55. Remy F, Demuynck K, Demeester T. Automatic Glossary of Clinical Terminology: a Large-Scale Dictionary of Biomedical Definitions Generated from Ontological Knowledge. 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks (BioNLP 2023); 2023 July: Association for Computational Linguistics; 2023: p. 265-272.
  56. 56. Wagner RA, Fischer MJ. The String-to-String Correction Problem. J ACM. 1974;21(1):168–73.
  57. 57. Bachmann M. Rapid fuzzy string matching in Python and C++ using the Levenshtein Distance. 2024. Available from: https://pypi.org/project/rapidfuzz/.
  58. 58. Seatgeek. Fuzzy String Matching in Python. 2020. Available from: https://github.com/seatgeek/fuzzywuzzy
  59. 59. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
  60. 60. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–30.
  61. 61. Deshpande M, Karypis G. Item-based top-n recommendation algorithms. ACM Trans Inf Syst. 2004;22(1):143–77.
  62. 62. Shi Y, Karatzoglou A, Baltrunas L, Larson M, Oliver N, Hanjalic A. CLiMF: learning to maximize reciprocal rank with collaborative less-is-more filtering. Proceedings of the Sixth ACM Conference on Recommender Systems; Dublin, Ireland: Association for Computing Machinery; 2012. p. 139–46.
  63. 63. Fisher A, Rudin C, Dominici F. All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. J Mach Learn Res. 2019;20.
  64. 64. Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14(5):365–76. pmid:23571845
  65. 65. Bauermeister S, Bauermeister JR, Bridgman R, Felici C, Newbury M, North L, et al. Research-ready data: the C-Surv data model. Eur J Epidemiol. 2023;38(2):179–87. pmid:36609896
  66. 66. Gomaa WH, Fahmy AA. A survey of text similarity approaches. International Journal of Computer Applications. 2013;68(13):13–8.
  67. 67. Navarro G. A guided tour to approximate string matching. ACM Comput Surv. 2001;33(1):31–88.
  68. 68. Kenter T, Rijke Md. Short Text Similarity with Word Embeddings. Proceedings of the 24th ACM International on Conference on Information and Knowledge Management; Melbourne, Australia: Association for Computing Machinery; 2015. p. 1411–20.
  69. 69. Amur ZH, Kwang Hooi Y, Bhanbhro H, Dahri K, Soomro GM. Short-Text Semantic Similarity (STSS): Techniques, Challenges and Future Perspectives. Applied Sciences. 2023;13(6):3911.
  70. 70. Feng Y, Zhou L, Ma C, Zheng Y, He R, Li Y. Knowledge graph-based thought: a knowledge graph-enhanced LLM framework for pan-cancer question answering. Gigascience. 2025;14:giae082. pmid:39775838
  71. 71. Hao Y, He H, Ho JC. LLMSYN: Generating Synthetic Electronic Health Records Without Patient-Level Data. In: Kaivalya D, Madalina F, Shalmali J, Zachary L, Rajesh R, Iñigo U, editors. Proceedings of the 9th Machine Learning for Healthcare Conference; Proceedings of Machine Learning Research: PMLR; 2024.
  72. 72. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems; Vancouver, BC, Canada: Curran Associates Inc.; 2020. Article 793.
  73. 73. Polikar R. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine. 2006;6(3):21–45.