Fig 1.
Overview of the automated variable matching approach.
aEach large language model (LLM) was applied to rank EU-JP variable pairs separately. b The features used by the Random Forest classifier included cosine similarity scores generated by LLMs, edit distance scores generated by fuzzy matching, and other features derived from the data dictionary (detailed in section Class label and machine learning features).
Table 1.
Examples for variable names, labels, and data sheet descriptionsa.
Fig 2.
Training and evaluation of Random Forest (RF) classifier in a single triala.
a The Random Forest classifier was trained and evaluated in 50 trials, with each trial having a different random split of the training and test datasets. Each test set had slightly varying ratios of positive (i.e., matched EU-JP variable pairs) to negative (i.e., unmatched EU-JP variable pairs) cases because some EU variables were manually aligned to multiple JP variables (see details in sub-sections Training and test datasets and Model comparison).
Table 2.
Characteristics of variable labels, data sheet descriptions, and derivation rules of EU and JP variables.
Table 3.
Performance of individual NLP methods in variable matchinga, b.
Table 4.
Random Forest and E5 model performance comparison.
Table 5.
Feature importance for the Random Forest modela.
Table 6.
Feature ablation analysisa.