A natural language processing approach to support biomedical data harmonization: Leveraging large language models

doi:10.1371/journal.pone.0328262

Fig 1.

Overview of the automated variable matching approach.

^aEach large language model (LLM) was applied to rank EU-JP variable pairs separately. b The features used by the Random Forest classifier included cosine similarity scores generated by LLMs, edit distance scores generated by fuzzy matching, and other features derived from the data dictionary (detailed in section Class label and machine learning features).

More »

Expand

Table 1.

Examples for variable names, labels, and data sheet descriptions^a.

More »

Expand

Fig 2.

Training and evaluation of Random Forest (RF) classifier in a single trial^a.

^a The Random Forest classifier was trained and evaluated in 50 trials, with each trial having a different random split of the training and test datasets. Each test set had slightly varying ratios of positive (i.e., matched EU-JP variable pairs) to negative (i.e., unmatched EU-JP variable pairs) cases because some EU variables were manually aligned to multiple JP variables (see details in sub-sections Training and test datasets and Model comparison).

More »