Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Overview of the automated variable matching approach.

aEach large language model (LLM) was applied to rank EU-JP variable pairs separately. b The features used by the Random Forest classifier included cosine similarity scores generated by LLMs, edit distance scores generated by fuzzy matching, and other features derived from the data dictionary (detailed in section Class label and machine learning features).

More »

Fig 1 Expand

Table 1.

Examples for variable names, labels, and data sheet descriptionsa.

More »

Table 1 Expand

Fig 2.

Training and evaluation of Random Forest (RF) classifier in a single triala.

a The Random Forest classifier was trained and evaluated in 50 trials, with each trial having a different random split of the training and test datasets. Each test set had slightly varying ratios of positive (i.e., matched EU-JP variable pairs) to negative (i.e., unmatched EU-JP variable pairs) cases because some EU variables were manually aligned to multiple JP variables (see details in sub-sections Training and test datasets and Model comparison).

More »

Fig 2 Expand

Table 2.

Characteristics of variable labels, data sheet descriptions, and derivation rules of EU and JP variables.

More »

Table 2 Expand

Table 3.

Performance of individual NLP methods in variable matchinga, b.

More »

Table 3 Expand

Table 4.

Random Forest and E5 model performance comparison.

More »

Table 4 Expand

Table 5.

Feature importance for the Random Forest modela.

More »

Table 5 Expand

Table 6.

Feature ablation analysisa.

More »

Table 6 Expand