Table 1.
The criteria used to evaluate LLMs’ responses.
Fig 1.
APP architecture.
Fig 2.
Platform features.
Table 2.
The difference between the large language model (GPT-4o vs Llama 3.1-8B) and mixed-effects models adjusted for evaluator (n = 816).
Table 3.
Example of five pairs of consecutive interactions for the large language models (GPT-4o vs. Llama 3.1-8B).