Development, system design, safety, and performance metrics of a conversational agent for reducing depressive and anxious symptoms based on a large language model: The MHAI study

The criteria used to evaluate LLMs’ responses.

APP architecture.

Platform features.

The difference between the large language model (GPT-4o vs Llama 3.1-8B) and mixed-effects models adjusted for evaluator (n = 816).

Example of five pairs of consecutive interactions for the large language models (GPT-4o vs. Llama 3.1-8B).