Figures
Abstract
Background
Conversational agents based on large language models (LLMs) have shown moderate efficacy in reducing depressive and anxiety symptoms. However, most existing evaluations lack methodological transparency, rely on closed-source models, and show limited standardization in performance and safety assessment.
Objective
We have two study objectives: (1) to develop an LLM-based conversational agent through system design analysis and initial functionality testing, and (2) to evaluate its safety and performance through standardized assessment in controlled simulated interactions focused on depression and anxiety of two LLMs (GPT-4o and Llama 3.1-8B).
Methods
We conducted a cross-sectional study in two phases. First, we developed a mental health platform integrating a conversational agent with functionalities including personalized context, pretrained therapeutic modules, self-assessment tools, and an emergency alert system. Second, we evaluated the agent’s responses in simulated interactions based on predefined user personas for each LLM. Four expert raters assessed 816 interaction pairs using a 5-point Likert scale evaluating tone, clarity, domain accuracy (correctness), robustness, completeness, boundaries, target language, and safety. In addition, we use performance metrics based on numerical criteria such as cost, response length, and number of tokens. Mixed-effects models were used to compare LLM performance and assess metric interrelations.
Results
First, we developed a web-based mental health platform using a user-centered design, structured into frontend, backend, and database layers. The system integrates therapeutic chat (GPT-4o and Llama 3.1-8B), psychological assessments (PHQ-9, GAD-7), CBT-based tasks, and an emergency alert system. The platform supports secure user authentication, data encryption, multilingual access, and session tracking. Second, GPT-4o outperformed Llama 3.1-8B in both performance metrics based on numerical criteria and Likert scale criteria, generating longer and more lexically diverse responses, using more tokens, and scoring higher in clarity, robustness, completeness, boundaries, and target language. However, it incurred higher costs, with no significant differences in tone, accuracy, or safety.
Citation: Villarreal-Zegarra D, Paredes-Gonzales Y, Dámaso-Román A, Quiñones-Inga J, Centeno-Terrazas G, Lozada YPA-M (2026) Development, system design, safety, and performance metrics of a conversational agent for reducing depressive and anxious symptoms based on a large language model: The MHAI study. PLoS One 21(3): e0344939. https://doi.org/10.1371/journal.pone.0344939
Editor: Issa Atoum, Philadelphia University, JORDAN
Received: November 5, 2025; Accepted: February 26, 2026; Published: March 18, 2026
Copyright: © 2026 Villarreal-Zegarra et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data is available at https://doi.org/10.6084/m9.figshare.29606618.v1.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors report no conflict of interest when conducting the study, analyzing the data, or writing the manuscript.
Introduction
Conversational agents based on large language models are demonstrating significant clinical efficacy in mental health applications. A meta-analysis of 15 randomized controlled trials showed that these systems achieve reductions in depressive symptoms and psychological distress, with moderate pooled effect sizes between 0.6 and 0.7 [1]. Complementarily, a recent randomized trial in adults demonstrated that a generative chatbot was effective in reducing symptoms of major depressive disorder, generalized anxiety, and eating disorders, with moderate effect sizes ranging from 0.627 to 0.903 [2]. Likewise, qualitative analyses have shown that these systems can sustain empathetic, non-judgmental, and clinically relevant dialogues, aligned with principles of face-to-face psychotherapy, offering anonymous, accessible, and acceptable interactions for individuals with depressive symptoms [3]. Nevertheless, questions remain regarding the methodological quality of available evidence and the robustness of evaluation frameworks used in these studies.
Despite emerging clinical evidence, critical methodological deficiencies persist that limit rigorous scientific validation of conversational agents in mental health. A systematic review of 137 studies revealed that 99.3% evaluated closed-source models without providing sufficient information to identify the specific model version, compromising reproducibility and comparability across research [4]. This is compounded by marked heterogeneity in measurement instruments, since an analysis of 203 tools identified that 73.9% were used solely in individual studies, and 52.7% lacked cited evidence of validity, being mostly created or adapted exclusively for the study in question [5]. These limitations also affect the evaluation of key dimensions such as empathy. A systematic review found that only 26% of 19 studies directly examined empathetic characteristics of conversational agents, thus restricting a comprehensive understanding of their therapeutic performance [6]. In an early effort to standardize these evaluations, a Delphi study reached expert consensus to establish a set of 24 metrics grouped into four main domains: global evaluation, response generation, comprehension, and aesthetics [7]. However, this framework has been scarcely adopted in subsequent empirical research, highlighting the current need for studies implementing systematic and comparative evaluation schemes based on standardized metrics.
To address these limitations, the present study developed an integrated conversational platform for mental health and systematically evaluated the comparative performance of GPT-4o and Llama 3.1–8B through system design analysis and standardized safety metrics. The selection of GPT-4o was based on recent evidence of its efficacy in cross-cultural interventions, where it obtained high scores in positivity (9.0/10) and empathy (8.7/10) [8]. Meanwhile, Llama 3.1–8B was selected for its competitive performance in healthcare support tasks, according to comparative evaluations, and for representing a validated open-source alternative [9].
The main scientific contribution of this study to the field of digital mental health is the proposal of a reproducible comparative evaluation framework that integrates standardized metrics of conversational quality, safety, and performance, applied systematically to an open-source language model and a commercial (closed-source) language model under controlled conditions. Unlike previous evaluations, the analysis is based on clinically plausible simulated interactions, adjusted for language, user persona, and evaluator, and simultaneously incorporates linguistic indicators, validated qualitative criteria, and economic metrics. In addition, the study presents a functional and integrated architecture of an LLM-based mental health platform, designed for early safety evaluations and future clinical studies, providing a replicable framework for subsequent research and for the responsible implementation of these systems in mental health care settings. Therefore, we have two study objectives: (1) to develop an LLM-based conversational agent through system design analysis and initial functionality testing, and (2) to evaluate its safety and performance through standardized assessment in controlled simulated interactions focused on depression and anxiety. Moreover, based on the second objective, we hypothesize that in the simulated interactions GPT-4o will obtain higher scores than Llama 3.1-8B on the performance metrics and safety.
Methods
Design
Our study was cross-sectional. Our study used the CHART checklist, the Reporting Guideline for Chatbot Health Advice Studies (see S1 File).
For the first objective, we developed a conversational agent based on LLM and conducted a descriptive study of its technical aspects, system design, and functionality testing. For the second objective, we evaluated the safety and performance metrics of the conversational agent’s responses, which were assessed by the research team in a controlled setting. The performance of two LLMs was evaluated: GPT-4o, which was selected for being the most widely used OpenAI version at the time of writing the study protocol (May 2025), and Llama 3.1-8B, which was chosen for being an open-access, easy-to-implement, and free LLM. It should be noted that the LLMs were not subjected to fine-tuning or Retrieval-Augmented Generation (RAG). Both LLMs used the same prompt, which was designed to simulate a cognitive-behavioral therapist with a supportive and validating attitude, providing short responses. The full prompt is provided in S2 File. The research protocol was written and planned in May 2025, was approved by the ethics committee on September 19, 2025, and the evaluation and recruitment took place between September 20 and 21, 2025.
Setting
The Digital Health Research Center team designed a mental health care platform that integrates a conversational agent, mental health assessments, behavioral activation-based activities and an alert/emergency system. The platform was designed to be interoperable with the OpenMRS electronic medical record system. However, for the purposes of this study, we do not address OpenMRS interoperability and instead focus on the internal functioning of the platform. The platform is intended as an additional resource for both in-person and telehealth care in a private mental health clinic. It has been developed in both English and Spanish. This research is part of the Mental Health platform supported by the Artificial Intelligence study (MHAI Study). The target population consists of users aged 18–30 who own a smartphone, are digital natives, and are simultaneously undergoing psychotherapy or pharmacological treatment for depression or anxiety.
Participants
The participants in this study were four members of the research team. For the first objective, which focuses on the development of the conversational agent, we describe the technical aspects of its development, system design, and functionality testing. We also present the iterative improvement process of the conversational agent, based on a user experience approach.
For the second objective, the conversational agent’s responses were evaluated by research team members with expertise in psychotherapy and digital mental health. The evaluation of safety and performance metrics were based on simulated conversations between the conversational agent and potential users experiencing anxiety and depression. To ensure that researchers simulated comparable user cases, three user personas were designed as a basis for the simulated interactions (see S3 File). These user personas specified baseline characteristics that researchers were instructed to simulate, with each conversation consisting of a minimum of 17 interaction pairs (17 responses from the LLM and 17 from the researcher). Each interaction pair served as a unit of evaluation.
The number of interaction pairs was determined based on an independent-means difference analysis to evaluate potential differences in the performance of GPT-4o and Llama 3.1-8B (two groups). A minimum of 788 interaction pairs was estimated to be required for evaluation by the four raters, assuming a small effect size (d = 0.2), an alpha probability of 0.05, 80% power, a two-tailed distribution, and an allocation ratio of 1. Each researcher generated responses for both conversational agents in English and Spanish (x2), using both LLMs (x2), across the three user personas per LLM and language (x3), and a minimum of 17 interaction pairs in each case (x17), resulting in a total of 204 interaction pairs evaluated by each researcher. All research team members were adults (over 18 years old), fluent in both English and Spanish, held undergraduate degrees in health-related fields, and had at least three years of experience in mental health.
Measurements and procedures
Development and system design.
A description was provided of the technologies used to develop the platform hosting the conversational agent, including how these technologies were interconnected across the frontend, backend, and database. Additionally, the implementation of the following functionalities was presented:
- Personalized context window: The conversational agent was designed to gather information from users regarding their preferences and needs to generate personalized activity recommendations. It also stored information from previous sessions and allowed tracking of user progress.
- Pretrained clinical intervention modules: The conversational agent offered intervention modules based on cognitive-behavioral therapy (CBT) techniques [10], mindfulness, and dialectical behavior therapy (DBT) [11], along with therapeutic components such as psychoeducation, relaxation techniques, behavioral activation, problem-solving strategies, and cognitive restructuring. These components were selected based on a systematic review of mental health conversational agents, which identified them as the most frequently used and effective strategies [12].
- Self-administered anxiety and depression scales: The platform allowed users to complete validated self-report assessments, including the Patient Health Questionnaire-9 (PHQ-9) [13] and the Generalized Anxiety Disorder-7 (GAD-7) [14], to assess depressive and anxious symptoms.
- Emergency alert for suicidal ideation: If a user reported suicidal ideation via the PHQ-9 questionnaire or if the conversational agent detected suicide-related content in conversations, an alert was triggered to notify the research team. Additionally, the user was provided with information about professional mental health resources. This emergency alert feature had been successfully tested in previous studies [3,15].
- Mood assessment: Each time a user logged into the platform, they were asked to assess their emotional state using two questions: 1) “How happy or sad do you feel right now?” answered on a 7-point scale from 7 = “very happy” to 1 = “very sad”. 2) “How nervous or calm do you feel right now?” answered on a 7-point scale from 1 = “very nervous” to 7 = “very relaxed and calm” [16]. This assessment was conducted at each login.
- Support for completing weekly tasks: The platform included a feature that allowed users to discuss their weekly therapeutic tasks with the conversational agent. The agent also provided personalized recommendations based on the user’s previously established preferences.
The conversational agent supported both text and voice input and output. In addition to the main functionalities, the platform included information on data protection policies, details about the research team, and emergency mental health helplines. All stages of the development of the conversational agent, including design modifications, technical implementations, and evaluation processes, were systematically documented by the research team and descriptively reported.
Development of user personas and clinical rationale.
The construction of user personas focused on the most frequent patient typologies observed in routine telehealth care at a specialized mental health center (Digital Health Centro Médico) and on the selection of clinically prioritized profiles commonly encountered in real-world practice at this center. The user personas were developed based on real cases that served as reference points and were intentionally selected to maximize clinical diversity across diagnoses and symptom severity levels. For each user persona, a set of clinical characteristics was specified, including diagnosis, severity, reason for seeking help, and patient context, as well as sociodemographic characteristics such as age, gender, occupation, and marital status.
To protect privacy and ensure anonymization, the personas were not direct reproductions of identifiable patients. Sociodemographic and contextual details were modified, combined, or generalized so that no persona corresponded to an identifiable individual. Names, dates, and unique events were not retained. This approach preserves clinical realism while minimizing the risk of reidentification. S3 File provides the complete specifications of the user personas.
Performance metrics based on quantitative response characteristics.
We evaluated a set of performance metrics for each of the LLMs:
- Response Length is the number of characters in a response (including spaces).
- Lexical Diversity is the variety of vocabulary used in response.
- Number of input tokens (prompt tokens) and number of output tokens (completion tokens).
- Cost per response, defined as the execution cost in USD for each model per response. We estimated token costs based on prior studies and computed the average cost per response using token counts. The estimated input and output token costs were for GPT-4o ($0.00000125 per input token and $0.000005 per output token) and Llama 3.1-8B ($0.00000003 per input token and $0.00000005 per output token). The estimate was calculated based on every 100,000 tokens.
Safety and conversational quality performance metrics.
A Likert scale from 1 (did not meet the criterion) to 5 (fully met the criterion) was used, and the evaluation criteria were defined in Table 1. The different criteria used had been applied in previous studies [7,17].
Analysis plan
For the main hypothesis (second objective), we evaluated which of the LLMs demonstrated better performance metrics and safety based on mixed-effects models adjusted for evaluator. Model adjusted by language (Spanish/English), user persona, and researcher. In addition, some of the responses were presented descriptively as examples of the style of responses for each model. We considered the hypothesis to be supported when the differences in performance metrics and safety were statistically significant (p < 0.05) and favored GPT-4o.
We conducted a one-way sensitivity analysis to contextualize the estimated token-based costs under realistic variations in session intensity. Using the mean prompt and completion token counts observed in our evaluation and the per-token prices. We computed the expected LLM execution cost for a standardized course of care comprising 10 sessions of approximately 45 minutes each. Assuming an interaction rate ranging from 1 to 3 model responses per minute (45–135 responses per session).
Ethical aspects
This study did not involve external human participants and therefore posed no ethical risk. The protocol was approved by the Institutional Research Ethics Committee of the Universidad Científica del Sur (N°1235-CIEI-CIENTÍFICA-2025). Because the study aimed to assess the safety of the conversational agent, no patients or potential end users were included, as exposure to potential hallucinations from the agent could pose an ethical risk to their safety or well-being. The researchers who evaluated the conversational agent’s responses completed and signed a written informed consent form and agreed to participate in the study. The informed consent form was sent by email, and the researchers returned it signed. The study did not include minors.
Results
Development and system design
A user-centered design approach was adopted, and the integration of different system components was analyzed. The system is structured into three main layers: frontend, backend, and database (see Fig 1 and S1 Table). Regarding deployment infrastructure, the application is hosted on Replit and configured to run on Cloud Run. The specific architecture detailing the connections between the various technologies used can be found in S1 Fig.
APP Features: The app includes six main functionalities to reduce depressive and anxiety symptoms (see Fig 2): (1) User authentication system, which enables secure registration and access to the platform, ensuring privacy through secure session handling, tracking each user’s last login, and activity logging. (2) Psychological assessments, integrating the PHQ-9 and GAD-7 questionnaires for detecting and monitoring symptoms of depression and anxiety, ensuring data storage in the database. (3) Therapeutic chat with AI, utilizing the OpenAI GPT-4o model with a prompt designed to generate responses in a therapeutic style based on cognitive behavioral therapy (CBT) and incorporating CBT-based clinical manuals within the context window. (4) Therapeutic tasks, including mindfulness breathing exercises and scheduled physical activities, with weekly monitoring to assess user adherence to therapeutic recommendations. (5) An emergency alert system triggers an email alert for the clinical team to make immediate contact in critical situations. (6) Accessibility, security, and privacy features, including an encryption mechanism to protect sensitive data, a role-based access control system, and multilingual support in Spanish and English to enhance platform accessibility.
The application’s workflow begins with user registration and login, ensuring secure authentication on the platform. Once logged in, users can access a variety of therapeutic features available on the dashboard, designed to address symptoms of anxiety and depression. S4 File provides a visual representation of the app’s user interface.
Safety and performance metrics
Our study evaluated the performance metrics based on quantitative response characteristics of two LLMs (GPT-4o vs. Llama 3.1-8B) and found that GPT-4o produced longer responses, with an average of 71.59 more characters than Llama 3.1-8B (95% CI: 51.23 to 91.94; p < 0.001), and exhibited greater lexical diversity, with a 7.77% increase compared to Llama 3.1-8B (95% CI: 6.86 to 8.69; p < 0.001). Additionally, GPT-4o used, on average, 3.16 more tokens for the input (95% CI: 1.37 to 4.95; p < 0.001) and 26.86 more tokens for the output (95% CI: 22.77 to 30.94; p < 0.001) than Llama 3.1-8B. This also translated into a higher cost per 100,000 tokens in USD for GPT-4o compared to Llama 3.1-8B (see Table 2).
In terms of conversational quality performance metrics, GPT-4o outperformed Llama 3.1-8B in several domains such like clarity of responses, the ability to handle ambiguous or incomplete questions while maintaining conversational coherence (robustness), comprehensive answers without including unsolicited information (completeness), the ability to restrict responses to mental health related content (boundaries), and the ability to respond in the same language in which the question was asked (target language) (see Table 2). No significant differences were observed in tone, domain accuracy, or response safety.
Consistent with our hypothesis, GPT-4o scored higher than Llama 3.1–8B on clarity, robustness, completeness, boundaries, and target language, while no differences were observed for tone, domain accuracy, or safety.
One-way sensitivity analysis of cost
The estimated total cost per patient by 10 sessions of 45 min each one, ranged from $0.22 to $0.66 for GPT-4o, compared with $0.0017 to $0.0053 for Llama 3.1-8B (see S2 Table). These results indicate that the relative cost difference is robust to plausible changes in message volume, although our estimates capture only variable token costs and exclude infrastructure and implementation overhead.
Examples of interaction pairs
We present examples of interaction pairs conducted by the authors while simulating the user personas (see Table 3). It should be noted that no interaction pairs that were clinically risky or unsafe received a score of 1, nor were there any particularly relevant cases that could be listed; therefore, no concrete examples of potentially unsafe interactions could be provided.
Discussion
This study addressed two objectives related to the development and evaluation of LLM-based conversational agents for depression and anxiety. For the first objective, we developed an integrated mental health platform that combines LLM-based conversational agents (GPT-4o and Llama 3.1-8B), validated psychological assessments (PHQ-9, GAD-7), evidence-based therapeutic modules (CBT, DBT), and an emergency alert system within a unified, multilingual interface. The integration of multiple therapeutic components for subsequent use in clinical contexts is a common strategy in the design of conversational agents for mental health [6].
For the second objective, we conducted a comparative evaluation of GPT-4o and Llama 3.1-8B in terms of safety and performance using standardized metrics in controlled simulated interactions. Our findings partially support our hypothesis, as GPT-4o outperformed Llama 3.1-8B only on selected conversational quality metrics. Specifically, GPT-4o demonstrated statistically superior performance in completeness, clarity, robustness, boundary adherence, and target language conformity. This difference may be attributed to the substantially smaller model size and lower parameter count of Llama 3.1-8B compared to GPT-4o’s higher capacity, which enables it to address complex benchmarks such as Humanity’s Last Exam, a level of performance that Llama 3.1-8B does not achieve.
Other studies have consistently found that GPT-4 and GPT-4o outperform various versions of LLaMA. For instance, one study compared GPT-4, Bard, and LLaMA-2 in the Taiwanese psychiatric licensing exam and found that GPT-4 exhibited superior performance compared to LLaMA-2, although the evaluation focused more on formal medical knowledge than on the quality of therapeutic conversation [18]. Another study reported that GPT-4o maintains greater consistency in treatment recommendations compared to LLaMA 3 variants in psychiatric diagnostic tasks [19]. Although these studies assessed different aspects of LLM performance in healthcare, GPT models have shown consistently superior performance over LLaMA, suggesting meaningful differences in their internal architectures. This is reflected in their performance on psychiatric tasks, supporting the generalization of our findings to more complex clinical applications.
At the safety level, one study evaluated GPT-4o alongside Claude, Gemini, and a variant of LLaMA 3 in psychiatric diagnostic tasks and found that GPT-4o maintained greater consistency in treatment recommendations compared to LLaMA 3 variants, although both models exhibited minimally biased diagnostic decisions [19]. However, another study found that GPT-4 and MentaLLaMA exhibited differences in gender- and sexual orientation-related biases in the context of eating disorders [20]. Both studies focused on differential diagnosis and the detection of specific biases, whereas our study assessed broader safety dimensions in therapeutic conversational interactions. This methodological difference may explain why we observed more marked equivalence in our safety metrics, suggesting that safety performance may vary depending on the specific clinical task and the evaluation framework used.
GPT-4o demonstrated superior natural language generation capabilities in structured therapeutic contexts, particularly in maintaining appropriate conversational boundaries and consistency across English and Spanish interactions. However, this improved performance comes with a significant economic cost, as GPT-4o incurred operational expenses more than 140 times higher than LLaMA 3.1-8B. While previous comparative studies have evaluated different LLM architectures in psychiatric applications [21], few have specifically quantified the economic implications of these performance differences for large-scale deployments, as we report here. This economic analysis is particularly relevant given the growing emphasis on sustainable and scalable digital health interventions [22], especially in low- and middle-income countries.
The evaluation of multiple assessment criteria in our study is valuable because most evaluations of LLM-based conversational agents in mental health focus on response accuracy, while few studies have examined conversational dialogue quality, as in our case [23]. Additionally, studies evaluating LLMs in mental health remain underrepresented in the scientific literature compared to other disciplines such as general healthcare or internal medicine [23].
Our study performs a sensitivity analysis based on the assumption of a potential mental health care service and considers token-based costs. Although this perspective is limited, as it does not account for implementation, supervision, data management, infrastructure, and other associated costs, it serves as a proxy that allows potential decision-makers and managers to identify a constant cost for packages of 10 sessions and for each individual session, depending on the model. It should be noted that the evidence suggests that, although the initial implementation process of digital mental health interventions is more costly than traditional care, these costs decrease when large volumes of care are considered [24,25].
We encourage researchers to conduct future studies focused on developing technological solutions in mental health that can comply with existing regulations, such as the standards of the Health Insurance Portability and Accountability Act (HIPAA), or similar legislation in different countries that ensures data privacy and security [26]. Although commercial models such as Gemini, Claude, or GPT are at the frontier of current knowledge, they are not HIPAA-compliant, which limits their use in healthcare. Therefore, it is advisable to use open-source models, such as GPT-OSS:120B or DeepSeek-R1, provided that national legislation allows their use. Running models on local servers enables control over data and facilitates compliance with privacy and health information security laws.
Strengths and limitations
This study has some limitations. First, we evaluated only two LLMs, GPT-4o and Llama 3.1-8B, without including other open-access models such as DeepSeek, larger Llama variants, or other state-of-the-art models such as Gemini 2.5-Pro, Grok-4, or GPT-o3. This restricts the generalizability of our findings and limits comparison with models that may present different performance–cost profiles. Although our results cannot be directly generalized to other LLMs, prior evidence indicates that state-of-the-art reasoning models generally outperform older or non-reasoning models [27]. Therefore, while our findings may not extend to earlier models, it is plausible that more recent state-of-the-art models would demonstrate similar or superior performance. Second, the evaluation was conducted using research team members simulating user personas rather than real patient interactions. While this approach limits ecological validity, we considered it appropriate for initial safety evaluation to avoid potential ethical risks associated with exposing vulnerable populations to unvalidated systems. Therefore, in later phases of the study, the platform should be validated in real-world settings. For example, this includes studies with real users (patients) in supervised environments, followed by randomized clinical trials. Third, although we employed a systematic method to evaluate the performance metrics based on Likert scale criteria, evaluations were conducted by expert evaluators from the research team, which may introduce subjective bias despite standardized training protocols. Fourth, because different evaluators rated different interactions, it was not possible to estimate inter-rater reliability coefficients (i.e., kappa or ICC). However, to mitigate evaluator-related bias, we used mixed-effects models adjusted for evaluator, thereby preventing the main analysis from being affected by rater severity or leniency, that is, by an evaluator being systematically more stringent or more lenient. Fifth, our study did not include fine-tuning or retrieval-augmented generation (RAG) at the model design stage. Although it has been shown that fine-tuned models exhibit performance comparable to general-purpose models for different health-related tasks [28], the use of RAG could represent an alternative to improve the model’s metric performance. Therefore, we estimate that subsequent stages of the study will include the use of RAG with transcripts of psychotherapy recordings, coding patient and therapist responses, and incorporating care manuals or clinical practice guidelines to improve the performance of the responses. Sixth, a cost-effectiveness analysis or clinical benchmarking were not possible because, although information on the cost per input and output token was available, there was no information on the clinical effect of the intervention. Therefore, future studies, such as randomized clinical trials, should include cost-effectiveness evaluations of this type of digital mental health intervention. Seventh, it was not possible to perform a sensitivity analysis of the findings for each LLM and user persona, as the available sample size was insufficient to replicate the main analysis stratified by these different conditions.
Conversely, this study demonstrates several important strengths that enhance the validity and applicability of our findings. First, we implemented a standardized evaluation protocol using validated metrics, enabling systematic and reproducible comparisons between models while maintaining consistency with established evaluation standards in health chatbot research [7]. Second, our study evaluates the potential of multilingual LLMs (English and Spanish), which is particularly relevant given the growing need for culturally and linguistically aligned mental health interventions [29]. Third, we make the dataset publicly available to support future analyses, in line with principles of transparency and open access.
Implications for future studies
Our study outlines a pathway for future research that addresses the limitations identified and ensures a substantial contribution to the field of digital mental health. First, we require evaluations with real users in controlled settings, that is, a preliminary usability assessment with patients and therapists to ensure that the platform is acceptable, usable, and can be adopted as a complement to psychotherapeutic processes. In addition, we should assess additional requirements that may be valuable to users and were not previously considered. Second, to ensure the ecological validity of the findings in real-world contexts, we must conduct a preliminary effectiveness evaluation in small groups, such as pilot clinical trials, or by using non-experimental methods, such as case series without a control group. This would allow us to determine whether the observed results are replicable in routine care settings and to assess preliminary clinical efficacy. Moreover, it would help identify the optimal platform prescription, such as the frequency of use and the number of sessions. Third, we should conduct robust clinical studies with control groups and adequate sample sizes. This would allow us to establish clinical efficacy under controlled conditions. Fourth, we should perform implementation studies in primary care and hospital settings to evaluate the real-world impact of MHAI in clinical contexts. At the same time, registration with the relevant regulatory authorities, such as the Food and Drug Administration (FDA) in the United States or the Dirección General de Medicamentos, Insumos y Drogas (DIGEMID) in Peru, will be required if the platform is to be considered a medical device for therapeutic use. Prior to this stage, the platform should be regarded as a tool that may improve overall emotional well-being, rather than as a formal therapy.
At the architectural level of MHAI, the platform must implement additional servers to support high traffic volumes through horizontal scalability, while maintaining compliance with data security and privacy standards such as HIPAA. Moreover, MHAI should not operate as an isolated platform; instead, it should be integrated into a real clinical care process and interoperable with electronic health record systems such as OpenMRS, Epic, or Cerner.
Regarding new features and functions, several developments are required. First, the emergency button must be tested under real-world conditions with real users, as only patients can determine whether it is helpful in practice. Although the current implementation is functional and operates correctly, it has not yet been tested in real-world contexts due to ethical constraints. Second, functions that allow users to obtain information from a licensed health professional, schedule a new appointment, or be redirected to a real clinical care pathway should be implemented. Third, additional large language models should be implemented to review conversations and extract clinically relevant information that can be validated by a health professional and subsequently entered into the patient’s electronic health record. For example, one model could identify suicide risk in conversations, while another could detect signs and symptoms that can later be validated and documented in the medical record.
Conclusions
GPT-4o demonstrated significant superiority in five specific conversational metrics (clarity, robustness, completeness, boundaries, and target language adherence) while maintaining statistical equivalence with Llama 3.1-8B in the most important dimensions for clinical safety: safety, therapeutic tone, and domain accuracy. However, GPT-4o incurred substantially higher operational costs, representing a critical trade-off between specific conversational performance and economic sustainability for large-scale implementations that require careful consideration in deployment decisions.
Simultaneously, we developed a comprehensive mental health web platform using user-centered design principles, structured in frontend, backend, and database layers, integrating therapeutic chatbot capabilities, CBT-based interventions, emergency alert systems, secure authentication, data encryption, multilingual access, and comprehensive session tracking. This integrated architecture demonstrates the technical feasibility of comprehensive LLM-based mental health systems while providing a replicable framework for future clinical implementations.
Supporting information
S1 Table. Technical characteristics for the frontend, backend, and database.
https://doi.org/10.1371/journal.pone.0344939.s005
(DOCX)
S1 Fig. Architecture and connection between the different technologies.
https://doi.org/10.1371/journal.pone.0344939.s007
(DOCX)
References
- 1. Li H, Zhang R, Lee Y-C, Kraut RE, Mohr DC. Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. NPJ Digit Med. 2023;6(1):236. pmid:38114588
- 2. Heinz MV, Mackin DM, Trudeau BM, Bhattacharya S, Wang Y, Banta HA, et al. Randomized Trial of a Generative AI Chatbot for Mental Health Treatment. NEJM AI. 2025;2(4).
- 3. Martinengo L, Lum E, Car J. Evaluation of chatbot-delivered interventions for self-management of depression: content analysis. J Affect Disord. 2022;319:598–607. pmid:36150405
- 4. Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, et al. Large language models for chatbot health advice studies: a systematic review. JAMA Netw Open. 2025;8(2):e2457879. pmid:39903463
- 5. Jabir AI, Martinengo L, Lin X, Torous J, Subramaniam M, Tudor Car L. Evaluating conversational agents for mental health: scoping review of outcomes and outcome measurement instruments. J Med Internet Res. 2023;25:e44548. pmid:37074762
- 6. Sanjeewa R, Iyer R, Apputhurai P, Wickramasinghe N, Meyer D. Empathic conversational agent platform designs and their evaluation in the context of mental health: systematic review. JMIR Ment Health. 2024;11:e58974. pmid:39250799
- 7. Denecke K, Abd-Alrazaq A, Househ M, Warren J. Evaluation Metrics for Health Chatbots: A Delphi Study. Methods Inf Med. 2021;60(5–06):171–9. pmid:34719011
- 8. Kang B, Hong M. Development and evaluation of a mental health chatbot using ChatGPT 4.0: mixed methods user experience study with Korean users. JMIR Med Inform. 2025;13:e63538. pmid:39752663
- 9. McInerney S, Nash T, Lee R, Falis M, Gruber F, Casey A. AI Chatbot for Cancer Patient Support: Development and Evaluation Using Llama 3.1, Mistral 7B, and PHI 3B. Stud Health Technol Inform. 2025;327:890–1.
- 10.
Barlow DH. Clinical handbook of psychological disorders: a step-by-step treatment manual. 6th ed. New York, NY, US: The Guilford Press; 2021.
- 11.
Linehan MM. Skills training manual for treating borderline personality disorder. 2nd ed. New York: Guilford Press; 2014.
- 12. Lin X, Martinengo L, Jabir AI, Ho AHY, Car J, Atun R, et al. Scope, characteristics, behavior change techniques, and quality of conversational agents for mental health and well-being: systematic assessment of apps. J Med Internet Res. 2023;25:e45984.
- 13. Spitzer RL, Kroenke K, Williams JB. Validation and utility of a self-report version of PRIME-MD: the PHQ primary care study. JAMA. 1999;282:1737–44.
- 14. Spitzer RL, Kroenke K, Williams JBW, Löwe B. A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch Intern Med. 2006;166(10):1092–7. pmid:16717171
- 15. Anmella G, Sanabra M, Primé-Tous M, Segú X, Cavero M, Morilla I, et al. Vickybot, a chatbot for anxiety-depressive symptoms and work-related burnout in primary care and health care professionals: development, feasibility, and potential effectiveness studies. J Med Internet Res. 2023;25:e43293. pmid:36719325
- 16. Vilela-Estrada AL, Villarreal-Zegarra D, Mayo-Puchoc N, Hol N, Flórez-Varela Á, Fung C. Does daily happiness and nervousness differ in adolescents and young adults in three Latin American cities: an experience sampling study. JMIR Preprints. 2024.
- 17. Coen E, Del Fiol G, Kaphingst KA, Borsato E, Shannon J, Smith HS, et al. Chatbot for the return of positive genetic screening results for hereditary cancer syndromes: a prompt engineering study. Res Sq. 2024.
- 18. Li D-J, Kao Y-C, Tsai S-J, Bai Y-M, Yeh T-C, Chu C-S, et al. Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan psychiatric licensing examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin Neurosci. 2024;78(6):347–52. pmid:38404249
- 19. Bouguettaya A, Stuart EM, Aboujaoude E. Racial bias in AI-mediated psychiatric diagnosis and treatment: a qualitative comparison of four large language models. NPJ Digit Med. 2025;8(1):332. pmid:40467886
- 20. Schnepper R, Roemmel N, Schaefert R, Lambrecht-Walzinger L, Meinlschmidt G. Exploring biases of large language models in the field of mental health: comparative questionnaire study of the effect of gender and sexual orientation in anorexia nervosa and bulimia nervosa case vignettes. JMIR Ment Health. 2025;12:e57986. pmid:40111287
- 21. Gargari OK, Habibi G, Nilchian N, Shafiee A. Comparative analysis of large language models in psychiatry and mental health: a focus on GPT, AYA, and Nemotron-3-8B. Asian J Psychiatr. 2024;99:104148. pmid:39047354
- 22. Rutten LJF, Ridgeway JL, Griffin JM. Advancing translation of clinical research into practice and population health impact through implementation science. Mayo Clinic Proceed. 2024;99:665–76.
- 23. Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA. 2025;333(4):319–28. pmid:39405325
- 24. Rohrbach PJ, Dingemans AE, Evers C, Van Furth EF, Spinhoven P, Aardoom JJ, et al. Cost-effectiveness of internet interventions compared with treatment as usual for people with mental disorders: systematic review and meta-analysis of randomized controlled trials. J Med Internet Res. 2023;25:e38204. pmid:36602854
- 25. Villarreal-Zegarra D, Alarcon-Ruiz CA, Melendez-Torres GJ, Torres-Puente R, Navarro-Flores A, Cavero V, et al. Development of a framework for the implementation of synchronous digital mental health: realist synthesis of systematic reviews. JMIR Ment Health. 2022;9(3):e34760. pmid:35348469
- 26. Rojas-Mezarina L, Villarreal-Zegarra D. Ethical and regulatory gaps in using generative AI for mental health support in low- and middle-income countries. Interacciones. 2026;12: e497.
- 27. Guo D, Yang D, Zhang H, Song J, Wang P, Zhu Q, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature. 2025;645(8081):633–8. pmid:40962978
- 28. Dorfner FJ, Dada A, Busch F, Makowski MR, Han T, Truhn D, et al. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J Am Med Inform Assoc. 2025;32(6):1015–24. pmid:40190132
- 29. Kang B, Hong M. Development and evaluation of a mental health chatbot using ChatGPT 4.0: mixed methods user experience study with Korean users. JMIR Med Inform. 2025;13:e63538. pmid:39752663