Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

User experience and safety of generative AI-based mental health chatbots: Scoping review protocol

  • Lotenna Olisaeloka ,

    Roles Conceptualization, Methodology, Writing – original draft, Writing – review & editing

    lotenna@student.ubc.ca

    Affiliations Department of Psychiatry, Faculty of Medicine, University of British Columbia, Vancouver, Canada, School of Population and Public Health, Faculty of Medicine, University of British Columbia, Vancouver, Canada

  • Chris Richardson,

    Roles Writing – original draft, Writing – review & editing

    Affiliations Department of Psychiatry, Faculty of Medicine, University of British Columbia, Vancouver, Canada, School of Population and Public Health, Faculty of Medicine, University of British Columbia, Vancouver, Canada

  • Daniel Vigo

    Roles Supervision, Writing – review & editing

    Affiliations Department of Psychiatry, Faculty of Medicine, University of British Columbia, Vancouver, Canada, School of Population and Public Health, Faculty of Medicine, University of British Columbia, Vancouver, Canada

Abstract

Introduction

Mental health problems constitute a significant global health challenge due to their rising prevalence and substantial treatment gap. Digital Mental Health Interventions (DMHIs) including mental health chatbots have emerged as promising solutions due to their effectiveness and scalability. Recent advances in Generative Artificial Intelligence (GenAI) have improved the conversational abilities of these chatbots, further amplifying their potential. However, despite instances of inadvertent harm stemming from the unpredictable nature of GenAI, little attention has been paid to user experience and safety of these chatbots.

Objective

This proposed review will explore existing research on GenAI-based mental health chatbots. Specifically, it aims to identify and describe current chatbots, focusing on user experience, safety and risk mitigation strategies.

Methods

The review will follow the Joanna Briggs Institute (JBI) guidelines for conducting scoping reviews. It will also adhere to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Review (PRISMA-ScR). A systematic database search of Medline (PubMed), Scopus, PsycINFO, ACM Digital Library, and IEEE Xplore will be conducted. The database search will be complimented by research-based search engines (Google Scholar and Consensus). Studies focusing on the development, evaluation or implementation of GenAI-based mental health chatbots will be included without limitations to specific disorders or population groups. Two independent reviewers will perform screening and data extraction. The analysis will include descriptive summary and thematic analysis, with results presented in tabular, graphical, and narrative formats.

Conclusion

This review will provide a comprehensive overview of GenAI-based mental health chatbots while identifying innovative practices and knowledge gaps relating to user experience and safety. Findings will inform the ethical development, evaluation and implementation of GenAI-based mental health interventions.

Background

Mental health and substance use disorders affect over one billion people globally, contributing substantially to disability, premature mortality, and economic burden worldwide [13]. Despite effective therapeutic approaches, a significant treatment gap persists with a majority of affected individuals remaining untreated [3,4]. This gap is driven by persistent barriers to care including treatment costs, shortages of trained personnel, geographical inaccessibility, and stigma [3].

Digital Mental Health Interventions (DMHI) have emerged as a promising strategy to address some of these challenges and expand access to care. DMHI encompass a range of technology-based tools such as online platforms, mobile apps, chatbots and virtual reality (VR), designed to deliver mental health services and support [5,6]. Asynchronous and self-guided DMHI, like chatbots are especially promising due to their accessibility and scalability [5,7].

Conversational Agents (CA), commonly called chatbots, are software applications that mimic human conversation through text or voice interactions. These agents have been used to deliver mental health interventions for a variety of conditions including depression, anxiety, eating disorders, and substance use [813]. However, user engagement with these tools is often limited, attributable to the lack of personalization and dynamic interaction [1416].

Personalization—the tailoring of an intervention to a user’s unique context—has been shown to enhance engagement, user experience, and effectiveness [12]. However, traditional mental health chatbots are primarily rule-based, relying on pre-programmed conversational flow which limit their flexibility and ability to personalize responses. While retrieval-based chatbots offer more adaptability, they still depend on pre-scripted responses, which restrict their ability to meet complex user needs [17,18]. In contrast, GenAI mental health chatbots powered by large language models (LLMs) can produce more interactive and contextually relevant responses. This allows for natural and tailored empathetic conversations, which emerging evidence suggest may improve engagement and therapeutic outcomes [1921].

The emergence of LLMs marked a turning point in the development of sophisticated mental health chatbots capable of human-like support [22,23]. However, the same flexibility and sophistication that enhance personalization and user engagement also introduce novel risks and safety challenges. GenAI chatbots may generate misinformation, produce inappropriate or harmful responses, and exhibit algorithmic bias. Further, their “black box” nature make them unpredictable and less reliable especially in crisis situations [2426]. These concerns have triggered broader debates about the ethical and safe deployment of GenAI in mental healthcare [27,28].

Despite increasing discourse on the ethical application of GenAI for mental health, there remains limited research on how to design and deploy GenAI-based mental health chatbots in effective and safe ways. Existing reviews in this area largely focus on traditional rule- and retrieval-based models [8,1012]. Notwithstanding, a recent meta-analysis highlighted the superior efficacy of AI-based mental health chatbots compared to traditional ones, due to their ability to simulate empathetic conversations and personalize interactions [21]. Still, there remains a lack of systematic synthesis examining their characteristics, user experience and safety profiles.

In this review, User Experience (UX) is conceptualized according to the International Organization for Standardization (ISO 9241−210), which defines UX as an individual’s “perceptions and responses resulting from the use and/or anticipated use of a product, system or service.” The ISO notes that UX “includes all the users’ emotions, beliefs, preferences, perceptions, physical and psychological responses, behaviours and accomplishments that occur before, during and after use.” [29]. In the context of DMHIs, this encompasses measures of acceptability, usability, perceived impact, and engagement [9,30]. Emerging research highlights both the appeal and pitfalls of GenAI mental health chatbots: users appreciate the engaging on-demand and non-judgmental support, but also express concerns about unreliable or potentially harmful content, as well as the risk of overdependence [31,32].

User safety is another critical consideration for DMHI and refers to how digital tools minimize harm, uphold data protection and privacy, and promote psychological well-being throughout intervention design and delivery [33]. The World Health Organization in its guidelines for digital interventions calls for safety considerations including assessing benefits and harms, ensuring data privacy, and using evidence to guide implementation [34]. The safety of traditional mental health chatbots previously received limited attention, mainly because the rule- and retrieval-based systems were perceived as low-risk [35]. In contrast, GenAI’s unpredictability poses challenges in ensuring reliable and safe responses which could result in serious consequences [36]. For instance, there have been reports of GenAI chatbots offering harmful advice [37], promoting substance use, and soliciting explicit content from minors [31]. In some cases, persistent interactions with GenAI chatbots have been linked to tragic outcomes, including suicide [38,39]. These incidents underscore the urgent need for robust safety protocols. The American College of Physicians has called for transparency, rigorous testing, and focused research to understand and mitigate AI-related risks in healthcare [40].

This scoping review fills a critical research gap relating to user experience and safety of GenAI-based mental health interventions. While, existing reviews have examined broader applications of AI and large language models (LLMs), none have systematically mapped Generative AI-based chatbots or explored how user experience and safety are conceptualized and operationalized within this emerging domain [12,24,41]. The proposed review therefore fills this gap by focusing specifically on LLM-powered chatbots and by integrating both user-centered and safety-oriented perspectives. A preliminary search of MEDLINE, the Cochrane Database of Systematic Reviews, and JBI Evidence Synthesis revealed no published or registered scoping or systematic reviews on this specific topic as of August 2024, when this review was registered.

Review questions

The proposed review seeks to:

  1. Identify and describe Generative AI-based chatbots developed specifically to deliver mental health interventions.
  2. Assess how user experience (e.g., acceptability, usability, engagement) are reported in studies of these chatbot interventions.
  3. Examine the safety mechanisms and risk mitigation strategies integrated during the development and deployment of these chatbot interventions.

Methods

The proposed scoping review will follow guidelines outlined in the Joanna Briggs Institute (JBI) manual for scoping reviews [42] and adhere to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) [43]. The review has been registered in Open Science Framework Registries (doi.org/10.17605/OSF.IO/HSNXA).

Eligibility criteria

The PCC (population, concept, context) framework recommended by JBI was used to develop the scope and eligibility criteria of the review to ensure a clear and effective search strategy [42]. Table 1 presents the inclusion and exclusion criteria.

thumbnail
Table 1. Inclusion and exclusion criteria for the scoping review.

https://doi.org/10.1371/journal.pone.0341631.t001

Search strategy

A preliminary search of MEDLINE was conducted to identify articles on the topic. Keywords and index (MeSH) terms identified from relevant articles were used to develop the full search strategy for MEDLINE (OVID) (S1 Appendix). This search strategy will be adapted to other selected databases: Scopus, PsycINFO, ACM Digital Library and IEEE Xplore. The databases were chosen to capture a broad range of sources from different disciplines related to the review objectives. PubMed was selected for its extensive coverage of publications in medicine and health sciences, while Scopus was chosen to include studies in relevant multidisciplinary areas such as science and technology, medicine, and social sciences. PsycINFO specializes in psychiatry and psychology related articles, making it essential for this review. The ACM Digital Library and IEEE Xplore were included because they index publications relating to AI and NLP application in mental health. The database search will be complemented by research-based search engines (Google Scholar and Consensus) to capture other relevant grey literature. The reference lists of all included sources of evidence will also be screened for additional studies.

Selection of evidence sources

Following the search, all identified citations will be imported into Covidence with duplicates automatically removed [44]. Following a pilot test, titles and abstracts will be screened by two independent reviewers against the eligibility criteria. Potentially relevant sources will be retrieved in full, and the full text of selected citations assessed in detail against the eligibility criteria by the same independent reviewers. At this stage, reasons for exclusion of sources of evidence will be recorded and reported in the scoping review. Any disagreements that arise between the reviewers at each stage of the selection process will be resolved through discussion, or with an additional reviewer. The results of the search and the study inclusion process will be reported in full in the final scoping review and presented in a PRISMA-ScR flow diagram [42].

Data extraction

Data will be extracted by two independent reviewers using a data extraction tool developed by the reviewers. The data extracted will include specific details about the participants, concept, context, study methods and key findings relevant to the review questions. Table 2 lists important data that will be extracted from included studies. These will be used to develop a draft extraction form in Covidence which will be modified and revised as necessary during the data extraction process. As recommended by JBI, any disagreements that arise between the reviewers will be resolved through discussion, or with an additional reviewer, to achieve consensus [42]. Where appropriate, authors of papers will be contacted to request missing or additional data, where required.

Data analysis and presentation

Data will be analysed and presented using descriptive statistics and narrative synthesis, focusing on the review objectives. Data visualization including summary tables, graphs and figures will be used to concisely present findings. Data extraction will be conducted within Covidence, which will also facilitate version control and audit trails. Extracted datasets will be exported to R (v4.2.3) for descriptive analysis and NVivo (v14) for qualitative synthesis, ensuring transparent data management and reproducibility [44,45]. Quantitative data will be summarized using descriptive statistics, including means, standard deviations, and frequency distributions, where applicable. No inferential or meta-analytic procedures will be performed, consistent with the scoping-review design. Qualitative data (e.g., user feedback and narrative findings) will undergo inductive thematic analysis following Braun and Clarke’s six-phase framework [46]. Coding will be performed independently by at least two reviewers, who will iteratively compare and refine themes through reflexive discussion until consensus is reached. An audit trail of coding decisions will be maintained to enhance transparency and trustworthiness [42].

Study characteristics (e.g., author, country, design, population, and context) will be summarized in structured tables and figures accompanied by a narrative overview. To map the GenAI-based chatbot interventions, a dedicated table will outline identified chatbots, their key features (e.g., deployment platform, interaction mode), and mental health problems they target.

User experience measures will be categorized and presented according to common themes such as acceptability, usability, engagement, and personalization, while safety mechanisms will be grouped into pre-deployment and delivery-focused strategies. To enhance conceptual clarity, pre-deployment (development-phase) safety mechanisms will be analyzed separately from deployment-phase risk mitigation strategies. The former includes model-training safeguards, bias and accuracy testing, and data-protection measures applied before user interaction. The latter captures real-world implementation safeguards such as human-in-the-loop oversight, user-support features, crisis-response protocols, and reporting of adverse events. This distinction will guide both data extraction and thematic synthesis. Overall, narrative synthesis will integrate findings across themes, highlighting innovative practices, safety considerations, and implications for future research, policy, and practice. The findings will inform the ethical design, evaluation, and regulation of GenAI tools for mental health care, offering timely guidance for developers, researchers, and policymakers seeking to ensure human-centered and safe deployment.

Ethical considerations

This review involves analysis of publicly available literature and does not require ethical approval. Nonetheless, the broader ethical dimensions of GenAI mental health tools are recognized. Particular attention will be paid to how included studies address informed consent, data privacy, transparency, and the mitigation of potential psychological harm arising from chatbot use. These aspects will be highlighted in the synthesis to inform ethical best practices for future AI-enabled mental health interventions.

Limitations

We acknowledge few anticipated limitations in this scoping review. First, the exclusion of non-English language publications may introduce language bias and limit the global comprehensiveness of the findings. Second, the expected variability across included studies in terms of study designs, reporting quality, and the varied operationalizations of key constructs such as user experience and safety may constrain the ability to directly compare results or synthesize findings quantitatively. Finally, given the rapidly evolving field of generative AI, relevant interventions may exist outside the academic literature, such as unpublished, proprietary, or inadequately described tools. This may affect the completeness of our review and highlights the need for ongoing updates as new evidence emerges. Despite these limitations, the review’s findings are expected to support the development of evidence-informed frameworks for responsible and equitable integration of generative AI in mental health interventions.

Supporting information

S1 Appendix. Initial Search for OVID MEDLINE (3/07/2024).

https://doi.org/10.1371/journal.pone.0341631.s001

(DOCX)

S2 Appendix. PRISMA-P (Preferred Reporting Items for Systematic review and Meta-Analysis Protocols) 2015 checklist*.

https://doi.org/10.1371/journal.pone.0341631.s002

(PDF)

References

  1. 1. Dattani S, Rodés-Guirao L, Ritchie H, Roser M. Mental Health. https://ourworldindata.org/mental-health. 2023.
  2. 2. Walker ER, McGee RE, Druss BG. Mortality in mental disorders and global disease burden implications: a systematic review and meta-analysis. JAMA Psychiatry. 2015;72(4):334–41. pmid:25671328
  3. 3. WHO. World mental health report: Transforming mental health for all. World Health Organization. https://www.who.int/publications-detail-redirect/9789240049338
  4. 4. Mental health atlas 2017. Geneva: World Health Organization. https://www.who.int/publications-detail-redirect/9789241514019
  5. 5. Kuhn E, Saleem M, Klein T, Köhler C, Fuhr DC, Lahutina S, et al. Interdisciplinary perspectives on digital technologies for global mental health. PLOS Glob Public Health. 2024;4(2):e0002867. pmid:38315676
  6. 6. Schueller SM, Torous J. Scaling evidence-based treatments through digital mental health. Am Psychol. 2020;75(8):1093–104. pmid:33252947
  7. 7. Naslund JA, Aschbrenner KA, Araya R, Marsch LA, Unützer J, Patel V, et al. Digital technology for treating and preventing mental disorders in low-income and middle-income countries: a narrative review of the literature. Lancet Psychiatry. 2017;4(6):486–500. pmid:28433615
  8. 8. Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS, Torous JB. Chatbots and Conversational Agents in Mental Health: A Review of the Psychiatric Landscape. Can J Psychiatry. 2019;64(7):456–64. pmid:30897957
  9. 9. Jabir AI, Martinengo L, Lin X, Torous J, Subramaniam M, Tudor Car L. Evaluating conversational agents for mental health: scoping review of outcomes and outcome measurement instruments. J Med Internet Res. 2023;25:e44548. pmid:37074762
  10. 10. Lim SM, Shiau CWC, Cheng LJ, Lau Y. Chatbot-delivered psychotherapy for adults with depressive and anxiety symptoms: a systematic review and meta-regression. Behav Ther. 2022;53(2):334–47. pmid:35227408
  11. 11. Abd-Alrazaq AA, Rababeh A, Alajlani M, Bewick BM, Househ M. Effectiveness and safety of using chatbots to improve mental health: systematic review and meta-analysis. J Med Internet Res. 2020;22(7):e16021. pmid:32673216
  12. 12. He Y, Yang L, Qian C, Li T, Su Z, Zhang Q, et al. Conversational Agent Interventions for Mental Health Problems: Systematic Review and Meta-analysis of Randomized Controlled Trials. J Med Internet Res. 2023;25:e43862. pmid:37115595
  13. 13. Bendotti H, Lawler S, Chan GCK, Gartner C, Ireland D, Marshall HM. Conversational artificial intelligence interventions to support smoking cessation: A systematic review and meta-analysis. Digit Health. 2023;9:20552076231211634. pmid:37928336
  14. 14. Borghouts J, Eikey E, Mark G, De Leon C, Schueller SM, Schneider M, et al. Barriers to and Facilitators of User Engagement With Digital Mental Health Interventions: Systematic Review. J Med Internet Res. 2021;23(3):e24387. pmid:33759801
  15. 15. Opie JE, Vuong A, Welsh ET, Esler TB, Khan UR, Khalil H. Outcomes of best-practice guided digital mental health interventions for youth and young adults with emerging symptoms: part ii. a systematic review of user experience outcomes. Clin Child Fam Psychol Rev. 2024;27(2):476–508. pmid:38634939
  16. 16. Liverpool S, Mota CP, Sales CMD, Čuš A, Carletto S, Hancheva C, et al. Engaging children and young people in digital mental health interventions: systematic review of modes of delivery, facilitators, and barriers. J Med Internet Res. 2020;22(6):e16317. pmid:32442160
  17. 17. Hornstein S, Zantvoort K, Lueken U, Funk B, Hilbert K. Personalization strategies in digital mental health interventions: a systematic review and conceptual framework for depressive symptoms. Front Digit Health. 2023;5:1170002. pmid:37283721
  18. 18. Abd-Alrazaq AA, Alajlani M, Ali N, Denecke K, Bewick BM, Househ M. Perceptions and opinions of patients about mental health chatbots: scoping review. J Med Internet Res. 2021;23(1):e17828. pmid:33439133
  19. 19. Darcy A, Daniels J, Salinger D, Wicks P, Robinson A. Evidence of human-level bonds established with a digital conversational agent: cross-sectional, retrospective observational study. JMIR Form Res. 2021;5(5):e27868. pmid:33973854
  20. 20. Beatty C, Malik T, Meheli S, Sinha C. Evaluating the therapeutic alliance with a free-text CBT conversational agent (Wysa): a mixed-methods study. Front Digit Health. 2022;4:847991. pmid:35480848
  21. 21. Li H, Zhang R, Lee Y-C, Kraut RE, Mohr DC. Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. npj Digit Med. 2023;6(1).
  22. 22. Miner AS, Shah N, Bullock KD, Arnow BA, Bailenson J, Hancock J. Key Considerations for Incorporating Conversational AI in Psychotherapy. Front Psychiatry. 2019;10:746. pmid:31681047
  23. 23. GPT-3 powers the next generation of apps. https://openai.com/blog/gpt-3-apps. Accessed 2023 October 30.
  24. 24. Balcombe L. AI chatbots in digital mental health. Informatics. 2023;10(4):82.
  25. 25. De Choudhury M, Pendse SR, Kumar N. Benefits and harms of large language models in digital mental health. 2023. https://doi.org/10.48550/arXiv.2311.14693
  26. 26. Akinrinmade AO, Adebile TM, Ezuma-Ebong C, Bolaji K, Ajufo A, Adigun AO, et al. Artificial Intelligence in Healthcare: Perception and Reality. Cureus. 2023;15(9):e45594. pmid:37868407
  27. 27. Denecke K, Gabarron E. The ethical aspects of integrating sentiment and emotion analysis in chatbots for depression intervention. Front Psychiatry. 2024;15:1462083. pmid:39611131
  28. 28. Mörch C-M, Gupta A, Mishara BL. Canada protocol: An ethical checklist for the use of artificial Intelligence in suicide prevention and mental health. Artif Intell Med. 2020;108:101934. pmid:32972663
  29. 29. International Organization for Standardization. ISO 9241-210:2010(en), Ergonomics of human-system interaction — Part 210: Human-centred design for interactive systems. https://www.iso.org/obp/ui/#iso:std:iso:9241:-210:ed-1:v1:en
  30. 30. Obikane E, Sasaki N, Imamura K, Nozawa K, Vedanthan R, Cuijpers P, et al. Usefulness of implementation outcome scales for digital mental health (iOSDMH): experiences from six randomized controlled trials. Int J Environ Res Public Health. 2022;19(23):15792. pmid:36497867
  31. 31. Ma Z, Mei Y, Su Z. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. AMIA Annu Symp Proc. 2024;2023:1105–14. pmid:38222348
  32. 32. Siddals S, Torous J, Coxon A. “It happened to be the perfect thing”: experiences of generative AI chatbots for mental health. Npj Ment Health Res. 2024;3(1):48. pmid:39465310
  33. 33. Taher R, Hsu C-W, Hampshire C, Fialho C, Heaysman C, Stahl D, et al. The Safety of Digital Mental Health Interventions: Systematic Review and Recommendations. JMIR Ment Health. 2023;10:e47433. pmid:37812471
  34. 34. WHO. Recommendations on digital interventions for health system strengthening. https://www.who.int/publications/i/item/9789241550505
  35. 35. Laranjo L, Dunn AG, Tong HL, Kocaballi AB, Chen J, Bashir R, et al. Conversational agents in healthcare: a systematic review. J Am Med Inform Assoc. 2018;25(9):1248–58. pmid:30010941
  36. 36. De Freitas J, Uğuralp AK, Oğuz-Uğuralp Z, Puntoni S. Chatbots and mental health: Insights into the safety of generative AI. J Consumer Psychol. 2023; jcpy.1393.
  37. 37. Jargon J. WSJ News Exclusive | How a Chatbot Went Rogue. Wall Street Journal. 2023. https://www.wsj.com/articles/how-a-chatbot-went-rogue-431ff9f9
  38. 38. Xiang C. ‘He would still be here’: Man dies by suicide after talking with AI chatbot, widow says. Vice. 2023. https://www.vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says
  39. 39. Montgomery B. Mother says AI chatbot led her son to kill himself in lawsuit against its maker. The Guardian. 2024. https://www.theguardian.com/technology/2024/oct/23/character-ai-chatbot-sewell-setzer-death
  40. 40. Daneshvar N, Pandita D, Erickson S, Snyder Sulmasy L, DeCamp M, ACP Medical Informatics Committee and the Ethics, Professionalism and Human Rights Committee. Artificial intelligence in the provision of health care: an american college of physicians policy position paper. Ann Intern Med. 2024;177(7):964–7. pmid:38830215
  41. 41. Guo Z, Lai A, Thygesen JH, Farrington J, Keen T, Li K. Large Language Models for Mental Health Applications: Systematic Review. JMIR Ment Health. 2024;11:e57400. pmid:39423368
  42. 42. Aromataris E, Lockwood C, Porritt K, Pilla B, Jordan Z. JBI Manual for Evidence Synthesis - JBI Global Wiki. https://jbi-global-wiki.refined.site/space/MANUAL. Accessed 2024 June 30.
  43. 43. Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. 2018;169(7):467–73. pmid:30178033
  44. 44. Covidence. Covidence - Better systematic review management. https://www.covidence.org/. Accessed 2024 July 1.
  45. 45. NVivo: Leading Qualitative Data Analysis Software. https://lumivero.com/products/nvivo/. Accessed 2025 April 19.
  46. 46. Braun V, Clarke V. Using thematic analysis in psychology. Qualitative Res Psychol. 2006;3(2):77–101.