Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Artificial intelligence as a predictive tool for mental health status: Insights from a systematic review and meta-analysis

  • Arsalan Humayun,

    Roles Conceptualization, Data curation, Formal analysis

    Affiliations Community Dentistry Department BADC, SMBBMU, Larkana, Pakistan, School of Dental Sciences, Health Campus, Universiti Sains Malaysia, Health Campus, Kubang Kerian, Kelantan, Malaysia

  • Ashwini M. Madawana,

    Roles Conceptualization, Data curation, Formal analysis

    Affiliation Paediatric Dentistry Unit, School of Dental Sciences, Universiti Sains Malaysia, Health Campus, Kubang Kerian, Kelantan, Malaysia

  • Akram Hassan,

    Roles Validation, Writing – review & editing

    Affiliation Periodontology Unit, School of Dental Sciences, Universiti Sains Malaysia, Health Campus, Kubang Kerian, Kelantan, Malaysia

  • Al Mahmud,

    Roles Validation, Writing – review & editing

    Affiliations Department of Statistics, Shahjalal University of Science & Technology, Sylhet, Bangladesh, School of Dental Sciences, Health Campus, Universiti Sains Malaysia, Health Campus, Kubang Kerian, Kelantan, Malaysia

  • Noorshaida Kamaruddin,

    Roles Validation, Writing – review & editing

    Affiliation School of Dental Sciences, Health Campus, Universiti Sains Malaysia, Health Campus, Kubang Kerian, Kelantan, Malaysia

  • Syed Husni Noor,

    Roles Conceptualization, Data curation, Formal analysis

    Affiliation Community Dentistry Department BADC, SMBBMU, Larkana, Pakistan

  • Syed Hatim Noor,

    Roles Conceptualization, Data curation, Formal analysis

    Affiliation School of Dental Sciences, Health Campus, Universiti Sains Malaysia, Health Campus, Kubang Kerian, Kelantan, Malaysia

  • Mohamad Arif Awang Nawi

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition

    mohamadarif@usm.my

    Affiliation Biostatistics Unit, School of Dental Sciences, Health Campus, Universiti Sains Malaysia, Health Campus, Kubang Kerian, Kelantan, Malaysia

Abstract

This systematic review and meta-analysis evaluates the effectiveness of AI-driven tools, particularly conversational agents (CAs), in alleviating psychological distress and improving mental health outcomes. The focus is on their impact across diverse populations, including clinical, subclinical, and older adults. A comprehensive search was conducted in PubMed, Google Scholar, Elsevier, and Scopus using specific MeSH terms and keywords such as “Artificial Intelligence,” “Machine Learning,” “Natural Language Processing,” “Depression,” and “Anxiety.” The timeframe included studies published between January 2000 and July 2024. Inclusion criteria comprised peer-reviewed original research articles, cohort studies, and case reports focusing on AI tools for mental health. Systematic reviews, secondary sources, and non-English publications were excluded. Random-effects meta-analysis was conducted using standardized mean differences, with effect sizes synthesized in forest plots. Twenty studies were included in the qualitative synthesis and six in the quantitative meta-analysis. The analysis demonstrated that AI-based CAs significantly reduce anxiety (Cohen’s d = 0.62, p < 0.01) and depression (Cohen’s d = 0.74, p < 0.001), with higher effectiveness observed in multimodal CAs compared to text-only systems. However, the long-term impact remains inconsistent due to variability in follow-up durations and methodological heterogeneity. Some studies lacked extended observation periods or reported diminished effects over time, highlighting a need for sustained intervention research. AI-based CAs, especially when integrated into mobile platforms and using multimodal interfaces, provide scalable and engaging support for mental health. While short-term benefits are evident, future studies should address long-term efficacy, methodological consistency, and ethical concerns like privacy and algorithmic bias to strengthen the utility and trust in AI interventions for mental health.

Introduction

Artificial Intelligence (AI) has rapidly emerged as a transformative force across various sectors, including health care, where it holds significant promise for enhancing diagnosis, treatment, and service delivery. In mental health, where millions are affected by conditions such as depression and anxiety, traditional methods like clinical interviews, questionnaires, and self-reports remain subjective, time-consuming, and resource-intensive [1]. This has created an urgent demand for more objective, scalable, and accessible tools—an area where AI has begun to demonstrate meaningful impact [2].

A prominent example of AI application in this domain is the use of Conversational Agents (CAs). These are digital programs that interact with users via text or voice and are designed to deliver therapeutic interventions or psychoeducation. Traditional rule-based CAs use pre-scripted decision trees, which limits their adaptability and contextual understanding [3]. As a result, their ability to evolve with a user’s mental state or deliver personalized interventions is constrained [4]. In contrast, AI-driven CAs—powered by machine learning (ML), natural language processing (NLP), and deep learning—can dynamically interpret complex user inputs and generate context-sensitive responses in real time [5]. This distinction is particularly important in mental health care, where personalization and emotional responsiveness are critical to building therapeutic rapport.

These advanced CAs can analyze diverse data sources, such as speech patterns, text inputs, and behavioral signals, to detect early signs of psychological distress and suggest interventions [6]. For instance, platforms like Woebot, Tess, and Wysa have been used to deliver structured cognitive behavioral therapy (CBT) programs via chat interfaces, showing preliminary success in reducing symptoms of anxiety and depression [15]. Additionally, AI algorithms have been employed to assess digital footprints—including social media activity and voice biomarkers—to predict the onset of mental health conditions with considerable accuracy [7].

Despite their promise, AI-based CAs present critical ethical and practical challenges. These systems often require access to highly sensitive personal data, raising concerns about data privacy, consent, and ownership [810]. Ensuring secure data handling is particularly vital in mental health, where users may be especially vulnerable. Algorithmic bias also poses a significant risk: if the training data are not representative of diverse populations, AI outputs may perpetuate inequality and produce inaccurate assessments [11,12]. Furthermore, safety concerns arise from the use of generative AI models, which, if not properly regulated, may deliver inappropriate or unsafe responses to individuals in crisis [13,14]. Unlike rule-based systems, which are easier to validate, generative models require ongoing monitoring and rigorous testing to ensure they provide safe and evidence-based care.

While early studies indicate that AI tools can reduce psychological distress and enhance user engagement, their long-term effectiveness and integration with human-led care remain underexplored. There is a growing need for longitudinal studies that assess not only short-term symptom reduction but also sustained mental well-being and therapeutic alliance over time. The importance of balanced reporting highlighting both the potential and limitations of AI in mental health is crucial for developing ethical, effective, and user-centered technologies.

In sum, AI-based CAs offer an innovative, accessible, and potentially transformative means of supporting mental health care. However, their deployment must be guided by careful evaluation, robust ethical frameworks, and a recognition of their limitations. As AI continues to evolve, ongoing research and policy efforts will be essential in ensuring that these tools are safe, effective, and equitable for all users.

Materials and methods

Data searching strategy

A systematic and comprehensive search was conducted across four major electronic databases: PubMed, Google Scholar, Scopus, and Elsevier, in accordance with PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. The search used a combination of Medical Subject Headings (MeSH) and keywords related to artificial intelligence and mental health, including: “Artificial Intelligence,” “Machine Learning,” “Deep Learning,” “Natural Language Processing,” “Psychiatric Disorder,” “Depression,” “Anxiety,” and “Mental Health Support.” Boolean operators (AND/OR) were employed to refine the search. The date range was restricted to articles published between January 2000 and July 2024. Only peer-reviewed studies published in English were included. Additionally, reference lists of included studies and relevant reviews were manually scanned to capture any potentially missed publications. The characteristics of the included studies are summarized in Table 1.

thumbnail
Table 1. Keywords and MeSH phrases utilized in the optimization method.

https://doi.org/10.1371/journal.pone.0332207.t001

Studies selection

Studies were eligible for inclusion if they (i) were original research articles, cohort studies, or case reports; (ii) focused on AI-driven tools used for assessing or improving mental health outcomes; and (iii) were published between 2000 and 2024. Studies were excluded if they (i) did not directly assess AI tools for mental health applications; (ii) were systematic reviews, meta-analyses, or other secondary sources (excluded to avoid duplication and focus on original data); (iii) lacked full-text availability; or (iv) were not published in English.

Data extraction and synthesis

This review adhered to PRISMA guidelines and employed a systematic process for data extraction and synthesis. Two independent reviewers screened studies at the title, abstract, and full-text levels based on predefined inclusion and exclusion criteria. Discrepancies in study selection or data interpretation were resolved through consensus or, when necessary, adjudicated by a third reviewer. No automation tools were used for the review process; instead, Microsoft Excel was used to manage references, record decisions, and organize extracted data.

For each included study, the following variables were extracted: (i) study identifiers (author, year, journal, country); (ii) study design and sample size; (iii) population characteristics (e.g., age, clinical status); (iv) type of AI-driven intervention (e.g., chatbot, predictive model); (v) outcome domains (e.g., depression, anxiety, psychological distress); (vi) outcome measurement tools (e.g., PHQ-9, GAD-7, PANAS); (vii) effect sizes and confidence intervals; (viii) intervention duration and follow-up periods; and (ix) key findings related to mental health outcomes. This structured approach ensured consistent and transparent data synthesis.

Secondary sources such as systematic reviews and meta-analyses were excluded to avoid data duplication and maintain a focus on primary empirical research. Although these sources can offer valuable synthesized insights, they were excluded to ensure that only original data directly contributing to outcome estimates were analyzed.

Certainty of evidence

The certainty of evidence for each primary outcome was evaluated using the GRADE (Grading of Recommendations Assessment, Development, and Evaluation) framework. GRADE considers five domains: (i) risk of bias, (ii) inconsistency of results, (iii) indirectness of evidence, (iv) imprecision, and (v) publication bias. Each outcome was rated as having high, moderate, low, or very low certainty. Two reviewers independently conducted GRADE assessments, resolving disagreements through discussion to achieve consensus. Each rating was justified and incorporated into the interpretation of results.

  1. i. Depression and anxiety reduction: The evidence was rated as moderate certainty. Although short-term reductions were consistent across studies, diversity in study design and limited long-term follow-up weakened the overall confidence.
  2. ii. User satisfaction: The evidence for user satisfaction, particularly with multimodal AI-driven chatbots, was assessed as high certainty, with studies consistently reporting positive engagement and therapeutic experience.
  3. iii. Comparative effectiveness of chatbot vs. bibliotherapy: The evidence was rated moderate certainty. Chatbots often outperformed bibliotherapy; however, methodological inconsistencies and limited blinding introduced bias.
  4. iv. Woebot efficacy in substance use reduction (during COVID-19): Evidence was deemed moderate certainty, reflecting promising results but also noting limitations in study design and sample size.
  5. v. Therapeutic alliance (especially in student populations): Evidence was assessed as high certainty, with robust support for AI-facilitated therapeutic relationships enhancing mental health outcomes.

Risk assessment

To evaluate the methodological rigor and potential bias in the included studies, the Critical Appraisal Skills Programme (CASP) checklist was employed. The CASP tool is a widely accepted framework used to assess the internal validity, reliability, and applicability of primary research studies across health and social care. This evaluation was carried out independently by two reviewers. Discrepancies in the appraisal results were resolved through discussion and, if necessary, a third reviewer’s input.

Each study was assessed using the 11-item CASP checklist, covering key domains: (i) clarity of the research question; (ii) appropriateness of the study design; (iii) recruitment strategy; (iv) randomization procedures; (v) blinding of participants and outcome assessors; (vi) baseline comparability of groups; (vii) completeness of outcome data; (viii) reliability of outcome measurements; (ix) statistical validity of results; (x) consideration of confounders; and (xi) generalizability of findings. Responses for each item were recorded as “Yes,” “No,” or “Can’t tell” based on how clearly the methodology was reported.

Bias assessments focused particularly on randomization methods, blinding procedures, handling of missing data (attrition bias), and completeness of outcome reporting. One study lacked participant blinding, which raised concerns about performance bias. Missing data were reviewed to determine whether appropriate imputation methods were applied or if cases were excluded, potentially introducing bias. Reporting bias was evaluated by checking for transparency in presenting effect estimates, p-values, and confidence intervals. Inconsistent or missing precision estimates were flagged.

The CASP findings informed the overall interpretation of results by identifying studies with methodological limitations and contributed directly to GRADE assessments of certainty of evidence. This dual-layered appraisal ensured that both the internal validity and practical applicability of the findings were systematically considered in the final analysis.

Statistical analysis

Meta-analysis was employed as the primary method for quantitatively synthesizing the results of the selected studies. This approach was chosen to estimate a pooled effect size for the impact of AI-based interventions on mental health outcomes, which is particularly suitable given the availability of comparable quantitative data across the included studies. A total of six eligible studies involving 901 participants were included in the meta-analysis. Effect sizes (Cohen’s d) were calculated based on pre- and post-intervention means and standard deviations, or extracted directly from reported results when available. Where necessary, conversion formulas were applied to standardize effect estimates. The meta-analysis was conducted using Review Manager (RevMan) 5.4, a software tool developed by The Cochrane Collaboration. Forest plots were generated to visually represent the individual and pooled effect sizes. Each study was displayed as a box proportional to the inverse of its variance, and horizontal lines indicated 95% confidence intervals.

Heterogeneity among studies was assessed using the I² statistic and the Cochrane Q test. I² values of 30–60% indicated moderate heterogeneity, and values >60% were considered substantial. A random-effects model was applied to account for anticipated clinical and methodological diversity among studies. Sensitivity analyses were conducted by systematically removing studies rated as high risk of bias based on the CASP evaluation and by altering assumptions regarding missing data (e.g., excluding studies with >20% attrition or imputed values without clear methodology). Although publication bias is a known concern in systematic reviews, formal assessment using funnel plots was not conducted due to the limited number of included studies (n < 10), which reduces the reliability of such diagnostics. However, steps were taken to minimize potential bias by conducting comprehensive database searches and manual reference checks. The choice of meta-analysis over narrative synthesis or qualitative review was justified by the goal of quantifying treatment effects across interventions with similar outcome constructs. This allowed for a more rigorous evaluation of the effectiveness of AI-based mental health tools and improved comparability across studies.

Results

Search results

A total of 78 records were initially identified through database searches: 31 from PubMed, 21 from Elsevier, and 26 from Google Scholar. After removing duplicates, 54 records remained: 19 from PubMed, 17 from Elsevier, and 18 from Google Scholar. These records were screened by evaluating titles and abstracts for relevance to the review objectives. Based on this screening, 2 studies from PubMed, 4 from Elsevier, and 2 from Google Scholar were excluded due to irrelevance or insufficient information, resulting in 12 PubMed, 13 Elsevier, and 16 Google Scholar records proceeding to the full-text eligibility assessment.

During the eligibility phase, full-text articles were reviewed for alignment with the inclusion criteria. Articles were excluded if they contained missing or incomplete data, lacked AI-specific interventions, or failed to report on mental health outcomes. This led to the exclusion of 3 full-texts from PubMed, 2 from Elsevier, and 6 from Google Scholar. As a result, 20 studies were included in the synthesis: 7 from PubMed, 5 from Elsevier, and 8 from Google Scholar. Among these, 6 studies (3 from PubMed and 3 from Google Scholar) met the criteria for inclusion in the final systematic review. The full selection process is illustrated in the PRISMA flow diagram (Fig 1).

Risk bias assessment

Risk of bias was assessed using the Critical Appraisal Skills Programme (CASP) checklist, applied specifically to randomized controlled trials included in the synthesis. Two reviewers independently evaluated each study across eleven CASP domains spanning four core areas: study design validity, methodological rigor, reliability of reported results, and applicability to practice. Discrepancies in scoring were resolved through discussion to reach consensus. Table 2 presents a detailed breakdown of the CASP assessment for each study, including Prochaska et al., Klos et al., Ogawa et al., Romanovskyi et al., Drouin et al., and Liu et al. All studies clearly addressed focused research questions, used appropriate randomization techniques, and accounted for all enrolled participants by study conclusion. Most studies ensured methodological soundness by implementing blinding at multiple levels, maintaining baseline group similarities, and offering consistent care across groups. However, the Liu et al. study did not blind participants, raising concerns about performance bias and the potential for expectancy effects to influence participant behavior or reported outcomes.

While most studies reported outcomes comprehensively, the Drouin et al. study did not clearly state effect size estimates or confidence intervals, which could hinder the interpretability and reproducibility of its findings. Despite this, all studies concluded that the interventions’ benefits outweighed potential harms or costs. In terms of external validity, all trials were deemed applicable to their local populations, with findings generalizable to broader contexts. Based on the CASP results, five studies were rated as “Good” in overall quality, while Liu et al. received a “Fair” rating due to its lack of participant blinding. This highlights the importance of full methodological transparency and safeguards against bias to strengthen the credibility of findings.

Effectiveness of AI-based chatbot therapy and comparison to bibliotherapy

The reviewed studies consistently demonstrate the growing potential of AI-based chatbot therapy in improving mental health outcomes, including reductions in depression, anxiety, negative affect, and substance use. Chatbots such as Woebot, Tess, and Elomia were found to be particularly effective when compared to traditional therapeutic methods, including bibliotherapy.

A randomized controlled trial involving 83 university students compared the effectiveness of chatbot-based therapy to bibliotherapy—a structured, self-guided therapeutic approach using written materials. Participants were randomly assigned to either the chatbot group or the bibliotherapy group and underwent a four-week intervention. Depression and anxiety symptoms were assessed using standardized tools: the Generalized Anxiety Disorder scale (GAD-7), Patient Health Questionnaire (PHQ-9), and the Positive and Negative Affect Schedule (PANAS). The chatbot group exhibited statistically significant improvements in both depression (p < 0.05) and anxiety (p < 0.01) scores compared to the bibliotherapy group. Effect sizes ranged from moderate to large (Cohen’s d = 0.6–0.8), indicating clinically meaningful changes (see Fig 2).

Additionally, therapeutic alliance was evaluated using the Working Alliance Inventory (WAI), a validated measure of the emotional and collaborative bond between user and intervention. The chatbot group reported significantly higher WAI scores than the bibliotherapy group (p < 0.05), suggesting a stronger perceived connection and engagement with the AI-based intervention. This finding supports the idea that interactional dynamics unique to chatbot communication may contribute to superior psychological outcomes relative to static, content-driven approaches like bibliotherapy.

Prochaska et al. [16] also highlighted the effectiveness of Woebot-SUDs, a specialized digital therapeutic aimed at reducing substance use during the COVID-19 pandemic. In a large randomized trial involving 734 U.S. adults, Woebot-SUDs led to a significant decrease in substance use occasions (p < 0.001) from baseline to post-intervention. These reductions were strongly correlated with improvements in self-efficacy and decreases in cravings, substance-related problems, and comorbid anxiety and depression symptoms. The intervention also showed high user satisfaction and was associated with reduced COVID-19–related psychological distress.

While these results are promising, limitations must be acknowledged. Some studies, like the one with 83 students, involved small sample sizes, which can reduce statistical power and generalizability. Furthermore, most of the reviewed research focused on specific populations university students and adults in the U.S. limiting the applicability of findings to broader demographic groups, including older adults, individuals with severe mental illness, and those from non-Western cultural backgrounds. Future studies should aim to include more diverse samples and assess long-term efficacy to better evaluate the scalability and inclusiveness of AI-based mental health interventions.

The I² statistic and the Cochrane Q test were used to assess the degree of heterogeneity across studies. I² values between 30% and 60% indicated moderate heterogeneity, values above 60% suggested substantial heterogeneity, and values below 30% were interpreted as low. While heterogeneity levels were generally moderate across the included studies, the small number of studies (fewer than ten) and variation in study designs, populations, and intervention types such as differences between platforms like Woebot, Tess, and Elomia limited the ability to perform meaningful subgroup analysis. These differences, if examined in more detail, might help clarify whether specific features of chatbot interventions, such as interactivity or personalization, or participant characteristics, such as being university students versus members of the general adult population, contributed to outcome variability. However, the available dataset did not contain enough detailed information to allow for such analyses. Similarly, sensitivity analyses were not conducted because the number of studies was insufficient and the included studies shared similar methodological features, making it difficult to assess the robustness of the synthesized findings.

Although formal evaluation of publication bias was not feasible because tests such as funnel plot asymmetry require at least ten studies, the possibility of publication bias should still be considered. The absence of unpublished or grey literature in the dataset raises the likelihood that the findings may reflect a bias toward positive outcomes. This potential bias represents an important limitation and should be taken into account when interpreting the overall effectiveness of AI-based chatbot interventions. Future reviews that include a larger number of studies will be better positioned to explore heterogeneity and publication bias more comprehensively (Table 3).

thumbnail
Table 3. Characteristics of studies included in the systematic review.

https://doi.org/10.1371/journal.pone.0332207.t003

Certainty of evidence

The certainty of evidence for the three primary outcomes reduction in anxiety, reduction in depression, and improvement in psychological well-being was assessed using the GRADE (Grading of Recommendations, Assessment, Development and Evaluation) methodology. This framework evaluates five domains: risk of bias, inconsistency, indirectness, imprecision, and publication bias. A summary of the findings is presented in Table 4.

thumbnail
Table 4. Summary of certainty of evidence for key outcomes using GRADE methodology.

https://doi.org/10.1371/journal.pone.0332207.t004

  1. i. Reduction in Anxiety:

Six randomized controlled trials contributed to the assessment of anxiety outcomes. The overall certainty of the evidence was rated as moderate. This rating was supported by a low risk of bias and the absence of major concerns related to indirectness or imprecision. However, moderate inconsistency was present across study results, which may reflect differences in intervention formats, treatment duration, or participant characteristics. Although sensitivity analyses could not be performed due to the limited number of studies, the general direction of the effect remained consistent, supporting moderate confidence in the results.

  1. ii. Reduction in Depression:

Five randomized controlled trials provided evidence for the outcome of depression reduction. The certainty of the evidence was rated as high, supported by low risk of bias, minimal inconsistency, and no identified concerns regarding indirectness. While a few studies showed slight imprecision in the effect estimates, the overall findings were robust and consistently favored AI-based interventions. Two reviewers independently applied the GRADE criteria, and any differences in scoring were resolved through discussion.

  1. iii. Improvement in Well-being:

This outcome was evaluated using four randomized controlled trials and was rated as having low certainty. Several factors contributed to this rating, including a moderate level of bias in some studies, high variation in the findings, and moderate concerns related to both indirectness and imprecision. These limitations suggest that while some positive effects were observed, additional well-designed studies are necessary to draw reliable conclusions about the impact of chatbot therapy on overall well-being.

The quality ratings derived from the GRADE process have practical implications for the adoption of AI-based interventions in mental health care. Moderate certainty for anxiety reduction implies that such tools are likely to be effective, although further research could influence this conclusion. High certainty for depression reduction supports strong confidence in the utility of chatbot therapies for managing depressive symptoms. Conversely, the low certainty related to psychological well-being suggests that findings should be interpreted cautiously and that more rigorous research is needed. Due to the small number of included studies and their methodological similarities, subgroup analyses and sensitivity tests were not conducted. Nevertheless, the possible influence of contextual variables, such as participants’ familiarity with digital tools, the accessibility of AI interventions in diverse communities, and cultural relevance, should be acknowledged. These factors may affect real-world outcomes and should be explored in future research to support equitable implementation of AI-based mental health solutions.

Discussion

This systematic review and meta-analysis evaluated the efficacy of AI-based conversational agents (CAs) in mental health care and identified several important findings. Most notably, the results confirm that AI-driven CAs significantly reduce psychological distress, particularly when integrated through multimodal platforms and generative artificial intelligence models. These technologies, often deployed via mobile applications or instant messaging platforms, offer real-time interaction and personalized support, thereby enhancing user engagement and treatment outcomes [22].

While the review includes widely used AI tools such as Woebot, Tess, and Elomia, their unique features and therapeutic mechanisms were not uniformly assessed across studies. For instance, Woebot utilizes cognitive behavioral techniques, Elomia emphasizes empathetic dialogue, and Tess combines behavioral health frameworks with adaptive personalization. Generative artificial intelligence models, in comparison to simpler rule-based systems, enable dynamic and context-sensitive responses. This enhances the system’s ability to respond appropriately to users’ emotional states, thereby strengthening the therapeutic alliance and improving intervention outcomes [16,19].

Conversational agents that incorporate both voice and text modalities demonstrated superior effectiveness compared to those relying solely on textual interaction, as supported by significant improvements in mental health metrics in several included studies. These tools appear to simulate human-like interaction more authentically, which may foster greater user trust and emotional connection. For example, one study found that multimodal delivery increased engagement and emotional rapport, particularly among older adults [14]. Nonetheless, more clarity is needed regarding the magnitude and statistical significance of these improvements, which future studies should address through detailed effect size reporting.

The efficacy of AI interventions varied across demographic and clinical groups. Positive results were especially observed in clinical and subclinical populations and among older adults, although the precise mental health conditions that benefit most such as depression, anxiety, or stress require further delineation. Older adults might respond better to AI tools due to higher adherence or unique communication preferences. Importantly, individuals with more severe symptoms often preferred some degree of human involvement, indicating that AI-based tools should complement, rather than replace, traditional therapeutic approaches [23,24].

A central limitation noted across studies was the presence of considerable variability in intervention formats, outcome measures, and participant populations. This heterogeneity complicates synthesis of findings and underscores the necessity for subgroup analyses. It remains unclear whether intervention effectiveness is influenced by specific variables such as symptom type, participant age, or delivery medium. Additionally, variability in outcome measurement tools, including GAD-7, PHQ-9, and customized scales, poses challenges in interpreting cross-study consistency and comparability [2].

Further methodological concerns include small sample sizes and lack of long-term follow-up in many studies, which reduce confidence in the reliability and sustainability of intervention outcomes. The absence of longitudinal data precludes conclusions regarding relapse prevention and ongoing therapeutic impact. Many studies disproportionately focused on university students or clinical cohorts in high-income countries, limiting the generalizability of findings to more diverse, global populations. Representation from varied cultural, economic, and geographical backgrounds was limited, raising concerns regarding the universal applicability of these interventions [6]. This review also faced several procedural limitations. Language bias may have been introduced by including only English-language studies. The exclusion of gray literature and unpublished research raises the possibility of publication bias, potentially overstating intervention efficacy. Manual screening and data extraction, although performed independently by multiple reviewers, could introduce subjective interpretation and procedural inefficiencies.

Future research should prioritize large-scale randomized controlled trials involving diverse and representative populations to ensure generalizability. Longitudinal designs are essential for evaluating the durability of psychological benefits. Studies should also investigate the optimal combination of human and AI interaction. Rather than replacing clinicians, AI interventions can act as scalable, accessible adjuncts that enhance traditional therapy delivery. Additional investigation into the specific therapeutic mechanisms such as empathy, personalization, and timing of responses that drive clinical improvement is warranted [3].

Ethical and regulatory issues demand deeper examination. Core concerns such as privacy, data security, informed consent, and transparency are especially salient in mental health contexts. Policymakers and regulatory bodies must develop comprehensive frameworks to ensure ethical deployment of AI-based mental health tools. Equitable access should be a priority, particularly for underserved and low-resource populations, necessitating cross-sector collaboration among developers, healthcare providers, and public health institutions [10].

Conclusions

AI-based conversational agents (CAs) hold considerable promise in alleviating psychological distress and promoting mental well-being, particularly when implemented with multimodal communication and personalized interaction. These tools have demonstrated effectiveness across clinical and subclinical populations, with especially positive outcomes observed among older adults. Multimodal systems that integrate text and voice have been shown to improve user engagement and therapeutic alliance, making them more effective than text-only formats. Despite these encouraging findings, inconsistencies in their impact on broader psychological well-being and the lack of long-term follow-up data underscore the need for further investigation. Future research should prioritize rigorous study designs that explore sustained effects, optimal delivery strategies, and the appropriate integration of AI tools alongside human support. Additionally, efforts must focus on addressing methodological limitations, ensuring inclusivity across diverse populations, and developing ethical frameworks to guide the equitable implementation of these technologies in global mental health care systems.

Supporting information

S1 Checklist. PRISMA 2020 checklist.

The completed PRISMA 2020 checklist indicating where each reporting item is addressed in the manuscript.

https://doi.org/10.1371/journal.pone.0332207.s001

(DOCX)

S1 Table. Extracted data from included primary studies.

Study-level data used in the meta-analysis (e.g., sample size, population/diagnosis, intervention, measures, effect sizes, follow-up, and risk-of-bias assessments). Column definitions are provided in the first rows; effect sizes are reported as Cohen’s d.

https://doi.org/10.1371/journal.pone.0332207.s002

(DOCX)

S2 Table. Comprehensive record of all studies identified in the literature search.

Screening log listing each record with design, sample size, intervention, outcome measures, inclusion/exclusion decision, and reason for exclusion (if excluded), plus source/access information. This supports the PRISMA flow.

https://doi.org/10.1371/journal.pone.0332207.s003

(DOCX)

References

  1. 1. Dingler T, Kwasnicka D, Wei J, Gong E, Oldenburg B. The use and promise of conversational agents in digital health. Yearb Med Inform. 2021;30(1):191–9. pmid:34479391
  2. 2. Lin Y-H, Chen C-Y, Wu S-I. Efficiency and quality of data collection among public mental health surveys conducted during the COVID-19 pandemic: systematic review. J Med Internet Res. 2021;23(2):e25118. pmid:33481754
  3. 3. Graham S, Depp C, Lee EE, Nebeker C, Tu X, Kim H-C, et al. Artificial intelligence for mental health and mental illnesses: an overview. Curr Psychiatry Rep. 2019;21(11):116. pmid:31701320
  4. 4. Tutun S, Johnson ME, Ahmed A, Albizri A, Irgil S, Yesilkaya I, et al. An AI-based decision support system for predicting mental health disorders. Inf Syst Front. 2023;25(3):1261–76. pmid:35669335
  5. 5. Olawade DB, Wada OZ, Odetayo A, David-Olawade AC, Asaolu F, Eberhardt J. Enhancing mental health with artificial intelligence: current trends and future prospects. J Med Surgery, Public Heal. 2024;3:100099.
  6. 6. Mayo LM, Perini I, Gustafsson PA, Hamilton JP, Kämpe R, Heilig M, et al. Psychophysiological and neural support for enhanced emotional reactivity in female adolescents with nonsuicidal self-injury. Biol Psych Cogn Neurosci Neuroimag. 2021;6(7):682–91. pmid:33541848
  7. 7. Nemesure MD, Heinz MV, Huang R, Jacobson NC. Predictive modeling of depression and anxiety using electronic health records and a novel machine learning approach with artificial intelligence. Sci Rep. 2021;11(1):1980. pmid:33479383
  8. 8. Abd-Alrazaq AA, Alajlani M, Alalwan AA, Bewick BM, Gardner P, Househ M. An overview of the features of chatbots in mental health: a scoping review. Int J Med Inform. 2019;132:103978. pmid:31622850
  9. 9. Ogunseye EO, Adenusi CA, Nwanakwaugwu AC, Ajagbe SA, Akinola SO. Predictive analysis of mental health conditions using adaboost algorithm. Paradigmplus. 2022;3(2):11–26.
  10. 10. May R, Denecke K. Security, privacy, and healthcare-related conversational agents: a scoping review. Inform Health Soc Care. 2022;47(2):194–210. pmid:34617857
  11. 11. Scoglio AA, Reilly ED, Gorman JA, Drebing CE. Use of social robots in mental health and well-being research: systematic review. J Med Internet Res. 2019;21(7):e13322. pmid:31342908
  12. 12. Robinson NL, Cottier TV, Kavanagh DJ. Psychosocial health interventions by social robots: systematic review of randomized controlled trials. J Med Internet Res. 2019;21(5):e13203. pmid:31094357
  13. 13. Rathbone AL, Prescott J. The use of mobile apps and sms messaging as physical and mental health interventions: systematic review. J Med Internet Res. 2017;19(8):e295. pmid:28838887
  14. 14. Vaidyam AN, Linggonegoro D, Torous J. Changes to the psychiatric chatbot landscape: a systematic review of conversational agents in serious mental illness: changements du paysage psychiatrique des chatbots: une revue systématique des agents conversationnels dans la maladie mentale sérieuse. Can J Psychiatry. 2021;66(4):339–48. pmid:33063526
  15. 15. He Y, Yang L, Qian C, Li T, Su Z, Zhang Q, et al. Conversational agent interventions for mental health problems: systematic review and meta-analysis of randomized controlled trials. J Med Internet Res. 2023;25:e43862. pmid:37115595
  16. 16. Prochaska JJ, Vogel EA, Chieng A, Baiocchi M, Maglalang DD, Pajarito S, et al. A randomized controlled trial of a therapeutic relational agent for reducing substance misuse during the COVID-19 pandemic. Drug Alcohol Depend. 2021;227:108986. pmid:34507061
  17. 17. Klos MC, Escoredo M, Joerin A, Lemos VN, Rauws M, Bunge EL. Artificial intelligence-based chatbot for anxiety and depression in university students: pilot randomized controlled trial. JMIR Form Res. 2021;5(8):e20678. pmid:34092548
  18. 18. Ogawa M, Oyama G, Morito K, Kobayashi M, Yamada Y, Shinkawa K, et al. Can AI make people happy? The effect of AI-based chatbot on smile and speech in Parkinson’s disease. Parkinsonism Relat Disord. 2022;99:43–6. pmid:35596975
  19. 19. Romanovskyi O, Pidbutska N, Knysh A. Elomia chatbot: the effectiveness of artificial intelligence in the fight for mental health. CEUR Workshop Proc. 2021;2870:1215–24.
  20. 20. Drouin M, Sprecher S, Nicola R, Perkins T. Is chatting with a sophisticated chatbot as good as chatting online or FTF with a stranger?. Comp Human Behav. 2022;128:107100.
  21. 21. Liu H, Peng H, Song X, Xu C, Zhang M. Using AI chatbots to provide self-help depression interventions for university students: a randomized trial of effectiveness. Internet Interv. 2022;27:100495. pmid:35059305
  22. 22. Beyeler M, Légeret C, Kiwitz F, van der Horst K. Usability and overall perception of a health bot for nutrition-related questions for patients receiving bariatric care: mixed methods study. JMIR Hum Factors. 2023;10:e47913. pmid:37938894
  23. 23. Cuijpers P, Karyotaki E, Eckshtain D, Ng MY, Corteselli KA, Noma H, et al. Psychotherapy for depression across different age groups: a systematic review and meta-analysis. JAMA Psych. 2020;77(7):694–702. pmid:32186668
  24. 24. Firth J, Torous J, Nicholas J, Carney R, Pratap A, Rosenbaum S, et al. The efficacy of smartphone-based mental health interventions for depressive symptoms: a meta-analysis of randomized controlled trials. World Psych. 2017;16(3):287–98. pmid:28941113