Skip to main content
Advertisement
  • Loading metrics

The efficacy, feasibility, and technical outcomes of a GPT-4o-based chatbot Amanda for relationship support: A randomized controlled trial

  • Laura M. Vowels ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Writing – original draft

    laura.vowels@roehampton.ac.uk

    Affiliations Institute of Psychology, University of Lausanne, Lausanne, Switzerland, School of Psychology, University of Roehampton, London, United Kingdom

  • Matthew J. Vowels,

    Roles Conceptualization, Methodology, Software, Writing – review & editing

    Affiliations School of Psychology, University of Roehampton, London, United Kingdom, The Sense Innovation and Research Center, Lausanne and Sion, Switzerland, Radiology Department, Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland, Centre for Vision Speech and Signal Processing, University of Surrey, Guildford, United Kigndom

  • Shannon K. Sweeney,

    Roles Formal analysis, Writing – original draft, Writing – review & editing

    Affiliation School of Psychology, University of Roehampton, London, United Kingdom

  • S. Gabe Hatch,

    Roles Writing – review & editing

    Affiliation Hatch Data and Mental Health, Payson, Utah, United States of America

  • Joëlle Darwiche

    Roles Conceptualization, Writing – review & editing

    Affiliation FAmily and DevelOpment research center (FADO), Institute of Psychology, University of Lausanne, Lausanne, Switzerland

Abstract

This randomized controlled trial evaluated the efficacy, feasibility, and technical outcomes of Amanda, a GPT-4-based chatbot, in delivering single-session relationship interventions. A total of 258 participants were randomly assigned to engage with either Amanda (n = 130) or a writing task (n = 128) focused on conflict reappraisal. Findings demonstrated significant improvements across 13 of 14 outcome variables—including relationship satisfaction, communication patterns, dyadic coping, problem-specific confidence, and individual well-being—over time in both conditions. Improvements emerged immediately after the intervention and were sustained or continued to improve at the two-week follow-up. However, there were no significant group differences for most outcomes, suggesting that both interventions were comparably effective. One significant group-by-time interaction emerged: participants in the chatbot condition reported lower levels of the partner-demand/self-withdraw communication pattern immediately post-intervention. The writing condition was also associated with lower overall distress about the issue. Feasibility outcomes indicated strong participant engagement with Amanda. Usability was rated highly (M = 4.19/5), as were therapeutic skills (M = 3.99/5) and working alliance (M = 4.75/6). Technical evaluation of interaction transcripts supported these findings, with high coder agreement on Amanda’s empathy, therapeutic questioning, and coherence. However, limitations were noted: Amanda occasionally produced repetitive or generic responses and did not consistently identify potential safety concerns. Overall, results suggest that Amanda provides a feasible and effective single-session relationship intervention, comparable in impact to an evidence-based writing task. This study highlights the potential for large language model-based chatbots to deliver scalable, accessible relationship support. Future research should assess Amanda’s use in multi-session interventions, explore performance in clinical populations, and enhance risk detection capabilities to ensure safe deployment in real-world settings.

Introduction

Relationship distress is a prevalent issue, with approximately one in three couples experiencing clinically significant problems that impact multiple aspects of life [14]. Traditional couple therapy is well-established as an effective intervention [5,6]. However, numerous barriers, including financial constraints and societal stigma, prevent many couples from accessing these services [3,7]. In response to these challenges, online interventions have emerged as a promising alternative, offering a more accessible and flexible means of support [710]. Research has shown that online interventions can be as effective as face-to-face interventions [10,11]. Despite their potential, digital interventions face significant challenges, most notably high dropout rates, which call into question their long-term feasibility [12,13].

Recent advancements in generative artificial intelligence (GenAI) and large language models (LLMs) have opened new avenues for developing interventions, including providing relationship support. GenAI and LLMs, which encompass machine learning algorithms that can generate novel human-like text outputs, form the backbone of modern chatbots. Chatbots have demonstrated potential in mental health interventions [1420], fostering therapeutic alliance that support sustained engagement [18,21]. For instance, a content analysis of the Wysa chatbot revealed that users developed strong bonds with the bot, with therapeutic alliance scores comparable to those in human-therapist interactions [22]. A systematic review emphasized that personalization and empathetic responses are critical for improved mental health outcomes in chatbot interactions [16]. Nonetheless, misinformation in sensitive contexts like mental health treatment could lead to harmful outcomes. Therefore, research must ensure the consistency and reliability of these chatbots.

The release of OpenAI’s ChatGPT, particularly the GPT-4 model, represents a substantial advancement in this field, potentially overcoming these earlier challenges. GPT-4, which builds on the foundation laid by GPT-3.5, offers more nuanced and human-like conversational capabilities, making it a strong candidate for developing more effective and engaging online interventions. GPT-4 has demonstrated improved performance over previous models in medical examinations and genetic information delivery [23,24]. However, challenges persist, including the generation of inaccurate information and outdated content [23,25]. In mental health and substance use education, GPT-4’s outputs required human oversight and were substandard compared to expert-developed materials [25]. Across other studies, research has shown that LLMs like GPT-3.5 and GPT-4 may outperform physicians and psychotherapists in providing information which is perceived as empathic and helpful [26,27]. In a recent study with a small sample, seven psychiatric inpatients interacted in 3–6 sessions provided by ChatGPT: The participants reported higher quality of life compared to the control group (six patients) and reported high satisfaction with the chatbot [28]. To our knowledge no other studies have directly examined the efficacy of modern LLMs in providing mental health interventions, nor have they examined how LLMs perform relative to other previously established relationship interventions. Thus, while LLMs like GPT-4 show potential in various interventions, including improving user engagement, their use requires careful consideration of limitations and ethical implications to ensure patient safety and information accuracy [25].

Initial studies evaluating the potential of GPT-4 for relationship interventions suggest that it may offer a viable alternative to traditional therapy. For example, responses generated by GPT-4 were rated as more empathetic and helpful than those produced by relationship experts in simulated sessions [27]. In a single session relationship intervention using GPT-4, participants and researchers rated the chatbot high across different therapeutic skills with 85% of the 20 participants reporting having had a positive experience with the chatbot [29]. This reflects a broader trend where LLMs are increasingly being explored for their therapeutic potential across various fields, although their application to relationship therapy is still in its early stages.

Given the rapid advancement of AI technologies and the increasing interest in their application to therapeutic settings, it is crucial to systematically investigate the capabilities and limitations of LLM-powered chatbots in providing relationship interventions and compare their efficacy to previously established relationship interventions. This study aims to address this gap by exploring how GPT-4 can deliver single-session relationship interventions, how it compares to an established writing intervention of similar length, and examining users’ perceptions of these chatbot interactions. While LLM-powered chatbots have shown promise, there remains a need for more sophisticated models specifically tailored for relationship support. Amanda, a GPT-4o-based chatbot, is designed to fill this gap by integrating a specialized prompt that instructs the AI to act as a relationship therapist, reflecting on client statements, providing empathy, and asking pertinent follow-up questions [29]. She has been evaluated across a range of therapeutic skills, including empathy, realism, and collaborative approach, based on frameworks established in prior studies [14,29] and has been shown to provide a feasible solution for relationship support [29].

There are three main types of outcomes that can be used to evaluate chatbot performance: Efficacy, feasibility, and technical outcomes [18]. A recent systematic review showed that while 61% of studies on chatbots reported clinical outcomes and 37% report feasibility outcomes, only 1% reported technical outcomes [18]. This is a significant gap in the literature as evaluating technical outcomes is imperative to establish the performance and safe usage of chatbots for clinical interventions. The present study employs a robust methodology evaluating all three types of outcomes.

First, we included a set of validated questionnaires for establishing efficacy of the intervention including a wide range of clinical outcomes such as relationship satisfaction, communication skills, dyadic coping, and individual well-being. We also included problem specific outcomes such as hopefulness, distress, and confidence about the presenting problem. The clinical outcomes were compared across the chatbot, and an established writing condition focused on conflict reappraisal to provide a comparison between a more traditional self-help intervention and a more interactive chatbot interaction [30]. Second, we also evaluated the feasibility of the chatbot intervention, i.e., the usability of the chatbot including how users view the interactions with the chatbot, using previously validated questionnaires on chatbot usability and therapeutic alliance as well as specific therapeutic skills using a questionnaire from previous work on relationship support chatbots [27,29]. Finally, to assess technical outcomes, i.e., outcomes related to the performance of the chatbot such as coded empathy, human-likeness, and conversation depth, we coded the transcripts of the interactions between the participants and the chatbot for technical outcomes using an evaluation framework based on prior work [14,29]. This evaluation framework helps to address the technical challenges that have historically limited the effectiveness of AI in therapeutic roles. By systematically assessing Amanda’s performance across these metrics using a variety of methodologies, this study seeks to advance our understanding of how AI can be effectively integrated into relationship interventions, potentially transforming how couples access and benefit from therapeutic support.

In this preregistered trial, participants were randomly assigned to engage with Amanda or complete a brief social psychological writing task focused on conflict reappraisal shown to improve relationship outcomes [30]. We preregistered the following two main hypotheses:

  1. H1: Participants in both treatment arms (chatbot and writing task) will report significant improvements over time (from pre-intervention to post-intervention and/or follow-up) in a) relationship satisfaction, b) relationship confidence, c) individual well-being, d) hopefulness about the issue, e) confidence in resolving the issue, f) communication about the issue, and g) distress related to the issue. These within-subject improvements are expected as both interventions are designed to promote cognitive and emotional reflection on a specific relationship problem. Given both conditions help the participants address and think differently about their relationship issue, we expect them to feel better about their relationship directly after the intervention. However, we might expect that there is a greater improvement at two-week follow-up given the participants will have had time to engage with their partner following the intervention.
  2. H2: The chatbot intervention (Amanda) will lead to significantly greater improvements over time compared to the writing task in the following outcomes: a) relationship satisfaction, b) relationship confidence, c) individual well-being, d) hopefulness about the issue, e) confidence in resolving the issue, f) communication about the issue, and g) distress related to the issue. This between-group hypothesis is based on the expectation that participants will engage more fully with Amanda due to the interactive and responsive nature of the chatbot, potentially enhancing therapeutic outcomes. This may only show up two weeks later as participants will have had an opportunity to interact with their partner.

Method

Ethics statement

The study was approved by University of Lausanne’s Research Ethics Commission with the project number of C_SSP_052024_00015. Participants were provided with an informed consent form and were asked to tick an electronic checkbox to indicate their consent. They were told they could withdraw their participation at any time until the data collection was completed.

Transparency and openness

The study was preregistered, and the preregistration can be found here: https://osf.io/wcghs . The analyses related to attitudes are reported in another manuscript [31]. All the study materials and analyses as well as the data can be found on the Open Science Framework project page: https://osf.io/wycda/. Session transcripts are not published openly due to the potentially sensitive and identifying information contained in the transcripts. We report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study.

Participants

A total of 615 participants completed the initial eligibility survey. Of these participants, 191 were not eligible due to the potential risk level being too high (e.g., reported suicidal ideation or potential domestic violence) or them not reporting on any specific relationship issue that they wanted to address. The remaining 424 participants were invited to complete the study with 319 participants starting the survey. A further 61 participants were excluded because they either did not meet the inclusion criteria, did not complete the survey, did not interact or complete the interaction with the chatbot, or had an issue with the chatbot interaction (either they stopped the interaction or were not able to launch the chatbot). Thus, the final sample consisted of 258 participants (130 in the chatbot condition and 128 in the writing condition). Please see the consort diagram in Fig 1 for a visual representation. Most participants were women (n = 169, 81%), heterosexual (n = 221, 86%), married (n = 132, 51%), white (n = 205, 79%), lived in the UK (n = 227, 88%), and had at least an undergraduate degree (n = 197, 76%). Most participants had also had at least some experience with chatbots (n = 186, 72%) and/or virtual assistants (n = 225, 87%). See Table 1 for the details of demographic variables for each group. At follow-up we had a total of 240 participants (122 in the chatbot condition and 118 in the writing condition) resulting in a dropout rate of 7.7%.

thumbnail
Table 1. Demographic and Background Characteristics of Participants Across Chatbot and Writing Task Conditions.

https://doi.org/10.1371/journal.pmen.0000411.t001

Procedure

Participants were recruited through Prolific, an online participant recruitment platform, in July 2024. Participants were eligible to take part in the study if they were at least 18 years old, currently in a romantic relationship, and reported a specific relationship issue they wished to address. Participants were excluded if they indicated recent thoughts of self-harm or suicidal ideation, as assessed by item 9 of the PHQ-9 [32], or if they reported any experiences of domestic violence, including physical harm or threats of harm from their partner [33]. Individuals were also excluded if they reported frequent emotional abuse, such as being regularly insulted, screamed at, or cursed at by their partner. Finally, participants who failed to report a specific relationship issue during the eligibility screening were excluded. These criteria were implemented to ensure participant safety and to confirm the appropriateness of a self-guided, low-intensity intervention. Participants who were deemed ineligible due to elevated risk were contacted via Prolific and provided with links to appropriate mental health support services. The participants who fulfilled our eligibility criteria were invited to participate in the main survey. Based on an a priori power calculation, 214 participants (107 per condition) were required to achieve 90% power to detect a small effect size (η² = 0.1) with p < .05. Based on our prior experience recruiting participants via Prolific for similar studies, we accounted for up to 20% attrition rate at follow-up to ensure sufficient power, we aimed to recruit a total of 256 participants (128 per condition).

In the main survey, the participants first completed questions about demographics, previous experience with chatbots or virtual assistants, attitudes toward chatbots and online interventions, and a range of clinical relationship and individual outcomes not reported in this manuscript. After the initial set of questionnaires, participants were randomly assigned into one of two conditions: interaction with a chatbot or a writing task of a brief social psychological intervention [30]. Participants were randomly assigned using Qualtrics’ built-in randomization function, which automatically allocated participants after they completed the baseline measures. This ensured that allocation was concealed from both participants and researchers during the enrolment and data collection process. Each intervention lasted between 15–30 minutes. The details of each intervention are described below. After the intervention, participants completed the same questions as before the intervention, except for demographics, as well as additional questions about the usability of the chatbot or the writing task depending on which condition they were assigned to. At the end, participants were provided with details of online counseling services in case they wanted to seek further support. After two weeks, participants were invited to participate in a follow-up survey where they again completed the same set of questionnaires and some additional questions about whether they had applied what they learnt from the interaction with their partner. data were collected between July 11th, 2024 and August 1st, 2024.

Interventions

Chatbot Interaction. In the chatbot interaction, participants were asked to engage with Amanda, a chatbot created for the purpose of this study, about their specific relationship issue for 20–30 minutes until the conversation naturally came to an end. Amanda is based on OpenAI’s GPT-4o LLM and is programmed to have a “memory” of the conversation meaning that the conversation was appended at the end of the chatbot prompt so that it could follow the conversation. GPT-4o is similarly to GPT-4 not specifically trained to be a psychotherapist but Amanda received the following base prompt which allowed it to interact in a manner approaching a psychotherapist:

You are a trained psychotherapist called Amanda specialising in working with relationship difficulties. I would like you to respond as a relationship therapist: reflect what the client has said, provide validation and empathy, stay close to what the client says instead of overinterpreting them, and ask follow-up questions designed for you to better understand the situation. Do not provide answers that are too long, only ask one question at a time, and try to maintain a natural conversation like I would have with a therapist. The conversations should last at least 20 interactions and should eventually end up with some relevant suggestions for how to improve the issue, but this should only come towards the end of the conversation once the person has had enough time to explore their issue and you have a good understanding of the issue and feel you can offer personalised suggestions for help. You should not offer any advice until you have had at least 10 interactions with the client or patient. Avoid asking for any information that is identifying (e.g., do not ask addresses, names, or companies where they work). If they do not provide any context, assume you know nothing about their situation and ask them for more information.

Writing Task. To compare the chatbot intervention to another intervention, we selected the reappraisal writing intervention (Finkel et al., 2013) because it allowed us to match the chatbot condition as closely as possible in both duration and cognitive demand. Both interventions required participants to actively reflect on their relationship issue for approximately 20–30 minutes, supporting a meaningful comparison of efficacy and feasibility across conditions. The participants completed a set of four writing tasks each for at least 3 minutes and up to 5 minutes. In Step 1, participants were asked to write a fact-based summary of the most significant disagreement they had experienced with their spouse over the preceding 2 weeks for 5 minutes, “focusing on behavior, not on thoughts or feelings.” If they could not think of a conflict in the past two weeks, they were asked to describe a significant conflict that was as recent as possible.

In Step 2, Participants were asked to complete a writing task during which they reappraised the conflict they had just written about. During the reappraisal writing task, participants responded to three prompts (5 minutes for each prompt): (1) “Think about the specific disagreement that you just wrote about having with your partner. Think about this disagreement with your partner from the perspective of a neutral third party who wants the best for all involved; a person who sees things from a neutral point of view. How might this person think about the disagreement? How might he or she find the good that could come from it?” (2) “Some people find it helpful to take this third-party perspective during their interactions with their romantic partner. However, almost everybody finds it challenging to take this third-party perspective at all times. In your relationship with your partner, what obstacles do you face in trying to take this third-partner perspective, especially when you’re having a disagreement with your partner?” (3) “Despite the obstacles to taking a third-party perspective, people can be successful in doing so. Over the next 2 weeks, please try your best to take this third-party perspective during interactions with your partner, especially during disagreements. How might you be most successful in taking this perspective in your interactions with your partner over the next 2 weeks? How might taking this perspective help you make the best of disagreements in your relationship?”

Measures

Demographic information. Participants completed a baseline questionnaire, which included questions on age, gender, ethnicity, sexual orientation, education level, employment, relationship status, children, country of residence, and relationship length.

Chatbot experience. Two single-item questions were used to assess participants’ previous experience with chatbots. These questions inquired if participants had used an AI-powered chatbot or virtual assistant (e.g., ChatGPT, Bing, Siri). Response options were “Yes” or “No,” with additional specifications if “Yes” was selected.

Attitudes toward chatbots [34]. A 14-item scale, divided into four subscales, measured participants’ attitudes toward chatbots: Performance Expectancy, Effort Expectancy, Willingness to Accept AI Devices, and Objection to AI Devices. Performance Expectancy included four items (e.g., “AI devices are more accurate than human beings”), while Effort Expectancy contained three items (e.g., “It takes me too long to learn how to interact with AI devices”). The Willingness and Objection subscales included three and four items respectively (e.g., “I am likely to interact with AI devices”; “I prefer human contact in service transactions”). Each item was rated on a 5-point scale, from strongly disagree [1] to strongly agree [5].

Attitudes toward internet-based therapies [35]. Participants’ views on internet-based therapies were assessed with 17 single-item questions, such as “Internet-based therapies are an appropriate alternative to conventional face-to-face psychotherapy.” Responses were rated on a 5-point Likert scale from strongly disagree [1] to strongly agree [5].

Relationship satisfaction. The Couple Satisfaction Index (Funk & Rogge, 2007), a four-item measure, assessed relationship satisfaction. Items included “In general, how satisfied are you with your relationship?” Overall scores ranged from 0 to 23, with higher scores indicating greater satisfaction.

Relationship confidence. Confidence in relationship trajectory was measured using a two-item scale adapted from [36]. Participants were asked “I believe we can handle whatever conflicts will arise in the future” and “I feel good about our prospects to make this relationship work.” Response options ranged from strongly disagree (scored as 1) to strongly agree [7], with higher scores reflecting greater confidence in the relationship.

Dyadic coping. The Dyadic Coping Inventory [37] assessed stress communication and coping strategies with partners. Two subscales were used: common dyadic coping (5 items, e.g., “We help one another to put the problem in perspective and see it in a new light”) and evaluation of dyadic coping (2 items, e.g., “I am satisfied with the support I receive from my partner and the way we deal with stress together.”). For the first subscale, higher scores indicate partners coping techniques when dealing with stress are better. For the second subscale, higher scores indicate a better personal evaluation of coping as a couple.

Communication. The Communication Patterns Questionnaire—Short Form (CPQ-SF; [38] was used to assess participants’ perceived frequency of communication patterns during conflict. The questionnaire consists of three subscales: constructive communication, self-demand/partner withdraw, and partner demand/self-withdraw. The constructive subscale consisted of three positive items and one negative item; a positive example item is “Both my partner and I try to discuss the problem”, and the negative item is “Both my partner and I blame, accuse, and criticize one another”. The self-demand/partner withdraw subscale contained 3 items; an example item is “I try to start a discussion while my partner tries to avoid a discussion”. The partner demand/self-withdraw subscale also contained three items; an example item is “My partner tries to start a discussion while I try to avoid a discussion”. Response options ranged from very unlikely (scored as 1) to very likely [9]. Each subscale was then calculated by adding or subtracting (for the negative item) the responses given for their respective items. Each subscale score ranges from 1 to 27, with a higher score corresponding to the perception of more frequent use of a specific communication pattern.

Relationship issue assessment. We asked participants five single-item questions regarding their current relationship issue. These questions assessed hopefulness (“How hopeful have you felt that you can resolve the relationship issue with your partner?”), confidence (“How confident have you felt that you and your partner will be able to overcome the relationship issue?”), and distress (“How distressed do you feel about the relationship issue?”). All items were rated from 0 to 6, with 0 being “not at all” and 6 being “extremely”.

Relationship problem severity and confidence. A single-item measure [39] evaluated the impact of the participants’ issue, rated on a 7-point Likert scale from 1 (not a problem) to 7 (extreme problem). Confidence in managing conflicts related to this issue was assessed similarly, with a 7-point scale from 1 (not confident) to 7 (extremely confident).

Communication regarding relationship issues. Two single-item questions asked participants to evaluate their communication and confidence in communicating with their partner about the relationship issue, both rated from 0 (not at all good) to 6 (perfect).

Individual Well Being. WHO-5 Well-being index [40] was used to measure participants’ current mental well-being. Participants rated how often they felt like items such as: “I have felt cheerful in good spirits.” Responses vary from “at no time” (0) to “all of the time” [5]. Overall scores range from 0 to 25, with higher scores indicating better mental well-being.

The Chatbot Usability Questionnaire (CUQ; [41] was used to ask participants with the chatbot condition to rate their experience with the chatbot. The questionnaire included 16 items, such as “The chatbot was easy to navigate” and “Chatbot responses were not relevant.” Responses were rated on a scale from 1 (strongly disagree) to 5 (strongly agree).

Therapeutic skills. Additionally, participants with the chatbot condition rated their experience using 17 single-item questions based off therapist ratings and qualitative interviews [27]. We conducted an exploratory factor analysis to assess the factor structure of the scale. The results showed that the scale items loaded onto one factor with item loadings above.3 except for Item 15 which was removed from the final scale. Therefore, the final scale consisted of 16 items. For participants in the writing condition, 10 single-item questions assessed the writing task’s effectiveness based on the same therapist ratings and qualitative interviews (e.g., “The writing task enabled me to explore the issue in depth”). The latter was not used in the manuscript as the purpose of the scale was only to make the two conditions more equal.

Working alliance. The Working Alliance Questionnaire [42] was used to evaluate the perceived alliance with the chatbot as a therapist. Participants rated six items, such as “The chatbot and I were working towards mutually agreed upon goals,” on a scale from 1 (not at all) to 6 (completely).

Data analysis

Data were analyzed using a per-protocol approach, including only participants who completed the assigned intervention (i.e., interacted with the chatbot or completed the writing task) and provided post-intervention data. Participants who did not complete the intervention or encountered technical issues were excluded from analyses. The preregistered analyses were conducted using a 2 (group) x3 (time) mixed ANOVA for each outcome separately. When there was a significant interaction between group and time, we used Bonferroni corrected post-hoc tests to adjust for Type I error and understand at which timepoints the groups differed significantly. All analyses were conducted in R. The ANOVA analyses were conducted using rstatix package [43]. In our analyses, we used partial eta-squared (η²p) for the mixed ANOVA results, as it allowed us to isolate the effect sizes of individual factors while accounting for other variables in the model. For post-hoc tests, we reported eta squared (η²) as it provides a straightforward measure of the proportion of total variance explained by each group comparison. We interpreted effect sizes using the conventional cut-offs for eta-squared: small (0.01), medium (0.06), and large (0.14) [44]. For post-hoc tests, we applied the Bonferroni correction and used Cohen’s d as the measure of effect size using the conventional cut-offs: small (0.2), medium (0.5), and large (0.8) [44].

The technical outcomes of the session were analyzed using content analysis. A total of 21 codes were applied to the data. Eight codes were initially selected based on relevance from Abd-Alrazaq (2020) who conducted a review of the technical metrics to evaluate healthcare chatbots: dialogue efficiency, context awareness, error management, comprehensibility, realism, empathy, repetitiveness of response, and chatbot’s understanding of response. The scale was applied to all but one code, dialogue efficiency, in which the number of interactions was counted instead. 12 additional codes were developed by the first and third authors based on face validity and Vowels’ (2023) study to evaluate the therapeutic approach: repetitiveness of response, alliance, reflection, focus, therapeutic questioning, explorative approach, collaborative solutions, response length, overall sense of flow and coherence, addressing safety concerns, cultural adaptation, and privacy. The codes were rated on a scale from 0 (absent), 1 (sometimes present less than 50% of the time), 2 (mostly present more than 50% of the time) and 3 (completely present). The scale was applied to all codes except the last four, which were coded yes or no. The first and third authors individually applied these 21 codes to the data. Any discrepancies between coders were then discussed to 100% agreement, with most differences being between scores of 2 or 3.

Results

Clinical outcomes

The results for the clinical outcomes are available in Table 2. Across all outcomes, except for relationship confidence, the results consistently showed that both interventions (chatbot and writing task) improved problem-specific outcomes, generic relationship outcomes, and individual well-being over time. We conducted post-hoc tests to better understand when the improvements occurred over time. The results varied somewhat across the different outcome variables. The following variables improved from pre to post and remained stable at follow-up: hopefulness about the problem, confidence about the problem, problem conflict confidence, relationship satisfaction, and self-demand/partner-withdraw pattern. In other words, the results suggested that the participants improved from pre-intervention to post-intervention and this improvement stayed stable until two weeks later but did not further improve on these outcomes. There were also some outcomes that improved from pre- to post-intervention and continued to improve two weeks later: distress about the problem, perceived problem severity, communication about the problem, and constructive communication. There were also some variables which did not improve immediately after the intervention but improved two weeks later: common dyadic coping, evaluation of dyadic coping, partner-demand/self-withdraw pattern, and individual well-being. In summary, all outcomes except relationship confidence improved over time in both groups with the improvements always occurring from pre-intervention to two weeks later but for some outcomes also already improving already immediately after the intervention and for some outcomes continuing to improve between post-intervention and follow-up.

thumbnail
Table 2. Analysis of Variance (ANOVA) Results Comparing Chatbot and Writing Task Conditions Across Time Points for the Clinical Outcome Variables.

https://doi.org/10.1371/journal.pmen.0000411.t002

The results did not show any significant differences across groups apart from on two outcomes: The results showed that the participants in the writing task reported significantly lower distress about the relationship issue overall, including at pre-intervention. There was also a significant interaction between group and time on the CPQ – partner-demand/self-withdraw subscale showing that participants in the chatbot condition reported significantly lower levels of partner-demand/self-withdraw pattern immediately after the interaction with the chatbot (p < .046), but this difference was no longer significant at follow-up (p = .199). Thus, overall the results suggest that both interventions (chatbot and writing task) provide improvements in relationship and individual functioning.

Feasibility outcomes

Table 3 reports the feasibility outcomes directly following the chatbot intervention. Overall, participants rated the chatbot highly on usability (4.19/5) and on therapeutic skills (3.99/5). They also rated their working alliance with the chatbot highly (4.75/6). Thus, the results suggest that the chatbot intervention provides a feasible option for offering relationship support, at least as a single session intervention. The main items that the participants rated lower included chatbot not clearly describing its purpose, it being somewhat robotic and moderately human-like, the responses were somewhat generic and repetitive, and the responses provided insights or new perspectives and explored the issue in depth only moderately. The rest of the attributes were rated overall very highly (above 4/5 or reverse scored items around 1/5).

thumbnail
Table 3. Post-Test Feasibility Outcomes for Chatbot Condition.

https://doi.org/10.1371/journal.pmen.0000411.t003

Technical outcomes

We coded all interaction transcripts between the participants and the chatbot for technical outcomes. We double coded 1/3 of the transcripts to ascertain reliability of the codes. We report reliability as percentage agreement between the two coders for each code. The overall Cohen’s Kappa was.868 across the coding between the two coders. Finally, the results for technical outcomes can be found in Table 4. The table also provides the percentage agreement between the two coders where the second coder coded 1/3 of the transcripts. The results showed that overall, the chatbot interactions were rated highly by the coders consistent with the ratings from the users. The main ratings where Amanda scored lower included it being repetitive and including herself in the process. All other items were overall rated at least 2.50/3. Finally, the chatbot explored a potential safety issue once out of three occasions where the coder(s) identified a potential safety concern that should be explored. When a participant described feelings of severe anxiety and low self-esteem or when another participant said they “feel like just giving up”, the chatbot did not address potential safety issues. However, when another participant talked about her sadness and crying a lot, Amanda explored coping mechanisms and suggested that the participant seek therapy for the feelings. When relevant, Amanda consistently addressed or accommodated participants’ culture in her responses. Finally, she never asked any identifying information from the participants and there was subsequently no identifying information in the transcripts.

thumbnail
Table 4. Technical Outcomes of Interaction Transcripts Coded by Researchers.

https://doi.org/10.1371/journal.pmen.0000411.t004

Discussion

The results of this study were notably consistent, with 13 out of 14 outcomes showing significant improvements in relationship functioning, problem-specific measures, and individual well-being across both conditions, supporting H1. This indicates that both the chatbot and the writing task were effective in addressing key relational and personal outcomes. However, H2 was not supported, as no significant differences emerged between the two conditions for most clinical outcomes. These findings align with previous research suggesting that self-help interventions, including digital formats, can be just as effective as alternative options [10,11]. A possible explanation for the similar outcomes between the groups is that both interventions fostered emotional and cognitive reflection, which may be equally impactful in short-term, single-session formats. Although one of the anticipated advantages of chatbots is the potential to develop a therapeutic alliance that could enhance engagement [18,21], this study’s single-session design did not allow for the exploration of this dynamic. Therefore, further research is necessary to determine whether chatbots like Amanda can foster sustained engagement and long-term effectiveness in multi-session interventions.

Feasibility and technical outcomes have historically been underexplored in studies evaluating chatbot interventions, with only 1% of chatbot studies reporting technical outcomes and 37% reporting feasibility outcomes (Jabir et al., 2022). In the present study, participants rated the chatbot highly for usability and therapeutic skills, supporting the feasibility of using a GPT-4o-based chatbot like Amanda for relationship support. Specifically, users rated the chatbot highly in terms of empathy and comprehensibility, though some reported that the responses were occasionally repetitive or somewhat robotic (i.e., the responses followed a specific formula of reflection, validation, question). Furthermore, participants also rated therapeutic alliance with the chatbot highly, mirroring previous studies which found that users can develop a therapeutic alliance with chatbots which mirrors that of human therapists [16,22]. The coding of technical outcomes by researchers corroborated these user ratings, with high levels of agreement between the users and researchers. These findings are consistent with previous research [29], which highlighted the potential for chatbots to provide feasible relationship support, but also pointed out similar limitations, such as occasional lack of human-like nuance. While the chatbot performed well in technical aspects like error management and context awareness, the single-session design limited our ability to assess its long-term capabilities.

Strengths and limitations

This study presents several strengths that contribute to the growing body of research on digital interventions for relationship support. One of the key strengths is its use of a randomized controlled trial (RCT) design, which ensures a high level of internal validity by minimizing selection bias and providing a robust comparison between the chatbot intervention and the writing task. Additionally, we also included a follow-up two weeks later which gave the participants time to apply the solutions they had decided on with the chatbot. Another notable strength is the comprehensive evaluation of clinical, feasibility, and technical outcomes, which allowed for a holistic assessment of the chatbot’s efficacy, user experience, and overall performance. Additionally, the use of a GPT-4o-based chatbot, Amanda, represents an innovative application of state-of-the-art AI technology, further advancing research on digital therapeutic tools.

However, this study also has several limitations that should be acknowledged. First, the single-session format limits the potential to assess one of the key advantages of chatbots over self-guided online interventions: their ability to build a therapeutic alliance over time. While chatbots like Amanda hold the promise of reducing dropout rates by fostering this alliance, we were unable to evaluate this potential in the current design. Another limitation is the non-clinical nature of the sample. Participants in this study were well above the cutoff score for clinically significant distress (13.5) and reported high levels of relationship confidence (nearly 6/7), which likely reduced the room for improvement and, consequently, the effect sizes observed. Furthermore, participants were paid for their time and were not actively seeking relationship interventions, limiting the generalizability of these findings to clinical populations. The sample’s willingness to engage with online interventions might also introduce selection bias, reducing the applicability of these findings to broader populations. A further limitation is that we used a per-protocol approach rather than an intention-to-treat analysis. While this allowed us to assess intervention effects among participants who completed the intervention as intended, it may overestimate efficacy by excluding those who disengaged. As such, the results may reflect ideal conditions rather than real-world effectiveness. Future studies should consider using intention-to-treat analyses to better capture the potential impact of dropout and incomplete engagement. Finally, Amanda is based on the LLM GPT-4o and we did not compare other high-performing LLMs such as Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5, or Meta’s Llama 3.0 which may provide different performances over GPT-4o.

Future directions

Future research should explore the use of GPT-4-based chatbots in multi-session interventions to assess their ability to build a sustained therapeutic alliance and improve long-term outcomes. Given the limitations of the single-session format in this study, future studies should examine whether extended interactions with a chatbot can enhance engagement and reduce dropout rates, a common issue in digital interventions. Additionally, future trials should focus on clinical samples, including participants experiencing clinically significant distress, to better understand the efficacy of chatbot interventions in populations that more closely resemble those seeking therapeutic support. This would provide insights into how well these tools perform in real-world, high-need contexts. Another avenue for research could involve comparing different large language models (LLMs), such as GPT-4, Claude, and Llama, to identify which models are best suited for relationship interventions and mental health support. Finally, while this study focused on relationship support, future research could expand the scope to explore the feasibility and efficacy of chatbot interventions across other domains, such as individual therapy or family counseling.

Implications for theory, research, and practice

The findings of this study have several important implications for theory, practice, policy, and research. Theoretically, this study reinforces the growing body of literature that suggests digital interventions, such as chatbots, are accessible, disseminatable, and have the potential to be an alternative to traditional face-to-face interventions assuming effect sizes grow as the number and length of sessions increase. This aligns with emerging theories on digital mental health, which emphasize the scalability and accessibility of AI-driven interventions. From a practical perspective, GPT-4-based chatbots like Amanda offer a promising tool for delivering relationship support at scale, particularly for individuals who face barriers to accessing in-person therapy. However, a critical concern highlighted by this study is the chatbot’s ability to identify and respond to safety risks. While Amanda explored potential safety concerns in a situation where a participant reported crying because of her sadness, she did not do so in examples where one participant described severe anxiety, or another said they feel like “just giving up”. We had screened out all participants with potential safety concerns, so no participant was in danger. However, the chatbot was not told that the participants had no safety concerns so should still have checked in with the participant. This underscores the importance of refining chatbot algorithms to better handle safety concerns, especially when used in mental health interventions. Policymakers should consider the need for robust safety protocols when considering integrating AI tools into therapeutic settings to ensure that vulnerable individuals are protected. One possibility is to integrate an “AI supervisor” with the chatbot therapist which is specifically instructed to only focus on potential risks that can take over the conversation when it identifies risk.

Conclusion

In conclusion, the present study provided initial evidence of efficacy and feasibility of using an LLM-based chatbot for relationship support and showed that the chatbot displayed effective therapeutic skills. In the future, improving the chatbot’s ability to detect risk should be a priority, particularly when working with clinical populations who may be at greater risk of harm. Additionally, while this study demonstrated the efficacy and feasibility of single-session interventions, further exploration of how chatbots can support ongoing therapy and address safety in multi-session formats is crucial. By advancing chatbot technology and integrating more sophisticated risk detection mechanisms, AI-based interventions may become a safe and scalable option for both therapeutic support and crisis management.

Data transparency statement: The hypotheses related to the attitudes toward the chatbot and digital health interventions were analyzed and reported in a preprint which is currently under review: [blinded for peer-review].

References

  1. 1. Proulx CM, Helms HM, Buehler C. Marital Quality and Personal Well‐Being: A Meta‐Analysis. J of Marriage and Family. 2007;69(3):576–93.
  2. 2. Robles TF, Slatcher RB, Trombello JM, McGinn MM. Marital quality and health: a meta-analytic review. Psychol Bull. 2014;140(1):140–87. pmid:23527470
  3. 3. Shaffer KM, Mayberry LS, Salivar EG, Doss BD, Lewis AM, Canter K. Dyadic digital health interventions: Their rationale and implementation. Procedia Comput Sci. 2022;206:183–94. pmid:36397858
  4. 4. Whisman MA, Beach SRH, Snyder DK. Is marital discord taxonic and can taxonic status be assessed reliably? Results from a national, representative sample of married couples. J Consult Clin Psychol. 2008;76(5):745–55.
  5. 5. Doss BD, Roddy MK, Wiebe SA, Johnson SM. A review of the research during 2010-2019 on evidence-based treatments for couple relationship distress. J Marital Fam Ther. 2022;48(1):283–306. pmid:34866194
  6. 6. Roddy MK, Walsh LM, Rothman K, Hatch SG, Doss BD. Meta-analysis of couple therapy: Effects across outcomes, designs, timeframes, and other moderators. J Consult Clin Psychol. 2020;88(7):583–96. pmid:32551734
  7. 7. Cicila LN, Georgia EJ, Doss BD. Incorporating Internet-based Interventions into Couple Therapy: Available Resources and Recommended Uses. Aust N Z J Fam Ther. 2014;35(4):414–30. pmid:26405375
  8. 8. Salivar EJ, Doss B. Web-based couple interventions: do they have a future? J Couple Relatsh Ther. 2013;12:168–85.
  9. 9. Hatch SG, Knopp K, Le Y, Allen MOT, Rothman K, Rhoades GK, et al. Online relationship education for help-seeking low-income couples: A Bayesian replication and extension of the OurRelationship and ePREP programs. Fam Process. 2022;61(3):1045–61. pmid:34383314
  10. 10. Megale A, Peterson E, Friedlander ML. How Effective is Online Couple Relationship Education? A Systematic Meta-Content Review. Contemp Fam Ther. 2022;44(3):294–304. pmid:34025019
  11. 11. Rafieifar M, Hanbidge AS, Lorenzini SB, Macgowan MJ. Comparative efficacy of online vs. face-to-face group interventions: A systematic review. Res Soc Work Pract. 2024.
  12. 12. Keller A, Babl A, Berger T, Schindler L. Efficacy of the web-based PaarBalance program on relationship satisfaction, depression and anxiety – a randomized controlled trial. Internet Interventions. 2021;23:100360.
  13. 13. Rothman K, Roddy MK, Doss BD. Completion of a stand-alone versus coach-supported trial of a web-based program for distressed relationships. Fam Relat. 2019;68(4):375–89.
  14. 14. Abd-Alrazaq AA, Rababeh A, Alajlani M, Bewick BM, Househ M. Effectiveness and safety of using chatbots to improve mental health: systematic review and meta-analysis. J Med Internet Res. 2020;22(7):e16021.
  15. 15. Boucher EM, Harake NR, Ward HE, Stoeckl SE, Vargas J, Minkel J, et al. Artificially intelligent chatbots in digital mental health interventions: a review. Expert Rev Med Devices. 2021;18(sup1):37–49.
  16. 16. He Y, Yang L, Qian C, Li T, Su Z, Zhang Q. Conversational agent interventions for mental health problems: a systematic review and meta-analysis of randomized controlled trials [Internet]. 2022 [cited 2023 Feb 17]. Available from: http://preprints.jmir.org/preprint/43862
  17. 17. Huq SM, Maskeliūnas R, Damaševičius R. Dialogue agents for artificial intelligence-based conversational systems for cognitively disabled: a systematic review. Disabil Rehabil Assist Technol. 2022;:1–20.
  18. 18. Jabir AI, Martinengo L, Lin X, Torous J, Subramaniam M, Tudor Car L. Evaluating conversational agents for mental health: A scoping review of outcome measurement instruments. J Med Int Res. 2022.
  19. 19. Laranjo L, Dunn AG, Tong HL, Kocaballi AB, Chen J, Bashir R, et al. Conversational agents in healthcare: a systematic review. J Am Med Inform Assoc. 2018;25(9):1248–58.
  20. 20. Vaidyam AN, Wisniewski H, Halamka JD, Kashavan MS, Torous JB. Chatbots and Conversational Agents in Mental Health: A Review of the Psychiatric Landscape. Can J Psychiatry. 2019;64(7):456–64. pmid:30897957
  21. 21. Martinengo L, Lum E, Car J. Evaluation of chatbot-delivered interventions for self-management of depression: content analysis. J Affect Disord. 2022;319:598–607.
  22. 22. Beatty C, Malik T, Meheli S, Sinha C. Evaluating the Therapeutic Alliance With a Free-Text CBT Conversational Agent (Wysa): A Mixed-Methods Study. Front Digit Health. 2022;4:847991. pmid:35480848
  23. 23. McGrath SP, Kozel BA, Gracefo S, Sutherland N, Danford CJ, Walton N. A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions. J Am Med Inform Assoc. 2024.
  24. 24. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems [Internet]. arXiv; 2023 [cited 2023 Dec 16]. Available from: http://arxiv.org/abs/2303.13375
  25. 25. Spallek S, Birrell L, Kershaw S, Devine EK, Thornton L. Can we use ChatGPT for Mental Health and Substance Use Education? Examining Its Quality and Potential Harms. JMIR Med Educ. 2023;9:e51243. pmid:38032714
  26. 26. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med. 2023;183(6):589–96. pmid:37115527
  27. 27. Vowels LM. Are chatbots the new relationship experts? Insights from three studies. Comput Hum Behav Artif Hum. 2024;2(2):100077.
  28. 28. Melo A, Silva I, Lopes J. ChatGPT: A Pilot Study on a Promising Tool for Mental Health Support in Psychiatric Inpatient Care. Int J Psychiatr Trainees [Internet]. 2024 Feb 9 [cited 2024 Sept 5]; Available from: https://ijpt.scholasticahq.com/article/92367-chatgpt-a-pilot-study-on-a-promising-tool-for-mental-health-support-in-psychiatric-inpatient-care
  29. 29. Vowels LM, Francois-Walcott RRR, Darwiche J. AI in relationship counselling: Evaluating ChatGPT’s therapeutic capabilities in providing relationship advice. Comput Hum Behav Artif Hum. 2024;2(2):100078.
  30. 30. Finkel EJ, Slotter EB, Luchies LB, Walton GM, Gross JJ. A brief intervention to promote conflict reappraisal preserves marital quality over time. Psychol Sci. 2013;24(8):1595–601.
  31. 31. Vowels LM, Francois-Walcott RRR, Grandjean M, Darwiche J, Vowels MJ. Navigating relationships with GenAI chatbots: User attitudes, acceptability, and potential. Comput Hum Behav Artif Hum. 2025;5:100183.
  32. 32. Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16(9):606–13. pmid:11556941
  33. 33. Sherin KM, Sinacore JM, Li XQ, Zitter RE, Shakil A. HITS: a short domestic violence screening tool for use in a family practice setting. Fam Med. 1998;30(7):508–12. pmid:9669164
  34. 34. Gursoy D, Chi OH, Lu L, Nunkoo R. Consumers acceptance of artificially intelligent (AI) device use in service delivery. Int J Inf Manag. 2019;49:157–69.
  35. 35. Apolinário-Hagen J, Harrer M, Kählke F, Fritsche L, Salewski C, Ebert DD. Public Attitudes Toward Guided Internet-Based Therapies: Web-Based Survey Study. JMIR Ment Health. 2018;5(2):e10735. pmid:29764797
  36. 36. Rhoades GK, Stanley SM, Markman HJ. The pre-engagement cohabitation effect: a replication and extension of previous findings. J Fam Psychol. 2009;23(1):107–11. pmid:19203165
  37. 37. Bodenmann G. Dyadic Coping Inventory. Encyclopedia of Couple and Family Therapy. 2018.
  38. 38. Futris TG, Campbell K, Nielsen RB, Burwell SR. The communication patterns questionnaire - short form: A review and assessment. Fam J. 2010;18(3):275–87.
  39. 39. Roddy MK, Stamatis CA, Rothman K, Doss BD. Mechanisms of change in a brief, online relationship intervention. J Fam Psychol. 2020;34(1):57–67. pmid:31380690
  40. 40. Bech P. Measuring the dimension of psychological general well-being by the WHO-5. Quality of Life Newsletter. 2004;32:15–6.
  41. 41. Holmes S, Moorhead A, Bond R, Zheng H, Coates V, Mctear M. Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces? 2019, 207 p.
  42. 42. Falkenström F, Hatcher RL, Skjulsvik T, Larsson MH, Holmqvist R. Development and validation of a 6-item working alliance questionnaire for repeated administrations during psychotherapy. Psychol Assess. 2015;27(1):169–83. pmid:25346997
  43. 43. Kassambara A. rstatix: Pipe-Friendly Framework for Basic Statistical Tests [Internet]. 2023 [cited 2024 Aug 2]. Available from: https://cran.r-project.org/web/packages/rstatix/index.html
  44. 44. Cohen J. Statistical power analysis for the behavioral sciences. 2nd edition. Routledge; 1988.