Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Assessment of ChatGPT-generated medical Arabic responses for patients with metabolic dysfunction–associated steatotic liver disease

  • Saleh A. Alqahtani ,

    Roles Supervision, Writing – original draft, Writing – review & editing

    salalqahtani@kfshrc.edu.sa

    Affiliations Liver, Digestive, and Lifestyle Health Research Section, and Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia, Division of Gastroenterology and Hepatology, Weill Cornell Medicine, New York, New York, United States of America

  • Reem S. AlAhmed,

    Roles Formal analysis, Writing – original draft

    Affiliation Liver, Digestive, and Lifestyle Health Research Section, and Biostatistics, Epidemiology and Scientific Computing Department, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia

  • Waleed S. AlOmaim,

    Roles Formal analysis, Writing – review & editing

    Affiliation Department of Pathology and Laboratory Medicine, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia

  • Saad Alghamdi,

    Roles Data curation, Writing – review & editing

    Affiliation Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia

  • Waleed Al-Hamoudi,

    Roles Data curation, Writing – review & editing

    Affiliation Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia

  • Khalid Ibrahim Bzeizi,

    Roles Formal analysis, Validation, Writing – review & editing

    Affiliation Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia

  • Ali Albenmousa,

    Roles Formal analysis, Writing – review & editing

    Affiliation Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia

  • Alessio Aghemo,

    Roles Conceptualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Biomedical Sciences, Humanitas University, Pieve Emanuele (MI), Italy, Division of Internal Medicine and Hepatology, Department of Gastroenterology, IRCCS Humanitas Research Hospital, Rozzano (MI), Italy

  • Nicola Pugliese,

    Roles Conceptualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Biomedical Sciences, Humanitas University, Pieve Emanuele (MI), Italy, Division of Internal Medicine and Hepatology, Department of Gastroenterology, IRCCS Humanitas Research Hospital, Rozzano (MI), Italy

  • Cesare Hassan,

    Roles Conceptualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Biomedical Sciences, Humanitas University, Pieve Emanuele (MI), Italy, Division of Internal Medicine and Hepatology, Department of Gastroenterology, IRCCS Humanitas Research Hospital, Rozzano (MI), Italy

  • Faisal A. Abaalkhail

    Roles Formal analysis, Writing – review & editing

    Affiliations Gastroenterology Section, Department of Medicine, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia, College of Medicine, Alfaisal University, Riyadh, Saudi Arabia

Abstract

Background and aim

Artificial intelligence (AI)-powered chatbots, such as Chat Generative Pretrained Transformer (ChatGPT), have shown promising results in healthcare settings. These tools can help patients obtain real-time responses to queries, ensuring immediate access to relevant information. The study aimed to explore the potential use of ChatGPT-generated medical Arabic responses for patients with metabolic dysfunction–associated steatotic liver disease (MASLD).

Methods

An English patient questionnaire on MASLD was translated to Arabic. The Arabic questions were then entered into ChatGPT 3.5 on November 12, 2023. The responses were evaluated for accuracy, completeness, and comprehensibility by 10 Saudi MASLD experts who were native Arabic speakers. Likert scales were used to evaluate: 1) Accuracy, 2) Completeness, and 3) Comprehensibility. The questions were grouped into 3 domains: (1) Specialist referral, (2) Lifestyle, and (3) Physical activity.

Results

Accuracy mean score was 4.9 ± 0.94 on a 6-point Likert scale corresponding to “Nearly all correct.” Kendall’s coefficient of concordance (KCC) ranged from 0.025 to 0.649, with a mean of 0.28, indicating moderate agreement between all 10 experts. Mean completeness score was 2.4 ± 0.53 on a 3-point Likert scale corresponding to “Comprehensive” (KCC: 0.03–0.553; mean: 0.22). Comprehensibility mean score was 2.74 ± 0.52 on a 3-point Likert scale, which indicates the responses were “Easy to understand” (KCC: 0.00–0.447; mean: 0.25).

Conclusion

MASLD experts found that ChatGPT responses were accurate, complete, and comprehensible. The results support the increasing trend of leveraging the power of AI chatbots to revolutionize the dissemination of information for patients with MASLD. However, many AI-powered chatbots require further enhancement of scientific content to avoid the risks of circulating medical misinformation.

Introduction

Metabolic dysfunction–associated steatotic liver disease (MASLD), formerly known as non-alcoholic fatty liver disease (NAFLD), is a global health concern, closely linked to the obesity epidemic and sedentary lifestyles [1, 2]. MASLD involves a full spectrum of conditions resulting from metabolic imbalances, such as metabolic dysfunction-associated steatohepatitis (MASH), previously called non-alcoholic steatohepatitis (NASH) [1, 3]. MASLD and MASH pose enormous financial and health burdens across countries, including those in the Arabic-speaking world [46]. Early detection and treatment of MASLD are crucial to prevent the progression of more severe stages like cirrhosis and hepatocellular carcinoma [1, 7]. However, barriers to healthcare access and patient literacy may create challenges in managing this condition effectively [8, 9].

In the digital age, artificial intelligence (AI) applications in healthcare offer innovative solutions to such challenges. Chatbots powered by advanced AI models, like the Chat Generative Pretrained Transformer (ChatGPT), can supplement patient education and engagement outside the clinical setting [10]. With their ability to process and produce human-like text, these AI tools can deliver instant, reliable medical information and support, potentially transforming patient self-management practices [11, 12].

From a previous study aimed to determine ChatGPT’s effectiveness in answering patient inquiries concerning MASLD and associated lifestyle factors, findings indicated that ChatGPT delivered accurate (mean score of 4.84 on a 6-point Likert scale), comprehensive (mean score of 2.08 on a 3-point scale), and easy to understand (mean score of 2.87 on a 3-point scale) responses. Nonetheless, it is noteworthy that the variability in ChatGPT’s responses may be attributed to factors such as the training dataset, context, and language [13].

Despite the promise of AI-powered interventions, their effectiveness for Arabic-speaking patients with MASLD remains underexplored. We aimed to explore the potential use of ChatGPT in generating medical responses in Arabic for patients with MASLD, assessing its accuracy, reliability, and comprehensiveness as an informative resource.

Materials and methods

A cross-sectional study assessed the effectiveness of ChatGPT in providing medical responses to Arabic-speaking patients with MASLD. The process followed three main steps: 1) A validated English-language patient questionnaire on MASLD [13], was translated into Arabic by the MASLD experts and an independent researcher, ensuring linguistic and contextual accuracy from a patient perspective; 2) The translated questions were then entered separately into ChatGPT 3.5 on November 12, 2023, simulating a realistic scenario where a patient seeks information regarding MASLD; and 3) Ten MASLD experts from Saudi Arabia, who were native Arabic speakers and fluent in English, independently evaluated the AI-generated responses. The data was collected from 01/31/2024 through 02/10/2024. For the survey and questionnaire, we primarily used Classical Arabic, which is the standard for formal and business writing, ensuring a common linguistic framework across diverse Arabic-speaking populations.

Three domains were assessed using respective Likert scales: 1) Accuracy: Responses were rated on a 6-point Likert scale ranging from ’Completely incorrect’ to ’Correct’; 2) Completeness: A 3-point Likert scale was utilized, categorizing responses as ’Incomplete’, ’Adequate’, or ’Comprehensive’; and 3) Comprehensibility: The intelligibility of responses was determined using a 3-point Likert scale marked by ’Difficult’, ’Partly difficult’, and ’Easy to understand’.

An additional open-ended question was integrated into the Arabic questionnaire to gather detailed feedback and expert commentary on the AI-generated response quality. This structured evaluation method aimed to capture the nuanced perspectives of clinical experts regarding the application of ChatGPT in patient education and its potential role in improving MASLD patient care within Arabic-speaking populations.

Statistical analysis

The data was analyzed using the Statistical Package for Social Sciences (SPSS), version 28 (IBM Corp., N.Y., USA). To assess the potential usability of ChatGPT’s Arabic responses for patients with MASLD, the non-parametric Kendall Tau’s correlation test was employed. It examined the association between experts’ ratings, using ordinal data from Likert scale assessments for the three domains, to determine the direction and strength of relationships between the variables under study. The mean scores, measured on 6- and 3-point Likert scales, Kendall’s coefficient of concordance, and range values were expressed.

Ethical statement

Ethical approval for this study was obtained from the Research Ethics Committee (REC) of King Faisal Specialist Hospital & Research Center, Riyadh, Saudi Arabia (RAC #2241013) on 01/29/2024. The REC recommended the approval of the study with a waiver of signing and documentation of consent. The decision of participant MASLD experts to submit the survey was considered consent.

Results

Accuracy

The mean score for accuracy was 4.92 ± 0.94 on a 6-point Likert scale corresponding to “Nearly all correct”. Kendall’s coefficient of concordance ranged from 0.025 to 0.649, with a mean of 0.28, indicating a moderate level of agreement among all 10 experts. The highest mean square was for question 5, with a mean of 5.3 corresponding to “Correct”. The lowest mean was question 13, with a mean score of 4.3, corresponding to “More correct than incorrect”. Among the three domains, Physical Activity had the highest accuracy mean of 5.07 ± 0.83, while specialist referral had the lowest mean score of 4.70 ± 1.02 (Fig 1).

thumbnail
Fig 1. Accuracy score.

Box plot showing the distribution of accuracy scores for each question. Graph shows the interquartile range (box), median (horizontal line), mean (dot), and outliers (whiskers).

https://doi.org/10.1371/journal.pone.0317929.g001

Completeness

The mean score for completeness was 2.37 ± 0.53 on a 3-point Likert scale, corresponding to “Comprehensive”. Kendall’s coefficient ranged from 0.03 to 0.553, with a mean of 0.22, indicating a moderate level of agreement among all 10 experts. The highest question mean score was Q8 of 2.6, corresponding to “Comprehensive”. The lowest mean was question 1, with a mean score of 2.1, corresponding to “Adequate”. Among the three domains, Physical Activity had the highest mean score of 2.43 ± 0.57, while specialist referral had the lowest mean score of 2.20 ± 0.48 (Fig 2).

thumbnail
Fig 2. Completeness score.

Box plot showing the distribution of completeness scores for each question. Graph shows the interquartile range (box), median (horizontal line), mean (dot), and outliers (whiskers).

https://doi.org/10.1371/journal.pone.0317929.g002

Comprehensibility

The average comprehensibility score was 2.74 ± 0.52, which indicates that the ChatGPT-generated responses were “Easy to understand”. Kendall’s coefficient ranged from 0.00 to 0.447, with a mean of 0.25, indicating a moderate level of agreement among all 10 experts. The highest question mean score of 2.9 was questions 2, 3, 6, 8, and 10. The lowest question mean score of 2.4 was questions 7 and 14. Among the three domains, Physical Activity had the highest mean score of 2.83 ± 0.38, while specialist referral had the lowest mean score of 2.50 ± 0.63 (Fig 3).

thumbnail
Fig 3. Comprehensibility score.

Box plot showing the distribution of comprehensibility scores for each question. Graph shows the interquartile range (box), median (horizontal line), mean (dot), and outliers (whiskers).

https://doi.org/10.1371/journal.pone.0317929.g003

Expert comments

When comparing responses with the highest/lowest frequency of the expert comments, the following questions generated responses with no comments by the expert (Questions 8–10 and 12). The highest questions that had more than one expert comment were questions 1, 5, and 14. Grouping comments by theme, the following had been identified to be the most repeated comments among the experts: 1) The generated responses used the term “NAFLD/NASH” instead of “MASLD/MASH”; 2) The Arabic-generated response translation of “Biopsy”; 3) The Arabic-generated response translation of “MRI”; and 4) The Arabic-generated response sentences on alcohol consumption.

Discussion

AI is significantly impacting the medical field, including gastroenterology and hepatology [14, 15]. In recent years, AI has been successfully applied in liver pathology and radiology to improve diagnostic accuracy and reduce inter- and intra-observer variability [1416]. Recently, significant attention has been paid to the clinical applications of AI-based chatbots, specifically ChatGPT in various contexts, including its potential use as an immediate, free, and on-demand information dissemination tool for patients with MASLD [14]. Identifying effective information dissemination tools for patients with MASLD is a clinical priority for disease management, as MASLD management needs a multidisciplinary approach [17]. Patient education and information dissemination are an essential component for helping patients in achieving and maintaining lifestyle changes [18, 19]. AI-based chatbots could be a valuable tool for patients by providing simplified explanations and guidance on first-line treatment options and disease management such as weight loss and physical activity recommendations.

Pugliese et al. [13] recently conducted the first study on ChatGPT 3.5 as an information dissemination tool for patients with MASLD, demonstrating that ChatGPT 3.5 can provide understandable and complete answers from the patient’s perspective to 15 pre-defined MASLD-related questions in English. The AI-generated answers were evaluated by 10 experts and found to be relatively accurate [13]. In addition, preliminary data from another study by the same authors showed that using a different language from English did not seem to affect the effectiveness of ChatGPT as a resource tool for patients with MASLD [20]. To date, no study assessed the effectiveness of AI-powered interventions for Arabic-speaking patients with MASLD.

In our study, we involved 10 MASLD experts from Saudi Arabia who were native Arabic speakers and evaluated the same set of questions that were previously analyzed in English. We found that ChatGPT’s ability to advise patients with MASLD was not affected by language, as the Arabic answers were deemed to be complete (with a mean score of 2.4 on a 3-point scale) and comprehensible (with a mean score of 2.74 on a 3-point scale). However, consistent with other studies, the accuracy of ChatGPT still requires improvement, with a mean score of 4.9 on a 6-point Likert scale (Table 1). So, while the Arabic language does not influence the completeness and accuracy of ChatGPT generated answers, it also does not improve the inaccuracies observed in clinically meaningful answers. Similar to a previous study conducted in the English language [13], the Physical Activity domain had the highest score as well for the Arabic questionnaire (Table 2).

thumbnail
Table 1. Comparing the mean score result between the Arabic and English responses [13].

https://doi.org/10.1371/journal.pone.0317929.t001

thumbnail
Table 2. Comparing domains mean score result between the Arabic and English responses [13].

https://doi.org/10.1371/journal.pone.0317929.t002

Limitations

Ten experts in the field of MASLD conducted the ratings using Likert scales. However, it is important to note that such scales have limitations as they allow for partial accuracy ratings. This is unacceptable in the medical field as it can lead to misunderstandings and dangerous consequences for patients. Another limitation is the availability of new and potentially better versions of ChatGPT (ChatGPT 4), as the study used version 3.5. However, it should be noted that ChatGPT 4 is not freely accessible to patients and thus it is unlikely to be used any time soon. While a variety of large language models are accessible, including free options, our decision to employ ChatGPT was primarily driven by methodological consistency. To ensure a reliable comparison between English and Arabic responses, it was crucial to maintain a standardized approach. By utilizing the same AI tool, we could isolate the impact of language differences on the generated content. We acknowledge the rapid advancements in AI technology and the potential benefits of exploring diverse models which may improve in accuracy and cultural relevance. Future research endeavors will undoubtedly involve a comparative analysis of various AI tools to assess their relative strengths and weaknesses in different language contexts.

In addition, it is crucial to consider the impact of socio-cultural factors on ChatGPT responses. The sociocultural background of the patient determines the tool’s capacity to offer culturally sensitive guidance, and patient preferences, health literacy levels, and cultural quirks may all affect how successful the responses are. Therefore, even if ChatGPT is a useful tool, its use needs to be done with consideration for the patients’ cultural variety [20] Chatbots also have other known limitations, including the risk of generating content that may not be grounded in evidence-based knowledge, a phenomenon known as ’hallucinations’ [21]. Retrieval augmented generation (RAG) is a potential method to address this issue. RAG combines the response-generating ability of AI-based chatbots with the ability to pull in verified information from external sources, resulting in more accurate and complete answers. There is a growing trend not only in acquiring information from AI-based apps and services but also in decision-making based on such information. Hence, the professional community should use AI responsibly by following the principles and ethics associated with it.

Conclusions

This study addresses the critical requirement for AI tools in the Arabic-speaking world, where the prevalence of MASLD is estimated to be higher than in Western countries [22]. Although our study confirms the promising results obtained by previous studies, the universal adoption of ChatGPT as a resource tool for MASLD patients is challenging [13, 20]. The identified limitations highlight the need for continued improvement of AI models in healthcare settings. Such improvement requires collaboration between AI experts and healthcare professionals, which is necessary and crucial. While the study results showcase that the AI-generated responses are accurate and consistent, patients should be informed not to replace conventional doctor visits with these technologies, as they facilitate educational patient material specifically, and are not a way to have a medical diagnosis or consultation.

Supporting information

S2 Table. Completeness Likert scale reference.

https://doi.org/10.1371/journal.pone.0317929.s002

(DOCX)

S3 Table. Comprehensiveness Likert scale reference.

https://doi.org/10.1371/journal.pone.0317929.s003

(DOCX)

S7 Table. Accuracy—Kendall’s tau analysis.

https://doi.org/10.1371/journal.pone.0317929.s007

(DOCX)

S9 Table. Comprehensiveness Kendall’s tau.

https://doi.org/10.1371/journal.pone.0317929.s009

(DOCX)

References

  1. 1. Chan WK, Chuah KH, Rajaram RB, Lim LL, Ratnasingam J, Vethakkan SR. Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD): A State-of-the-Art Review. J Obes Metab Syndr. 2023;32(3):197–213. pmid:37700494
  2. 2. Zelber-Sagi S, Ratziu V, Oren R. Nutrition and physical activity in NAFLD: An overview of the epidemiological evidence. World J Gastroenterol WJG. 2011 Aug 7;17(29):3377–89. pmid:21876630
  3. 3. Staufer K, Stauber RE. Steatotic Liver Disease: Metabolic Dysfunction, Alcohol, or Both? Biomedicines. 2023 Jul 26;11(8):2108. pmid:37626604
  4. 4. Alqahtani S. A., Broering D. C., Alghamdi S. A., Bzeizi K. I., Alhusseini N., Alabbad S. I., et al. (2021). Changing trends in liver transplantation indications in Saudi Arabia: from hepatitis C virus infection to nonalcoholic fatty liver disease. BMC gastroenterology, 21(1), 245. pmid:34074270
  5. 5. Coker T., Saxton J., Retat L., Alswat K., Alghnam S., Al-Raddadi R. M., et al. (2022). The future health and economic burden of obesity-attributable type 2 diabetes and liver disease among the working-age population in Saudi Arabia. PloS one, 17(7), e0271108. pmid:35834577
  6. 6. Golabi P., Paik J. M., AlQahtani S., Younossi Y., Tuncer G., & Younossi Z. M. (2021). Burden of non-alcoholic fatty liver disease in Asia, the Middle East and North Africa: Data from Global Burden of Disease 2009–2019. Journal of hepatology, 75(4), 795–809. pmid:34081959
  7. 7. Yin X, Guo X, Liu Z, Wang J. Advances in the Diagnosis and Treatment of Non-Alcoholic Fatty Liver Disease. Int J Mol Sci. 2023 Feb 2;24(3):2844. pmid:36769165
  8. 8. Lazarus JV, Colombo M, Cortez-Pinto H, Huang TTK, Miller V, Ninburg M, et al. NAFLD—sounding the alarm on a silent epidemic. Nat Rev Gastroenterol Hepatol. 2020 Jul;17(7):377–9. pmid:32514153
  9. 9. Allen-Meares P, Lowry B, Estrella ML, Mansuri S. Health Literacy Barriers in the Health Care System: Barriers and Opportunities for the Profession. Health Soc Work. 2020 Jan 28;45(1):62–4. pmid:31993624
  10. 10. Chakraborty C, Pal S, Bhattacharya M, Dash S, Lee SS. Overview of Chatbots with special emphasis on artificial intelligence-enabled ChatGPT in medical science. Front Artif Intell. 2023 Oct 31;6:1237704. pmid:38028668
  11. 11. Javaid M, Haleem A, Singh RP. ChatGPT for healthcare services: An emerging stage for an innovative perspective. BenchCouncil Trans Benchmarks Stand Eval. 2023 Feb 1;3(1):100105.
  12. 12. Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023 Sep 22;23(1):689. pmid:37740191
  13. 13. Pugliese N, Wai-Sun Wong V, Schattenberg JM, Romero-Gomez M, Sebastiani G, NAFLD Expert Chatbot Working Group, et al. Accuracy, Reliability, and Comprehensibility of ChatGPT-Generated Medical Responses for Patients With Nonalcoholic Fatty Liver Disease. Clin Gastroenterol Hepatol Off Clin Pract J Am Gastroenterol Assoc. 2023 Sep 15;S1542-3565(23)00704–8. pmid:37716618
  14. 14. Le Berre C, Sandborn WJ, Aridhi S, et al. Application of Artificial Intelligence to Gastroenterology and Hepatology. Gastroenterology. 2020;158(1):76–94.e2. pmid:31593701
  15. 15. Schattenberg JM, Chalasani N, Alkhouri N. Artificial Intelligence Applications in Hepatology. Clin Gastroenterol Hepatol. 2023;21(8):2015–2025. pmid:37088460
  16. 16. Nam D, Chapiro J, Paradis V, Seraphin TP, Kather JN. Artificial intelligence in liver diseases: Improving diagnostics, prognostics and response prediction. JHEP Rep. 2022;4(4):100443. Published 2022 Feb 2. pmid:35243281
  17. 17. Rinella ME, Neuschwander-Tetri BA, Siddiqui MS, et al. AASLD Practice Guidance on the clinical assessment and management of nonalcoholic fatty liver disease. Hepatology. 2023;77(5):1797–1835. pmid:36727674
  18. 18. Pugliese N, Plaz Torres MC, Petta S, Valenti L, Giannini EG, Aghemo A. Is there an ’ideal’ diet for patients with NAFLD?. Eur J Clin Invest. 2022;52(3):e13659. pmid:34309833
  19. 19. Balakrishnan M, Liu K, Schmitt S, et al. Behavioral weight-loss interventions for patients with NAFLD: A systematic scoping review. Hepatol Commun. 2023;7(8):e0224. Published 2023 Aug 3. pmid:37534947
  20. 20. Pugliese N., Polverini D., Lombardi R., Pennisi G., Ravaioli F., Armandi A., et al., & NAFLD Expert Chatbot Working Group. (2024). Evaluation of ChatGPT as a Counselling Tool for Italian-Speaking MASLD Patients: Assessment of Accuracy, Completeness and Comprehensibility. Journal of Personalized Medicine, 14(6), 568. pmid:38929789
  21. 21. Goddard J. Hallucinations in ChatGPT: A Cautionary Tale for Biomedical Researchers. Am J Med. 2023;136(11):1059–1060. pmid:37369274
  22. 22. Younossi ZM, Golabi P, Paik JM, Henry A, Van Dongen C, Henry L. The global epidemiology of nonalcoholic fatty liver disease (NAFLD) and nonalcoholic steatohepatitis (NASH): a systematic review. Hepatology. 2023;77(4):1335–1347. pmid:36626630