Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning

Yu Chang; Si-Sheng Huang; Wen-Yu Hsu; Yi-Chun Liu

doi:10.1371/journal.pone.0330303

Abstract

Chatbots are increasingly being recognized as valuable tools for clinical support in psychiatry. This study systematically evaluated the clinical knowledge and reasoning of 27 leading chatbots in psychiatry. Using 160 multiple-choice questions from the Taiwan Psychiatry Licensing Examinations and Rasch analysis, we quantified performance and qualitatively assessed reasoning processes. OpenAI’s ChatGPT-o1-preview emerged as the top performer, achieving a Rasch ability score of 2.23, significantly surpassing the passing threshold (p < 0.001). While it excelled in diagnostic and therapeutic reasoning, it also demonstrated notable limitations in factual recall, niche topics, and occasional reasoning biases. Our findings indicate that while advanced chatbots hold significant potential as clinical decision-support tools, their current limitations underscore that rigorous human oversight is indispensable for patient safety. Continuous evaluation and domain-specific training are crucial for the safe integration of these technologies into clinical practice.

Citation: Chang Y, Huang S-S, Hsu W-Y, Liu Y-C (2025) Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning. PLoS One 20(8): e0330303. https://doi.org/10.1371/journal.pone.0330303

Editor: George Vousden, Public Library of Science, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: January 5, 2025; Accepted: July 28, 2025; Published: August 14, 2025

Copyright: © 2025 Chang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Restrictions apply to the availability of these data. Data(licensing questions) were obtained from the Taiwanese Society of Psychiatry and are available at https://www.sop.org.tw/news/n_list.asp. Otherwise, the data underlying the results presented in the study are available from https://www.kaggle.com/datasets/yuchangsnes5503/evaluating-chatbots-in-psychiatry/data.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Chatbots, powered by generative artificial intelligence and trained through deep learning algorithms, are designed to engage in natural conversations [1]. These systems, often referred to as large language models (LLMs), exhibit a remarkable ability to process and respond with contextually relevant information [1]. Recent advancements in large-scale data training and sophisticated reasoning mechanisms have expanded chatbots’ capabilities from general knowledge dissemination to specialized applications [2]. In the medical field, research has demonstrated that chatbots can achieve acceptable levels of performance in various professional examinations and assessments [3–8].

The growing interest in chatbots for clinical support highlights their potential to enhance healthcare delivery [9]. Previous studies have established a connection between clinical competence and foundational knowledge [10], suggesting that an evaluation of chatbots’ clinical knowledge could provide valuable insights into their practical utility. While previous research has used Rasch analysis to assess chatbots’ knowledge in psychiatry, gaps in understanding remain, particularly in certain specialized areas [4]. A comprehensive evaluation of chatbots’ strengths and limitations in psychiatric clinical knowledge is still needed.

Measuring psychiatric clinical knowledge poses unique challenges due to the complexity and context-dependent nature of psychiatric assessments. Past research has attempted to evaluate chatbots in this domain [6,11,12], with one study administering a set of ten multiple-choice questions focused on differential diagnosis to both chatbots and psychiatrists [6]. However, the study primarily relied on surface-level comparisons of scores and lacked sufficient depth to quantify or qualitatively understand the underlying knowledge structures of the chatbots. This limited scope has left aspects of chatbot capabilities underexplored, such as their reasoning processes.

To address these limitations, our study employs Rasch analysis [13], a statistical method commonly used in educational and psychological testing to quantify item difficulty and respondent ability on a unified scale. By applying Rasch analysis to a larger and more diverse set of psychiatry-related questions, we aim to provide a detailed assessment of the strengths and weaknesses of the most advanced chatbots. Our study focuses on exploring these foundational aspects to provide insights that can guide their effective integration into clinical practice.

Materials and methods

Study design and question selection

To evaluate the clinical knowledge of chatbots, this study utilized multiple-choice questions (MCQs) from the Taiwan Psychiatry Licensing Examination administered by the Taiwanese Society of Psychiatry. To comprehensively evaluate chatbot performance, we included questions related to any clinically relevant topic based on all levels outlined in Bloom’s Taxonomy [14]. Specifically, we selected MCQs from the Taiwan Psychiatry Licensing Examination from the past two years (2023 and 2024). These exams represent the first stage of obtaining board certification and consist entirely of MCQs. Questions from these exams are primarily derived from the 12th edition of the Kaplan & Sadock’s Synopsis of Psychiatry, ensuring that the content reflects the latest psychiatric knowledge. These questions were crafted by experienced board-certified psychiatrists, and each exam consisted of 100 questions, each worth one point. A score of 60 is set as the passing threshold. To maintain relevance and consistency, questions involving Taiwan-specific laws and policies or those solely testing basic medical knowledge were excluded. This exclusion ensures the broader applicability of our findings, as legal and regulatory frameworks vary across regions. Our study focuses on evaluating core psychiatric knowledge and reasoning rather than jurisdiction-specific legal aspects. The remaining questions were categorized into six domains: pathophysiology and epidemiology, diagnostic assessment and clinical examination, psychopharmacology and other therapeutic modalities, psychosocial and cultural influences, neuroscience and behavioral science, and forensic psychiatry and ethic.

Chatbot selection

We selected chatbots based on their rankings from the LMarena website [15], which provides comparative evaluations of multiple chatbots. This platform allows users to input queries and compare responses from two chatbots, with the final rankings determined by aggregated user choices. The top 10 ranked chatbots were initially screened in this study, excluding those with restricted availability (e.g., geofenced in mainland China). This process identified leading chatbot companies for inclusion. The final selection comprised chatbots from OpenAI, Google, Anthropic, Meta, xAI, Alibaba, and Mistral. We included all available models from each company to ensure diversity in capabilities and performance. The chatbots evaluated in this study included OpenAI’s GPT-4o, GPT-4o-mini, GPT-o1-preview, GPT-o1-mini, and GPT-4-Turbo; Google’s Gemini-1.5-Pro, Gemini-1.5-Flash, Gemini-1.5-Flash-8B, Gemini-Exp1121, and LearnLM1.5 Pro; Anthropic’s Claude-2, Claude-2.1, Claude-3-Haiku, Claude-3-Sonnet, Claude-3-Opus, Claude-3.5-Haiku, Claude-3.5-Sonnet, and Claude-3.5-Sonnet-June; Meta’s Llama-3.1-70B, Llama-3.1-405B, Llama-3.1-Nemotron, and Llama-3.2-90B; xAI’s Grok-beta; Alibaba’s Qwen-2 and Qwen-2.5; and Mistral’s Large-2.

Evaluation procedure

The evaluation was conducted from November 29 to December 1, 2024. In this study, a zero-shot testing approach was employed, meaning that the chatbots were presented with questions without any prior examples, demonstrations, or contextual information related to the test items. Each chatbot underwent testing with batches of 10 MCQs. The standardized prompt provided to the chatbots was: “Below are 10 multiple-choice questions with their options. Please provide the answers as numbers only.” Limiting each batch to 10 MCQs ensured that all questions fit within the context length constraints of the chatbots. Each chatbot’s responses were recorded, and the correctness of each answer was documented. For a deeper understanding of chatbot reasoning, we further prompted the chatbots to explain their answers for specific questions using the standardized query: “Please explain your reasoning for Question X in detail.” This use of uniform prompts was essential to ensure comparability across models by minimizing performance variations that can arise from different phrasing and batching strategies to which chatbots are sensitive.

The primary outcomes of this study were twofold: (1) the raw score for each chatbot, which was subsequently converted into a logit ability estimate using the Rasch model, and (2) a qualitative assessment of the strengths and weaknesses of the best-performing chatbot. The latter focused on its explanations for individual questions, with particular attention to instances where it correctly answered difficult questions or incorrectly answered simple ones. We assessed the explanations in three aspects: (1) Factual Accuracy: Was the explanation based on correct clinical and pharmacological facts? (2) Logical Coherence: Did the reasoning follow a logical path from premises to conclusion? (3) Identification of Nuance and Bias: Did the model grasp the core clinical principle being tested and avoid common reasoning errors?

Statistic analysis

First, we performed descriptive analysis, including calculations of the mean, standard deviation, maximum and minimum scores, and Cronbach’s alpha to assess the internal consistency of the test. We then conducted further analysis using the Rasch model, a statistical method widely used in psychometrics to evaluate the relationship between item difficulty and the ability of respondents. Rasch analysis is based on modern test theory and provides estimates of item difficulty and respondent ability on the same scale, expressed in logits (log-odds units). Since there is no absolute measure of question difficulty, we employed the Rasch model to assess the relative difficulty levels of the selected questions. This approach also facilitates future comparisons with other medical licensing examinations beyond Taiwan, providing a broader reference for chatbot assessment. The analysis was performed using the WINSTEPS software (version 5.8.2), which is a leading commercial software about Rasch analysis, on a Windows 10 operating system.

In the Rasch model, the probability of a chatbot answering a question correctly is determined by the difference between the chatbot’s ability (β) and the item’s difficulty (δ) [16]. This is calculated using the formula: , where represents the probability of a correct response. When the chatbot’s ability (β) matches the difficulty of the item (δ), the probability of answering correctly is 50%. The Rasch model iteratively adjusts these estimates to minimize error, producing reliable measures of both ability and difficulty.

Before conducting the Rasch analysis, we first checked whether the dataset met the model’s basic assumptions. This involved using principal component analysis (PCA) on the residuals to ensure that the test mainly measured one underlying concept. In Rasch analysis, it’s important that secondary dimensions (other factors besides the main concept) explain no more than 20% of the residual variance [4,17]. We also calculated essential unidimensionality, which shows how much of the variance is explained by the main concept—in this case, psychiatric clinical knowledge. A value of 50% or higher was considered acceptable [18]. After confirming unidimensionality, we estimated chatbot ability and item difficulty using joint maximum likelihood estimation (JMLE). The passing threshold was set at 60% accuracy for the selected questions and was also estimated using JMLE. To evaluate whether the best-performing chatbot significantly surpassed this threshold, we performed Wald tests, considering a two-tailed p-value of < 0.05 as statistically significant.

To ensure the model’s validity, we analyzed fit statistics to determine if the data aligned with the expectations of the Rasch model. Infit mean square (MNSQ) was utilized to evaluate the consistency of responses to items matching the chatbot’s ability level, while outfit MNSQ detected unusual response patterns to extremely easy or difficult items. Acceptable values for both metrics ranged from 0.5 to 1.5 [19]. Additionally, z-standardized (ZSTD) was evaluated, with acceptable values between ±1.96 [19].

Finally, we created a person–item map (PKMAP) for each chatbot, providing a visual representation of its performance across questions of varying difficulty and showing how its ability matches the difficulty levels of the test items. We conducted a detailed analysis of its responses to identify patterns (as mentioned in the evaluation procedure section), with particular focus on the simplest questions it answered incorrectly and the most challenging ones it answered correctly.

Results

Chatbot performance overview

Table 1 listed the distribution of questions across two years of licensing exams, comprising 160 questions divided into six categories. The majority of questions were categorized under Pathophysiology and Epidemiology, Diagnostic Assessment and Clinical Examination, and Psychopharmacology and Other Therapeutic Modalities. Table 2 lists the chatbots analyzed in this study, along with their release dates and associated companies. These chatbots were released between July 2023 and November 2024. We evaluated the performance of 27 chatbots (Table 3), yielding an average raw score of 97.7 (61% accuracy), with a standard deviation of 19.5. The highest score was 129, while the lowest was 56. The test reliability Cronbach’s alpha was 0.93.

Download:

Table 1. Distribution of question characteristics.

https://doi.org/10.1371/journal.pone.0330303.t001

Download:

Table 2. Chatbot characteristics.

https://doi.org/10.1371/journal.pone.0330303.t002

Download:

Table 3. Raw scores and Rasch model parameters for chatbots.

https://doi.org/10.1371/journal.pone.0330303.t003

Dimensionality and Rasch model analysis

The dimensionality analysis confirmed that the dataset aligned with the assumptions of the Rasch model. The raw variance explained by the measures was 38.9%, while the essential unidimensionality calculated as the proportion of Rasch-common variance, reached 76.5%. These values exceed the commonly accepted thresholds of 20% for unidimensionality and 50% for essential unidimensionality, supporting the interpretation that the observed response patterns primarily reflect a single latent trait as psychiatric clinical knowledge.

The Rasch model parameters for chatbots were presented in Table 3. ChatGPT-o1-preview achieved the highest performance among all models, with a JMLE ability score of 2.23, substantially surpassing the passing threshold (JMLE = 0.44, p < .001). Its infit (MNSQ = 1.08, ZSTD = 0.54) and outfit (MNSQ = 0.97, ZSTD = 0.07) statistics were within the acceptable range of 0.5–1.5 and ±1.96, respectively, indicating strong consistency with the Rasch model’s expectations. The chatbot’s responses demonstrated both reliability and validity in assessing psychiatric clinical knowledge.

Performance and reasoning of ChatGPT-o1-preview

The person–item map (PKMAP) for ChatGPT-o1-preview (Fig 1) visually represents its performance across questions of varying difficulty levels. In the PKMAP, the vertical axis represents the difficulty of questions in logits, with more difficult items located higher on the map, while the horizontal axis separates correct responses (on the left) from incorrect responses (on the right). The upper-left quadrant (e.g., Items 52, 54, 79, and 140) of the map highlights areas where ChatGPT-o1-preview demonstrated strong capabilities, successfully answering challenging questions. Conversely, the lower-right quadrant (e.g., Items 27, 36, 42, 43, 77, 92, 106, 117, 131, and 146) indicates areas where the chatbot struggled, with incorrect answers to relatively easier questions. This distribution provides a clear visualization of the chatbot’s strengths and weaknesses in its performance on the questions.

Download:

Fig 1. The person–item map (PKMAP) of ChatGPT-o1-preview.

It illustrated the relationship between the chatbot’s ability and the difficulty of the test items. The vertical axis, measured in logits, represents the difficulty level of the questions.

https://doi.org/10.1371/journal.pone.0330303.g001

A detailed analysis of the chatbot’s answering process (Table 4, Table 5) revealed key strengths and weaknesses of its reasoning. ChatGPT-o1-preview excelled in areas such as diagnostic reasoning and broader therapeutic concepts (e.g., recognizing paraphilic disorders and treatment paradigms for schizophrenia), and pharmacological principles (e.g., drug mechanisms, indications, and side effects). However, it exhibited notable limitations in recalling specific factual details (e.g., remission timelines for transvestic disorder, concordance rates for generalized anxiety disorder in twin studies, or diagnostic definitions of negative symptoms). Additionally, biases in reasoning were observed, such as overemphasizing lithium’s efficacy in depression augmentation therapy while underestimating the role of antipsychotics or dismissing hypnosis as a therapeutic option. The chatbot also demonstrated its capacity for self-correction. In several cases (e.g., Items 27, 42, 77, 131, and 145), it revised its initial incorrect answers upon re-evaluation, ultimately producing accurate solutions.

Download:

Table 4. Strengths analysis of ChatGPT-o1-preview’s answering process.

https://doi.org/10.1371/journal.pone.0330303.t004

Download:

Table 5. Weaknesses analysis of ChatGPT-o1-preview’s answering process.

https://doi.org/10.1371/journal.pone.0330303.t005

The symbol “XXX” indicates the chatbot’s estimated ability, and each item on the map represents a specific question number from the exam, accompanied by either a “1” for correct responses or a “0” for incorrect ones. A “1” indicates a correct response and is displayed on the left side of the map, while a “0” signifies an incorrect response and is placed on the right side. The vertical positioning of each item reflects its difficulty.

Discussion

To our knowledge, this study is the first to apply Rasch analysis to evaluate chatbots’ clinical knowledge and reasoning in psychiatry using expert-designed multiple-choice questions. Rasch model may simplify certain aspects of clinical reasoning complexity, it provides a robust framework for quantifying chatbot performance and identifying specific strengths and weaknesses. A fundamental understanding of chatbots’ internal knowledge and reasoning is essential for advancing their clinical applications and scalability. Among the 27 chatbots assessed, ChatGPT-o1-preview emerged as the top-performing model, achieving a correct response rate of 80.6% and a JMLE ability estimate of 2.23. According to the Rasch model, this indicates that ChatGPT-o1-preview would achieve a correct response rate of 85.6%. This performance not only surpassed the passing threshold by a significant margin but also placed its score well within the range of successful human candidates seeking board certification. The strengths of ChatGPT-o1-preview were particularly evident in its understanding of high-level diagnostic, therapeutic, and pharmacological concepts. For instance, the model showcased advanced reasoning in areas such as schizophrenia management across various stages, diagnostic clarity in paraphilic disorders, and a thorough understanding of psychopharmacology, including drug mechanisms, indications, and side effects. Moreover, ChatGPT-o1-preview exhibited a strong ability to self-correct during re-evaluation.

Despite these strengths, the chatbot demonstrated notable limitations. It struggled with questions requiring precise factual recall, such as the exact Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition, Text Revision (DSM-5-TR) criteria [20], remission timelines for transvestic disorder, and rare statistical data (e.g., concordance rates for generalized anxiety disorder). This highlights the need for caution regarding the chatbot’s susceptibility to hallucinations or confabulations (a term used when a model produces factually incorrect statements with great confidence, analogous to neuro-psychiatric confabulation) [21]. When probed for highly specific details, ChatGPT-o1-preview occasionally generated inaccurate or fabricated information. This limitation is consistent with prior studies on chatbot performance in factual recall tasks. For example, the SimpleQA benchmark [22], which focuses on short, fact-seeking queries from diverse aspects, requires each question to meet strict criteria: it must have a single, indisputable answer that is easy to grade, and the answer should remain constant over time. Chatbots tend to perform poorly on SimpleQA tasks. Similarly, the Chinese SimpleQA [23], a localized version of the SimpleQA benchmark, has demonstrated similar challenges in chatbot performance. Additionally, the chatbot occasionally displayed reasoning biases, such as overemphasizing lithium’s efficacy in depression augmentation therapy while underestimating the role of antipsychotics or dismissing hypnosis as a viable treatment for dissociative identity disorder.

Our findings highlight how biases in medical reasoning, such as the overemphasis on lithium for major depressive disorder augmentation therapy, could pose clinical risks. Identifying these domain-specific biases is crucial, as they may directly impact patient safety. Such biases may stem from the training data’s inherent limitations or uneven exposure to certain clinical concepts [24–26]. Such findings align with previous research, including biasmedQA, which highlights how biased information in prompts can lead to biased clinical judgments by chatbots [27]. As for ethical considerations, adhering to the principle of ‘do no harm’ remains a fundamental criterion in psychiatric care. This principle is especially critical given the unique vulnerabilities of psychiatric patients. For patient populations with compromised reality testing, such as individuals with psychotic disorders or severe cognitive impairment, a single piece of misinformation from an AI could reinforce a delusion, undermine therapeutic trust, or precipitate a clinical crisis. Ensuring accountability for AI-related harm requires robust regulatory frameworks. To mitigate potential errors and their impact on patient care, it is essential to implement strategies such as human oversight, model interpretability, and real-time validation in clinical environments. Additionally, clinicians must be equipped with guidelines to effectively assess and intervene when chatbot-generated recommendations deviate from best practices.

Our study underscores the importance of both training data quality and ongoing performance optimization in enhancing chatbot reliability for psychiatric applications. While ChatGPT-o1-preview demonstrated superior knowledge in psychiatry, its performance in niche areas was sometimes biased and less reliable. This discrepancy reflects the accessibility of training resources; open-access journals, books, and widely available materials likely dominate the training corpus, while proprietary textbooks and leading psychiatric journals remain less accessible. Recent partnerships between publishers and chatbot developers signal potential solutions to this issue. For instance, Axel Springer has partnered with OpenAI to integrate journalism with AI technologies [28], and Elsevier Health has collaborated with OpenEvidence to develop ClinicalKey AI [29]. Collaborative efforts such as these could bridge the gap between accessible and proprietary knowledge, benefiting chatbot training. Techniques like retrieval-augmented generation (RAG) [30,31], which uses vectorized data to enhance responses by retrieving relevant information, or fine-tuning [32,33], which optimizes models by training them on domain-specific datasets, can further improve chatbot recall ability in specialized fields.

From a clinical perspective, the findings suggest that chatbots like ChatGPT-o1-preview hold significant potential as tools for delivering timely and relatively accurate feedback to support clinical decision-making. While chatbots, even multimodal or more advanced models, may hold potential for clinical use, they cannot yet be directly integrated into psychiatric practice without rigorous experimental validation and regulatory considerations. Human oversight remains indispensable to ensure safe and effective implementation. First, given the chatbot’s performance limitations, it is advisable to present it with pre-defined clinical options and guide it to provide explanations, rather than relying on it to generate responses independently. Employing adequate prompt engineering [34,35], which involves designing specific and structured queries, can substantially improve the quality of chatbot responses. Second, encouraging chatbots to perform iterative reasoning or re-evaluate their outputs has been shown to enhance accuracy, as demonstrated in this study. Third, leveraging multiple models simultaneously can provide diverse perspectives and help mitigate biases inherent in individual systems. Finally, rather than depending on chatbots for factual recall, they are better suited for reasoning and decision support, where their outputs can effectively complement clinical expertise. By adopting a thoughtful and cautious approach, chatbots can serve as valuable adjuncts to enhance patient care while minimizing risks.

The rapid advancements in chatbot capabilities cannot be ignored. Historical research suggests that as of early 2024, chatbots had achieved performance sufficient to pass licensing exams, with some models, such as ChatGPT-o1-preview, significantly surpassing that benchmark [4]. This accelerated improvement suggests that conclusions from earlier studies, particularly those conducted in 2023 or earlier [36,37], may no longer accurately reflect the current state of chatbot capabilities. Therefore, ongoing evaluation of new chatbot is essential to keep pace with their evolving capabilities and to better understand their applicability in clinical practice.

Limitations

This study has several limitations. First, the sample size of 27 chatbots, although sufficient for a pilot study using Rasch analysis [38], may limit the generalizability of our findings. However, the large number of questions used (n = 160) ensured a low standard error for ability estimates, enhancing the reliability of the results. Additionally, our focus on state-of-the-art (SOTA) models may limit the generalizability of our findings to smaller or less advanced language models. Second, while the multiple-choice questions used in this study provided a standardized method of assessing psychiatric knowledge, they may not fully capture the breadth of clinical expertise required in practice. Nevertheless, establishing a clear understanding of chatbots’ knowledge base and reasoning processes remains a fundamental step before integrating them into AI-assisted clinical decision-making, especially given the potential risks associated with their deployment. Third, all questions were presented in traditional Mandarin, and the results may vary across different languages. However, core psychiatric knowledge and reasoning are largely language-independent, as fundamental clinical principles remain consistent across different linguistic contexts. Fourth, while possessing clinical knowledge is important, it does not necessarily translate to the effective application of that knowledge in real-world scenarios. Although we have highlighted key aspects of potential clinical application, future research should prioritize systematically test for biases to ensure reliability and evaluating the efficacy of chatbots in supporting clinical decision-making through controlled trials in real-world psychiatric settings [39–41].

Conclusion

This study demonstrates that ChatGPT-o1-preview, released by OpenAI in September 2024, outperformed other chatbots in a standardized evaluation of psychiatric clinical knowledge. The model’s strengths lie in its understanding of diagnostic frameworks, treatment paradigms, and pharmacology concepts. However, its limitations in recalling specific details, addressing niche knowledge, and overcoming biases highlight the need for cautious interpretation of its outputs. Building on these findings, we have proposed key aspects of practical considerations to support their integration into psychiatric practice. As chatbot technologies continue to evolve, ongoing assessments of their capabilities and controlled clinical trials will be crucial for understanding their broader applicability and ensuring safe and effective implementation in clinical settings.

References

1. Wang Z, Chu Z, Doan TV, Ni S, Yang M, Zhang W. History, development, and principles of large language models-an introductory survey. arXiv. 2024. http://arxiv.org/abs/2402.06853
- View Article
- Google Scholar
2. Chakraborty C, Pal S, Bhattacharya M, Dash S, Lee S-S. Overview of Chatbots with special emphasis on artificial intelligence-enabled ChatGPT in medical science. Front Artif Intell. 2023;6:1237704. pmid:38028668
- View Article
- PubMed/NCBI
- Google Scholar
3. Reyhan AH, Mutaf Ç, Uzun İ, Yüksekyayla F. A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity. J Clin Med. 2024;13(21):6512. pmid:39518652
- View Article
- PubMed/NCBI
- Google Scholar
4. Chang Y, Su C-Y, Liu Y-C. Assessing the Performance of Chatbots on the Taiwan Psychiatry Licensing Examination Using the Rasch Model. Healthcare (Basel). 2024;12(22):2305. pmid:39595502
- View Article
- PubMed/NCBI
- Google Scholar
5. Liu M, Okuhara T, Dai Z, Huang W, Okada H, Furukawa E, et al. Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study. medRxiv. 2024. 2024.07.09.24310129. https://www.medrxiv.org/content/10.1101/2024.07.09.24310129v1
- View Article
- Google Scholar
6. Li D-J, Kao Y-C, Tsai S-J, Bai Y-M, Yeh T-C, Chu C-S, et al. Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin Neurosci. 2024;78(6):347–52. pmid:38404249
- View Article
- PubMed/NCBI
- Google Scholar
7. Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. 2025;333(4):319–28. pmid:39405325
- View Article
- PubMed/NCBI
- Google Scholar
8. Goktas P, Grzybowski A. Assessing the Impact of ChatGPT in Dermatology: A Comprehensive Rapid Review. J Clin Med. 2024;13(19):5909. pmid:39407969
- View Article
- PubMed/NCBI
- Google Scholar
9. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med. 2023;3(1):1–8.
- View Article
- Google Scholar
10. Whyte J, Ward P, Eccles DW. The relationship between knowledge and clinical performance in novice and experienced critical care nurses. Heart & Lung. 2009;38(6):517–25.
- View Article
- Google Scholar
11. Kao H-J, Chien T-W, Wang W-C, Chou W, Chow JC. Assessing ChatGPT’s capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine (Baltimore). 2023;102(25):e34068. pmid:37352054
- View Article
- PubMed/NCBI
- Google Scholar
12. Thirunavukarasu AJ, Mahmood S, Malem A, Foster WP, Sanghera R, Hassan R, et al. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLOS Digit Health. 2024;3(4):e0000341. pmid:38630683
- View Article
- PubMed/NCBI
- Google Scholar
13. Rasch G. Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests. Oxford, England: Nielsen & Lydiche. 1960:184.
14. Adams NE. Bloom’s taxonomy of cognitive learning objectives. J Med Libr Assoc. 2015;103(3):152–3. pmid:26213509
- View Article
- PubMed/NCBI
- Google Scholar
15. Chiang WL, Zheng L, Sheng Y, Angelopoulos AN, Li T, Li D. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv. 2024. http://arxiv.org/abs/2403.04132
- View Article
- Google Scholar
16. Bond T, Yan Z, Heene M. Applying the Rasch model: fundamental measurement in the human sciences. New York London. 2021:376.
17. Rasch analysis of the Indonesian mental health screening tools. The Open Psychology Journal. 2021;14:198–203.
- View Article
- Google Scholar
18. Liu X, Cao P, Lai X, Wen J, Yang Y. Assessing essential unidimensionality of scales and structural coefficient bias. Educational and Psychological Measurement. 2023;83(1):28–47.
- View Article
- Google Scholar
19. Pitaloka DAE, Kusuma IY, Pratiwi H, Pradipta IS. Development and validation of assessment instrument for the perception and attitude toward tuberculosis among the general population in Indonesia: a Rasch analysis of psychometric properties. Front Public Health. 2023;11:1143120.
- View Article
- Google Scholar
20. Association AP. Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition, Text Revision. Washington, DC. 2022:1050.
21. Hatem R, Simmons B, Thornton JE. Chatbot Confabulations Are Not Hallucinations. JAMA Intern Med. 2023;183(10):1177. pmid:37578766
- View Article
- PubMed/NCBI
- Google Scholar
22. Wei J, Karina N, Chung HW, Jiao YJ, Papay S, Glaese A. Measuring short-form factuality in large language models. In: arXiv, 2024. http://arxiv.org/abs/2411.04368
23. He Y, Li S, Liu J, Tan Y, Wang W, Huang H, et al. Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models. arXiv. 2024. http://arxiv.org/abs/2411.07140
- View Article
- Google Scholar
24. Gallegos IO, R A R, Barrow J, Tanjim MM, Kim S, Dernoncourt F, et al. Bias and Fairness in Large Language Models: A Survey. Computational Linguistics. 2024;50(3):1097–179.
- View Article
- Google Scholar
25. Ayoub NF, Balakrishnan K, Ayoub MS, Barrett TF, David AP, Gray ST. Inherent bias in large language models: a random sampling analysis. Mayo Clinic Proceedings: Digital Health. 2024;2(2):186–91.
- View Article
- Google Scholar
26. Wang J, Redelmeier DA. Cognitive Biases and Artificial Intelligence. NEJM AI. 2024;1(12):AIcs2400639.
- View Article
- Google Scholar
27. Schmidgall S, Harris C, Essien I, Olshvang D, Rahman T, Kim JW. Evaluation and mitigation of cognitive biases in medical language models. npj Digital Medicine. 2024;7(1):1–9.
- View Article
- Google Scholar
28. Partnership with Axel Springer to deepen beneficial use of AI in journalism. Available from: https://openai.com/index/axel-springer-partnership/.
29. Elsevier Health partners with OpenEvidence to launch next generation ClinicalKey AI. Available from: https://www.elsevier.com/about/press-releases/elsevier-health-partners-with-openevidence-to-deliver-trusted-evidence-based.
30. Miao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia OA, Cheungpasitporn W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina (Kaunas). 2024;60(3):445. pmid:38541171
- View Article
- PubMed/NCBI
- Google Scholar
31. Wang C, Long Q, Xiao M, Cai X, Wu C, Meng Z. BioRAG: A RAG-LLM Framework for Biological Question Reasoning. arXiv. 2024. http://arxiv.org/abs/2408.01107
- View Article
- Google Scholar
32. Maharjan J, Garikipati A, Singh NP, Cyrus L, Sharma M, Ciobanu M, et al. OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci Rep. 2024;14(1):14156. pmid:38898116
- View Article
- PubMed/NCBI
- Google Scholar
33. Tan Y, Zhang Z, Li M, Pan F, Duan H, Huang Z, et al. MedChatZH: A tuning LLM for traditional Chinese medicine consultations. Comput Biol Med. 2024;172:108290. pmid:38503097
- View Article
- PubMed/NCBI
- Google Scholar
34. Zhang X, Talukdar N, Vemulapalli S, Ahn S, Wang J, Meng H, et al. Comparison of prompt engineering and fine-tuning strategies in large language models in the classification of clinical notes. medRxiv. 2024. 2024.02.07.24302444. https://www.medrxiv.org/content/10.1101/2024.02.07.24302444v1
- View Article
- Google Scholar
35. Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023;25:e50638.
- View Article
- Google Scholar
36. Weng T-L, Wang Y-M, Chang S, Chen T-J, Hwang S-J. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc. 2023;86(8):762–6. pmid:37294147
- View Article
- PubMed/NCBI
- Google Scholar
37. Au Yeung J, Kraljevic Z, Luintel A, Balston A, Idowu E, Dobson RJ, et al. AI chatbots not yet ready for clinical use. Front Digit Health. 2023;5:1161098. pmid:37122812
- View Article
- PubMed/NCBI
- Google Scholar
38. Linacre J. Sample size and item calibration stability.1994. Available from: https://www.semanticscholar.org/paper/Sample-size-and-item-calibration-stability-Linacre/bd9943b087bfeb266bf8a77290ecb15bc1ccc9be
39. Chang Y, Ju PC, Hsieh MH, Chang CC. Evaluating the impact of authoritative and subjective cues on large language model reliability for clinical inquiries: An experimental study. medRxiv. 2025;:2025.07.15.25331607.
- View Article
- Google Scholar
40. Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024;7(10):e2440969. pmid:39466245
- View Article
- PubMed/NCBI
- Google Scholar
41. Chang Y, Liu Y-C, Huang S-S, Hsu W-Y. Assessing bias in AI-driven psychiatric recommendations: A comparative cross-sectional study of chatbot-classified and CANMAT 2023 guideline for adjunctive therapy in difficult-to-treat depression. Psychiatry Res. 2025;348:116501. pmid:40267866
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Wang Z, Chu Z, Doan TV, Ni S, Yang M, Zhang W. History, development, and principles of large language models-an introductory survey. arXiv. 2024. http://arxiv.org/abs/2402.06853
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Chakraborty C, Pal S, Bhattacharya M, Dash S, Lee S-S. Overview of Chatbots with special emphasis on artificial intelligence-enabled ChatGPT in medical science. Front Artif Intell. 2023;6:1237704. pmid:38028668
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Reyhan AH, Mutaf Ç, Uzun İ, Yüksekyayla F. A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity. J Clin Med. 2024;13(21):6512. pmid:39518652
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Chang Y, Su C-Y, Liu Y-C. Assessing the Performance of Chatbots on the Taiwan Psychiatry Licensing Examination Using the Rasch Model. Healthcare (Basel). 2024;12(22):2305. pmid:39595502
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Liu M, Okuhara T, Dai Z, Huang W, Okada H, Furukawa E, et al. Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study. medRxiv. 2024. 2024.07.09.24310129. https://www.medrxiv.org/content/10.1101/2024.07.09.24310129v1
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref6] 6. Li D-J, Kao Y-C, Tsai S-J, Bai Y-M, Yeh T-C, Chu C-S, et al. Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin Neurosci. 2024;78(6):347–52. pmid:38404249
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. 2025;333(4):319–28. pmid:39405325
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Goktas P, Grzybowski A. Assessing the Impact of ChatGPT in Dermatology: A Comprehensive Rapid Review. J Clin Med. 2024;13(19):5909. pmid:39407969
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med. 2023;3(1):1–8.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref10] 10. Whyte J, Ward P, Eccles DW. The relationship between knowledge and clinical performance in novice and experienced critical care nurses. Heart & Lung. 2009;38(6):517–25.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref11] 11. Kao H-J, Chien T-W, Wang W-C, Chou W, Chow JC. Assessing ChatGPT’s capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine (Baltimore). 2023;102(25):e34068. pmid:37352054
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref12] 12. Thirunavukarasu AJ, Mahmood S, Malem A, Foster WP, Sanghera R, Hassan R, et al. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLOS Digit Health. 2024;3(4):e0000341. pmid:38630683
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Rasch G. Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests. Oxford, England: Nielsen & Lydiche. 1960:184.

[ref14] 14. Adams NE. Bloom’s taxonomy of cognitive learning objectives. J Med Libr Assoc. 2015;103(3):152–3. pmid:26213509
View Article
PubMed/NCBI
Google Scholar

[47] View Article

[48] PubMed/NCBI

[49] Google Scholar

[ref15] 15. Chiang WL, Zheng L, Sheng Y, Angelopoulos AN, Li T, Li D. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv. 2024. http://arxiv.org/abs/2403.04132
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref16] 16. Bond T, Yan Z, Heene M. Applying the Rasch model: fundamental measurement in the human sciences. New York London. 2021:376.

[ref17] 17. Rasch analysis of the Indonesian mental health screening tools. The Open Psychology Journal. 2021;14:198–203.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref18] 18. Liu X, Cao P, Lai X, Wen J, Yang Y. Assessing essential unidimensionality of scales and structural coefficient bias. Educational and Psychological Measurement. 2023;83(1):28–47.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref19] 19. Pitaloka DAE, Kusuma IY, Pratiwi H, Pradipta IS. Development and validation of assessment instrument for the perception and attitude toward tuberculosis among the general population in Indonesia: a Rasch analysis of psychometric properties. Front Public Health. 2023;11:1143120.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref20] 20. Association AP. Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition, Text Revision. Washington, DC. 2022:1050.

[ref21] 21. Hatem R, Simmons B, Thornton JE. Chatbot Confabulations Are Not Hallucinations. JAMA Intern Med. 2023;183(10):1177. pmid:37578766
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref22] 22. Wei J, Karina N, Chung HW, Jiao YJ, Papay S, Glaese A. Measuring short-form factuality in large language models. In: arXiv, 2024. http://arxiv.org/abs/2411.04368

[ref23] 23. He Y, Li S, Liu J, Tan Y, Wang W, Huang H, et al. Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models. arXiv. 2024. http://arxiv.org/abs/2411.07140
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref24] 24. Gallegos IO, R A R, Barrow J, Tanjim MM, Kim S, Dernoncourt F, et al. Bias and Fairness in Large Language Models: A Survey. Computational Linguistics. 2024;50(3):1097–179.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref25] 25. Ayoub NF, Balakrishnan K, Ayoub MS, Barrett TF, David AP, Gray ST. Inherent bias in large language models: a random sampling analysis. Mayo Clinic Proceedings: Digital Health. 2024;2(2):186–91.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref26] 26. Wang J, Redelmeier DA. Cognitive Biases and Artificial Intelligence. NEJM AI. 2024;1(12):AIcs2400639.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref27] 27. Schmidgall S, Harris C, Essien I, Olshvang D, Rahman T, Kim JW. Evaluation and mitigation of cognitive biases in medical language models. npj Digital Medicine. 2024;7(1):1–9.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref28] 28. Partnership with Axel Springer to deepen beneficial use of AI in journalism. Available from: https://openai.com/index/axel-springer-partnership/.

[ref29] 29. Elsevier Health partners with OpenEvidence to launch next generation ClinicalKey AI. Available from: https://www.elsevier.com/about/press-releases/elsevier-health-partners-with-openevidence-to-deliver-trusted-evidence-based.

[ref30] 30. Miao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia OA, Cheungpasitporn W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina (Kaunas). 2024;60(3):445. pmid:38541171
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref31] 31. Wang C, Long Q, Xiao M, Cai X, Wu C, Meng Z. BioRAG: A RAG-LLM Framework for Biological Question Reasoning. arXiv. 2024. http://arxiv.org/abs/2408.01107
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref32] 32. Maharjan J, Garikipati A, Singh NP, Cyrus L, Sharma M, Ciobanu M, et al. OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci Rep. 2024;14(1):14156. pmid:38898116
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref33] 33. Tan Y, Zhang Z, Li M, Pan F, Duan H, Huang Z, et al. MedChatZH: A tuning LLM for traditional Chinese medicine consultations. Comput Biol Med. 2024;172:108290. pmid:38503097
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref34] 34. Zhang X, Talukdar N, Vemulapalli S, Ahn S, Wang J, Meng H, et al. Comparison of prompt engineering and fine-tuning strategies in large language models in the classification of clinical notes. medRxiv. 2024. 2024.02.07.24302444. https://www.medrxiv.org/content/10.1101/2024.02.07.24302444v1
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref35] 35. Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023;25:e50638.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref36] 36. Weng T-L, Wang Y-M, Chang S, Chen T-J, Hwang S-J. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc. 2023;86(8):762–6. pmid:37294147
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref37] 37. Au Yeung J, Kraljevic Z, Luintel A, Balston A, Idowu E, Dobson RJ, et al. AI chatbots not yet ready for clinical use. Front Digit Health. 2023;5:1161098. pmid:37122812
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref38] 38. Linacre J. Sample size and item calibration stability.1994. Available from: https://www.semanticscholar.org/paper/Sample-size-and-item-calibration-stability-Linacre/bd9943b087bfeb266bf8a77290ecb15bc1ccc9be

[ref39] 39. Chang Y, Ju PC, Hsieh MH, Chang CC. Evaluating the impact of authoritative and subjective cues on large language model reliability for clinical inquiries: An experimental study. medRxiv. 2025;:2025.07.15.25331607.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref40] 40. Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024;7(10):e2440969. pmid:39466245
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref41] 41. Chang Y, Liu Y-C, Huang S-S, Hsu W-Y. Assessing bias in AI-driven psychiatric recommendations: A comparative cross-sectional study of chatbot-classified and CANMAT 2023 guideline for adjunctive therapy in difficult-to-treat depression. Psychiatry Res. 2025;348:116501. pmid:40267866
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Study design and question selection

Chatbot selection

Evaluation procedure

Statistic analysis

Results

Chatbot performance overview

Dimensionality and Rasch model analysis

Performance and reasoning of ChatGPT-o1-preview

Discussion

Limitations

Conclusion

References