Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Evaluating chatbots in psychiatry: Rasch-based insights into clinical knowledge and reasoning

  • Yu Chang,

    Roles Conceptualization, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliations Department of Psychiatry, Chung Shan Medical University Hospital, Taichung, Taiwan, School of Medicine, Chung Shan Medical University, Taichung, Taiwan

  • Si-Sheng Huang,

    Roles Conceptualization, Writing – original draft

    Affiliations Post Baccalaureate Medicine, National Chung Hsing University, Taichung, Taiwan, Department of Psychiatry, Changhua Christian Hospital, Changhua, Taiwan

  • Wen-Yu Hsu,

    Roles Methodology, Writing – original draft

    Affiliations School of Medicine, Chung Shan Medical University, Taichung, Taiwan, Post Baccalaureate Medicine, National Chung Hsing University, Taichung, Taiwan, Department of Psychiatry, Changhua Christian Hospital, Changhua, Taiwan

  • Yi-Chun Liu

    Roles Conceptualization, Formal analysis, Validation, Writing – original draft, Writing – review & editing

    183711@cch.org.tw

    Affiliations Post Baccalaureate Medicine, National Chung Hsing University, Taichung, Taiwan, Department of Psychiatry, Changhua Christian Hospital, Changhua, Taiwan, Department of Psychiatry, Changhua Christian Children’s Hospital, Changhua, Taiwan, Department of Health Business Administration, Hungkuang University, Taichung City, Taiwan

Abstract

Chatbots are increasingly being recognized as valuable tools for clinical support in psychiatry. This study systematically evaluated the clinical knowledge and reasoning of 27 leading chatbots in psychiatry. Using 160 multiple-choice questions from the Taiwan Psychiatry Licensing Examinations and Rasch analysis, we quantified performance and qualitatively assessed reasoning processes. OpenAI’s ChatGPT-o1-preview emerged as the top performer, achieving a Rasch ability score of 2.23, significantly surpassing the passing threshold (p < 0.001). While it excelled in diagnostic and therapeutic reasoning, it also demonstrated notable limitations in factual recall, niche topics, and occasional reasoning biases. Our findings indicate that while advanced chatbots hold significant potential as clinical decision-support tools, their current limitations underscore that rigorous human oversight is indispensable for patient safety. Continuous evaluation and domain-specific training are crucial for the safe integration of these technologies into clinical practice.

Introduction

Chatbots, powered by generative artificial intelligence and trained through deep learning algorithms, are designed to engage in natural conversations [1]. These systems, often referred to as large language models (LLMs), exhibit a remarkable ability to process and respond with contextually relevant information [1]. Recent advancements in large-scale data training and sophisticated reasoning mechanisms have expanded chatbots’ capabilities from general knowledge dissemination to specialized applications [2]. In the medical field, research has demonstrated that chatbots can achieve acceptable levels of performance in various professional examinations and assessments [38].

The growing interest in chatbots for clinical support highlights their potential to enhance healthcare delivery [9]. Previous studies have established a connection between clinical competence and foundational knowledge [10], suggesting that an evaluation of chatbots’ clinical knowledge could provide valuable insights into their practical utility. While previous research has used Rasch analysis to assess chatbots’ knowledge in psychiatry, gaps in understanding remain, particularly in certain specialized areas [4]. A comprehensive evaluation of chatbots’ strengths and limitations in psychiatric clinical knowledge is still needed.

Measuring psychiatric clinical knowledge poses unique challenges due to the complexity and context-dependent nature of psychiatric assessments. Past research has attempted to evaluate chatbots in this domain [6,11,12], with one study administering a set of ten multiple-choice questions focused on differential diagnosis to both chatbots and psychiatrists [6]. However, the study primarily relied on surface-level comparisons of scores and lacked sufficient depth to quantify or qualitatively understand the underlying knowledge structures of the chatbots. This limited scope has left aspects of chatbot capabilities underexplored, such as their reasoning processes.

To address these limitations, our study employs Rasch analysis [13], a statistical method commonly used in educational and psychological testing to quantify item difficulty and respondent ability on a unified scale. By applying Rasch analysis to a larger and more diverse set of psychiatry-related questions, we aim to provide a detailed assessment of the strengths and weaknesses of the most advanced chatbots. Our study focuses on exploring these foundational aspects to provide insights that can guide their effective integration into clinical practice.

Materials and methods

Study design and question selection

To evaluate the clinical knowledge of chatbots, this study utilized multiple-choice questions (MCQs) from the Taiwan Psychiatry Licensing Examination administered by the Taiwanese Society of Psychiatry. To comprehensively evaluate chatbot performance, we included questions related to any clinically relevant topic based on all levels outlined in Bloom’s Taxonomy [14]. Specifically, we selected MCQs from the Taiwan Psychiatry Licensing Examination from the past two years (2023 and 2024). These exams represent the first stage of obtaining board certification and consist entirely of MCQs. Questions from these exams are primarily derived from the 12th edition of the Kaplan & Sadock’s Synopsis of Psychiatry, ensuring that the content reflects the latest psychiatric knowledge. These questions were crafted by experienced board-certified psychiatrists, and each exam consisted of 100 questions, each worth one point. A score of 60 is set as the passing threshold. To maintain relevance and consistency, questions involving Taiwan-specific laws and policies or those solely testing basic medical knowledge were excluded. This exclusion ensures the broader applicability of our findings, as legal and regulatory frameworks vary across regions. Our study focuses on evaluating core psychiatric knowledge and reasoning rather than jurisdiction-specific legal aspects. The remaining questions were categorized into six domains: pathophysiology and epidemiology, diagnostic assessment and clinical examination, psychopharmacology and other therapeutic modalities, psychosocial and cultural influences, neuroscience and behavioral science, and forensic psychiatry and ethic.

Chatbot selection

We selected chatbots based on their rankings from the LMarena website [15], which provides comparative evaluations of multiple chatbots. This platform allows users to input queries and compare responses from two chatbots, with the final rankings determined by aggregated user choices. The top 10 ranked chatbots were initially screened in this study, excluding those with restricted availability (e.g., geofenced in mainland China). This process identified leading chatbot companies for inclusion. The final selection comprised chatbots from OpenAI, Google, Anthropic, Meta, xAI, Alibaba, and Mistral. We included all available models from each company to ensure diversity in capabilities and performance. The chatbots evaluated in this study included OpenAI’s GPT-4o, GPT-4o-mini, GPT-o1-preview, GPT-o1-mini, and GPT-4-Turbo; Google’s Gemini-1.5-Pro, Gemini-1.5-Flash, Gemini-1.5-Flash-8B, Gemini-Exp1121, and LearnLM1.5 Pro; Anthropic’s Claude-2, Claude-2.1, Claude-3-Haiku, Claude-3-Sonnet, Claude-3-Opus, Claude-3.5-Haiku, Claude-3.5-Sonnet, and Claude-3.5-Sonnet-June; Meta’s Llama-3.1-70B, Llama-3.1-405B, Llama-3.1-Nemotron, and Llama-3.2-90B; xAI’s Grok-beta; Alibaba’s Qwen-2 and Qwen-2.5; and Mistral’s Large-2.

Evaluation procedure

The evaluation was conducted from November 29 to December 1, 2024. In this study, a zero-shot testing approach was employed, meaning that the chatbots were presented with questions without any prior examples, demonstrations, or contextual information related to the test items. Each chatbot underwent testing with batches of 10 MCQs. The standardized prompt provided to the chatbots was: “Below are 10 multiple-choice questions with their options. Please provide the answers as numbers only.” Limiting each batch to 10 MCQs ensured that all questions fit within the context length constraints of the chatbots. Each chatbot’s responses were recorded, and the correctness of each answer was documented. For a deeper understanding of chatbot reasoning, we further prompted the chatbots to explain their answers for specific questions using the standardized query: “Please explain your reasoning for Question X in detail.” This use of uniform prompts was essential to ensure comparability across models by minimizing performance variations that can arise from different phrasing and batching strategies to which chatbots are sensitive.

The primary outcomes of this study were twofold: (1) the raw score for each chatbot, which was subsequently converted into a logit ability estimate using the Rasch model, and (2) a qualitative assessment of the strengths and weaknesses of the best-performing chatbot. The latter focused on its explanations for individual questions, with particular attention to instances where it correctly answered difficult questions or incorrectly answered simple ones. We assessed the explanations in three aspects: (1) Factual Accuracy: Was the explanation based on correct clinical and pharmacological facts? (2) Logical Coherence: Did the reasoning follow a logical path from premises to conclusion? (3) Identification of Nuance and Bias: Did the model grasp the core clinical principle being tested and avoid common reasoning errors?

Statistic analysis

First, we performed descriptive analysis, including calculations of the mean, standard deviation, maximum and minimum scores, and Cronbach’s alpha to assess the internal consistency of the test. We then conducted further analysis using the Rasch model, a statistical method widely used in psychometrics to evaluate the relationship between item difficulty and the ability of respondents. Rasch analysis is based on modern test theory and provides estimates of item difficulty and respondent ability on the same scale, expressed in logits (log-odds units). Since there is no absolute measure of question difficulty, we employed the Rasch model to assess the relative difficulty levels of the selected questions. This approach also facilitates future comparisons with other medical licensing examinations beyond Taiwan, providing a broader reference for chatbot assessment. The analysis was performed using the WINSTEPS software (version 5.8.2), which is a leading commercial software about Rasch analysis, on a Windows 10 operating system.

In the Rasch model, the probability of a chatbot answering a question correctly is determined by the difference between the chatbot’s ability (β) and the item’s difficulty (δ) [16]. This is calculated using the formula: , where represents the probability of a correct response. When the chatbot’s ability (β) matches the difficulty of the item (δ), the probability of answering correctly is 50%. The Rasch model iteratively adjusts these estimates to minimize error, producing reliable measures of both ability and difficulty.

Before conducting the Rasch analysis, we first checked whether the dataset met the model’s basic assumptions. This involved using principal component analysis (PCA) on the residuals to ensure that the test mainly measured one underlying concept. In Rasch analysis, it’s important that secondary dimensions (other factors besides the main concept) explain no more than 20% of the residual variance [4,17]. We also calculated essential unidimensionality, which shows how much of the variance is explained by the main concept—in this case, psychiatric clinical knowledge. A value of 50% or higher was considered acceptable [18]. After confirming unidimensionality, we estimated chatbot ability and item difficulty using joint maximum likelihood estimation (JMLE). The passing threshold was set at 60% accuracy for the selected questions and was also estimated using JMLE. To evaluate whether the best-performing chatbot significantly surpassed this threshold, we performed Wald tests, considering a two-tailed p-value of < 0.05 as statistically significant.

To ensure the model’s validity, we analyzed fit statistics to determine if the data aligned with the expectations of the Rasch model. Infit mean square (MNSQ) was utilized to evaluate the consistency of responses to items matching the chatbot’s ability level, while outfit MNSQ detected unusual response patterns to extremely easy or difficult items. Acceptable values for both metrics ranged from 0.5 to 1.5 [19]. Additionally, z-standardized (ZSTD) was evaluated, with acceptable values between ±1.96 [19].

Finally, we created a person–item map (PKMAP) for each chatbot, providing a visual representation of its performance across questions of varying difficulty and showing how its ability matches the difficulty levels of the test items. We conducted a detailed analysis of its responses to identify patterns (as mentioned in the evaluation procedure section), with particular focus on the simplest questions it answered incorrectly and the most challenging ones it answered correctly.

Results

Chatbot performance overview

Table 1 listed the distribution of questions across two years of licensing exams, comprising 160 questions divided into six categories. The majority of questions were categorized under Pathophysiology and Epidemiology, Diagnostic Assessment and Clinical Examination, and Psychopharmacology and Other Therapeutic Modalities. Table 2 lists the chatbots analyzed in this study, along with their release dates and associated companies. These chatbots were released between July 2023 and November 2024. We evaluated the performance of 27 chatbots (Table 3), yielding an average raw score of 97.7 (61% accuracy), with a standard deviation of 19.5. The highest score was 129, while the lowest was 56. The test reliability Cronbach’s alpha was 0.93.

thumbnail
Table 3. Raw scores and Rasch model parameters for chatbots.

https://doi.org/10.1371/journal.pone.0330303.t003

Dimensionality and Rasch model analysis

The dimensionality analysis confirmed that the dataset aligned with the assumptions of the Rasch model. The raw variance explained by the measures was 38.9%, while the essential unidimensionality calculated as the proportion of Rasch-common variance, reached 76.5%. These values exceed the commonly accepted thresholds of 20% for unidimensionality and 50% for essential unidimensionality, supporting the interpretation that the observed response patterns primarily reflect a single latent trait as psychiatric clinical knowledge.

The Rasch model parameters for chatbots were presented in Table 3. ChatGPT-o1-preview achieved the highest performance among all models, with a JMLE ability score of 2.23, substantially surpassing the passing threshold (JMLE = 0.44, p < .001). Its infit (MNSQ = 1.08, ZSTD = 0.54) and outfit (MNSQ = 0.97, ZSTD = 0.07) statistics were within the acceptable range of 0.5–1.5 and ±1.96, respectively, indicating strong consistency with the Rasch model’s expectations. The chatbot’s responses demonstrated both reliability and validity in assessing psychiatric clinical knowledge.

Performance and reasoning of ChatGPT-o1-preview

The person–item map (PKMAP) for ChatGPT-o1-preview (Fig 1) visually represents its performance across questions of varying difficulty levels. In the PKMAP, the vertical axis represents the difficulty of questions in logits, with more difficult items located higher on the map, while the horizontal axis separates correct responses (on the left) from incorrect responses (on the right). The upper-left quadrant (e.g., Items 52, 54, 79, and 140) of the map highlights areas where ChatGPT-o1-preview demonstrated strong capabilities, successfully answering challenging questions. Conversely, the lower-right quadrant (e.g., Items 27, 36, 42, 43, 77, 92, 106, 117, 131, and 146) indicates areas where the chatbot struggled, with incorrect answers to relatively easier questions. This distribution provides a clear visualization of the chatbot’s strengths and weaknesses in its performance on the questions.

thumbnail
Fig 1. The person–item map (PKMAP) of ChatGPT-o1-preview.

It illustrated the relationship between the chatbot’s ability and the difficulty of the test items. The vertical axis, measured in logits, represents the difficulty level of the questions.

https://doi.org/10.1371/journal.pone.0330303.g001

A detailed analysis of the chatbot’s answering process (Table 4, Table 5) revealed key strengths and weaknesses of its reasoning. ChatGPT-o1-preview excelled in areas such as diagnostic reasoning and broader therapeutic concepts (e.g., recognizing paraphilic disorders and treatment paradigms for schizophrenia), and pharmacological principles (e.g., drug mechanisms, indications, and side effects). However, it exhibited notable limitations in recalling specific factual details (e.g., remission timelines for transvestic disorder, concordance rates for generalized anxiety disorder in twin studies, or diagnostic definitions of negative symptoms). Additionally, biases in reasoning were observed, such as overemphasizing lithium’s efficacy in depression augmentation therapy while underestimating the role of antipsychotics or dismissing hypnosis as a therapeutic option. The chatbot also demonstrated its capacity for self-correction. In several cases (e.g., Items 27, 42, 77, 131, and 145), it revised its initial incorrect answers upon re-evaluation, ultimately producing accurate solutions.

thumbnail
Table 4. Strengths analysis of ChatGPT-o1-preview’s answering process.

https://doi.org/10.1371/journal.pone.0330303.t004

thumbnail
Table 5. Weaknesses analysis of ChatGPT-o1-preview’s answering process.

https://doi.org/10.1371/journal.pone.0330303.t005

The symbol “XXX” indicates the chatbot’s estimated ability, and each item on the map represents a specific question number from the exam, accompanied by either a “1” for correct responses or a “0” for incorrect ones. A “1” indicates a correct response and is displayed on the left side of the map, while a “0” signifies an incorrect response and is placed on the right side. The vertical positioning of each item reflects its difficulty.

Discussion

To our knowledge, this study is the first to apply Rasch analysis to evaluate chatbots’ clinical knowledge and reasoning in psychiatry using expert-designed multiple-choice questions. Rasch model may simplify certain aspects of clinical reasoning complexity, it provides a robust framework for quantifying chatbot performance and identifying specific strengths and weaknesses. A fundamental understanding of chatbots’ internal knowledge and reasoning is essential for advancing their clinical applications and scalability. Among the 27 chatbots assessed, ChatGPT-o1-preview emerged as the top-performing model, achieving a correct response rate of 80.6% and a JMLE ability estimate of 2.23. According to the Rasch model, this indicates that ChatGPT-o1-preview would achieve a correct response rate of 85.6%. This performance not only surpassed the passing threshold by a significant margin but also placed its score well within the range of successful human candidates seeking board certification. The strengths of ChatGPT-o1-preview were particularly evident in its understanding of high-level diagnostic, therapeutic, and pharmacological concepts. For instance, the model showcased advanced reasoning in areas such as schizophrenia management across various stages, diagnostic clarity in paraphilic disorders, and a thorough understanding of psychopharmacology, including drug mechanisms, indications, and side effects. Moreover, ChatGPT-o1-preview exhibited a strong ability to self-correct during re-evaluation.

Despite these strengths, the chatbot demonstrated notable limitations. It struggled with questions requiring precise factual recall, such as the exact Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition, Text Revision (DSM-5-TR) criteria [20], remission timelines for transvestic disorder, and rare statistical data (e.g., concordance rates for generalized anxiety disorder). This highlights the need for caution regarding the chatbot’s susceptibility to hallucinations or confabulations (a term used when a model produces factually incorrect statements with great confidence, analogous to neuro-psychiatric confabulation) [21]. When probed for highly specific details, ChatGPT-o1-preview occasionally generated inaccurate or fabricated information. This limitation is consistent with prior studies on chatbot performance in factual recall tasks. For example, the SimpleQA benchmark [22], which focuses on short, fact-seeking queries from diverse aspects, requires each question to meet strict criteria: it must have a single, indisputable answer that is easy to grade, and the answer should remain constant over time. Chatbots tend to perform poorly on SimpleQA tasks. Similarly, the Chinese SimpleQA [23], a localized version of the SimpleQA benchmark, has demonstrated similar challenges in chatbot performance. Additionally, the chatbot occasionally displayed reasoning biases, such as overemphasizing lithium’s efficacy in depression augmentation therapy while underestimating the role of antipsychotics or dismissing hypnosis as a viable treatment for dissociative identity disorder.

Our findings highlight how biases in medical reasoning, such as the overemphasis on lithium for major depressive disorder augmentation therapy, could pose clinical risks. Identifying these domain-specific biases is crucial, as they may directly impact patient safety. Such biases may stem from the training data’s inherent limitations or uneven exposure to certain clinical concepts [2426]. Such findings align with previous research, including biasmedQA, which highlights how biased information in prompts can lead to biased clinical judgments by chatbots [27]. As for ethical considerations, adhering to the principle of ‘do no harm’ remains a fundamental criterion in psychiatric care. This principle is especially critical given the unique vulnerabilities of psychiatric patients. For patient populations with compromised reality testing, such as individuals with psychotic disorders or severe cognitive impairment, a single piece of misinformation from an AI could reinforce a delusion, undermine therapeutic trust, or precipitate a clinical crisis. Ensuring accountability for AI-related harm requires robust regulatory frameworks. To mitigate potential errors and their impact on patient care, it is essential to implement strategies such as human oversight, model interpretability, and real-time validation in clinical environments. Additionally, clinicians must be equipped with guidelines to effectively assess and intervene when chatbot-generated recommendations deviate from best practices.

Our study underscores the importance of both training data quality and ongoing performance optimization in enhancing chatbot reliability for psychiatric applications. While ChatGPT-o1-preview demonstrated superior knowledge in psychiatry, its performance in niche areas was sometimes biased and less reliable. This discrepancy reflects the accessibility of training resources; open-access journals, books, and widely available materials likely dominate the training corpus, while proprietary textbooks and leading psychiatric journals remain less accessible. Recent partnerships between publishers and chatbot developers signal potential solutions to this issue. For instance, Axel Springer has partnered with OpenAI to integrate journalism with AI technologies [28], and Elsevier Health has collaborated with OpenEvidence to develop ClinicalKey AI [29]. Collaborative efforts such as these could bridge the gap between accessible and proprietary knowledge, benefiting chatbot training. Techniques like retrieval-augmented generation (RAG) [30,31], which uses vectorized data to enhance responses by retrieving relevant information, or fine-tuning [32,33], which optimizes models by training them on domain-specific datasets, can further improve chatbot recall ability in specialized fields.

From a clinical perspective, the findings suggest that chatbots like ChatGPT-o1-preview hold significant potential as tools for delivering timely and relatively accurate feedback to support clinical decision-making. While chatbots, even multimodal or more advanced models, may hold potential for clinical use, they cannot yet be directly integrated into psychiatric practice without rigorous experimental validation and regulatory considerations. Human oversight remains indispensable to ensure safe and effective implementation. First, given the chatbot’s performance limitations, it is advisable to present it with pre-defined clinical options and guide it to provide explanations, rather than relying on it to generate responses independently. Employing adequate prompt engineering [34,35], which involves designing specific and structured queries, can substantially improve the quality of chatbot responses. Second, encouraging chatbots to perform iterative reasoning or re-evaluate their outputs has been shown to enhance accuracy, as demonstrated in this study. Third, leveraging multiple models simultaneously can provide diverse perspectives and help mitigate biases inherent in individual systems. Finally, rather than depending on chatbots for factual recall, they are better suited for reasoning and decision support, where their outputs can effectively complement clinical expertise. By adopting a thoughtful and cautious approach, chatbots can serve as valuable adjuncts to enhance patient care while minimizing risks.

The rapid advancements in chatbot capabilities cannot be ignored. Historical research suggests that as of early 2024, chatbots had achieved performance sufficient to pass licensing exams, with some models, such as ChatGPT-o1-preview, significantly surpassing that benchmark [4]. This accelerated improvement suggests that conclusions from earlier studies, particularly those conducted in 2023 or earlier [36,37], may no longer accurately reflect the current state of chatbot capabilities. Therefore, ongoing evaluation of new chatbot is essential to keep pace with their evolving capabilities and to better understand their applicability in clinical practice.

Limitations

This study has several limitations. First, the sample size of 27 chatbots, although sufficient for a pilot study using Rasch analysis [38], may limit the generalizability of our findings. However, the large number of questions used (n = 160) ensured a low standard error for ability estimates, enhancing the reliability of the results. Additionally, our focus on state-of-the-art (SOTA) models may limit the generalizability of our findings to smaller or less advanced language models. Second, while the multiple-choice questions used in this study provided a standardized method of assessing psychiatric knowledge, they may not fully capture the breadth of clinical expertise required in practice. Nevertheless, establishing a clear understanding of chatbots’ knowledge base and reasoning processes remains a fundamental step before integrating them into AI-assisted clinical decision-making, especially given the potential risks associated with their deployment. Third, all questions were presented in traditional Mandarin, and the results may vary across different languages. However, core psychiatric knowledge and reasoning are largely language-independent, as fundamental clinical principles remain consistent across different linguistic contexts. Fourth, while possessing clinical knowledge is important, it does not necessarily translate to the effective application of that knowledge in real-world scenarios. Although we have highlighted key aspects of potential clinical application, future research should prioritize systematically test for biases to ensure reliability and evaluating the efficacy of chatbots in supporting clinical decision-making through controlled trials in real-world psychiatric settings [3941].

Conclusion

This study demonstrates that ChatGPT-o1-preview, released by OpenAI in September 2024, outperformed other chatbots in a standardized evaluation of psychiatric clinical knowledge. The model’s strengths lie in its understanding of diagnostic frameworks, treatment paradigms, and pharmacology concepts. However, its limitations in recalling specific details, addressing niche knowledge, and overcoming biases highlight the need for cautious interpretation of its outputs. Building on these findings, we have proposed key aspects of practical considerations to support their integration into psychiatric practice. As chatbot technologies continue to evolve, ongoing assessments of their capabilities and controlled clinical trials will be crucial for understanding their broader applicability and ensuring safe and effective implementation in clinical settings.

References

  1. 1. Wang Z, Chu Z, Doan TV, Ni S, Yang M, Zhang W. History, development, and principles of large language models-an introductory survey. arXiv. 2024. http://arxiv.org/abs/2402.06853
  2. 2. Chakraborty C, Pal S, Bhattacharya M, Dash S, Lee S-S. Overview of Chatbots with special emphasis on artificial intelligence-enabled ChatGPT in medical science. Front Artif Intell. 2023;6:1237704. pmid:38028668
  3. 3. Reyhan AH, Mutaf Ç, Uzun İ, Yüksekyayla F. A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity. J Clin Med. 2024;13(21):6512. pmid:39518652
  4. 4. Chang Y, Su C-Y, Liu Y-C. Assessing the Performance of Chatbots on the Taiwan Psychiatry Licensing Examination Using the Rasch Model. Healthcare (Basel). 2024;12(22):2305. pmid:39595502
  5. 5. Liu M, Okuhara T, Dai Z, Huang W, Okada H, Furukawa E, et al. Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study. medRxiv. 2024. 2024.07.09.24310129. https://www.medrxiv.org/content/10.1101/2024.07.09.24310129v1
  6. 6. Li D-J, Kao Y-C, Tsai S-J, Bai Y-M, Yeh T-C, Chu C-S, et al. Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin Neurosci. 2024;78(6):347–52. pmid:38404249
  7. 7. Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. 2025;333(4):319–28. pmid:39405325
  8. 8. Goktas P, Grzybowski A. Assessing the Impact of ChatGPT in Dermatology: A Comprehensive Rapid Review. J Clin Med. 2024;13(19):5909. pmid:39407969
  9. 9. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med. 2023;3(1):1–8.
  10. 10. Whyte J, Ward P, Eccles DW. The relationship between knowledge and clinical performance in novice and experienced critical care nurses. Heart & Lung. 2009;38(6):517–25.
  11. 11. Kao H-J, Chien T-W, Wang W-C, Chou W, Chow JC. Assessing ChatGPT’s capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine (Baltimore). 2023;102(25):e34068. pmid:37352054
  12. 12. Thirunavukarasu AJ, Mahmood S, Malem A, Foster WP, Sanghera R, Hassan R, et al. Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study. PLOS Digit Health. 2024;3(4):e0000341. pmid:38630683
  13. 13. Rasch G. Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests. Oxford, England: Nielsen & Lydiche. 1960:184.
  14. 14. Adams NE. Bloom’s taxonomy of cognitive learning objectives. J Med Libr Assoc. 2015;103(3):152–3. pmid:26213509
  15. 15. Chiang WL, Zheng L, Sheng Y, Angelopoulos AN, Li T, Li D. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv. 2024. http://arxiv.org/abs/2403.04132
  16. 16. Bond T, Yan Z, Heene M. Applying the Rasch model: fundamental measurement in the human sciences. New York London. 2021:376.
  17. 17. Rasch analysis of the Indonesian mental health screening tools. The Open Psychology Journal. 2021;14:198–203.
  18. 18. Liu X, Cao P, Lai X, Wen J, Yang Y. Assessing essential unidimensionality of scales and structural coefficient bias. Educational and Psychological Measurement. 2023;83(1):28–47.
  19. 19. Pitaloka DAE, Kusuma IY, Pratiwi H, Pradipta IS. Development and validation of assessment instrument for the perception and attitude toward tuberculosis among the general population in Indonesia: a Rasch analysis of psychometric properties. Front Public Health. 2023;11:1143120.
  20. 20. Association AP. Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition, Text Revision. Washington, DC. 2022:1050.
  21. 21. Hatem R, Simmons B, Thornton JE. Chatbot Confabulations Are Not Hallucinations. JAMA Intern Med. 2023;183(10):1177. pmid:37578766
  22. 22. Wei J, Karina N, Chung HW, Jiao YJ, Papay S, Glaese A. Measuring short-form factuality in large language models. In: arXiv, 2024. http://arxiv.org/abs/2411.04368
  23. 23. He Y, Li S, Liu J, Tan Y, Wang W, Huang H, et al. Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models. arXiv. 2024. http://arxiv.org/abs/2411.07140
  24. 24. Gallegos IO, R A R, Barrow J, Tanjim MM, Kim S, Dernoncourt F, et al. Bias and Fairness in Large Language Models: A Survey. Computational Linguistics. 2024;50(3):1097–179.
  25. 25. Ayoub NF, Balakrishnan K, Ayoub MS, Barrett TF, David AP, Gray ST. Inherent bias in large language models: a random sampling analysis. Mayo Clinic Proceedings: Digital Health. 2024;2(2):186–91.
  26. 26. Wang J, Redelmeier DA. Cognitive Biases and Artificial Intelligence. NEJM AI. 2024;1(12):AIcs2400639.
  27. 27. Schmidgall S, Harris C, Essien I, Olshvang D, Rahman T, Kim JW. Evaluation and mitigation of cognitive biases in medical language models. npj Digital Medicine. 2024;7(1):1–9.
  28. 28. Partnership with Axel Springer to deepen beneficial use of AI in journalism. Available from: https://openai.com/index/axel-springer-partnership/.
  29. 29. Elsevier Health partners with OpenEvidence to launch next generation ClinicalKey AI. Available from: https://www.elsevier.com/about/press-releases/elsevier-health-partners-with-openevidence-to-deliver-trusted-evidence-based.
  30. 30. Miao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia OA, Cheungpasitporn W. Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications. Medicina (Kaunas). 2024;60(3):445. pmid:38541171
  31. 31. Wang C, Long Q, Xiao M, Cai X, Wu C, Meng Z. BioRAG: A RAG-LLM Framework for Biological Question Reasoning. arXiv. 2024. http://arxiv.org/abs/2408.01107
  32. 32. Maharjan J, Garikipati A, Singh NP, Cyrus L, Sharma M, Ciobanu M, et al. OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci Rep. 2024;14(1):14156. pmid:38898116
  33. 33. Tan Y, Zhang Z, Li M, Pan F, Duan H, Huang Z, et al. MedChatZH: A tuning LLM for traditional Chinese medicine consultations. Comput Biol Med. 2024;172:108290. pmid:38503097
  34. 34. Zhang X, Talukdar N, Vemulapalli S, Ahn S, Wang J, Meng H, et al. Comparison of prompt engineering and fine-tuning strategies in large language models in the classification of clinical notes. medRxiv. 2024. 2024.02.07.24302444. https://www.medrxiv.org/content/10.1101/2024.02.07.24302444v1
  35. 35. Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023;25:e50638.
  36. 36. Weng T-L, Wang Y-M, Chang S, Chen T-J, Hwang S-J. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc. 2023;86(8):762–6. pmid:37294147
  37. 37. Au Yeung J, Kraljevic Z, Luintel A, Balston A, Idowu E, Dobson RJ, et al. AI chatbots not yet ready for clinical use. Front Digit Health. 2023;5:1161098. pmid:37122812
  38. 38. Linacre J. Sample size and item calibration stability.1994. Available from: https://www.semanticscholar.org/paper/Sample-size-and-item-calibration-stability-Linacre/bd9943b087bfeb266bf8a77290ecb15bc1ccc9be
  39. 39. Chang Y, Ju PC, Hsieh MH, Chang CC. Evaluating the impact of authoritative and subjective cues on large language model reliability for clinical inquiries: An experimental study. medRxiv. 2025;:2025.07.15.25331607.
  40. 40. Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024;7(10):e2440969. pmid:39466245
  41. 41. Chang Y, Liu Y-C, Huang S-S, Hsu W-Y. Assessing bias in AI-driven psychiatric recommendations: A comparative cross-sectional study of chatbot-classified and CANMAT 2023 guideline for adjunctive therapy in difficult-to-treat depression. Psychiatry Res. 2025;348:116501. pmid:40267866