Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians

Ali Hadi; Edward Tran; Branavan Nagarajan; Amrit Kirpalani

doi:10.1371/journal.pone.0307383

Abstract

Background

ChatGPT is a large language model (LLM) trained on over 400 billion words from books, articles, and websites. Its extensive training draws from a large database of information, making it valuable as a diagnostic aid. Moreover, its capacity to comprehend and generate human language allows medical trainees to interact with it, enhancing its appeal as an educational resource. This study aims to investigate ChatGPT’s diagnostic accuracy and utility in medical education.

Methods

150 Medscape case challenges (September 2021 to January 2023) were inputted into ChatGPT. The primary outcome was the number (%) of cases for which the answer given was correct. Secondary outcomes included diagnostic accuracy, cognitive load, and quality of medical information. A qualitative content analysis was also conducted to assess its responses.

Results

ChatGPT answered 49% (74/150) cases correctly. It had an overall accuracy of 74%, a precision of 48.67%, sensitivity of 48.67%, specificity of 82.89%, and an AUC of 0.66. Most answers were considered low cognitive load 51% (77/150) and most answers were complete and relevant 52% (78/150).

Discussion

ChatGPT in its current form is not accurate as a diagnostic tool. ChatGPT does not necessarily give factual correctness, despite the vast amount of information it was trained on. Based on our qualitative analysis, ChatGPT struggles with the interpretation of laboratory values, imaging results, and may overlook key information relevant to the diagnosis. However, it still offers utility as an educational tool. ChatGPT was generally correct in ruling out a specific differential diagnosis and providing reasonable next diagnostic steps. Additionally, answers were easy to understand, showcasing a potential benefit in simplifying complex concepts for medical learners. Our results should guide future research into harnessing ChatGPT’s potential educational benefits, such as simplifying medical concepts and offering guidance on differential diagnoses and next steps.

Citation: Hadi A, Tran E, Nagarajan B, Kirpalani A (2024) Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians. PLoS ONE 19(7): e0307383. https://doi.org/10.1371/journal.pone.0307383

Editor: Fateen Ata, Hamad Medical Corporation, QATAR

Received: April 25, 2023; Accepted: July 3, 2024; Published: July 31, 2024

Copyright: © 2024 Hadi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Artificial Intelligence (AI) refers to computer systems that can perform tasks that require human intelligence, such as visual perception, decision-making, and language understanding [1]. Natural Language Processing (NLP), a crucial field in AI, focuses on the interaction between human language and computer systems [2]. NLP algorithms are capable of analyzing and generating human language, making them valuable tools in various sectors, including healthcare [2].

In the healthcare sector, NLP can be applied in several ways, such as in clinical documentation, coding and billing, monitoring drug safety, and keeping track of patients [3–5]. Large Language Models (LLMs) are a type of NLP model that can perform various language tasks, such as text completion, summarization, translation, and question-answering [6]. LLMs are trained on massive amounts of text data and can generate human-like responses to natural language queries [6, 7].

ChatGPT is a Large Language Model (LLM) developed by OpenAI, capable of performing a diverse array of natural language tasks [8]. At the moment, ChatGPT is arguably the most well-known, commercially available LLM. Its widespread accessibility appeals to a broad audience, including medical trainees and physicians, who are likely to be curious about its performance in a clinical setting.

A study recently found that ChatGPT was able to accurately answer biomedical and clinical questions on the United States Medical Licensing Examination (USMLE) at a level that approached or exceeded the passing threshold [9]. The study also found that ChatGPT’s accuracy was characterized by high concordance and density of insight, indicating its potential to generate novel insights and assist in medical education [9]. While these results have ignited discussions around potential implications for ChatGPT in healthcare, they also highlight the potential use of this tool in medical education. Whereas the ability of ChatGPT to answer concise, encyclopedic questions have been studied, the quality of its responses to complex medical cases remains unclear [10].

In this study, we aim to evaluate ChatGPT’s performance as a diagnostic tool for complex clinical cases to explore its diagnostic accuracy, the cognitive load of its answers, and the overall relevance of its responses. We aim to understand the potential benefits and limitations of ChatGPT in clinical education.

ChatGPT is powered by Generative Pre-trained Transformer (GPT) 3.5, an LLM trained on a massive dataset of text with over 400 billion words from the internet including books, articles, and websites [8]. However, this dataset is private and therefore lacks transparency as users have no convenient means to validate the accuracy or the source of the information being generated. We plan to conduct qualitative analysis to evaluate the quality of medical information ChatGPT provides.

While ChatGPT is able to generate novel responses that closely resemble natural human language [11] it lacks genuine comprehension of the content it receives or produces.

Once again, this underscores the importance of evaluating the responses provided by ChatGPT. While responses may sound grammatically correct and offer correct medical information, it is essential to assess the overall relevance to the medical question at hand as to not mislead medical trainees.

Medscape Clinical Challenges include complex cases that are designed to challenge the knowledge and diagnostic skills of healthcare professionals [12]. The cases are often based on real-world scenarios and may involve multiple comorbidities, unusual presentations, and diagnostic dilemmas [12]. By employing these challenges, we can evaluate ChatGPT’s ability to answer medical queries, diagnose conditions, and select appropriate treatment plans in a context that closely resembles actual clinical practice [13].

Materials and methods

Artificial intelligence

ChatGPT operates as a server-based language model, meaning it cannot access the internet. All responses are generated in real-time, relying on the abstract associations between words ("tokens") within the neural network. This constraint mirrors real-life clinical settings where professionals do not have the freedom to easily access additional scientific literature and also allows us to accurately evaluate ChatGPT’s knowledge.

Input source

We tested the performance of ChatGPT in answering Medscape Clinical Challenges. These complex cases are designed to challenge the knowledge and diagnostic skills of healthcare professionals [12]. These challenges present a clinical scenario that includes patient history, physical examination findings, and laboratory or imaging results. Healthcare professionals are required to make a diagnosis or choose an appropriate treatment plan using multiple-choice questions [12]. Feedback is provided after each answer with explanations of the correct diagnosis and treatment plan. The distribution of answer options selected by Medscape users is also provided. This feedback mechanism allows an accurate evaluation of ChatGPT’s responses compared to correct answers and also allows us to directly compare its thought process and decision making to healthcare professionals.

Medscape’s Case Challenges were selected because they were open-source and freely accessible. To prevent any possibility of ChatGPT having prior knowledge of the cases, only those authored after its NLP model training in August 2021 were included. This deliberate selection ensures that the chatbot hadn’t been trained on these specific cases beforehand, guaranteeing that each case presented is entirely novel to ChatGPT and that it does not already know the answers.

Data collection

Data was collected by the three authors, medical trainees (A.H, B.N, and E.T), and all content was reviewed by a Staff Physician (A.K). We felt that it was most appropriate for medical trainees to be the primary evaluator of ChatGPT’s responses, given that it will likely be medical trainees who would rely heaviest on it as an external resource. The three authors (A.H, B.N, and E.T) utilized publicly available clinical case challenges from Medscape, published between September 2021 and January 2023, after the date of ChatGTP’s model 3.5’s training. A total of 150 Medscape cases were analyzed; cases were randomized amongst the three authors with each case being overlapped by at least 2 authors. We excluded any cases with visual assets, such as clinical images, medical photography, and graphs, to ensure the consistency of the input format for ChatGPT.

Input and prompt standardization

To ensure consistency in the input provided to ChatGPT, the three independent reviewers transformed the case challenge content into one standardized prompt. Each prompt included an unbiased script of what we wanted from the output, followed by the relevant case presentation and multiple-choice answers. The standardization of prompts ensures consistent and reproducible responses across different users and effectively addresses the OpenAI’s placed restriction of using ChatGPT for healthcare advice.

Prompts were standardized as such, all information available on the data extraction supplementary file:

Prompt 1: I’m writing a literature paper on the accuracy of CGPT of correctly identified a diagnosis from complex, WRITTEN, clinical cases. I will be presenting you a series of medical cases and then presenting you with a multiple choice of what the answer to the medical cases.

Prompt 2: Come up with a differential and provide rationale for why this differential makes sense and findings that would cause you to rule out the differential. Here are your multiple choice options to choose from and give me a detailed rationale explaining your answer.

[Insert multiple choices]

[Insert all Case info]

[Insert radiology description]

ChatGPT interaction and data extraction

The standardized prompts were input into ChatGPT using the legacy model 3.5, and the model generated responses containing the suggested answer to the case challenge as well as background info on the disease, reasons for ruling in the diagnosis, and reasons for ruling out other diagnoses.

Primary outcome assessment

All cases were evaluated by at least two independent raters (A.H, B.N or E.T) for each case and blinded to each other’s responses. ChatGTP responses were extracted, and the primary outcome was analyzed based on the percentage of cases for which the answer given was correct.

Secondary outcome assessment

All cases were evaluated by at least two independent raters (A.H, B.N or E.T). To assess secondary outcomes, we employed three validated medical education evaluation scales:

Diagnostic Accuracy: The raters assessed the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates of ChatGPT’s answers, considering the suggested differentials and the final diagnosis provided. Each case had four answer options, and ChatGPT’s explanation for each of the four answer options was categorized as either true or false, positive or negative [13]. We then calculated the accuracy, precision, sensitivity and specificity base as shown:
Accuracy: (TP + TN)/Total Responses
Precision: TP/ (TP + FP)
Sensitivity: TP/ (TP + FN)
Specificity: TN / (TN + FP)
To further evaluate the model’s performance, we generated a Receiver Operating Characteristic (ROC) curve and calculated the Area Under the Curve (AUC). This involved collecting model scores or probabilities for each instance, sorting instances based on their scores, iterating thresholds to calculate True Positive Rate (TPR) and False Positive Rate (FPR) for each threshold, plotting the FPR against the TPR to create the ROC curve, and computing the AUC to quantify the model’s discriminative ability. This thorough analysis provided both visual representation and scalar measurement to assess the model’s efficacy in diagnostic accuracy.
Cognitive Load: The raters evaluated the cognitive load of ChatGPT’s answers as low, moderate, or high, based on the complexity and clarity of the information provided according to the following scale [14]:
Low cognitive load: The answer is easy to understand and requires minimal cognitive effort to process
Moderate cognitive load: The answer requires moderate cognitive effort to process
High cognitive load: The answer is complex and requires significant cognitive effort to process
Quality of Medical Information: The raters assessed the quality of the medical information provided by ChatGPT according to the following criteria:
Complete: The answer includes all relevant information for making an accurate diagnosis
Incomplete: The answer is missing some relevant information for making an accurate diagnosis
Relevant: The answer includes information that is directly relevant to the diagnosis
Irrelevant: The answer includes information that is not directly relevant to the diagnosis
Using the above scale answers were categorized as one of: complete/relevant, complete/irrelevant, incomplete/relevant, and incomplete/irrelevant [15].
Discrepancies between raters were resolved through discussion and consensus. In order to assess the inter-rater reliability of our outcomes, we used Cohen’s Kappa coefficient. This statistical measure evaluates the agreement between two raters who each classify items into mutually exclusive categories. It is particularly useful in this study, as it accounts for any agreement that might occur by chance, which is important given the variability of responses from ChatGPT.

Content analysis

A content analysis was conducted on ChatGPT’s responses to identify patterns of strength and weakness. This analysis focused on the model’s ability to rule out specific differential diagnoses, provide reasonable diagnostic steps, and interpret laboratory values, specialized diagnostic testing, and imaging results. Additionally, we assessed the model’s ability to consider key information relevant to the diagnosis.

Data analysis

Results

A total of 150 Medscape cases were included in the analysis (see Table 1), with a total of 600 answer options (four per case) provided to ChatGPT.

Download:

Table 1. Summary of ChatGPT’s performance on MedScape clinical case challenges.

https://doi.org/10.1371/journal.pone.0307383.t001

Primary outcome

Out of the 150 cases, ChatGPT provided correct answers for 74/150 (49%) of cases (Fig 1). In 92/150 (61%) of cases, ChatGPT provided the answer that the majority of Medscape users provided for the same question.

Download:

Fig 1. Percentage of correct answers, most common answer and correct answer despite the majority incorrect by ChatGPT 3.5 with MedScape clinical case challenges.

https://doi.org/10.1371/journal.pone.0307383.g001

Secondary outcomes

Diagnostic accuracy.

There are a total of 150 questions, each with 4 different multiple-choice options, resulting in a total of 600 possible answers, with only one correct answer per question. We found a true positive for 73/600 (12%), false positives for 77/600 (13%), true negatives for 373/600 (62%) and false negatives for 77/600 (13%) (Fig 2). ChatGPT demonstrated an accuracy of 74%, with a precision of 49%. Its sensitivity was 49%, while it achieved a specificity of 83%.The AUC for the ROC was 0.66.

Download:

Fig 2. Confusion matrix evaluating the diagnostic accuracy of ChatGPT 3.5, considering each answer within the 150 MedScape clinical case challenges.

https://doi.org/10.1371/journal.pone.0307383.g002

Cognitive load.

Out of the 150 responses, 78/150 (52%) were categorized as low cognitive load, 61/150 (41%), were found to be a moderate cognitive load, and 11/150 (7%) were classified as high cognitive load (Fig 3).

Download:

Fig 3. Receiver Operator Curve (ROC) for the diagnostic accuracy of ChatGPT 3.5 answers within 150 MedScape clinical case challenges.

https://doi.org/10.1371/journal.pone.0307383.g003

Quality of medical information.

Responses were complete and relevant for 78/150 (52%) cases. None of the 0/150 (0%) responses were complete but irrelevant, and 64/150 (43%) responses were deemed incomplete yet relevant. Additionally, 8/150 (5%) of the responses were classified as both incomplete and irrelevant (Fig 4).

Download:

Fig 4. Cognitive load of ChatGPT 3.5 answers given in response to 150 MedScape clinical case challenges.

https://doi.org/10.1371/journal.pone.0307383.g004

Cohen’s kappa for diagnostic accuracy, cognitive load, and quality of medical information was 0.78 (substantial Inter-rater reliability), 0.64 (substantial Inter-rater reliability), and 1.0 (perfect) respectively (Fig 5).

Download:

Fig 5. Quality of medical answers given by ChatGPT 3.5 response to 150 MedScape clinical case challenges.

https://doi.org/10.1371/journal.pone.0307383.g005

Content analysis

We collated the main strengths of ChatGPT’s responses into four themes: clinical rationale, identifying pertinent positives and negatives, ruling our differential diagnoses, and suggesting future investigations. A representative sample of responses including a rationale is provided in Table 2.

Download:

Table 2. Qualitative analysis of the strengths associated with ChatGPT’s answers in response to MedScape clinical case challenges.

https://doi.org/10.1371/journal.pone.0307383.t002

The models’ main weaknesses were categorized into: misinterpretation of numerical values, inability to handle images, difficulty with nuanced diagnoses, hallucinations, and neglected information (Table 3).

Download:

Table 3. Qualitative analysis and examples of the weaknesses associated with ChatGPT’s answers in response to MedScape clinical case challenges.

https://doi.org/10.1371/journal.pone.0307383.t003

Discussion

Diagnostic accuracy

ChatGPT demonstrated a case accuracy of 49%, an overall accuracy of 74%, a precision of 48.67%,sensitivity of 48.67%,specificity of 82.89%. ChatGPT’s AUC was 0.66, indicating moderate discriminative ability between correct and incorrect diagnoses.

In our assessment of ChatGPT’s diagnostic accuracy in 150 complex clinical cases from MedScape It is important to distinguish between case accuracy and overall accuracy. Case accuracy, which reflects the proportion of cases where the model correctly identified the single correct answer, stood at 49%. However, the overall accuracy, considered the model’s success in correctly rejecting incorrect options across all multiple-choice elements, reached 74.33%. This higher value is due to the ChatGPT’s ability to identify true negatives (incorrect options), which significantly contributes to the overall accuracy, enhancing its utility in eliminating incorrect choices. This difference highlights ChatGPT’s high specificity, indicating its ability to excel at ruling out incorrect diagnoses. However, it needs improvement in precision and sensitivity to reliably identify the correct diagnosis. Precision and sensitivity are crucial for a diagnostic tool because missed diagnoses can lead to significant consequences for patients, such as the lack of necessary treatments or further diagnostic testing, resulting in worse health outcomes.

Overall, these results raise concerns about its accuracy as a diagnostic and education tool for clinicians and medical learners. Several factors led to ChatGPT’s mediocre performance in diagnosing complex clinical cases. Its training data is sourced from diverse texts like books, articles, and websites [8]. These sources offer the AI model a broad understanding of everyday topics and English language nuances but may lack in-depth knowledge in specialized fields like medicine, hindering its ability to diagnose complex cases [16]. Additionally, the training data only includes information up until September 2021 [8]. As a result, recent advancements in various fields may not be reflected in ChatGPT’s knowledge, potentially leading to outdated or inaccurate information being provided by the AI model. To improve diagnostic accuracy, it is crucial that ChatGPT’s training data be augmented with up-to-date, specialized medical information and that the model’s architecture be adapted to handle the nuances of clinical case analysis better.

ChatGPT, provided a considerable number of false positives (13%) and false negatives (13%) which has implications for its use as a diagnostic tool for clinical practice. In the context of false positives and false negatives, it is crucial to consider the role of AI hallucinations as they can significantly impact the accuracy of the information given [17]. Hallucinations refer to outputs generated by an AI model that seem coherent but are not based on factual information, arising from biases, errors, or over-optimization in the model’s training data or its inability to accurately decipher ambiguous or incomplete input data [16]. False positives occur when the AI model incorrectly identifies a condition or disease that is not present, which would lead to unnecessary treatments or interventions that may cause undue stress and anxiety. False negatives occur when an AI model fails to identify a condition or disease that is present, potentially delaying necessary treatments or interventions, and allowing worse outcomes. AI hallucinations contribute partially to the emergence of false positives and false negatives, emphasizing the importance of refining AI models’ training and enhancing their capacity to process intricate information. By doing so, we can potentially improve diagnostic accuracy and reduce the influence of AI hallucinations on medical diagnoses and decision-making processes [17].

Completeness and relevance of medical answers

ChatGPT’s extensive training in diverse textual data has enabled it to generate complete and coherent responses with proficiency in grammar, context, and a wide range of topics [8]. In most cases, the results produced by ChatGPT are either complete and relevant in 78 out of 150 cases (52%) or incomplete but still relevant, 64/150 (43%), to the user’s inquiry. However, despite its capabilities, ChatGPT may still produce irrelevant responses due to factors such as lack of true understanding, ambiguity or insufficient input, and over-optimization for coherence [16].

Despite ChatGPT’s proficiency in pattern-matching and generating text based on those patterns, its lack of genuine understanding of the content may result in incomplete answers [18]. In some instances, the AI model produces responses that, while syntactically correct and logical in appearance, only partially address the core issue or question. This can be attributed to the model’s struggle to grasp broader context or nuances, such as the interconnectedness of symptoms, patient history, and risk factors. When crucial information is overlooked or relevant details are not connected, the generated answers might be incomplete, not fully meeting the user’s needs or expectations. However, these incomplete answers can still hold some relevance to the topic at hand, providing users with partial information or guidance that could be of value.

In the context of medical learners, ChatGPT’s ability to generate incomplete but still relevant answers can provide valuable insights and learning opportunities. Although the AI model may not always deliver a comprehensive response, the partial information it offers can still contribute to the learner’s understanding of various medical concepts, symptoms, patient histories, or risk factors. These relevant fragments can encourage medical learners to actively engage in critical thinking and problem-solving, prompting them to seek further information to fill in the gaps and develop a more comprehensive understanding of the subject matter. In this way, ChatGPT may hold potential as a supplementary learning tool, however, learners and educators must be wary of the potential for inaccuracy and concepts should be cross-referenced from trusted sources.

In real-world clinical settings, patient information can be ambiguous, incomplete or even incorrect, which poses a challenge for ChatGPT [19]. While patients may not provide all the details relevant to their clinical case, a human healthcare provider can make inferences and use their medical knowledge to put ambiguous details into context, helping them to make informed medical decisions [20]. In contrast, ChatGPT may struggle to make these inferences and as a result, generate irrelevant responses due to an over-reliance on the information provided. As a result, while ChatGPT may assist healthcare providers, it cannot yet replace the expertise and judgment of a human provider [21]. A human healthcare provider can also take into account nonverbal cues and recognize when a patient may omit or miss important details that could affect their diagnosis or treatment [22]. These factors are not easily captured in text-based interactions, making human expertise essential in the diagnostic and treatment process.

Cognitive load

ChatGPT tends to generate responses with low (77/150) 51%, to moderate (61/150) 41% cognitive load, emphasizing accessibility and readability for users. This characteristic may be advantageous for novice medical students, as it facilitates improved learner engagement and information retention [23]. However, the combination of this ease of understanding with potentially incorrect or irrelevant information can result in misconceptions and a false sense of comprehension. This issue poses a significant challenge for ChatGPT’s application as a medical education tool, as the efficacy of the tool is heavily influenced by the learner’s preexisting knowledge, expertise, and cognitive capacity. In the absence of tailored approaches for these factors, ChatGPT may hinder learners’ ability to apply their knowledge in complex or unfamiliar situations. Addressing this limitation necessitates the development of adaptive algorithms to adjust cognitive load levels based on individual users and the integration of supplementary resources to ensure a comprehensive understanding of the content [24]. Consequently, it is crucial to exercise caution and verify information when relying on ChatGPT for medical inquiries.

Content analysis strengths and weakness

Our analysis revealed several key limitations in ChatGPT’s diagnostic capabilities. First, the model had difficulty interpreting numerical values, likely due to its reliance on context and language patterns learned during training, which occasionally led to overlooking or misinterpreting critical lab values [9]. ChatGPT’s inability to evaluate laboratory images hindered its diagnostic performance, especially when such images were vital for accurate diagnosis.

ChatGPT also struggled to distinguish between diseases with subtly different presentations and the model also occasionally generated incorrect or implausible information, known as AI hallucinations, emphasizing the risk of sole reliance on ChatGPT for medical guidance and the necessity of human expertise in the diagnostic process [17].

Finally, ChatGPT sometimes ignored key information relevant to the diagnosis. The lack of contextualizing all given information highlights the importance of human input in ensuring critical information is considered during the diagnostic process [21].

Ethical considerations

As technology becomes increasingly integrated into healthcare, with Electronic Medical Records (EMRs) and other digital tools becoming commonplace, the imperative to securely manage sensitive medical data has never been more critical. Patient privacy and data security are not just ethical imperatives but also crucial for maintaining trust in medical systems [21]. However, as we integrate more advanced technologies into healthcare, new challenges emerge. One significant concern is the potential for algorithms to perpetuate existing biases present in their training data. The selection of this data, often influenced by human biases, can inadvertently reinforce disparities in medical diagnoses and treatment plans, further exacerbating racial and other disparities in healthcare outcomes [25]. Moreover, while AI can provide valuable insights, the importance of human oversight cannot be overstated. Physicians must consider the broader clinical context and individual patient needs, recognizing that the most statistically accurate diagnosis or treatment plan might not align with a patient’s cultural or religious values [21]. As AI’s role in healthcare grows, so does the need for a clear legal framework addressing liability. Questions arise regarding responsibility for misdiagnosis: Should the onus lie with AI development teams, the physicians who rely on these tools, or a combination of both? As we navigate these complexities, the overarching goal remains to ensure that AI serves as a tool to enhance, not replace, the human touch in medicine.

Limitations

There are several limitations to consider in this study on ChatGPT’s use in medical education. First, our study focused on a single AI model (ChatGTP model 3.5), which may not be representative of other AI models or future iterations of ChatGPT. Our study only utilized Medscape Clinical Challenges, which, while complex and diverse, may not cover all aspects of medical education [12]. The initial approach was to develop a meaningful list of cases that encompasses other aspects of medicine such as management, pharmacotherapy, and pathophysiology, however, the 150 cases from Medscape primarily only focus on differential diagnosis cases [12]. Finally, the input and prompt standardization process relied on the expertise of the authors, and alternative methods of standardization could potentially influence the model’s performance. Future studies should explore different AI models, case sources, and educational contexts to further assess the utility of AI in medical education.

Future perspectives

ChatGPT has gained significant popularity as a teaching tool in medical education [26]. Its access to extensive medical knowledge combined with its ability to deliver real-time, unique, insightful responses is invaluable. In conjunction with traditional teaching methods, ChatGPT can help students bridge gaps in knowledge and simplify complex concepts by delivering instantaneous and personalized answers to clinical questions [27–31]. However, the use of ChatGPT in medical education poses challenges; outdated databases and hallucinations can lead to the dissemination of inaccurate and misleading information to students [32–35]. To overcome this problem, we foresee future advancements in other LLMs, either trained on medical literature or integrated with real-time medical databases. These specialized models would offer users the advantage of access to accurate medical knowledge and up-to-date clinical guidelines. Beyond its integration, it is important to explore the long-term implications of using LLMs, such as ChatGPT, in health care and medical education. Although numerous studies, including ours, have evaluated ChatGPT for medical education, further research is essential to determine the quality and efficacy of ChatGPT as a tool in this field [32–35].

Future research should focus on discerning the competency of medical professionals who are over-reliant on ChatGPT, assessing patient confidence in AI-supported diagnoses, and evaluating their overall impact on clinical outcomes. These future studies will aid in the development of guidelines for integrating AI into both medical education and clinical practice. While many agree that there’s an urgent need for appropriate guidelines and regulations for the application of ChatGPT in healthcare and medical education, it is equally as important to proceed cautiously, ensuring that LLMs like ChatGPT are implemented in a responsible and ethical manner [26, 36, 37]. As we learn to embrace AI in healthcare, further research in this field will shape the future of patient care and medical training. It’s imperative to proceed with caution, when using ChatGPT as a diagnostic tool and also as a teaching aid and to make sure that it is used in a responsible and ethical manner.

Conclusion

The combination of high relevance with relatively low accuracy advises against relying on ChatGPT for medical counsel, as it can present important information that may be misleading [24]. While our results indicate that ChatGPT consistently delivers the same information to different users, demonstrating substantial inter-rater reliability, it also reveals the tool’s shortcomings in providing factually correct medical information, as evident by its low diagnostic accuracy. Additional research should focus on enhancing the accuracy and dependability of ChatGPT as a diagnostic instrument. Integrating ChatGPT into medical education and clinical practice necessitates a thorough examination of its educational and clinical limitations. Transparent guidelines should be established for ChatGPT’s clinical usage, and medical students and clinicians should be trained on how to effectively and responsibly employ the tool.

Supporting information

S1 File.

https://doi.org/10.1371/journal.pone.0307383.s001

(XLSX)

References

1. Russell SJ, Norvig P. Artificial Intelligence: A Modern Approach. 3rd ed. Upper Saddle River, NJ: Prentice Hall; 2010.
2. Khurana D, Koli A, Khatter K, et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2023;82:3713–3744. pmid:35855771
- View Article
- PubMed/NCBI
- Google Scholar
3. Friedman C, Hripcsak G. Natural language processing and its future in medicine. Acad Med. 2013;84(8):890–5.
- View Article
- Google Scholar
4. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008;17(1):128–44. pmid:18660887
- View Article
- PubMed/NCBI
- Google Scholar
5. Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc. 2014;21(2):221–30. pmid:24201027
- View Article
- PubMed/NCBI
- Google Scholar
6. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33.
- View Article
- Google Scholar
7. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9. pmid:30617335
- View Article
- PubMed/NCBI
- Google Scholar
8. OpenAI. ChatGPT [Internet]. 2021 [cited 2023 Apr 11]. Available from: https://www.openai.com/research/chatgpt/
- View Article
- Google Scholar
9. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of CHATGPT on USMLE: Potential for AI-assisted medical education using large language models. 2022;
- View Article
- Google Scholar
10. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. J Med Syst. 2023; 47(1): 33. pmid:36869927
- View Article
- PubMed/NCBI
- Google Scholar
11. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI Blog. 2018;1.
- View Article
- Google Scholar
12. Medscape. Clinical challenges [Internet]. [cited 2023 Apr 11]. Available from: https://www.medscape.com/casechallengehub
13. Deeks J. J., Altman D. G., & Gatsonis C. (2004). Cochrane handbook for systematic reviews of diagnostic test accuracy. Cochrane book series.
- View Article
- Google Scholar
14. Paas F, van Merriënboer JJ. Cognitive-load theory: Methods to manage working memory load in the learning of complex tasks. Current Directions in Psychological Science. 2020;29(4):394–8.
- View Article
- Google Scholar
15. Demner-Fushman Dina, and Lin Jimmy. "Answering clinical questions with knowledge-based and statistical techniques." Computational Linguistics 33.1 (2007): 63–103.
- View Article
- Google Scholar
16. Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. A review of challenges and opportunities in machine learning for health. AMIA Jt Summits Transl Sci Proc. 2020;2020:191–200. pmid:32477638
- View Article
- PubMed/NCBI
- Google Scholar
17. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023 Feb 19;15(2). pmid:36811129
- View Article
- PubMed/NCBI
- Google Scholar
18. Shortliffe EH, Sepúlveda MJ. Clinical Decision Support in the Era of Artificial Intelligence. JAMA. 2018;320(21):2199–2200. pmid:30398550
- View Article
- PubMed/NCBI
- Google Scholar
19. Hollander JE, Carr BG. Virtually Perfect? Telemedicine for Covid-19. N Engl J Med. 2020;382(18):1679–1681. pmid:32160451
- View Article
- PubMed/NCBI
- Google Scholar
20. Eva KW. What every teacher needs to know about clinical reasoning. Med Educ. 2005;39(1):98–106. pmid:15612906
- View Article
- PubMed/NCBI
- Google Scholar
21. Char DS, Shah NH, Magnus D, Hsiao AL, Scherer RW. Implementing machine learning in health care—addressing ethical challenges. New England Journal of Medicine. 2018;378(11):981–3. pmid:29539284
- View Article
- PubMed/NCBI
- Google Scholar
22. Matheny ME, Whicher D, Thadaney Israni S. Artificial intelligence in health care:A report from the National Academy of Medicine. JAMA. 2019;323(6):509–10.
- View Article
- Google Scholar
23. Wartman SA, Combs CD. Reimagining Medical Education in the Age of AI. AMA J Ethics. 2019;21(2):E146–152. pmid:30794124
- View Article
- PubMed/NCBI
- Google Scholar
24. Miller DD, Brown EW. Artificial Intelligence in Medical Practice: The Question to the Answer? Am J Med. 2018;131(2):129–133. pmid:29126825
- View Article
- PubMed/NCBI
- Google Scholar
25. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. pmid:31649194
- View Article
- PubMed/NCBI
- Google Scholar
26. Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare. 2023;11(6):887. pmid:36981544
- View Article
- PubMed/NCBI
- Google Scholar
27. Khan A, Jawaid M, Khan A, Sajjad M. ChatGPT-Reshaping medical education and clinical management. Pak J Med Sci. 2023;39:605–7. pmid:36950398
- View Article
- PubMed/NCBI
- Google Scholar
28. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312.
- View Article
- Google Scholar
29. Gunawan J. Exploring the future of nursing: Insights from the ChatGPT model. Belitung Nurs J. 2023;9:1–5. pmid:37469634
- View Article
- PubMed/NCBI
- Google Scholar
30. Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med. 2019;380(14):1347–1358. pmid:30943338
- View Article
- PubMed/NCBI
- Google Scholar
31. van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: Five priorities for research. Nature. 2023;614:224–6. pmid:36737653
- View Article
- PubMed/NCBI
- Google Scholar
32. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of its Successes and Shortcomings. medRxiv. 2023; Preprint. pmid:37334036
- View Article
- PubMed/NCBI
- Google Scholar
33. Ahn C. Exploring ChatGPT for information of cardiopulmonary resuscitation. Resuscitation. 2023;185:109729. pmid:36773836
- View Article
- PubMed/NCBI
- Google Scholar
34. Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study. J Educ Eval Health Prof. 2023;20:1.
- View Article
- Google Scholar
35. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. medRxiv. 2023.
- View Article
- Google Scholar
36. Alberts IL, Mercolli L, Pyka T, Prenosil G, Shi K, Rominger A, et al. Large language models (LLM) and ChatGPT: What will the impact on nuclear medicine be? Eur J Nucl Med Mol Imaging. 2023; Online ahead of print.
- View Article
- Google Scholar
37. Sallam M, Salim NA, Barakat M, Al-Tammemi AB. ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study. Narra J. 2023;3:e103.
- View Article
- Google Scholar

[ref1] 1. Russell SJ, Norvig P. Artificial Intelligence: A Modern Approach. 3rd ed. Upper Saddle River, NJ: Prentice Hall; 2010.

[ref2] 2. Khurana D, Koli A, Khatter K, et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2023;82:3713–3744. pmid:35855771
View Article
PubMed/NCBI
Google Scholar

[3] View Article

[4] PubMed/NCBI

[5] Google Scholar

[ref3] 3. Friedman C, Hripcsak G. Natural language processing and its future in medicine. Acad Med. 2013;84(8):890–5.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref4] 4. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008;17(1):128–44. pmid:18660887
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref5] 5. Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc. 2014;21(2):221–30. pmid:24201027
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref6] 6. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9. pmid:30617335
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref8] 8. OpenAI. ChatGPT [Internet]. 2021 [cited 2023 Apr 11]. Available from: https://www.openai.com/research/chatgpt/
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref9] 9. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of CHATGPT on USMLE: Potential for AI-assisted medical education using large language models. 2022;
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref10] 10. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. J Med Syst. 2023; 47(1): 33. pmid:36869927
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref11] 11. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI Blog. 2018;1.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref12] 12. Medscape. Clinical challenges [Internet]. [cited 2023 Apr 11]. Available from: https://www.medscape.com/casechallengehub

[ref13] 13. Deeks J. J., Altman D. G., & Gatsonis C. (2004). Cochrane handbook for systematic reviews of diagnostic test accuracy. Cochrane book series.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Paas F, van Merriënboer JJ. Cognitive-load theory: Methods to manage working memory load in the learning of complex tasks. Current Directions in Psychological Science. 2020;29(4):394–8.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref15] 15. Demner-Fushman Dina, and Lin Jimmy. "Answering clinical questions with knowledge-based and statistical techniques." Computational Linguistics 33.1 (2007): 63–103.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref16] 16. Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. A review of challenges and opportunities in machine learning for health. AMIA Jt Summits Transl Sci Proc. 2020;2020:191–200. pmid:32477638
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref17] 17. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023 Feb 19;15(2). pmid:36811129
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref18] 18. Shortliffe EH, Sepúlveda MJ. Clinical Decision Support in the Era of Artificial Intelligence. JAMA. 2018;320(21):2199–2200. pmid:30398550
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref19] 19. Hollander JE, Carr BG. Virtually Perfect? Telemedicine for Covid-19. N Engl J Med. 2020;382(18):1679–1681. pmid:32160451
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref20] 20. Eva KW. What every teacher needs to know about clinical reasoning. Med Educ. 2005;39(1):98–106. pmid:15612906
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref21] 21. Char DS, Shah NH, Magnus D, Hsiao AL, Scherer RW. Implementing machine learning in health care—addressing ethical challenges. New England Journal of Medicine. 2018;378(11):981–3. pmid:29539284
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref22] 22. Matheny ME, Whicher D, Thadaney Israni S. Artificial intelligence in health care:A report from the National Academy of Medicine. JAMA. 2019;323(6):509–10.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref23] 23. Wartman SA, Combs CD. Reimagining Medical Education in the Age of AI. AMA J Ethics. 2019;21(2):E146–152. pmid:30794124
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref24] 24. Miller DD, Brown EW. Artificial Intelligence in Medical Practice: The Question to the Answer? Am J Med. 2018;131(2):129–133. pmid:29126825
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref25] 25. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. pmid:31649194
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref26] 26. Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare. 2023;11(6):887. pmid:36981544
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref27] 27. Khan A, Jawaid M, Khan A, Sajjad M. ChatGPT-Reshaping medical education and clinical management. Pak J Med Sci. 2023;39:605–7. pmid:36950398
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref28] 28. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref29] 29. Gunawan J. Exploring the future of nursing: Insights from the ChatGPT model. Belitung Nurs J. 2023;9:1–5. pmid:37469634
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref30] 30. Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med. 2019;380(14):1347–1358. pmid:30943338
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref31] 31. van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: Five priorities for research. Nature. 2023;614:224–6. pmid:36737653
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref32] 32. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of its Successes and Shortcomings. medRxiv. 2023; Preprint. pmid:37334036
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref33] 33. Ahn C. Exploring ChatGPT for information of cardiopulmonary resuscitation. Resuscitation. 2023;185:109729. pmid:36773836
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref34] 34. Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study. J Educ Eval Health Prof. 2023;20:1.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref35] 35. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. medRxiv. 2023.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref36] 36. Alberts IL, Mercolli L, Pyka T, Prenosil G, Shi K, Rominger A, et al. Large language models (LLM) and ChatGPT: What will the impact on nuclear medicine be? Eur J Nucl Med Mol Imaging. 2023; Online ahead of print.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref37] 37. Sallam M, Salim NA, Barakat M, Al-Tammemi AB. ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study. Narra J. 2023;3:e103.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

Figures

Abstract

Background

Methods

Results

Discussion

Introduction

Materials and methods

Artificial intelligence

Input source

Data collection

Input and prompt standardization

ChatGPT interaction and data extraction

Primary outcome assessment

Secondary outcome assessment

Content analysis

Data analysis

Results

Primary outcome

Secondary outcomes

Diagnostic accuracy.

Cognitive load.

Quality of medical information.

Content analysis

Discussion

Diagnostic accuracy

Completeness and relevance of medical answers

Cognitive load

Content analysis strengths and weakness

Ethical considerations

Limitations

Future perspectives

Conclusion

Supporting information

S1 File.

References