Figures
Abstract
Background
ChatGPT is a large language model (LLM) trained on over 400 billion words from books, articles, and websites. Its extensive training draws from a large database of information, making it valuable as a diagnostic aid. Moreover, its capacity to comprehend and generate human language allows medical trainees to interact with it, enhancing its appeal as an educational resource. This study aims to investigate ChatGPT’s diagnostic accuracy and utility in medical education.
Methods
150 Medscape case challenges (September 2021 to January 2023) were inputted into ChatGPT. The primary outcome was the number (%) of cases for which the answer given was correct. Secondary outcomes included diagnostic accuracy, cognitive load, and quality of medical information. A qualitative content analysis was also conducted to assess its responses.
Results
ChatGPT answered 49% (74/150) cases correctly. It had an overall accuracy of 74%, a precision of 48.67%, sensitivity of 48.67%, specificity of 82.89%, and an AUC of 0.66. Most answers were considered low cognitive load 51% (77/150) and most answers were complete and relevant 52% (78/150).
Discussion
ChatGPT in its current form is not accurate as a diagnostic tool. ChatGPT does not necessarily give factual correctness, despite the vast amount of information it was trained on. Based on our qualitative analysis, ChatGPT struggles with the interpretation of laboratory values, imaging results, and may overlook key information relevant to the diagnosis. However, it still offers utility as an educational tool. ChatGPT was generally correct in ruling out a specific differential diagnosis and providing reasonable next diagnostic steps. Additionally, answers were easy to understand, showcasing a potential benefit in simplifying complex concepts for medical learners. Our results should guide future research into harnessing ChatGPT’s potential educational benefits, such as simplifying medical concepts and offering guidance on differential diagnoses and next steps.
Citation: Hadi A, Tran E, Nagarajan B, Kirpalani A (2024) Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians. PLoS ONE 19(7): e0307383. https://doi.org/10.1371/journal.pone.0307383
Editor: Fateen Ata, Hamad Medical Corporation, QATAR
Received: April 25, 2023; Accepted: July 3, 2024; Published: July 31, 2024
Copyright: © 2024 Hadi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Artificial Intelligence (AI) refers to computer systems that can perform tasks that require human intelligence, such as visual perception, decision-making, and language understanding [1]. Natural Language Processing (NLP), a crucial field in AI, focuses on the interaction between human language and computer systems [2]. NLP algorithms are capable of analyzing and generating human language, making them valuable tools in various sectors, including healthcare [2].
In the healthcare sector, NLP can be applied in several ways, such as in clinical documentation, coding and billing, monitoring drug safety, and keeping track of patients [3–5]. Large Language Models (LLMs) are a type of NLP model that can perform various language tasks, such as text completion, summarization, translation, and question-answering [6]. LLMs are trained on massive amounts of text data and can generate human-like responses to natural language queries [6, 7].
ChatGPT is a Large Language Model (LLM) developed by OpenAI, capable of performing a diverse array of natural language tasks [8]. At the moment, ChatGPT is arguably the most well-known, commercially available LLM. Its widespread accessibility appeals to a broad audience, including medical trainees and physicians, who are likely to be curious about its performance in a clinical setting.
A study recently found that ChatGPT was able to accurately answer biomedical and clinical questions on the United States Medical Licensing Examination (USMLE) at a level that approached or exceeded the passing threshold [9]. The study also found that ChatGPT’s accuracy was characterized by high concordance and density of insight, indicating its potential to generate novel insights and assist in medical education [9]. While these results have ignited discussions around potential implications for ChatGPT in healthcare, they also highlight the potential use of this tool in medical education. Whereas the ability of ChatGPT to answer concise, encyclopedic questions have been studied, the quality of its responses to complex medical cases remains unclear [10].
In this study, we aim to evaluate ChatGPT’s performance as a diagnostic tool for complex clinical cases to explore its diagnostic accuracy, the cognitive load of its answers, and the overall relevance of its responses. We aim to understand the potential benefits and limitations of ChatGPT in clinical education.
ChatGPT is powered by Generative Pre-trained Transformer (GPT) 3.5, an LLM trained on a massive dataset of text with over 400 billion words from the internet including books, articles, and websites [8]. However, this dataset is private and therefore lacks transparency as users have no convenient means to validate the accuracy or the source of the information being generated. We plan to conduct qualitative analysis to evaluate the quality of medical information ChatGPT provides.
While ChatGPT is able to generate novel responses that closely resemble natural human language [11] it lacks genuine comprehension of the content it receives or produces.
Once again, this underscores the importance of evaluating the responses provided by ChatGPT. While responses may sound grammatically correct and offer correct medical information, it is essential to assess the overall relevance to the medical question at hand as to not mislead medical trainees.
Medscape Clinical Challenges include complex cases that are designed to challenge the knowledge and diagnostic skills of healthcare professionals [12]. The cases are often based on real-world scenarios and may involve multiple comorbidities, unusual presentations, and diagnostic dilemmas [12]. By employing these challenges, we can evaluate ChatGPT’s ability to answer medical queries, diagnose conditions, and select appropriate treatment plans in a context that closely resembles actual clinical practice [13].
Materials and methods
Artificial intelligence
ChatGPT operates as a server-based language model, meaning it cannot access the internet. All responses are generated in real-time, relying on the abstract associations between words ("tokens") within the neural network. This constraint mirrors real-life clinical settings where professionals do not have the freedom to easily access additional scientific literature and also allows us to accurately evaluate ChatGPT’s knowledge.
Input source
We tested the performance of ChatGPT in answering Medscape Clinical Challenges. These complex cases are designed to challenge the knowledge and diagnostic skills of healthcare professionals [12]. These challenges present a clinical scenario that includes patient history, physical examination findings, and laboratory or imaging results. Healthcare professionals are required to make a diagnosis or choose an appropriate treatment plan using multiple-choice questions [12]. Feedback is provided after each answer with explanations of the correct diagnosis and treatment plan. The distribution of answer options selected by Medscape users is also provided. This feedback mechanism allows an accurate evaluation of ChatGPT’s responses compared to correct answers and also allows us to directly compare its thought process and decision making to healthcare professionals.
Medscape’s Case Challenges were selected because they were open-source and freely accessible. To prevent any possibility of ChatGPT having prior knowledge of the cases, only those authored after its NLP model training in August 2021 were included. This deliberate selection ensures that the chatbot hadn’t been trained on these specific cases beforehand, guaranteeing that each case presented is entirely novel to ChatGPT and that it does not already know the answers.
Data collection
Data was collected by the three authors, medical trainees (A.H, B.N, and E.T), and all content was reviewed by a Staff Physician (A.K). We felt that it was most appropriate for medical trainees to be the primary evaluator of ChatGPT’s responses, given that it will likely be medical trainees who would rely heaviest on it as an external resource. The three authors (A.H, B.N, and E.T) utilized publicly available clinical case challenges from Medscape, published between September 2021 and January 2023, after the date of ChatGTP’s model 3.5’s training. A total of 150 Medscape cases were analyzed; cases were randomized amongst the three authors with each case being overlapped by at least 2 authors. We excluded any cases with visual assets, such as clinical images, medical photography, and graphs, to ensure the consistency of the input format for ChatGPT.
Input and prompt standardization
To ensure consistency in the input provided to ChatGPT, the three independent reviewers transformed the case challenge content into one standardized prompt. Each prompt included an unbiased script of what we wanted from the output, followed by the relevant case presentation and multiple-choice answers. The standardization of prompts ensures consistent and reproducible responses across different users and effectively addresses the OpenAI’s placed restriction of using ChatGPT for healthcare advice.
Prompts were standardized as such, all information available on the data extraction supplementary file:
Prompt 1: I’m writing a literature paper on the accuracy of CGPT of correctly identified a diagnosis from complex, WRITTEN, clinical cases. I will be presenting you a series of medical cases and then presenting you with a multiple choice of what the answer to the medical cases.
Prompt 2: Come up with a differential and provide rationale for why this differential makes sense and findings that would cause you to rule out the differential. Here are your multiple choice options to choose from and give me a detailed rationale explaining your answer.
[Insert multiple choices]
[Insert all Case info]
[Insert radiology description]
ChatGPT interaction and data extraction
The standardized prompts were input into ChatGPT using the legacy model 3.5, and the model generated responses containing the suggested answer to the case challenge as well as background info on the disease, reasons for ruling in the diagnosis, and reasons for ruling out other diagnoses.
Primary outcome assessment
All cases were evaluated by at least two independent raters (A.H, B.N or E.T) for each case and blinded to each other’s responses. ChatGTP responses were extracted, and the primary outcome was analyzed based on the percentage of cases for which the answer given was correct.
Secondary outcome assessment
All cases were evaluated by at least two independent raters (A.H, B.N or E.T). To assess secondary outcomes, we employed three validated medical education evaluation scales:
- Diagnostic Accuracy: The raters assessed the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) rates of ChatGPT’s answers, considering the suggested differentials and the final diagnosis provided. Each case had four answer options, and ChatGPT’s explanation for each of the four answer options was categorized as either true or false, positive or negative [13]. We then calculated the accuracy, precision, sensitivity and specificity base as shown:
Accuracy: (TP + TN)/Total Responses
Precision: TP/ (TP + FP)
Sensitivity: TP/ (TP + FN)
Specificity: TN / (TN + FP)
To further evaluate the model’s performance, we generated a Receiver Operating Characteristic (ROC) curve and calculated the Area Under the Curve (AUC). This involved collecting model scores or probabilities for each instance, sorting instances based on their scores, iterating thresholds to calculate True Positive Rate (TPR) and False Positive Rate (FPR) for each threshold, plotting the FPR against the TPR to create the ROC curve, and computing the AUC to quantify the model’s discriminative ability. This thorough analysis provided both visual representation and scalar measurement to assess the model’s efficacy in diagnostic accuracy. - Cognitive Load: The raters evaluated the cognitive load of ChatGPT’s answers as low, moderate, or high, based on the complexity and clarity of the information provided according to the following scale [14]:
Low cognitive load: The answer is easy to understand and requires minimal cognitive effort to process
Moderate cognitive load: The answer requires moderate cognitive effort to process
High cognitive load: The answer is complex and requires significant cognitive effort to process - Quality of Medical Information: The raters assessed the quality of the medical information provided by ChatGPT according to the following criteria:
Complete: The answer includes all relevant information for making an accurate diagnosis
Incomplete: The answer is missing some relevant information for making an accurate diagnosis
Relevant: The answer includes information that is directly relevant to the diagnosis
Irrelevant: The answer includes information that is not directly relevant to the diagnosis
Using the above scale answers were categorized as one of: complete/relevant, complete/irrelevant, incomplete/relevant, and incomplete/irrelevant [15].
Discrepancies between raters were resolved through discussion and consensus. In order to assess the inter-rater reliability of our outcomes, we used Cohen’s Kappa coefficient. This statistical measure evaluates the agreement between two raters who each classify items into mutually exclusive categories. It is particularly useful in this study, as it accounts for any agreement that might occur by chance, which is important given the variability of responses from ChatGPT.
Content analysis
A content analysis was conducted on ChatGPT’s responses to identify patterns of strength and weakness. This analysis focused on the model’s ability to rule out specific differential diagnoses, provide reasonable diagnostic steps, and interpret laboratory values, specialized diagnostic testing, and imaging results. Additionally, we assessed the model’s ability to consider key information relevant to the diagnosis.
Results
A total of 150 Medscape cases were included in the analysis (see Table 1), with a total of 600 answer options (four per case) provided to ChatGPT.
Primary outcome
Out of the 150 cases, ChatGPT provided correct answers for 74/150 (49%) of cases (Fig 1). In 92/150 (61%) of cases, ChatGPT provided the answer that the majority of Medscape users provided for the same question.
Secondary outcomes
Diagnostic accuracy.
There are a total of 150 questions, each with 4 different multiple-choice options, resulting in a total of 600 possible answers, with only one correct answer per question. We found a true positive for 73/600 (12%), false positives for 77/600 (13%), true negatives for 373/600 (62%) and false negatives for 77/600 (13%) (Fig 2). ChatGPT demonstrated an accuracy of 74%, with a precision of 49%. Its sensitivity was 49%, while it achieved a specificity of 83%.The AUC for the ROC was 0.66.
Cognitive load.
Out of the 150 responses, 78/150 (52%) were categorized as low cognitive load, 61/150 (41%), were found to be a moderate cognitive load, and 11/150 (7%) were classified as high cognitive load (Fig 3).
Quality of medical information.
Responses were complete and relevant for 78/150 (52%) cases. None of the 0/150 (0%) responses were complete but irrelevant, and 64/150 (43%) responses were deemed incomplete yet relevant. Additionally, 8/150 (5%) of the responses were classified as both incomplete and irrelevant (Fig 4).
Cohen’s kappa for diagnostic accuracy, cognitive load, and quality of medical information was 0.78 (substantial Inter-rater reliability), 0.64 (substantial Inter-rater reliability), and 1.0 (perfect) respectively (Fig 5).
Content analysis
We collated the main strengths of ChatGPT’s responses into four themes: clinical rationale, identifying pertinent positives and negatives, ruling our differential diagnoses, and suggesting future investigations. A representative sample of responses including a rationale is provided in Table 2.
The models’ main weaknesses were categorized into: misinterpretation of numerical values, inability to handle images, difficulty with nuanced diagnoses, hallucinations, and neglected information (Table 3).
Discussion
Diagnostic accuracy
ChatGPT demonstrated a case accuracy of 49%, an overall accuracy of 74%, a precision of 48.67%,sensitivity of 48.67%,specificity of 82.89%. ChatGPT’s AUC was 0.66, indicating moderate discriminative ability between correct and incorrect diagnoses.
In our assessment of ChatGPT’s diagnostic accuracy in 150 complex clinical cases from MedScape It is important to distinguish between case accuracy and overall accuracy. Case accuracy, which reflects the proportion of cases where the model correctly identified the single correct answer, stood at 49%. However, the overall accuracy, considered the model’s success in correctly rejecting incorrect options across all multiple-choice elements, reached 74.33%. This higher value is due to the ChatGPT’s ability to identify true negatives (incorrect options), which significantly contributes to the overall accuracy, enhancing its utility in eliminating incorrect choices. This difference highlights ChatGPT’s high specificity, indicating its ability to excel at ruling out incorrect diagnoses. However, it needs improvement in precision and sensitivity to reliably identify the correct diagnosis. Precision and sensitivity are crucial for a diagnostic tool because missed diagnoses can lead to significant consequences for patients, such as the lack of necessary treatments or further diagnostic testing, resulting in worse health outcomes.
Overall, these results raise concerns about its accuracy as a diagnostic and education tool for clinicians and medical learners. Several factors led to ChatGPT’s mediocre performance in diagnosing complex clinical cases. Its training data is sourced from diverse texts like books, articles, and websites [8]. These sources offer the AI model a broad understanding of everyday topics and English language nuances but may lack in-depth knowledge in specialized fields like medicine, hindering its ability to diagnose complex cases [16]. Additionally, the training data only includes information up until September 2021 [8]. As a result, recent advancements in various fields may not be reflected in ChatGPT’s knowledge, potentially leading to outdated or inaccurate information being provided by the AI model. To improve diagnostic accuracy, it is crucial that ChatGPT’s training data be augmented with up-to-date, specialized medical information and that the model’s architecture be adapted to handle the nuances of clinical case analysis better.
ChatGPT, provided a considerable number of false positives (13%) and false negatives (13%) which has implications for its use as a diagnostic tool for clinical practice. In the context of false positives and false negatives, it is crucial to consider the role of AI hallucinations as they can significantly impact the accuracy of the information given [17]. Hallucinations refer to outputs generated by an AI model that seem coherent but are not based on factual information, arising from biases, errors, or over-optimization in the model’s training data or its inability to accurately decipher ambiguous or incomplete input data [16]. False positives occur when the AI model incorrectly identifies a condition or disease that is not present, which would lead to unnecessary treatments or interventions that may cause undue stress and anxiety. False negatives occur when an AI model fails to identify a condition or disease that is present, potentially delaying necessary treatments or interventions, and allowing worse outcomes. AI hallucinations contribute partially to the emergence of false positives and false negatives, emphasizing the importance of refining AI models’ training and enhancing their capacity to process intricate information. By doing so, we can potentially improve diagnostic accuracy and reduce the influence of AI hallucinations on medical diagnoses and decision-making processes [17].
Completeness and relevance of medical answers
ChatGPT’s extensive training in diverse textual data has enabled it to generate complete and coherent responses with proficiency in grammar, context, and a wide range of topics [8]. In most cases, the results produced by ChatGPT are either complete and relevant in 78 out of 150 cases (52%) or incomplete but still relevant, 64/150 (43%), to the user’s inquiry. However, despite its capabilities, ChatGPT may still produce irrelevant responses due to factors such as lack of true understanding, ambiguity or insufficient input, and over-optimization for coherence [16].
Despite ChatGPT’s proficiency in pattern-matching and generating text based on those patterns, its lack of genuine understanding of the content may result in incomplete answers [18]. In some instances, the AI model produces responses that, while syntactically correct and logical in appearance, only partially address the core issue or question. This can be attributed to the model’s struggle to grasp broader context or nuances, such as the interconnectedness of symptoms, patient history, and risk factors. When crucial information is overlooked or relevant details are not connected, the generated answers might be incomplete, not fully meeting the user’s needs or expectations. However, these incomplete answers can still hold some relevance to the topic at hand, providing users with partial information or guidance that could be of value.
In the context of medical learners, ChatGPT’s ability to generate incomplete but still relevant answers can provide valuable insights and learning opportunities. Although the AI model may not always deliver a comprehensive response, the partial information it offers can still contribute to the learner’s understanding of various medical concepts, symptoms, patient histories, or risk factors. These relevant fragments can encourage medical learners to actively engage in critical thinking and problem-solving, prompting them to seek further information to fill in the gaps and develop a more comprehensive understanding of the subject matter. In this way, ChatGPT may hold potential as a supplementary learning tool, however, learners and educators must be wary of the potential for inaccuracy and concepts should be cross-referenced from trusted sources.
In real-world clinical settings, patient information can be ambiguous, incomplete or even incorrect, which poses a challenge for ChatGPT [19]. While patients may not provide all the details relevant to their clinical case, a human healthcare provider can make inferences and use their medical knowledge to put ambiguous details into context, helping them to make informed medical decisions [20]. In contrast, ChatGPT may struggle to make these inferences and as a result, generate irrelevant responses due to an over-reliance on the information provided. As a result, while ChatGPT may assist healthcare providers, it cannot yet replace the expertise and judgment of a human provider [21]. A human healthcare provider can also take into account nonverbal cues and recognize when a patient may omit or miss important details that could affect their diagnosis or treatment [22]. These factors are not easily captured in text-based interactions, making human expertise essential in the diagnostic and treatment process.
Cognitive load
ChatGPT tends to generate responses with low (77/150) 51%, to moderate (61/150) 41% cognitive load, emphasizing accessibility and readability for users. This characteristic may be advantageous for novice medical students, as it facilitates improved learner engagement and information retention [23]. However, the combination of this ease of understanding with potentially incorrect or irrelevant information can result in misconceptions and a false sense of comprehension. This issue poses a significant challenge for ChatGPT’s application as a medical education tool, as the efficacy of the tool is heavily influenced by the learner’s preexisting knowledge, expertise, and cognitive capacity. In the absence of tailored approaches for these factors, ChatGPT may hinder learners’ ability to apply their knowledge in complex or unfamiliar situations. Addressing this limitation necessitates the development of adaptive algorithms to adjust cognitive load levels based on individual users and the integration of supplementary resources to ensure a comprehensive understanding of the content [24]. Consequently, it is crucial to exercise caution and verify information when relying on ChatGPT for medical inquiries.
Content analysis strengths and weakness
Our analysis revealed several key limitations in ChatGPT’s diagnostic capabilities. First, the model had difficulty interpreting numerical values, likely due to its reliance on context and language patterns learned during training, which occasionally led to overlooking or misinterpreting critical lab values [9]. ChatGPT’s inability to evaluate laboratory images hindered its diagnostic performance, especially when such images were vital for accurate diagnosis.
ChatGPT also struggled to distinguish between diseases with subtly different presentations and the model also occasionally generated incorrect or implausible information, known as AI hallucinations, emphasizing the risk of sole reliance on ChatGPT for medical guidance and the necessity of human expertise in the diagnostic process [17].
Finally, ChatGPT sometimes ignored key information relevant to the diagnosis. The lack of contextualizing all given information highlights the importance of human input in ensuring critical information is considered during the diagnostic process [21].
Ethical considerations
As technology becomes increasingly integrated into healthcare, with Electronic Medical Records (EMRs) and other digital tools becoming commonplace, the imperative to securely manage sensitive medical data has never been more critical. Patient privacy and data security are not just ethical imperatives but also crucial for maintaining trust in medical systems [21]. However, as we integrate more advanced technologies into healthcare, new challenges emerge. One significant concern is the potential for algorithms to perpetuate existing biases present in their training data. The selection of this data, often influenced by human biases, can inadvertently reinforce disparities in medical diagnoses and treatment plans, further exacerbating racial and other disparities in healthcare outcomes [25]. Moreover, while AI can provide valuable insights, the importance of human oversight cannot be overstated. Physicians must consider the broader clinical context and individual patient needs, recognizing that the most statistically accurate diagnosis or treatment plan might not align with a patient’s cultural or religious values [21]. As AI’s role in healthcare grows, so does the need for a clear legal framework addressing liability. Questions arise regarding responsibility for misdiagnosis: Should the onus lie with AI development teams, the physicians who rely on these tools, or a combination of both? As we navigate these complexities, the overarching goal remains to ensure that AI serves as a tool to enhance, not replace, the human touch in medicine.
Limitations
There are several limitations to consider in this study on ChatGPT’s use in medical education. First, our study focused on a single AI model (ChatGTP model 3.5), which may not be representative of other AI models or future iterations of ChatGPT. Our study only utilized Medscape Clinical Challenges, which, while complex and diverse, may not cover all aspects of medical education [12]. The initial approach was to develop a meaningful list of cases that encompasses other aspects of medicine such as management, pharmacotherapy, and pathophysiology, however, the 150 cases from Medscape primarily only focus on differential diagnosis cases [12]. Finally, the input and prompt standardization process relied on the expertise of the authors, and alternative methods of standardization could potentially influence the model’s performance. Future studies should explore different AI models, case sources, and educational contexts to further assess the utility of AI in medical education.
Future perspectives
ChatGPT has gained significant popularity as a teaching tool in medical education [26]. Its access to extensive medical knowledge combined with its ability to deliver real-time, unique, insightful responses is invaluable. In conjunction with traditional teaching methods, ChatGPT can help students bridge gaps in knowledge and simplify complex concepts by delivering instantaneous and personalized answers to clinical questions [27–31]. However, the use of ChatGPT in medical education poses challenges; outdated databases and hallucinations can lead to the dissemination of inaccurate and misleading information to students [32–35]. To overcome this problem, we foresee future advancements in other LLMs, either trained on medical literature or integrated with real-time medical databases. These specialized models would offer users the advantage of access to accurate medical knowledge and up-to-date clinical guidelines. Beyond its integration, it is important to explore the long-term implications of using LLMs, such as ChatGPT, in health care and medical education. Although numerous studies, including ours, have evaluated ChatGPT for medical education, further research is essential to determine the quality and efficacy of ChatGPT as a tool in this field [32–35].
Future research should focus on discerning the competency of medical professionals who are over-reliant on ChatGPT, assessing patient confidence in AI-supported diagnoses, and evaluating their overall impact on clinical outcomes. These future studies will aid in the development of guidelines for integrating AI into both medical education and clinical practice. While many agree that there’s an urgent need for appropriate guidelines and regulations for the application of ChatGPT in healthcare and medical education, it is equally as important to proceed cautiously, ensuring that LLMs like ChatGPT are implemented in a responsible and ethical manner [26, 36, 37]. As we learn to embrace AI in healthcare, further research in this field will shape the future of patient care and medical training. It’s imperative to proceed with caution, when using ChatGPT as a diagnostic tool and also as a teaching aid and to make sure that it is used in a responsible and ethical manner.
Conclusion
The combination of high relevance with relatively low accuracy advises against relying on ChatGPT for medical counsel, as it can present important information that may be misleading [24]. While our results indicate that ChatGPT consistently delivers the same information to different users, demonstrating substantial inter-rater reliability, it also reveals the tool’s shortcomings in providing factually correct medical information, as evident by its low diagnostic accuracy. Additional research should focus on enhancing the accuracy and dependability of ChatGPT as a diagnostic instrument. Integrating ChatGPT into medical education and clinical practice necessitates a thorough examination of its educational and clinical limitations. Transparent guidelines should be established for ChatGPT’s clinical usage, and medical students and clinicians should be trained on how to effectively and responsibly employ the tool.
References
- 1.
Russell SJ, Norvig P. Artificial Intelligence: A Modern Approach. 3rd ed. Upper Saddle River, NJ: Prentice Hall; 2010.
- 2. Khurana D, Koli A, Khatter K, et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2023;82:3713–3744. pmid:35855771
- 3. Friedman C, Hripcsak G. Natural language processing and its future in medicine. Acad Med. 2013;84(8):890–5.
- 4. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008;17(1):128–44. pmid:18660887
- 5. Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc. 2014;21(2):221–30. pmid:24201027
- 6. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33.
- 7. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9. pmid:30617335
- 8. OpenAI. ChatGPT [Internet]. 2021 [cited 2023 Apr 11]. Available from: https://www.openai.com/research/chatgpt/
- 9. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of CHATGPT on USMLE: Potential for AI-assisted medical education using large language models. 2022;
- 10. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. J Med Syst. 2023; 47(1): 33. pmid:36869927
- 11. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI Blog. 2018;1.
- 12.
Medscape. Clinical challenges [Internet]. [cited 2023 Apr 11]. Available from: https://www.medscape.com/casechallengehub
- 13. Deeks J. J., Altman D. G., & Gatsonis C. (2004). Cochrane handbook for systematic reviews of diagnostic test accuracy. Cochrane book series.
- 14. Paas F, van Merriënboer JJ. Cognitive-load theory: Methods to manage working memory load in the learning of complex tasks. Current Directions in Psychological Science. 2020;29(4):394–8.
- 15. Demner-Fushman Dina, and Lin Jimmy. "Answering clinical questions with knowledge-based and statistical techniques." Computational Linguistics 33.1 (2007): 63–103.
- 16. Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. A review of challenges and opportunities in machine learning for health. AMIA Jt Summits Transl Sci Proc. 2020;2020:191–200. pmid:32477638
- 17. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023 Feb 19;15(2). pmid:36811129
- 18. Shortliffe EH, Sepúlveda MJ. Clinical Decision Support in the Era of Artificial Intelligence. JAMA. 2018;320(21):2199–2200. pmid:30398550
- 19. Hollander JE, Carr BG. Virtually Perfect? Telemedicine for Covid-19. N Engl J Med. 2020;382(18):1679–1681. pmid:32160451
- 20. Eva KW. What every teacher needs to know about clinical reasoning. Med Educ. 2005;39(1):98–106. pmid:15612906
- 21. Char DS, Shah NH, Magnus D, Hsiao AL, Scherer RW. Implementing machine learning in health care—addressing ethical challenges. New England Journal of Medicine. 2018;378(11):981–3. pmid:29539284
- 22. Matheny ME, Whicher D, Thadaney Israni S. Artificial intelligence in health care:A report from the National Academy of Medicine. JAMA. 2019;323(6):509–10.
- 23. Wartman SA, Combs CD. Reimagining Medical Education in the Age of AI. AMA J Ethics. 2019;21(2):E146–152. pmid:30794124
- 24. Miller DD, Brown EW. Artificial Intelligence in Medical Practice: The Question to the Answer? Am J Med. 2018;131(2):129–133. pmid:29126825
- 25. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–53. pmid:31649194
- 26. Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare. 2023;11(6):887. pmid:36981544
- 27. Khan A, Jawaid M, Khan A, Sajjad M. ChatGPT-Reshaping medical education and clinical management. Pak J Med Sci. 2023;39:605–7. pmid:36950398
- 28. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312.
- 29. Gunawan J. Exploring the future of nursing: Insights from the ChatGPT model. Belitung Nurs J. 2023;9:1–5. pmid:37469634
- 30. Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med. 2019;380(14):1347–1358. pmid:30943338
- 31. van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: Five priorities for research. Nature. 2023;614:224–6. pmid:36737653
- 32. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of its Successes and Shortcomings. medRxiv. 2023; Preprint. pmid:37334036
- 33. Ahn C. Exploring ChatGPT for information of cardiopulmonary resuscitation. Resuscitation. 2023;185:109729. pmid:36773836
- 34. Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study. J Educ Eval Health Prof. 2023;20:1.
- 35. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making. medRxiv. 2023.
- 36. Alberts IL, Mercolli L, Pyka T, Prenosil G, Shi K, Rominger A, et al. Large language models (LLM) and ChatGPT: What will the impact on nuclear medicine be? Eur J Nucl Med Mol Imaging. 2023; Online ahead of print.
- 37. Sallam M, Salim NA, Barakat M, Al-Tammemi AB. ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study. Narra J. 2023;3:e103.