Figures
Abstract
Objectives
As a large language model (LLM) trained on a large data set, ChatGPT can perform a wide array of tasks without additional training. We evaluated the performance of ChatGPT on postgraduate UK medical examinations through a systematic literature review of ChatGPT’s performance in UK postgraduate medical assessments and its performance on Member of Royal College of Physicians (MRCP) Part 1 examination.
Methods
Medline, Embase and Cochrane databases were searched. Articles discussing the performance of ChatGPT in UK postgraduate medical examinations were included in the systematic review. Information was extracted on exam performance including percentage scores and pass/fail rates.
MRCP UK Part 1 sample paper questions were inserted into ChatGPT-3.5 and -4 four times each and the scores marked against the correct answers provided.
Results
12 studies were ultimately included in the systematic literature review.
ChatGPT-3.5 scored 66.4% and ChatGPT-4 scored 84.8% on MRCP Part 1 sample paper, which is 4.4% and 22.8% above the historical pass mark respectively. Both ChatGPT-3.5 and -4 performance was significantly above the historical pass mark for MRCP Part 1, indicating they would likely pass this examination.
ChatGPT-3.5 failed eight out of nine postgraduate exams it performed with an average percentage of 5.0% below the pass mark.
ChatGPT-4 passed nine out of eleven postgraduate exams it performed with an average percentage of 13.56% above the pass mark. ChatGPT-4 performance was significantly better than ChatGPT-3.5 in all examinations that both models were tested on.
Conclusion
ChatGPT-4 performed at above passing level for the majority of UK postgraduate medical examinations it was tested on. ChatGPT is prone to hallucinations, fabrications and reduced explanation accuracy which could limit its potential as a learning tool. The potential for these errors is an inherent part of LLMs and may always be a limitation for medical applications of ChatGPT.
Citation: Vij O, Calver H, Myall N, Dey M, Kouranloo K (2024) Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments. PLoS ONE 19(7): e0307372. https://doi.org/10.1371/journal.pone.0307372
Editor: Thiago P. Fernandes, Federal University of Paraiba, BRAZIL
Received: March 10, 2024; Accepted: July 3, 2024; Published: July 31, 2024
Copyright: © 2024 Vij et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Artificial intelligence (AI) has transformed the way we approach a huge number of industries and tasks ranging from consumer products to computer programming, as well as medicine [1,2]. Previously, the development of clinical AI models required significant time and resources with highly domain-specific training data [3]. However, with the release of ChatGPT by OpenAI in November 2022, the application of AI to many industries, including medicine, became far more accessible [4]. Within the medical field, ChatGPT has potential applications in medical education, clinical reasoning and research [5–7].
ChatGPT is a large language model (LLM) powered by OpenAI’s Generative Pre-trained Transformer(GPT)-3.5 or -4 and was seen as a major breakthrough in AI [8]. Previous iterations of AI language models were largely based on sequential style neural networks, such as recurrent neural networks (RNN) and long short-term memory (LSTM) neural networks [9–12]. With the introduction of the transformer architecture by Vaswani et al. in 2017, this has been widely adopted to develop LLMs such as GPT [13]. The key innovations with transformer architecture are two-fold. Firstly, they can process sequences non-sequentially. This means they do not ‘forget’ tokens far back in a sequence and the next word prediction considers the whole context not just the last few words. Secondly, self-attention, which weighs each token differently depending on the context, without which, the model could be basing its prediction on irrelevant information [13]. This ultimately means that LLMs can generate novel sequences never previously observed by the model [1,14]. As LLMs are trained on a vast data set, this allows ChatGPT to perform a wide array of tasks without the need for any additional specific training. It is able to generate computer code and analyse data, translate between languages, write discharge summaries and answer examination questions [8,15–18]. This had led to ChatGPT being tested on several medical examinations since its release to the public [1,19–26]. Most notably, ChatGPT-3.5 was shown to operate at or near pass level in the infamous United States Medical Licensing Examination (USMLE) [1]. GPT-4 was then released on 14th March 2023, which can reportedly solve more difficult problems with greater accuracy due to a broader general knowledge and problem-solving ability [27]. Indeed, with the upgraded GPT-4, ChatGPT’s USMLE step 1 score increased from 64.5% to 88% [1,28].
Following its success on the USMLE, ChatGPT-4 has now passed numerous national undergraduate medical licensing examinations including those from the Australia, China, Iran, Japan, Korea, Peru, Saudi Arabia and the UK [19–26]. ChatGPT’s performance has also been tested on several postgraduate medical examinations.
We wanted to further evaluate the ability of ChatGPT-3.5 and -4 to undertake postgraduate medical exams without any additional training. We conducted a systematic literature review (SLR) of ChatGPT’s performance in UK postgraduate medical assessments to analyse the level of knowledge it is able to accurately report. To consolidate this review, we tested ChatGPT-3.5 and -4 on Part 1 of the Membership of the Royal College of Physicians (MRCP Part 1) examination and included these results in the analysis. MRCP Part 1 is a postgraduate UK medical qualification accessible to doctors with a minimum of 12 months experience in medical employment, that tests candidates using multiple choice questions [29]. MRCP Part 1 was selected firstly, as it has been demonstrated to be a reliable examination [30]. Secondly, because it is an internationally recognised postgraduate medical assessment attempted by around 30% of UK medical graduates which forms a critical part of career progression for aspiring physicians in the UK [31].
Methods
Systematic Literature Review (SLR)
This SLR was undertaken in accordance with the Cochrane Handbook and reported as per the Preferred Reporting Items for Systematic Review and Meta-Analysis [32,33].
The review question was: What is the performance of ChatGPT on UK postgraduate medical examinations?
Databases were searched for the performance of ChatGPT on UK postgraduate medical examinations. Outcomes from the paper included: iteration of ChatGPT used, pass/fail rates, percentage score on the examination and factors influencing exam outcome.
Search strategy, databases and study selection
To ensure comprehensive coverage, indexing terms (MeSH, applicable to Medline and Cochrane, and Emtree headings on Embase) in addition to keyword searching were used. The full search strategy is available in the supplementary material.
Medline, Embase and Cochrane databases were searched for articles discussing the performance of ChatGPT-3.5 and -4 on UK postgraduate medical examinations from their conception until 24th January 2024. The search was restricted to English-language articles. Eligible articles included: observational studies, qualitative studies, and randomised control trials.
Full-length articles were uploaded onto Rayyan (www.rayyan.ai) with duplicates removed. Articles that met the inclusion criteria were examined by one author at abstract and full paper stage, with a 20% validity screening. Information was extracted on exam performance including percentage scores and pass/fail rates as well as factors that influenced exam outcome.
Assessment of performance of ChatGPT-4 in MRCP Part 1
The MRCP UK Part 1 sample questions, accessible via The Royal College of Physcians’ website, were used [34]. This consisted of 197 questions each with 5 multiple choice options from A to E. These questions, along with their options A to E, were compiled into a single text file. This text file was inserted as individual questions into ChatGPT-3.5 and as a single text into ChatGPT-4 to mimic an examination paper. Answers given by ChatGPT were recorded and marked against the correct answers provided by the MRCP UK mark scheme. The performance of ChatGPT-3.5 and -4 were marked as a total score out of 197 and as a percentage. The examination paper was entered into ChatGPT-3.5 and -4 four times and the results recorded. Four repeats were used as a study evaluating ChatGPT response consistency found no statistically significant difference between ChatGPT-3.5 or -4 performance following three rounds of questions [35]. The Shapiro-Wilk and Levene’s tests were applied to the data with all P values >0.5, suggesting the data was normally distributed with equal variance. An independent samples T-test was used to test the significance between the historical pass mark of MRCP Part 1 and the average scores of ChatGPT-3.5 and -4. MRCP Part 1 questions were categorised into factual recall or complex reasoning question types and the number of each question type that ChatGPT got incorrect was counted. A two-proportion Z test was used to calculate if the number of each question category ChatGPT got incorrect was significant when compared with the number of these questions within the MRCP Part 1 sample questions.
To assess whether MRCP (UK) Part 1 sample questions were in ChatGPT training data or if the models were working out answers for themselves, we applied an algorithm known as memorisation effects Levenshtein detector (MELD) [36]. To our knowledge this is the only algorithm used to detect if LLMs were trained on data that is inputted, and has been applied to other studies with similar methodology [37–39]. The proposed data set is split into halves with the first half inserted into the LLM and the LLM output matched to the second half of the data set. If the LLM output and second half of the data set share a 95% overlap or more, it is likely that the LLM was trained on that data set. MELD was implemented using the algorithm in the appendix of the paper that introduced this approach with all 197 sample questions and answers tokenised and split into pairs [36]. The full code for the application of MELD is available on GitHub (https://github.com/calvh3/gpt_tester). GPT-4’s API was set with a temperature of 0. Following the implementation of MELD, none of GPT-4’s completions had greater than a 95% match with the MRCP (UK) Part 1 sample paper questions, indicating that it is unlikely ChatGPT-4 was trained on this data set.
Results
Systematic literature review
Initially 1116 articles were retrieved with 12 included (Fig 1). This gave a total of 12 postgraduate UK examinations that ChatGPT performance was tested on. The examinations encompassed the following specialties: anaesthetics (n = 2), general practice (GP; n = 1), neurology (n = 1), obstetrics and gynaecology (n = 1), ophthalmology (n = 2), orthopaedics (n = 1), surgery (n = 1), multi-specialty (n = 1). The average grade at which these examinations are sat for a UK candidate, expressed as years after graduating, is 1.92 years.
Cochrane Library encompasses library of: Systematic reviews; systematic review protocols; controlled clinical trials.
All examinations were in the format of a written exam, comprising single best answer or multiple choice questions, except one [40]. Li et al. tested ChatGPT-4 performance on Membership of the Royal College of Obstetricians and Gynaecologists (MRCOG) Part 3 virtual Objective Structured Clinical Exam (OSCE) circuit in which ChatGPT-4 outperformed two human participants with a score of 77.2%, compared with 73.7% for human participants [40].
MRCP (UK) Part 1
The average pass marks for the previous eight MRCP Part 1 examinations were 62.0% (SD 1.76, range 58.2–64.3%). ChatGPT-3.5 took the MRCP UK Part 1 sample paper four times and scored 126, 136, 128 and 133 out of 197 giving an average percentage of 66.4% (SD 2.32, range 63.4–69.0%) which is significantly higher than the historical pass mark (P = 0.0042, 95% CI [1.74, 7.05]). This score was 4.4% above the pass mark.
ChatGPT-4 took the MRCP UK Part 1 sample paper four times and scored 164, 168, 169 and 167 out of 197, giving an average percentage of 84.8% (SD 1.10, range 83.2–85.8%) which is significantly above the historical pass mark (P = 0.0001, 95% CI [20.62, 24.97]). This score was 22.8% above the pass mark. ChatGPT-4 performance was also significantly better than ChatGPT-3.5 (P = 0.0001, 95% CI [15.25, 21.54]).
ChatGPT-4 got 16 questions incorrect across all four attempts. Of these 16 questions, 4 were categorised as factual recall and 12 as clinical reasoning. This was not found to be statistically significant compared with the proportion of factual recall and clinical reasoning questions in the MRCP Part 1 sample questions (Z = 1.0263, P = 0.30302).
Combined analysis
ChatGPT-3.5 failed eight out of nine postgraduate UK medical exams it performed, and only achieved a mark within or above passing range in MRCP Part 1. The average score for ChatGPT-3.5 was 56% (SD 0.084, range 43%-69.7%), with an average difference between GPT-3.5 score and pass mark of -5.0%.
ChatGPT-4 passed nine out of eleven examinations it performed with an average score of 76.2% (SD 1.23, range 50%-100%), with an average difference between GPT-4 score and pass mark of +13.56%. The specialties that ChatGPT-4 failed were orthopaedics and radiology. The specific examinations that ChatGPT-4 failed were the Fellowship of the Royal College of Surgeons in Orthopaedics (FRCS (Orth)) Part 1 and the Fellowship of the Royal College of Radiologists (FRCR) Part 1 examinations, which require a minimum of 6 years and 2 years postgraduate clinical experience respectively [37,41]. Therefore, all postgraduate examinations that require 0–1 year of clinical experience were passed by ChatGPT-4.
ChatGPT-4 outperformed ChatGPT-3.5 in all examinations that were taken by both LLMs. This was reported to be significantly greater in four examinations, including MRCP Part 1 (P<0.001) [37,42,43]. The above results are outlined in Table 1.
Discussion
ChatGPT-4 performed above passing level on the majority of postgraduate UK medical exams that it was tested on, with no additional training. However, these results must be interpreted with caution.
Our results demonstrate that ChatGPT-4 can answer MRCP Part 1 questions at a level that is above the historical pass mark. The MELD analysis suggest that this performance is a reflection of ChatGPT’s ability to interpret and process information rather than simply recall answers. The candidate scores for MRCP are historically variable. It has previously been shown that performance in MRCP Part 1 varies with medical school for UK applicants, and score on Professional and Linguistic Assessments Board (PLAB) examinations for international applicants [31,49]. Previous studies have demonstrated that top performers in MRCP Part 1 are applicants who attended Oxford or Cambridge University, followed by those who scored ≥35 marks above the PLAB1 pass mark (PLAB1 A1 35+) [31,49]. Oxford and Cambridge students scored an average of around +5%-10% above the MRCP Part 1 pass mark, with PLAB1 A1 35+ cohort just below this [49]. Our data indicates that ChatGPT-4 would outperform all UK medical schools by a considerable margin with an average of +22.8% above the pass mark, at least two times above the average pass mark as those who attended Oxford University. In addition, ChatGPT-3.5 would be outperformed by candidates from Oxford and Cambridge Universities but its score in the sample paper would outperform all other UK universities and international applicant scores.
Despite the evidence that ChatGPT-4 can perform well in postgraduate UK medical exams, there are still concerns with using LLMs both in medical education and its application to clinical practice. Due to their design, LLMs can produce text that sounds correct, but it cannot guarantee that it is accurate [50]. This has been demonstrated with ChatGPT, and while GPT-4 is an improvement on GPT-3.5, it is still prone to hallucination and fabrication [51]. ChatGPT has been shown to inaccurately report the content of genuine publications and when asked to generate a short literature review, 18% of GPT-4 citations were fabricated [52,53]. With many medical students and doctors using ChatGPT as an adjunct for education and clinical practice, this propensity for fabrication becomes problematic. Indeed, many papers express concerns regarding factual inaccuracies and spread of misinformation that could occur with widespread use of LLMs such as ChatGPT [54]. This propensity for misinformation is further exemplified by LLM’s poor explanation accuracy; GPT-4 demonstrated an explanation accuracy of 65.9% dropping from the correct rate of 80.5% in the FRCR Part 2A examination [42]. This suggests that while ChatGPT-4 can pass many of these postgraduate examinations, it may not be able to explain the answers correctly at a passing level. Both these factors, fabrication and reduced explanation accuracy, highlight the potential limitations for LLMs including ChatGPT as learning tools. However, the effects of fabrication and hallucination could potentially be minimised with the use of prompts which require LLMs to provide the rationale behind their clinical decision making [55,56]. Savage et al. propose a new LLM workflow in which LLMs provide a clinical reasoning rationale before their output. This offers clinicians an interpretable means to assess if the answer given by the model is true or false. Incorrect model responses are often accompanied by rationales with factual inaccuracies, rendering inaccurate answers identifiable [55]. Savage et al. suggest LLMs providing clinical reasoning rationale is achievable through diagnostic reasoning Chain-of-Though (CoT) prompting. This is a method whereby input prompts are altered to instruct a LLM to divide tasks into smaller reasoning steps. This more accurately reflects the step-by-step cognitive processes utilised by clinicians in medical practice. Despite this, only GPT-4 and not GPT-3.5 was able to imitate advanced clinical reasoning processes to arrive at an accurate diagnosis when tested with diagnostic CoT prompts, and while it was observed that GPT-4 can imitate clinical reasoning thought processes it cannot apply clinical reasoning in a comparable way to a human [55].
ChatGPT performance appeared to be enhanced with factual recall and diminished when required to apply more complex reasoning. ChatGPT performed well on shorter questions as well as those testing anatomy and pharmacology, while it performed poorly in physiology and legal or ethical questions [37,38,41]. Indeed, in MRCP Part 1, ChatGPT-4 performed worst in questions asking for further investigations, with 29.4% incorrect. This was thought to be a consequence of limited exposure, where ChatGPT is able to recall facts similar to those in the data set it was trained on, but struggles with more complex questions due to lack of higher-order thinking and limited clinical experience [41]. In contrast to this, other studies found that there was no significant difference in performance between higher and lower order questions (P = 0.816 for GPT-3.5, P = 0.427 for GPT-4) and equal proficiency when answering basic science and clinical questions [42,43]. Of the questions ChatGPT-4 got consistently incorrect on MRCP Part 1, there was no statistically significant difference between the number of incorrectly answered factual recall and complex reasoning questions (Z = 1.0263, P = 0.30302). Interestingly, ChatGPT’s performance was found to mimic performance of humans candidates, with worse results on questions humans found more challenging [39]. These differences in performance could be explained firstly due to an undertrained model, with lack of representation of data in questions where ChatGPT performed poorly [1,43]. Indeed, one study found that ChatGPT performed significantly better on general medicine than neuro-ophthalmology, suggesting lack of training on more esoteric topics [39]. However, with the improved performance of GPT-4 compared with -3.5 and a more up-to-date data set, this issue will likely become less prevalent. It could also be suggested that an insufficiency in human judgement at the initial reinforcement stages of model development, which is a significant liability of LLMs, may be additionally responsible for this change. In short this means that AI ability becomes concomitant with human ability, and would explain why ChatGPT performed worse on questions humans found more challenging [1,39,57]. This reduction in LLM’s performance in more complex reasoning could be mitigated through similar methods used to minimise hallucination. The Tree-of-Thoughts (ToT) system proposes the construction of a tree-like structure in which each node represents a partial solution in the problem-solving process. The tree comprises coherent sequences of language called “thoughts”. Each thought is then evaluated by the LLM using a deliberative reasoning process similar to that of type 2 human reasoning [56,58]. The ToT method has been applied to LLMs undertaking various complex reasoning tasks, including sudoku and crosswords, and the results suggest there cannot be non-hallucinatory, reliable human-like reasoning in LLMs without a strategy that employs a ToT, or similar system, as well as elements that mimic human-like working memory [56].
The majority of examinations discussed and analysed ChatGPT’s factual knowledge and reasoning. However, a number of medical examinations require adequate clinical skills, patient manner and professionalism–qualities required to be a safe and effective clinician [45]. It has been suggested that part of the reason ChatGPT was unable to perform well on some examination questions was due to lack of clinical experience [41]. Despite its lack of experience, ChatGPT was able to outperform human candidates at a postgraduate virtual OSCE and achieved strong marks in communication, for which empathy is essential [40]. This demonstrates ChatGPT’s ability to rapidly assemble complex clinical information into a coherent response [40]. Not only was this strong performance a surprise, but ChatGPT’s success within a field in which human interaction and communication is fundamental, has led people to suggest that we may need to change the way medical examinations are performed [43].
The strong performance of ChatGPT-4 on postgraduate UK medical exams highlights the huge potential for collaboration with AI for decision making, as well as medical education [38,44]. With the noticeable progress made from GPT-3.5 to -4 it is probable that future iterations of GPT will be even more competent [37]. Although LLMs have limitations to their utility, their ability to process complex information designed to assess high-level clinical reasoning, demonstrates the large potential AI models have as an adjunct role in medicine. It is essential for doctors to be aware of the limitations of LLMs, such as ChatGPT, so that the models can be used and applied effectively and safely into medical practice.
Limitations
It cannot be definitively excluded that ChatGPT may have been trained on the data sets that it was tested on and was therefore simply recalling facts rather than interpreting data. However, MELD was used in this paper to show that it was unlikely that ChatGPT was trained on the MRCP (UK) Part 1 sample paper. MELD was also used in three other papers to suggest ChatGPT was not trained on the data set it was tested on [37–39].
All studies included in our SLR used proxies for the examination they were attempting to test, such as question banks or sample papers, rather than an exact examination paper. While these are likely a good representation of the examination, it is difficult to be certain of how the historical pass marks translate to sample paper and question bank performance.
Although it was possible to breakdown the question types ChatGPT got incorrect into factual recall and complex reasoning, we were unable to assess the reasoning processes which lead to the LLM making these errors. This was largely due to the style of examination questions inputted into ChatGPT, which did not necessitate a step-by-step breakdown of responses into smaller reasoning steps, unlike in CoT prompting. This means we were unable to comment on the types of reasoning errors that commonly occur leading to ChatGPT giving incorrect responses.
Conclusions
ChatGPT-4 performed at above passing level for the majority of UK postgraduate medical examinations it was tested on, with no additional training. ChatGPT-4’s performance on MRCP(UK) Part 1 greatly outperformed the historical average performance of applicants from all medical schools.
ChatGPT-4 outperformed ChatGPT-3.5 in all examinations that they both undertook, in keeping with their respective performances on previous examinations. While improved with GPT-4, ChatGPT is prone to hallucinations, fabrications and reduced explanation accurary which could limit its potential as a learning tool. The potential for these errors are an inherent part of LLMs, thus, despite the impressive trajectory of ChatGPT from -3.5 to -4, this may always be a limitation for the medical applications of LLMs, however there is potential to minimise these effects through strategies that mimic human deliberative reasoning such as Chain-of-Thought and Tree-of-Thought search systems.
It appears that ChatGPT performed better on factual recall than higher order complex clinical scenarios, and its performance mimicked the performance of human participants, performing worse on more difficult questions. This convincing performance is likely to improve with future iterations of ChatGPT, whereby ChatGPT may become an increasingly valuable medical education resource if applied appropriately. However, it is essential that clinicians are aware of the limitations of LLMs, such as ChatGPT, if these models are to be applied safely to medical practice.
Supporting information
S1 File. Search strategies for systematic review on the capabilities of Chat-GPT in postgraduate medical assessments.
https://doi.org/10.1371/journal.pone.0307372.s001
(DOCX)
Acknowledgments
The authors would like to acknowledge OpenAI for providing access to ChatGPT-3.5 and -4 and the memorisation effets Levenshtein detector (MELD) algorithm, which were used in this study.
References
- 1. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2: e0000198. pmid:36812645
- 2. Gore JC. Artificial intelligence in medical imaging. Magnetic Resonance Imaging. 2020;68: A1–A4. pmid:31857130
- 3. Chen P-HC, Liu Y, Peng L. How to develop machine learning models for healthcare. Nat Mater. 2019;18: 410–414. pmid:31000806
- 4.
ChatGPT. ChatGPT. [cited 20 Mar 2023]. Available: https://chat.openai.com.
- 5. Lee H. The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ. 2023. pmid:36916887
- 6. Hirosawa T, Shimizu T. Enhancing clinical reasoning with Chat Generative Pre-trained Transformer: a practical guide. Diagnosis (Berl). 2023. pmid:37779351
- 7. Ruksakulpiwat S, Kumar A, Ajibade A. Using ChatGPT in Medical Research: Current Status and Future Directions. J Multidiscip Healthc. 2023;16: 1513–1520. pmid:37274428
- 8. Rudolph J, Tan S, Tan S. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning and Teaching. 2023;6.
- 9. Gharehbaghi A, Partovi E, Babic A. Parralel Recurrent Convolutional Neural Network for Abnormal Heart Sound Classification. Caring is Sharing–Exploiting the Value in Data for Health and Innovation. IOS Press; 2023. pp. 526–530. pmid:37203741
- 10. Jia Y. Application of Recurrent Neural Network Algorithm in Intelligent Detection of Clinical Ultrasound Images of Human Lungs. Comput Intell Neurosci. 2022;2022: 9602740. pmid:35785081
- 11. Koo KC, Lee KS, Kim S, Min C, Min GR, Lee YH, et al. Long short-term memory artificial neural network model for prediction of prostate cancer survival outcomes according to initial treatment strategy: development of an online decision-making support system. World J Urol. 2020;38: 2469–2476. pmid:31925552
- 12. Yu K, Zhang M, Cui T, Hauskrecht M. Monitoring ICU Mortality Risk with A Long Short-Term Memory Recurrent Neural Network. Pac Symp Biocomput. 2020;25: 103–114. pmid:31797590
- 13. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. arXiv; 2023.
- 14. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29: 1930–1940. pmid:37460753
- 15. Stokel-Walker C. AI bot ChatGPT writes smart essays — should professors worry? Nature. 2022 [cited 9 Jan 2024]. pmid:36494443
- 16. Patel SB, Lam K. ChatGPT: the future of discharge summaries? The Lancet Digital Health. 2023;5: e107–e108. pmid:36754724
- 17. Sahari Y, Al-Kadi AMT, Ali JKM. A Cross Sectional Study of ChatGPT in Translation: Magnitude of Use, Attitudes, and Uncertainties. J Psycholinguist Res. 2023;52: 2937–2954. pmid:37934302
- 18. Shue E, Liu L, Li B, Feng Z, Li X, Hu G. Empowering beginners in bioinformatics with ChatGPT. Quant Biol. 2023;11: 105–108. pmid:37378043
- 19. Kleinig O, Gao C, Bacchi S. This too shall pass: the performance of ChatGPT-3.5, ChatGPT-4 and New Bing in an Australian medical licensing examination. Med J Aust. 2023;219: 237. pmid:37528548
- 20. Fang C, Wu Y, Fu W, Ling J, Wang Y, Liu X, et al. How does ChatGPT-4 preform on non-English national medical licensing examination? An evaluation in Chinese language. PLOS Digit Health. 2023;2: e0000397. pmid:38039286
- 21. Ebrahimian M, Behnam B, Ghayebi N, Sobhrakhshankhah E. ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model. BMJ Health Care Inform. 2023;30: e100815. pmid:38081765
- 22. Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study. JMIR Med Educ. 2023;9: e48002. pmid:37384388
- 23. Jang D, Yun T-R, Lee C-Y, Kwon Y-K, Kim C-E. GPT-4 can pass the Korean National Licensing Examination for Korean Medicine Doctors. PLOS Digit Health. 2023;2: e0000416. pmid:38100393
- 24. Torres-Zegarra BC, Rios-Garcia W, Ñaña-Cordova AM, Arteaga-Cisneros KF, Chalco XCB, Ordoñez MAB, et al. Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study. J Educ Eval Health Prof. 2023;20: 30. pmid:37981579
- 25. Aljindan FK, Al Qurashi AA, Albalawi IAS, Alanazi AMM, Aljuhani HAM, Falah Almutairi F, et al. ChatGPT Conquers the Saudi Medical Licensing Exam: Exploring the Accuracy of Artificial Intelligence in Medical Knowledge Assessment and Implications for Modern Medical Education. Cureus. 15: e45043. pmid:37829968
- 26. Lai UH, Wu KS, Hsu T-Y, Kan JKC. Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment. Front Med (Lausanne). 2023;10: 1240915. pmid:37795422
- 27.
GPT-4. [cited 9 Jan 2024]. Available: https://openai.com/gpt-4.
- 28. Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Medical Teacher. 2023;0: 1–7. pmid:37839017
- 29.
Royal College of Physicians of United Kingdom. Part 1 | MRCPUK. [cited 20 Mar 2023]. Available: https://www.mrcpuk.org/mrcpuk-examinations/part-1.
- 30. McManus IC, Mooney-Somers J, Dacre JE, Vale JA, MRCP(UK) Part I Examining Board, Federation of Royal Colleges of Physicians, MRCP(UK) Central Office. Reliability of the MRCP(UK) Part I Examination, 1984–2001. Med Educ. 2003;37: 609–611. pmid:12834418
- 31. McManus I, Elder AT, de Champlain A, Dacre JE, Mollon J, Chis L. Graduates of different UK medical schools show substantial differences in performance on MRCP(UK) Part 1, Part 2 and PACES examinations. BMC Medicine. 2008;6: 5. pmid:18275598
- 32.
Cochrane Handbook for Systematic Reviews of Interventions. [cited 16 Jun 2023]. Available: https://training.cochrane.org/handbook/current.
- 33. Page MJ, Moher D, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372: n160. pmid:33781993
- 34.
Part 1 sample questions | MRCPUK. [cited 9 Jan 2024]. Available: https://www.mrcpuk.org/mrcpuk-examinations/part-1/part-1-sample-questions.
- 35. Funk PF, Hoch CC, Knoedler S, Knoedler L, Cotofana S, Sofo G, et al. ChatGPT’s Response Consistency: A Study on Repeated Queries of Medical Examination Questions. Eur J Investig Health Psychol Educ. 2024;14: 657–668. pmid:38534904
- 36. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems. arXiv; 2023.
- 37. Ariyaratne S, Jenko N, Mark Davies A, Iyengar KP, Botchu R. Could ChatGPT Pass the UK Radiology Fellowship Examinations? Acad Radiol. 2023;29: S1076–6332(23)00661-X. pmid:38160089
- 38. Birkett L, Fowler T, Pullen S. Performance of ChatGPT on a primary FRCA multiple choice question bank. Br J Anaesth. 2023;131: e34–e35. pmid:37210281
- 39. Fowler T, Pullen S, Birkett L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br J Ophthalmol. 2023;6: bjo-2023-324091. pmid:37932006
- 40. Li SW, Kemp MW, Logan SJS, Dimri PS, Singh N, Mattar CNZ, et al. ChatGPT outscored human candidates in a virtual objective structured clinical examination in obstetrics and gynecology. Am J Obstet Gynecol. 2023;229: 172.e1-172.e12. pmid:37088277
- 41. Saad A, Iyengar KP, Kurisunkal V, Botchu R. Assessing ChatGPT’s ability to pass the FRCS orthopaedic part A exam: A critical analysis. Surgeon. 2023;21: 263–266. pmid:37517980
- 42. Ghosn Y, El Sardouk O, Jabbour Y, Jrad M, Kamareddine MH, Abbas N, et al. ChatGPT 4 Versus ChatGPT 3.5 on The Final FRCR Part A Sample Questions. Assessing Performance and Accuracy of Explanations. medRxiv; 2023. p. 2023.09.06.
- 43. Raimondi R, Tzoumas N, Salisbury T, Di Simplicio S, Romano MR, Bommireddy T, et al. Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams. Eye. 2023;37: 3530–3533. pmid:37161074
- 44. Aldridge MJ, Penders R. Artificial intelligence and anaesthesia examinations: exploring ChatGPT as a prelude to the future. British Journal of Anaesthesia. 2023;131: e36–e37. pmid:37244834
- 45. Armitage RC. Performance of Generative Pre-trained Transformer-4 (GPT-4) in Membership of the Royal College of General Practitioners (MRCGP)-style examination questions. Postgrad Med J. 2023;23: 23. pmid:38142282
- 46. Giannos P. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK Neurology Specialty Certificate Examination. BMJ Neurol Open. 2023;5: e000451. pmid:37337531
- 47. Tsoutsanis P, Tsoutsanis A. Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam. Comput Biol Med. 2023;168: 107794. pmid:38043471
- 48. Yiu A, Lam K. Performance of large language models at the MRCS Part A: a tool for medical education? Annals of the Royal College of Surgeons of England. 2023;1: 01. pmid:38037955
- 49. McManus IC, Wakeford R. PLAB and UK graduates’ performance on MRCP(UK) and MRCGP examinations: data linkage study. BMJ. 2014;348: g2621. pmid:24742473
- 50. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence. 2023;6. Available: https://www.frontiersin.org/articles/10.3389/frai.2023.1169595.
- 51. Currie GM. GPT-4 in Nuclear Medicine Education: Does It Outperform GPT-3.5? Journal of Nuclear Medicine Technology. 2023;51: 314–317. pmid:37852647
- 52. Emsley R. ChatGPT: these are not hallucinations – they’re fabrications and falsifications. Schizophrenia (Heidelb). 2023;9: 52. pmid:37598184
- 53. Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep. 2023;13: 14045. pmid:37679503
- 54. Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare. 2023;11: 887. pmid:36981544
- 55. Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit Med. 2024;7: 20. pmid:38267608
- 56. Bellini-Leite SC. Dual Process Theory for Large Language Models: An overview of using Psychology to address hallucination and reliability issues. Adaptive Behavior. 2023; 10597123231206604.
- 57. Moshirfar M, Altaf AW, Stoakes IM, Tuttle JJ, Hoopes PC. Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions. Cureus. 15: e40822. pmid:37485215
- 58. Yao S, Yu D, Zhao J, Shafran I, Griffiths TL, Cao Y, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv; 2023.