Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam

Recently developed chatbots based on large language models (further called bots) have promising features which could facilitate medical education. Several bots are freely available, but their proficiency has been insufficiently evaluated. In this study the authors have tested the current performance on the multiple-choice medical licensing exam of University of Antwerp (Belgium) of six widely used bots: ChatGPT (OpenAI), Bard (Google), New Bing (Microsoft), Claude instant (Anthropic), Claude+ (Anthropic) and GPT-4 (OpenAI). The primary outcome was the performance on the exam expressed as a proportion of correct answers. Secondary analyses were done for a variety of features in the exam questions: easy versus difficult questions, grammatically positive versus negative questions, and clinical vignettes versus theoretical questions. Reasoning errors and untruthful statements (hallucinations) in the bots’ answers were examined. All bots passed the exam; Bing and GPT-4 (both 76% correct answers) outperformed the other bots (62–67%, p = 0.03) and students (61%). Bots performed worse on difficult questions (62%, p = 0.06), but outperformed students (32%) on those questions even more (p<0.01). Hallucinations were found in 7% of Bing’s and GPT4’s answers, significantly lower than Bard (22%, p<0.01) and Claude Instant (19%, p = 0.02). Although the creators of all bots try to some extent to avoid their bots being used as a medical doctor, none of the tested bots succeeded as none refused to answer all clinical case questions.Bing was able to detect weak or ambiguous exam questions. Bots could be used as a time efficient tool to improve the quality of a multiple-choice exam.


Introduction
The development of AI applications announces a new era in many fields of society including medicine and medical education.Especially artificial intelligence (AI) chatbots based on large language models (further called bots) have promising features which could facilitate education by offering simulation training, by personalizing learning experiences with individualised feedback, or by acting as a decision support in clinical training situations.However, before adopting this technology in the medical curriculum, its capabilities have yet to be thoroughly tested [1,2].
Soon after the first bots became publicly available, higher medical education institutes started to report on their performance in medical exam simulations [3].A scoping review listed its potential use in medical teaching: automated scoring, teaching assistance, personalized learning, research assistance, quick access to information, generating case scenarios and exam questions, content creation for learning facilitation, and language translation [4].
Whereas bots seem to be informative and logical in many of their responses, in others they answer with obvious, sometimes dangerous, hallucinations (confident responses which however contain reasoning errors or are unjustified by the current state of the art) [5].They will reproduce flaws in the datasets they are trained by; they may reflect or even amplify societal inequality or biases or generate inaccurate or fake information [6].
Mostly, bots perform near the passing mark [6][7][8][9], although they outperform students in some reports [10][11][12].Performance is in general better on more easy questions and when the exam is written in English [13,14].Notably their score is generally worse as exams at more advanced stages in the medical curriculum are offered.However, bots seem to learn rapidly, and new versions do considerably better than their prototypes [15][16][17].As bots evolve, their proficiency needs continuous monitoring and updating.
Whereas media articles state that higher education institutes already anticipate the dangers of bots in terms of possible exam fraud, they also offer opportunities to assist in developing exams, for example by identifying ambiguous or badly formulated exam questions.
Very few comparisons between different bots have been made, and those that do exist only compare two or three bots and do not report hallucination rates [18,19].
In this study, we use the final theory exam that all medical students need to pass to obtain the degree of Medical Doctor.It is followed by an oral exam which is not part of this study.The current exam was used in 2021 at the University of Antwerp, Belgium.It is similar to countrywide exams used in other countries, such as the United States Medical Licensing exam step 1 and step 2CK [20].
In this study we have tested the current performance of six publicly available bots on the University of Antwerp medical licensing exam.The primary outcomes concern the performance of each bot on the exam.Secondary outcomes include performance on subsets of questions, interrater variability, proportion of hallucinations and the detection of possible weak exam questions.

Ethics
This experiment has been approved by the Ethics Committee of the University of Antwerp and the Antwerp University Hospital (reference number MDF 21/03/037, amendment number 5462).

Materials
At the end of the undergraduate medical training at the University of Antwerp, medical students must pass a general medical knowledge examination before being licensed as medical doctor.Besides an oral viva examination, this general medical knowledge examination contains 102 multiple choice questions covering the entire range of curricular courses.In this study, the exam as it was presented to the students in their second master year (before their final year of clinical training) was used.The scoring system was adapted afterwards, so the student's scores in this paper do not reflect the actual grades given to the students.The questions were not available online, so they were not used for the training of the studied bots.

Bot selection
Six bots that are publicly available and can currently be used by teachers and students were tested.The most widely used free bots were selected: ChatGPT (OpenAI), Bard (Google), and New Bing (Microsoft, called Bing Chat at the time of writing and Microsoft Copilot at the time of publication).Claude instant (Anthropic), Claude+(Anthropic) and GPT-4(OpenAI) were added to the list because they allow for an evaluation of the difference between a free and a paying version.Even though Bing is based on the GPT-4 large language model, it also uses other sources such as Bing Search so it is a customized version of the pure GPT-4 bot [21].

Data extraction
The exam was translated using Deepl (DeepL SE), a neural machine translation service.Clear translation errors were corrected by author SM, but the writing style and grammar were not improved in order to mimic an everyday testing situation.Questions containing images/tables (N = 2) and local questions were excluded (N = 5).Local questions were excluded because they concern theories, frameworks or models that have only been described in Dutch and are only applicable to Belgium and the Netherlands.Literal translation of these questions leads to nonsense questions in English.
Details on how and when the bots were used can be found in Table 1.By coincidence, the authors found out that when Bard refuses to answer a medical question, prompting it with "please regenerate draft" may force it to answer the question anyhow.This was not the case for the other bots.In all cases where Bard refused to answer, this additional prompt was used.

Outcomes
The primary outcome was the performance on the exam expressed as a proportion of correct answers (score).This outcome was also measured in the same way as the students were rated on this exam (adapted score): eleven questions contained a second best answer (an acceptable alternative to the best answer), a score of 0.33 was awarded when this option was chosen; twenty questions contained a fatal answer (this option is dangerous for the patient) leading to a score of -1.For calculation of the student's scores, the image, table, and local questions were excluded as well.
The primary outcomes were assessed in four subsets of answers.Firstly, the difficulty of the questions: thirteen questions were difficult (recorded P-value in question bank below 0.30 meaning that less than 30% of the students answered the question correct [22]), 36 easy (recorded P-value in question bank above 0.80) and 46 moderate (recorded P-value in question bank between 0.30 and 0.80).Secondly, the grammar of the questions: negative formulated questions (e.g., "which statement is not correct?")vs positive statements.Five questions were negatively formulated.Thirdly, the type of question: theory (50 questions) or describing a patient (clinical vignette, 45 questions).Finally, questions with vs without fatal answers.
In those cases where a bot answered a question incorrectly with a fatal answer, the proportion of selected fatal answers among all wrong answers was calculated.
The primary outcome was also assessed for a virtual bot (called Ensemble Bot), the answer of this bot was the most common value (mode) of the answers of all six bots [23].The reasoning behind an ensemble bot is that it enables possible improvements to a decision system's robustness and accuracy by combing several bots and thus reducing variance [24].
Three additional outcomes were assessed.Firstly, the proportion of hallucinations as rated by the authors among the incorrect answers of the best scoring bot.Authors VV and DM read all incorrect answers and judged them as containing a hallucination or not.In case of discordance, author SM made a final decision.A hallucination was previously defined as content that is nonsensical or untruthful in relation to certain sources [25].This definition is not usable for the current research so the authors defined a hallucination as content that either contains clear reasoning or is untruthful in relation to current evidence based medical literature.To detect reasoning errors, no medical knowledge is required.For example: "the risk is about 1 in 100 (3%)".To detect untruthful answers, the authors had to use their own background knowledge combined with common online resources to verify the AI answers.One clear example of an untruthful answer given by several bots: "This is a commonly used mnemonic to remember the order: "NAVEL"-Nerve, Artery, Vein, Empty space (from medial to lateral)."The bots suggested this is the order of the inguinal structures from lateral to medial.This mnemonic does exist, but it should be used from lateral to medial.Because a multiple-choice exam was studied, the hallucinations could not be found in the answer itself but in the arguments supporting the selected answer.Bots never answer with a simple letter, they all produce written out answers of varying length.The authors wanted to report reasoning errors and untruthful answers separately but found out that often, these two were both present in a bot's answer so this outcome was suspended.Secondly, the proportion of possible weak questions among the incorrect answers of the best scoring bot was assessed.For this outcome, all authors discussed all incorrect answers of the best scoring bot and reached unanimous consensus.
Thirdly, the interrater variability was examined.Originally, the authors planned to test whether user interpretation of the answers would be different from strict interpretation of the bot's answer as this difference was significant in a previous study [9].This outcome was suspended because such cases occurred only in ChatGPT and Bard.

Analysis
The differences in performance among the bots/students, differences in performance among categories of questions, and differences in the proportion of hallucinations were tested with a one-way ANOVA test and pairwise unpaired two-sample T-tests.P-values were 2-tailed where applicable, and a p-value of less than 0.05 was considered statistically significant.A p-value between 0.05 and 0.10 was considered a trend.For the wrong answers on questions with a fatal answer, a chi 2 test was used to assess the difference between the bot's proportion of fatal answers and the random proportion of fatal answers (which equals 0.33).Fleiss' Kappa was used to assess the overall agreement among the bots.Cohen's kappa was used to assess pairwise interrater agreement between the different bots.Raw data was collected using Excel 2023 (Microsoft).JMP Pro version 17 (JMP Statistical Discovery LLC) was used for all analyses except Fleiss' kappa which was calculated in R version 4.31 (DescTools package).

Overall exam performance
See Table 2 for an overview of the scores of the tested bots.Bing and GPT-4 scored the best with 76% correct answers and an adapted score (the way students were rated) of 76% as well.*This is the score that was used to assess students.A second-best answer was rated as +0.33 and a fatal answer as -1.

CI: confidence interval for the score (%)
To illustrate this performance S1 The mean score of all bots was 68%, the scores of the individual bots were not significantly different from this mean (p = 0.12).However, Bing and GPT-4 scored significantly better than Bard (p = 0.03) and Claude Instant (P = 0.03).GPT-4 had the same score as Bing but had more wrong answers (25 versus 13).Claude+ did not significantly score better than Claude Instant.All Bots gave one fatal answer (on different questions) except Bard which did not give any fatal answers.Bing gave four second best answers, ChatGPT/Bard/GPT three, Claud two and Claud Instant only one.For thirteen questions, Bard refused to answer.After prompting Bard up to five times with "regenerate draft", it still refused to answer four questions, seven were answered correctly and two were wrongly.The performance of the bots using the adapted score was very similar because the added points of second-best answers were smoothed out by the lost points due to fatal answers.The mean score of the 95 students was 61% (standard deviation 9), the mean adapted score for students was 60% (standard deviation 21).The Ensemble Bot (answers with the most common answer among the six bots) scored the same as Bing (72 correct answers, 76%).
renal replacement therapy: "Complete.Renal function replacement therapy is indicated. . .a) in any symptomatic patient with an eGFR <15 ml/min/1.73m 2 .b) only in patients under 65 years of age.c) in anyone with an eGFR < 6 l/min/1.73m 2 d) only when urea is elevated".Bing answered "a)".After review of current literature, the authors judge that an eGFR below 15 is indeed a commonly used cut of value for starting renal replacement therapy but it is not the only reason so start dialysis.Because statement a contains "any", Bing's answer is wrong, but the authors do understand why Bing gave this answer and why a student might give this answer as well.The same argument applies to answer c which is supposed to be the correct answer.Even more, the eGFR cut-off of six is odd.This question needs improvement.

Discussion
In this study, significant differences in the performance of publicly available AI chatbots on the Antwerp Medical License Exam were found.Both GPT-4 and Bing scored the best, but Bing turns out more reliable as it produces fewer wrong answers.This performance is in line with previous research [15][16][17].An ensemble bot which combines all tested bots scored equally so we cannot recommend its use based on the current study.The proportion of hallucinations was much lower for Bing than for Bard and Claude+/Claude Instant.The improvement of these new bots both in scores as in proportion of hallucinations sounds impressing, it might however increase the risk as users will have more confidence in wrong or even dangerous answers as the bots (in general) answer more correctly.The risk of replicating biases in the data on which these models are trained remains.Other authors already pointed out the meaning of these results: bots can pass exams, but this does not make them medical doctors as this requires far more capacities than reproduction of knowledge alone.The current study raises the questions whether a multiple choice exam is a useful way to assess the competencies modern doctors need (mostly concerning human interactions) [27].Bing performed equally as GPT-4 but with less wrong answers, so currently it is not worth paying for a bot in order to test a medical exam, neither is it useful to create an ensemble bot based on the mode of all bot's answers.Ensemble bots based on more complex rules than just the mode of all answers should be studied further.We can recommend the use of Bing to detect weak questions among the wrong answers.This is a time-efficient way to improve the quality of a multiple-choice exam.In this study, the labour-intensive work of discussing and revising questions was narrowed down from all 95 included questions to the 23 questions on which Bing answered incorrectly.The argumentation of Bing was used to check these questions.Machine translating, inputting in Bing and recording the answers for the entire exam took about two working hours.Three questions were improved for future examens.Further research on the efficiency of this method is necessary.
The trend we found towards better bot performance on easy questions is in line with previous research [13].However, the difference in performance between students and bots was large for difficult questions and absent for easy questions.This compelling new finding demands further research.Maybe bots are most useful in those situations that are difficult for humans?
The lack of a significant difference in performance between positive and negative questions, and between clinical vignettes and theory questions needs confirmation on larger datasets and on other exams.The finding on clinical vignettes has been found before [12].
Next to the field of medical education, bots might also be useful in clinical practice [28,29].Numerous authors in various fields have tried to pose clinical questions.The results are variable but all authors conclude that thus far AI can't compete with a real doctor [30][31][32][33][34].In a study on paediatric emergencies for example, ChatGPT/GPT-4 reliably advised to call emergency services only in 54% of the cases, gave correct first aid instructions in 45% and incorrectly advised advanced life support techniques to parents 13.6% [35].However, some companies are developing new AI tools that might assist clinicians.Google's medicine-specific large language model called Med-PaLM delivered highly accurate answers to multiple-choice and long-form medical questions but it fell short of clinicians' responses to those queries [36,37].The aim of this study was not to assess this aspect but by coincidence we noticed that in some cases, bots refuse to answer because they are not medical doctors.The creators of all studied bots try, to a certain extent, to avoid their bots being used as a medical doctor.None of the tested bots succeeded as none refused to answer all clinical case questions.Only Claude + and Claude instant refused (at times) to answer the question and closed the conversation.For all other bots users can try to pursue them to answer the question anyhow.This finding was most compelling for Bard where after entering the same questions repeatedly, Bard did answer it in nine out of thirteen cases.
The rise of generative AI also raises many ethical and legal issues: their enormous energy consumption, use of data sources without permission, use of sources protected by copyright, lack of reporting guidelines and many more.Before widely implementing AI in medical exams, more legislation and knowledge is necessary on these topics [38,39].
The strengths of this study mainly concern its novelty: the comparison of six different bots had not been published yet.The bots tested are available to the public so our methodology can easily be re-used.This study, however, has got several limitations as well.It only concerned one exam with a moderate size set of questions.There was no usable definition of hallucinations, neither a validated approach to detect them available at the time of writing.The definition we have used (chatbot generated content that either contains clear reasoning or is untruthful in relation to current evidence based medical literature) might inspire other authors although we found out that a distinction between reasoning errors and untruthful statements was not feasible.The exclusion of tables, local questions and images reduces the use of the comparison to real students.Future bots will most likely be able to process such questions as well.Finally, the exam was translated in English to make the current paper understandable for a broad audience.Further research on other languages is necessary.

Conclusion
Six generative AI chatbots passed the Antwerp multiple choice exam necessary for obtaining a license as a medical doctor.Bing (and to a lesser extent GPT-4) outperformed all other bots and students.Bots performed worse on difficult questions but outperformed students on those questions even more.Bing can be used to detect weak multiple-choice questions.Creators should improve their bot's algorithm if they do not want to them to be used as tool for medical advice.

Table 1 . Overview of the tested generative chat bots.
*: Poe (Platform for Open Exploration, Quora) was used because it allows fluent testing of multiple bots at the same time.A trial subscription of one week was used.https://doi.org/10.1371/journal.pdig.0000349.t001