Figures
Abstract
Purpose
Large language models (LLMs) have the potential to change medical education. Whether LLMs can generate multiple-choice questions (MCQs) that are of similar quality to those created by humans is unclear. This investigation assessed the quality of MCQs generated by LLMs compared to humans.
Methods
This review was registered with PROSPERO (CRD42025608775). A systematic review and frequentist random-effects network meta-analysis (NMA) or pairwise meta-analysis was performed. Ovid MEDLINE, Ovid EMBASE, and Scopus were searched from inception to November 1, 2024. The quality of MCQs was assessed with seven pre-defined outcomes: question relevance, clarity, accuracy/correctness; distractor quality; item difficulty analysis; and item discrimination analysis (point biserial correlation and item discrimination index). Continuous data were transformed to a 10-point scale to facilitate statistical analysis and reported as mean differences (MD). The MERSQI and the Grade of Recommendations, Assessment, Development and Evaluation (GRADE) NMA guidelines were used to assess risk of bias and certainty of evidence assessments.
Results
Five LLMs were included. NMA demonstrated that ChatGPT 4 generated similar quality MCQs to humans with regards to question relevance (MD −0.13; 95% CI: −0.44,0.18; GRADE: VERY LOW), question clarity (MD −0.03; 95% CI: −0.15,0.10; GRADE: VERY LOW), and distractor quality (MD −0.10; 95% CI: −0.24,0.04; GRADE: VERY LOW); however, MCQs generated by Llama 2 performed worse than humans with regards to question clarity (MD −1.21; 95% CI: −1.60,-0.82; GRADE: VERY LOW) and distractor quality (MD −1.50; 95% CI: −2.03,-0.97; GRADE: VERY LOW). Exploratory post-hoc t-tests demonstrated that ChatGPT 3.5 performed worse than Llama 2 and ChatGPT 4 with regards to question clarity and distractor quality (p < 0.001).
Citation: Riehm L, Nanji K, Lakhani M, Pankiv E, Hasanee D, Pfeifer W (2026) The use of large language models in generating multiple choice questions for health professions education: A systematic review and network meta-analysis. PLoS One 21(1): e0340277. https://doi.org/10.1371/journal.pone.0340277
Editor: Nikhilesh Anand, The University of Texas Rio Grande Valley School of Medicine, UNITED STATES OF AMERICA
Received: September 1, 2025; Accepted: December 18, 2025; Published: January 2, 2026
Copyright: © 2026 Riehm et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Rationale
Large Language Models (LLMs) are language-based foundation models that can comprehend and generate natural language, as well as participate in human-like conversations [1]. There are a variety LLMs which can be utilized for diverse purposes; since the release of ChatGPT in November 2022, there are many published studies evaluating the ability of LLMs to perform on medical licensing exams, such as the USMLE [2,3] and various subspecialty examinations (including neurology, neurosurgery, and ophthalmology board exams) [4–6]. Recently, there has been growing interest for the use of LLMs in the field of medical education [1,7].
LLMs can simplify complex topics from academic materials and provide feedback with scientifically backed explanations, making them an invaluable tool to prepare revision aids and test the foundational knowledge of medical learners [2,3,8]. LLMs are invaluable for medical educators as they enhance the ability to rapidly obtain and assemble information; this reduces the time required to develop educational resources and improves overall efficiency in environments where medical educators must balance their limited time between education and clinical duties [9]. LLMs may be useful in creating and reviewing multiple choice questions (MCQs), as they are able to enhance comprehension for medical learners by providing explanations and reasoning for incorrect options [10]. In addition to their role in acquiring knowledge, there has been growing interest in the role of LLMs in generating examination material and testing learners’ knowledge acquisition. The MCQ has classically been a popular tool to test factual knowledge in a variety of fields; more recent guidelines within the field of medical education have emphasized ways the MCQ can be developed to test higher-order concepts, including knowledge integration, interpretation, and problem-solving [11].
It is unclear whether MCQs generated by LLMs are of comparable quality to those generated by humans; for example, whether they are of an appropriate level of difficulty, clearly written, able to discriminate between low- and high-performing students, or factually accurate. LLMs are plagued with issues such as hallucination, where the LLM produces content that diverges from the source material input by the user (input-conflicting hallucination), is in direct contradiction with information previously produced by itself (context-conflicting hallucination) or is not consistent with factual knowledge (fact-conflicting hallucination) [12]. It is integral that MCQs generated by LLMs are compared to those written by humans to assess whether the exam questions generated by LLMs are appropriate to use in medical examinations. However, there are few published studies evaluating the quality of MCQs written by LLMs, and even fewer comparing the quality of LLM-generated questions to those written by humans. A direct comparison of available LLM systems to determine which is most reliable has not yet been conducted. Previous reviews on this topic did not conduct any network meta-analysis and had search strategies that are now outdated [13].
Objectives
The focus of this systematic review will be to address the following question: in health professions education, what is the quality of multiple-choice questions generated by the most commonly used large language models compared to those generated by humans? Specifically, which LLMs produce MCQs of comparable quality to human-written items? Quality will be evaluated based on the following outcomes: question relevance, question clarity, distractor quality, question accuracy/correctness, item difficulty analysis, and item discrimination analysis (point biserial correlation and item discrimination index).
Methods
This systematic review was reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) statement [14]. (S1 Table) and was registered with PROSPERO (CRD42025608775). Ethics approval was not required for this systematic review, as all analyses were performed using study data that had previously obtained informed consent and ethics approval.
Eligibility criteria
Inclusion and exclusion criteria were created utilizing the PICOTS framework [15]. Studies were included if they involved the generation of MCQs by LLMs or humans for undergraduate and postgraduate learners from allied health care professions (including medicine, dentistry, pharmacy, nursing, occupational therapy, physiotherapy, and optometry). Studies were excluded if they involved the generation of questions for non-medical learners or if the exam questions were in a non-MCQ format. All data from published RCTs, cohort studies, case control studies, case series, commentaries/letters to the editor, and conference proceedings were included. Review articles were excluded.
Information sources and search strategy
A literature search of Ovid MEDLINE, Ovid EMBASE, and Scopus databases was performed on November 1, 2024, for studies in all allied health care professions meeting our eligibility criteria. This search strategy was jointly developed by a multidisciplinary team comprised of clinicians and health research librarians. To identify any potentially overlooked studies, the reference list of screened studies was searched for further eligible studies. The as “Similar Articles” sections was also searched in their respective database for additional studies. Covidence (Veritas Health Innovation, Melbourne, Australia) was utilized to house exported studies for further screening. S1 File [16–18]. outlines the search strategy utilized for each database and registry.
Selection process
Two reviewers (LR and ML) worked independently and in duplicate to screen the titles and abstracts, and subsequently all full texts, of studies that were identified during the initial search. Any discrepancies were discussed amongst reviewers to reach unanimous agreement. Covidence (Veritas Health Innovation, Melbourne, Australia) was used to track triage decisions.
Data collection
Three reviewers (LR, ML, and DH) underwent a calibration and subsequently worked independently and in duplicate to extract data utilizing a standardized pilot-tested collection form designed in Microsoft Excel (Version 16.92). Discrepancies in data collection were resolved through a third party (KN). Any unclear information from the published manuscripts was clarified with the corresponding author.
Data items
Information extracted regarding study design included the following: author(s), year of publication, journal of publication, the medical field MCQs tested upon (including subspecialty), the LLM(s) utilized including version, the presence of control MCQs generated by humans, the number of MCQs generated by the LLM(s) and humans, whether LLM quality was evaluated by humans and the qualifications of the individuals performing the evaluation, whether the MCQs were administered to students, the outcomes of interest, and risk of bias items.
Outcome measures
To determine the quality of MCQs generated by LLMs, a literature review was performed [19,20] and seven pre-defined outcomes were identified: question relevance, question clarity, distractor quality, question accuracy/correctness, item difficulty analysis, and item discrimination analysis (point biserial correlation and item discrimination index). The item difficulty analysis assesses the proportion of examinees who respond correctly to a given item, whereas the item discrimination index reflects how well a given item distinguishes between high- and low-scoring students.
Risk of bias
Two reviewers (LR and ML) underwent calibration, training and subsequently worked independently and in duplicate to assess the methodological quality of included studies using the MERSQI [21]. Discrepancies were resolved through a third party (KN). The quality of each medical education study was assessed across the following domains: study design, sampling (number of institutions and response rate), type of data (subjective or objective), validity (internal structure, content, and relationships to other variables), data analysis (appropriateness and complexity), and outcomes. Studies with a MERSQI score less than 11.6, as per Cook et al, were classified as low methodologic quality [22].
Statistical analysis
The method of evidence synthesis for each outcome depended on the availability of data. In the event of sufficient data, a frequentist network meta-analysis using random effects modelling was performed comparing the results across the different LLMs and human control. The network meta-analysis allowed comparison of multiple different LLMs, even if particular LLMs were not directly compared in any study. Frequentist NMAs assume specified heterogeneity whereas Bayesian approaches incorporate a degree of uncertainty in the estimation of heterogeneity. Despite these differences, both are valid approaches as a study that re-analyzed NMAs using both methods found no major differences in the results obtained [23]. A frequentist approach was chosen in this analysis given that our group had more experience with this framework. If there was insufficient data for a network meta-analysis, random effects pairwise meta-analyses were performed comparing the LLMs to a human control. If all the available data were not incorporated into the network meta-analysis or pairwise meta-analysis, then random-effect single-arm analyses were performed to calculate pooled summary effect estimates across all LLMs or human controls.
Data reported on a Likert scale was transformed to a 10-point scale to facilitate statistical analyses. For continuous outcomes, the summary effect estimates reported the pooled mean difference with corresponding 95% confidence interval (CI) for comparative analyses and the pooled mean for single arm analyses. For dichotomous outcomes, relative risks were calculated and reported with their corresponding 95% CI. For outcomes reporting correlation coefficients, data were transformed to Fisher’s z scale to stabilize the variance and normalize the distribution prior to meta-analysis [24].
Complete case analyses were performed for all analyses. In the event of missing measures of variability for continuous outcome measures, standard deviations were imputed in line with the methodology from the Cochrane Handbook of Systematic Reviews [25].
Analyses were performed in R 4.0.4 (The R Project, Aukland, New Zealand) using the “meta”, and “netmeta” packages.
Confidence in cumulative evidence
The certainty of the evidence was assessed following the Grade of Recommendations, Assessment, Development and Evaluation (GRADE) guidelines [26]. The GRADE approach involves separate grading of the quality of evidence for each outcome followed by determining an overall quality of evidence across outcomes. Two reviewers rated each domain for each comparison separately and resolved discrepancies by consensus. The certainty for each comparison and outcome was rated as ‘high’, ‘moderate’, ‘low’ or ‘very low’, based on the type of studies included, risk of bias, inconsistency, indirectness, publication bias, intransitivity, incoherence, and imprecision. Node splitting models were used to assess for incoherence [27]. Heterogeneity was assessed through a combination of visually inspecting the forest plot and assessing the I2 statistic. Publication bias was assessed by visually inspecting funnel plots and Egger’s test for comparisons informed by more than 10 studies. If there were fewer than 10 studies, publication bias was assessed by evaluating the quality of the search and measures taken to obtain all possible evidence to inform the outcome of interest.
Subgroup analyses
Subgroup analyses were pre-specified in the protocol registration and were performed for all single-arm analyses to compare the effects of the different LLMs. If the chi-squared test for the subgroup analysis was significant, exploratory post-hoc t-tests were conducted comparing results between question sources. A correction for multiple hypothesis testing was not performed given that the post-hoc tests were strictly exploratory in nature.
Results
Study selection
The Preferred Reporting Items for Systematic Reviews and Meta-Analysis flow diagram for study analysis is outlined in Fig 1. A total of 1,017 articles were identified through the database search; 306 were removed as duplicates, leaving 711 remaining articles that proceeded to title and abstract screening. Fifty-nine articles underwent full-text screening. Ultimately, 15 [28–42] studies met the review’s inclusion criteria.
Study characteristics
Table 1 outlines the key characteristics of the included studies. The most frequently employed LLM was ChatGPT, with 10 studies utilizing ChatGPT 3.5 and 4 studies utilizing ChatGPT 4. Other LLMs used to create MCQs included Bard 1.0, Llama 2, and ChatPDF.com. Two studies compared the quality of MCQs generated by different LLMs, one of which additionally compared the quality of MCQs generated by different LLMs to those created by humans. Only 5 (33%) of the included studies compared MCQs created by LLMs to those created by humans. Eight (53%) studies administered the generated MCQs to students. Thirteen (87%) of the studies used a panel comprised of experts in the medical field and/or medical education experts to evaluate the quality of MCQs created by LLMs; 9 of the study panels comprised of experts in the medical field, 3 of the panels comprised of experts in medical education, and only 2 of the study panels comprised of individuals with qualifications in the relevant medical field and medical education. Eleven (73%) of the studies generated MCQs for undergraduate/postgraduate medicine; two (13%) studies created MCQs for college-level or university-level courses; and only two (13%) studies generated MCQs for other allied health care professions (dentistry, pharmacy).
Risk of bias
Table 2 summarizes the methodological quality of the included studies. All the included studies utilized a quantitative design and methodology; the mean (SD, range) MERSQI score of these studies was 11.82 (1.44, 9.27–15) out of 18. Eight (53%) of the studies utilized a single-group cross-sectional or post-test design, and 7 (47%) of the studies utilized a two-group non-randomized design. Most studies were held at a single institution (73%) and collected objective data (100%) that went beyond a descriptive analysis (87%). 100% of included studies had a data analysis appropriate for study design and type of data. The domain with the poorest scores was the validity of the evaluation instrument. 8 (53%) of the studies were of relatively low quality with scores below 11.6.
Outcomes
A summary of the type of analysis performed for each outcome, as well as the LLMs evaluated for each outcome, can be found in Table 3.
Relevance.
Five studies assessed the relevance of MCQs on a Likert scale utilizing an expert panel. Two studies evaluated two or more interventions and were included in the NMA. The remaining three investigations were single-arm studies, and their results were solely included in the single-arm analysis. LLMs assessed included ChatGPT 3.5, ChatGPT 4 and Llama 2, with a human control arm present.
Network meta-analysis between ChatGPT 4, Llama 2, and humans demonstrated that there was no significant difference in the relevance of questions generated by ChatGPT 4 compared to humans (MD −0.13; 95% CI: −0.44, 0.18; GRADE: VERY LOW), but MCQs generated by Llama 2 may be less relevant than those created by humans (MD −0.76; 95% CI: −1.27, −0.25; GRADE: VERY LOW) (Fig 2a, S1 Fig).
(a) Forest plot of network meta-analysis for question relevancy. (b) Forest plot of single-arm analysis for question relevancy.
Single-arm analysis demonstrated a mean pooled transformed score of 8.82 (95% CI: 8.01, 9.62; GRADE: VERY LOW; subgroup analysis between ChatGPT 3.5, ChatGPT 4, Llama 2, and human-generated questions did not demonstrate a significant difference in question relevancy (chi-squared: 0.79; df = 3; p = 0.85) (Fig 2b).
Clarity.
Four studies assessed the clarity of MCQs on a Likert scale via an expert panel. Two studies evaluated two or more interventions and were included in the NMA. The remaining two investigations were single-arm studies, and their results were solely included in the single-arm analysis. LLMs assessed included ChatGPT 3.5, ChatGPT 4, and Llama 2, with a human control arm present.
NMA evaluating ChatGPT 4, Llama 2, and humans demonstrated that there was no significant difference in the clarity of questions generated by ChatGPT 4 compared to humans (MD −0.03; 95% CI: −0.15, 0.10; GRADE: VERY LOW), but MCQs generated by Llama 2 may be less clear than those created by humans (MD −1.21; 95% CI: −1.60, −0.82; GRADE: VERY LOW) (Fig 3a, S2 Fig). Subsequent single-arm analysis demonstrated a mean pooled score of 8.5 (95% CI: 7.47, 9.52; GRADE: VERY LOW). Tests for subgroup analysis between ChatGPT 3.5, ChatGPT 4, Llama 2, and human-generated questions demonstrated a significant difference in question clarity (chi-squared: 57.37; df = 3, p < 0.01) (Fig 3b). Exploratory post-hoc t-tests demonstrated that ChatGPT 3.5 had lower clarity scores than ChatGPT 4 (p < 0.001), Llama 2 (p < 0.001), and humans (p = 0.022).
(a) Forest plot of network meta-analysis for question clarity. (b) Forest plot of single-arm analysis for question clarity.
Distractor quality.
Four studies assessed the quality of MCQ distractors on a Likert scale via an expert panel. Two studies evaluated two or more interventions and were included in the NMA. The remaining two investigations were single-arm studies, and their results were solely included in the single-arm analysis. LLMs assessed included ChatGPT 3.5, ChatGPT 4, and Llama 2, with a human control arm present.
Network meta-analysis between ChatGPT 4, Llama 2, and humans demonstrated that there was no significant difference in the quality of distractors generated by ChatGPT 4 compared to humans (MD −0.10; 95% CI: −0.24, 0.04; GRADE: VERY LOW), but MCQs generated by Llama 2 may have lower quality of distractors than those created by humans (MD −1.50; 95% CI: −2.03, −0.97; GRADE: VERY LOW) (Fig 4a, S3 Fig). Subsequent single-arm analysis demonstrated a mean pooled score of 8.03 (95% CI: 6.83, 9.23; GRADE: VERY LOW). Tests for subgroup analysis between ChatGPT 3.5, ChatGPT 4, Llama 2, and human-generated questions demonstrated a significant difference in distractor quality (chi-squared: 78.57; df = 3, p < 0.01) (Fig 4b). Exploratory post-hoc t-tests demonstrated that ChatGPT 3.5 had lower distractor quality scores than ChatGPT 4 (p < 0.001), Llama 2 (p < 0.001), and humans (p = 0.008).
(a) Forest plot of network meta-analysis for distractor quality. (b) Forest plot of single-arm analysis for distractor quality.
Accuracy.
Four studies assessed the accuracy of MCQs by utilizing an expert panel to determine the % correct multiple choice questions. One study compared two interventions and was evaluated in a pairwise analysis, the remaining three studies were single arm analyses. LLMs assessed included ChatGPT 3.5 and Bard 1.0, with no human control arm.
Pairwise analysis demonstrated that there was no significant difference in accuracy between Chat GPT3.5 and Bard (RR: 0.84; 95%CI: 0.59 to 1.22; GRADE: Very Low) (Fig 5a).
(a) Forest plot of pairwise analysis for question accuracy. (b) Forest plot of single-arm analysis for question accuracy.
Single-arm analysis demonstrated a mean pooled score of 0.73 (95% CI: 0.57, 0.87; GRADE: VERY LOW); subgroup analysis between ChatGPT 3.5 and Bard-generated questions did not demonstrate a significant difference in question accuracy (chi-squared: 0.12; df = 1; p = 0.73) (Fig 5b).
Item difficulty analysis.
Seven studies performed an item difficulty analysis by assessing the difficulty index of MCQs. Three studies compared Chat GPT 3.5 to a human control and were included in the pairwise analysis. LLMs assessed included ChatGPT 3.5, ChatGPT 4, and ChatPDF.com, with a human control arm present.
Pairwise analysis between studies comparing ChatGPT 3.5 and humans demonstrated that there is no significant difference in the difficulty index (MD 5.86; 95% CI: −8.49, 20.20; GRADE: VERY LOW) (Fig 6a).
(a) Forest plot of pairwise analysis for question difficulty index. (b) Forest plot of single-arm analysis for question difficulty index.
Single-arm analysis demonstrated a mean pooled score of 67.88 (95% CI: 60.92, 74.84; GRADE: VERY LOW). Tests for subgroup analysis between ChatGPT 3.5, ChatGPT 4, ChatPDF.com, and human-generated questions demonstrated a significant difference in the difficulty index (chi-squared: 116.72; df = 3, p < 0.01) (Fig 6b). Exploratory post-hoc t-tests demonstrated that ChatPDF.com had a lower difficulty index than ChatGPT 3.5 (p < 0.001), ChatGPT 4 (p = 0.010), and humans (p < 0.001).
Item discrimination analysis – item discrimination index.
Five studies performed an item discrimination analysis of MCQs by assessing the average item discrimination index. Two studies compared both Chat GPT 3.5 to human controls and were included in the pairwise meta-analysis. LLMs assessed included ChatGPT 3.5, ChatGPT 4, and ChatPDF.com, with a human control arm present.
Pairwise analysis between ChatGPT 3.5 and humans demonstrated that there is no significant difference in the average item discrimination index (MD −0.07; 95% CI: −0.34, 0.20; GRADE: VERY LOW) (Fig 7a).
(a) Forest plot of pairwise analysis for item discrimination index. (b) Forest plot of single-arm analysis for item discrimination index.
Single-arm analysis demonstrated a mean pooled score of 0.28 (95% CI: 0.21, 0.35; GRADE: VERY LOW). Tests for subgroup analysis between ChatGPT 3.5, ChatGPT 4, ChatPDF.com, and human-generated questions demonstrated a significant difference in the average item discrimination index (chi-squared: 33.53; df = 3, p < 0.01) (Fig 7b). Exploratory post-hoc t-tests demonstrated that ChatPDF.com had a higher average item discrimination index than ChatGPT 3.5 (p < 0.001) and ChatGPT 4 (p < 0.001). Additionally, questions produced by Chat GPT4.0 had a lower item discrimination analysis compared with humans (p = 0.025), and Chat GPT 3.5 (p = 0.010).
Item discrimination analysis – point biserial correlation.
Four studies performed an item discrimination analysis of MCQs by assessing the point biserial correlation. Two studies compared both Chat GPT 3.5 to human controls and were included in the pairwise meta-analysis. The LLMs assessed were ChatGPT 3.5, and human generated questions.
Pairwise meta-analysis demonstrated no significant difference in scores between the Chat GPT 3.5 and human generated questions (MD: −0.05; 95% CI: −0.21 to 0.11; GRADE: VERY LOW) (Fig 8a).
(a) Forest plot of pairwise analysis for point biserial correlation. (b) Forest plot of single-arm analysis for point biserial correlation.
Single-arm analysis demonstrated a mean pooled score of 0.29 (95% CI: 0.11, 0.46; GRADE: VERY LOW). There was no significant difference in the subgroup analysis (chi-squared = 0.07; p = 0.80) (Fig 8b).
Certainty of evidence
The certainty of evidence for all outcomes was rated as very low. Outcomes were most frequently downgraded due to serious concerns regarding indirectness, imprecision, or risk of bias (S2 Table).
Discussion
Utilizing LLMs to generate valid, reliable, and accurate multiple-choice questions has the potential to change the field of medical education. In this systematic review and network meta-analysis, the quality of multiple-choice questions generated by five LLMs for undergraduate and postgraduate medical examinations was evaluated based on seven pre-specified outcomes. To our knowledge, this is the first systematic review on this topic that was able to perform a statistical analysis of the data. The results demonstrate that the quality of multiple-choice questions generated by ChatGPT 4 and ChatPDF.com may be comparable to those generated by humans, while questions generated by ChatGPT 3.5 and Llama 2 may be of worse quality than those generated by humans; however, the certainty of evidence is very low across all outcomes, and significantly more research is needed.
There may be no significant difference between the quality of MCQs generated by ChatGPT 4 and humans with regards to qualitative measures of question quality (question relevance, question clarity, and distractor quality); this is of very low certainty of evidence. However, only one study evaluated the quality of ChatGPT 4 on more quantitative measures such as the difficulty index, item discrimination index, and point biserial correlation, and there were no studies that directly compared the quality of MCQs written by ChatGPT 4 to those written by humans on these quantitative indices. To delineate whether the quality of MCQs generated by ChatGPT 4 are similar in quality to those written by humans, head-to-head studies need to be conducted that are adequately controlled and administered to students to assess quantitative measures of quality.
Multiple choice questions generated by ChatGPT 3.5 may be of worse quality than those generated by humans, and previous studies have shown that instructions written by ChatGPT 3.5 are less understandable than those written by humans [43]. In our analysis, ChatGPT 3.5 performed worse than ChatGPT 4 and Llama 2 with regards to question clarity and distractor quality. Given that NMA suggests the quality of MCQs generated by Llama 2 is worse than those created by humans with regards to question clarity and distractor quality, we hypothesize that ChatGPT 3.5 may also perform worse than humans on these indices. There was no significant difference in the difficulty index, item discrimination index, or point biserial correlation between ChatGPT 3.5 and human; however, the trend suggests that questions written by humans are more difficult and more discriminative between high- and low-performing students than those written by ChatGPT 3.5. The literature suggests that easy MCQs have a difficulty index of 70% and greater, and moderately difficult MCQs have a difficulty index of less than 70% [44]; in our study, MCQs written by ChatGPT 3.5 had a difficulty index of 73.28% whereas those written by humans had a difficulty index of 68.11%. Additionally, an ideal item discrimination index and point biserial correlation for large scale standardized tests is 0.3 [34]; ChatGPT 3.5 had an item discrimination index and point biserial correlation less than 0.3, whereas humas were above the threshold of 0.3 for both indices. The quality of this evidence is very low, and studies comparing the quality of questions written by ChatGPT 3.5 to humans in a systematic fashion is needed.
Finally, ChatPDF.com may create MCQs that are more difficult and have a higher ability to discriminate between high- and low-scoring students than ChatGPT 3.5 and ChatGPT 4. However, this is of very low certainty of evidence, and there was only one study that evaluated ChatPDF.com with no control data. Further investigations evaluating this LLM is required.
It is vital to highlight that the quality of multiple-choice questions generated by LLMs can be negatively affected in a multitude of ways, including the design of the prompt provided to the LLM as well as hallucination [12,45]. In practice, hallucination can lead to the generation of inaccurate material that may appears plausible on the surface, disrupt difficulty calibration by masking what the model knows, and undermine discrimination by making questions/explanations inconsistent and unreliable. Human experts within the field of study are required to routinely fact-check the question stem and answer key generated by AI to ensure that content is accurate, coherent, unbiased, and able to accurately discriminate between students [46]. It may be beneficial for educators to have formal training in prompt design, as well as in critically analyzing output from LLM to ensure content accuracy. Constraining the LLM to specific resources, such as course textbooks or instructor-provided resources, can be helpful in ensuring the content is generated from reputable and reliable resources. Educators should consider piloting LLM-generated MCQs on a small subset of students prior to integrating questions into formal examinations. Of the LLMs evaluated in this network meta-analysis, ChatGPT 4 is potentially the best resource for educators to generate drafts of MCQs, although quality review of the generated content is required, and the performance of questions created by ChatGPT 4 on quantitative indices is unknown. Other LLMs, such as ChatGPT 3.5 or Llama 2, are less likely to produce accurate and unbiased content, given their worse performance on qualitative and quantitative measures of question quality. Finally, while our study suggests that ChatPDF.com may produce more difficult and discriminating questions, this signal requires further study and evaluation to truly delineate the quality of the questions.
Our review has several limitations. There was variability in how MCQ quality was defined and measured across studies. Only a minority of studies were comparative; the remainder relied on single-group cross-sectional or non-randomized two-group designs, usually from a single institution and with small sample sizes. Such characteristics underpin the “very low” GRADE ratings. There are no published randomized control trials evaluating the quality of MCQs written by LLMs to those written by humans. Two-thirds of the studies included in our review did not perform head-to-head studies evaluating the quality of LLM-generated MCQs to those written by humans. Approximately half of the studies did not administer the MCQs to students; thus, the conclusions regarding the ability of LLMs to generate high-quality MCQs are drawn from a relatively small and contextually heterogenous set of studies. The certainty of evidence is very low and was limited by risk of bias within the studies, inconsistency, and imprecision; this was particularly evident with limited data to perform the network meta-analysis. Although inter-rater reliability among experts may be adequate, the limited validity evidence warrants restraint in inferring meaningful distinctions from variations in Likert-scale ratings of clarity, relevance, and distractor quality. One notable limitation is that most included studies focused on undergraduate and postgraduate medicine, with relatively few conducted in other allied health care professions. The extent to which our findings generalize beyond the field of medicine remains unclear. Finally, given the rapid evolution of LLMs, future or novel models may not align with the patterns observed in this systematic review and network meta-analysis. Nevertheless, the findings provide meaningful insights and highlight the need for further high-quality research across a broader range of health care fields to better understand the role of LLM-generated MCQs in medical education.
Conclusion
To our knowledge, this is the first systematic review to apply network meta-analysis comparing all eligible studies evaluating the quality of MCQs generated by LLMs in the field of medical education. The results suggest that ChatGPT 4 may create MCQs of similar quality to those written by humans on certain qualitative outcomes. Questions generated by Llama 2 and ChatGPT 3.5 perform worse on key outcomes that pertain to question quality, whereas those generated by ChatPDF.com may perform better on key quantitative outcomes. To address the limitations of this analysis, further studies are needed that directly compare the above LLMs with human-generated MCQs in a systematic manner, administering these MCQs to students across varied allied health care fields to evaluate whether LLMs can be effectively used to generate examination questions in health care professions education. Given the rapid evolution of LLM capabilities, our findings provide a timely contribution to the growing literature, offering a contemporary benchmark for educators and researchers to inform the integration of these tools into medical assessment design.
Supporting information
S2 Table. Summary of GRADE ratings for each outcome.
https://doi.org/10.1371/journal.pone.0340277.s005
(PDF)
References
- 1. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ. 2023;9:e48291. pmid:37261894
- 2. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. pmid:36812645
- 3. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. pmid:36753318
- 4. Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board–Style Examination. JAMA Netw Open. 2023;6(12):e2346721–e2346721.
- 5. Ali R, Tang OY, Connolly ID, Zadnik Sullivan PL, Shin JH, Fridley JS, et al. Neurosurgery. 2023;93(6):1353–65.
- 6. Cai LZ, Shaheen A, Jin A, Fukui R, Yi JS, Yannuzzi N, et al. Performance of Generative Large Language Models on Ophthalmology Board–Style Questions. Am J Ophthalmol. 2023;254:141–9.
- 7. Eysenbach G. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Med Educ. 2023;9:e46885. pmid:36863937
- 8. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt J-N, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med (Lond). 2023;3(1):141. pmid:37816837
- 9. Lucas HC, Upperman JS, Robinson JR. A systematic review of large language models and their implications in medical education. Med Educ. 2024;58(11):1276–85. pmid:38639098
- 10. Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The Role of Large Language Models in Medical Education: Applications and Implications. JMIR Med Educ. 2023;9:e50945. pmid:37578830
- 11. Coughlin PA, Featherstone CR. How to Write a High Quality Multiple Choice Question (MCQ): A Guide for Clinicians. Eur J Vasc Endovasc Surg. 2017;54(5):654–8. pmid:28870436
- 12.
Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. https://github.com/HillZhang1999/
- 13. Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ. 2024;24(1):354. pmid:38553693
- 14. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. pmid:33782057
- 15.
McKenzie JE, Brennan SE, Ryan RE, Thomson HJ, Johnston RV, Thomas J. Defining the criteria for including studies and how they will be grouped for synthesis. Cochrane Handbook for Systematic Reviews of Interventions. Cochrane. 2022.
- 16.
Campbell S, Kung J. Filter to Retrieve Studies Related to Artificial Intelligence from the OVID MEDLINE Database. Geoffrey & Robyn Sperber Health Sciences Library, University of Alberta. 2024.
- 17.
Campbell S, Kung J. Filter to Retrieve Studies Related to Artificial Intelligence from the OVID EMBASE Database. Geoffrey & Robyn Sperber Health Sciences Library, University of Alberta. 2024.
- 18.
Campbell S, Kung J. Filter to retrieve studies related to artificial intelligence from the Scopus database. Geoffrey & Robyn Sperber Health Sciences Library, University of Alberta. 2024.
- 19. Eleragi AMS, Miskeen E, Hussein K, Rezigalla AA, Adam MIE, Al-Faifi JA, et al. Evaluating the multiple-choice questions quality at the College of Medicine, University of Bisha, Saudi Arabia: a three-year experience. BMC Med Educ. 2025;25(1).
- 20. Netere AK, Hughes T, Babey AM, Hawes M, Mifsud J, Kelly JP. Evaluating the quality of multiple-choice question pilot database: A global educator-created tool for concept-based pharmacology learning. Pharmacol Res Perspect. 2024;12(5).
- 21. Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Association between funding and quality of published medical education research. JAMA. 2007;298(9):1002–9. pmid:17785645
- 22. Cook DA, Hatala R, Brydges R, Zendejas B, Szostek JH, Wang AT, et al. Technology-enhanced simulation for health professions education: a systematic review and meta-analysis. JAMA. 2011;306(9):978–88.
- 23. Sadeghirad B, Foroutan F, Zoratti MJ, Busse JW, Brignardello-Petersen R, Guyatt G, et al. Theory and practice of Bayesian and frequentist frameworks for network meta-analysis. BMJ Evid Based Med. 2023;28(3):204–9. pmid:35760451
- 24.
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to Meta-Analysis, Converting Among Effect Sizes. Wiley. 2009.
- 25.
Higgins JP, Li T, Deeks JJ. Choosing effect measures and computing estimates of effect. Cochrane Training.
- 26. GRADE handbook. https://gdt.gradepro.org/app/handbook/handbook.html
- 27. van Valkenhoef G, Dias S, Ades AE, Welton NJ. Automated generation of node-splitting models for assessment of inconsistency in network meta-analysis. Res Synth Methods. 2016;7(1):80–93. pmid:26461181
- 28. Edwards CJ, Erstad BL. Evaluation of a Generative Language Model Tool for Writing Examination Questions. Am J Pharm Educ. 2024;88(4):100684. pmid:38479646
- 29. Ayub I, Hamann D, Hamann CR, Davis MJ. Dermatology Residency. Dermatology. 2023.
- 30.
Mistry NP, Saeed H, Rafique S, Le T, Obaid H, Adams SJ. Large language models as tools to generate radiology board-style multiple-choice questions. 2024. http://creativecommons.org/licenses/by/4.0/
- 31. Hwang K, Wang K, Alomair M, Choa F-S, Chen LK. Towards Automated Multiple Choice Question Generation and Evaluation: Aligning with Bloom’s Taxonomy. Lecture Notes in Computer Science. Springer Nature Switzerland. 2024. 389–96.
- 32. Laohawetwanit T, Apornvirat S, Kantasiripitak C. ChatGPT as a teaching tool: Preparing pathology residents for board examination with AI-generated digestive system pathology tests. Am J Clin Pathol. 2024;162(5):471–9. pmid:38795049
- 33. Lotto C, Sheppard SC, Anschuetz W, Stricker D, Molinari G, Huwendiek S, et al. ChatGPT Generated Otorhinolaryngology Multiple-Choice Questions: Quality, Psychometric Properties, and Suitability for Assessments. OTO Open. 2024;8(3):70018.
- 34. Kıyak YS, Coşkun Ö, Budakoğlu Iİ, Uluoğlu C. ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. Eur J Clin Pharmacol. 2024;80(5):729–35. pmid:38353690
- 35. Ngo A, Gupta S, Perrine O, Reddy R, Ershadi S, Remick D. ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Acad Pathol. 2023;11(1):100099. pmid:38162414
- 36. Ho B, Id HC, Kui G, Lau K, Tin G, Wong C, et al. ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). 2023.
- 37. Ahmed WM, Azhari AA, Alfaraj A, Alhamadani A, Zhang M, Lu C-T. The Quality of AI-Generated Dental Caries Multiple Choice Questions: A Comparative Analysis of ChatGPT and Google Bard Language Models. Heliyon. 2024;10(7):e28198. pmid:38596020
- 38.
Laupichler MC, Rother JF, Grunwald Kadow IC, Ahmadi S, Raupach T. Large Language Models in Medical Education: Comparing ChatGPT to Human-Generated Exam Questions. 2023. www.random.org
- 39.
Rezigalla AA. Open Access BMC Medical Education. ttp://creativecommons.org/licenses/by/4.0/
- 40. Schneid SD, Armour C, Evans S, Brandl K. Alexa, write my exam: ChatGPT for MCQ creation. Medical Education. 2024;58(11):1373–4.
- 41. Ozlem C, Cos¸kun C, Kıyak YS, Is¸ıl I, Budako I. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment Özlem Coşkun, Yavuz Selim Kıyak & Işıl İrem Budakoğlu ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. 2024. https://www.tandfonline.com/action/journalInformation?journalCode=imte20
- 42. Haynes Z, Elliott B, Goodwin A, Fabyan K. Chatgpt vs subject matter expert generated pulmonary multiple-choice questions. CHEST. 2024;166(4):A3858.
- 43. Nanji K, Yu CW, Wong TY, Sivaprasad S, Steel DH, Wykoff CC, et al. Evaluation of postoperative ophthalmology patient instructions from ChatGPT and Google Search. Can J Ophthalmol. 2024;59(1):e69–71. pmid:37884271
- 44. Abdulghani H, Ahmad F, Salah Khalil M, Mohammad Abdulghani H, Ponnamperuma GG, Khalil MS. The relationship between non-functioning distractors and item difficulty of multiple choice questions: A descriptive analysis. Journal of Health Specialties. 2014.
- 45. Kıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024;100(1189):858–65. pmid:38840505
- 46. Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for scientific writing?. Crit Care. 2023;27(1):75. pmid:36841840