Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy

Trang Thi Nguyen; Linh Nguyen; Ha Thi Nguyet Do; Huong Thi Thu Nguyen; Son Minh Tong

doi:10.1371/journal.pone.0341317

Abstract

Introduction

While LLMs are used to generate medical and dental MCQs, their alignment with Bloom’s Taxonomy remains unexplored.

Materials and Methods

Five widely used LLMs, including ChatGPT-4o (OpenAI), Copilot Pro (Microsoft), Claude Sonnet 4 (Anthropic), Grok 3 (xAI), and DeepSeek R1 (DeepSeek) were evaluated. Each model generated 60 MCQs (total 300) based on content from an oral and maxillofacial anatomy textbook across the five cognitive levels of Bloom’s Taxonomy. Two independent investigators assessed each item using a 5-point Likert scale for remembering, understanding, applying, analyzing, and evaluating/creating. Inter-rater reliability was measured using weighted Cohen’s kappa. Model performance and inter-model differences were analyzed using the Kruskal–Wallis test.

Results

Inter-rater reliability was moderate to strong (kappa = 0.74–0.86). Median scores for remembering, understanding, applying, and evaluating/creating were above 4 across all LLMs, while the analyzing level scored a median of 3.5 for ChatGPT-4o and DeepSeek R1. No significant difference was found between models in remembering and understanding levels (p > 0.05). Claude Sonnet 4 outperformed the other models at the applying, analyzing, and evaluating/creating levels (p = 0.01, 0.003, and 0.005, respectively). Within-model analysis showed that only Copilot Pro and Claude Sonnet 4 consistently aligned with Bloom’s cognitive levels across all categories. In contrast, ChatGPT-4o, DeepSeek R1, and Grok 3 performed significantly better at the lower cognitive levels (p = 0.00, 0.00, and 0.001, respectively).

Conclusions

All LLMs performed well at lower cognitive levels, while Claude Sonnet 4 achieved the highest alignment at higher-order levels.

Citation: Nguyen TT, Nguyen L, Do HTN, Nguyen HTT, Tong SM (2026) Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy. PLoS One 21(2): e0341317. https://doi.org/10.1371/journal.pone.0341317

Editor: Musa Adekunle Ayanwale, University of Johannesburg Faculty of Education, SOUTH AFRICA

Received: July 15, 2025; Accepted: January 6, 2026; Published: February 27, 2026

Copyright: © 2026 Nguyen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Multiple Choice Questions (MCQs) are widely used in educational systems as a cost-effective way to assess knowledge, comprehension, and problem-solving in large cohorts [1–5], but creating high-quality, cognitively targeted items requires expertise and resources [6,7].

To ensure that MCQs assess a broad range of cognitive skills, educators frequently employ Bloom’s Taxonomy—a hierarchical model that classifies learning objectives from basic recall (Remember) to higher-order thinking skills such as application (Apply), analysis (Analyze), and evaluation/creation (Evaluate/Create) [8,9]. This taxonomy helps guide the construction of questions that target specific educational outcomes and cognitive domains [10–13].

In recent years, artificial intelligence (AI) and, more specifically, large language models (LLMs) have demonstrated remarkable capabilities in natural language generation. These models have been used to assist in various educational tasks, such as analyzing data, answering questions, grading, tutoring, and even generating test items [14–17]. With the rapid and ongoing development of LLMs, newer versions often outperform their predecessors in terms of linguistic capacities, contextual understanding, and accuracy [18,19].

While previous studies have explored AI’s role in educational applications, including automated MCQ generation [14,20,21], there is a lack of systematic evaluation of how well different LLMs generate questions that align with the cognitive levels outlined in Bloom’s Taxonomy. To date, no comparative study has assessed the ability of state-of-the-art LLMs to generate MCQs across Bloom’s levels.

This study aims to fill this gap by comparing several current-generation LLMs in their ability to generate MCQs across the spectrum of Bloom’s Taxonomy. By analyzing the cognitive depth of AI-generated questions, this research provides insights into the use of LLMs and evaluates their potential as tools for high-quality question generation in medical and dental education.

Materials and methods

Study design and duration

This cross-sectional comparative study was conducted from May 25, 2025, to Jun 25, 2025, to evaluate the capacity of LLMs to generate MCQs aligned with Bloom’s Taxonomy. The study was approved by the Institutional Ethical Board of the XXX University with the approval number XXX. Informed verbal consent was obtained from all participants, documented in the study records, and witnessed by a member of the research team. A schematic representation of the study workflow is presented in Fig 1.

Download:

Fig 1. Flowchart illustrating the study design.

https://doi.org/10.1371/journal.pone.0341317.g001

MCQs and answers were generated using several widely used LLMs, followed by evaluation from two experienced clinicians.

LLM Selection

Five widely used LLMs were selected based on their public availability, documented widespread use, and technical relevance as of mid-2025. These included: ChatGPT-4o (OpenAI, https://chatgpt.com), released in May 2024; Copilot Pro (Microsoft, powered by GPT-4, https://copilot.microsoft.com), launched in March 2024; Claude Sonnet 4 (Anthropic, https://claude.ai/new), released in May 2025; Grok 3 (xAI, https://grok.com), released in February 2025; and DeepSeek R1 (DeepSeek, https://chat.deepseek.com), introduced in November 2023. Among the selected models, ChatGPT-4o and Copilot Pro required paid subscriptions, whereas Claude Sonnet 4, Grok 3, and DeepSeek R1 were freely accessible. All models were accessed through their respective web interfaces to ensure accessibility, standardization and reproducibility of inputs and outputs.

Sample size calculation

Sample size was determined using G*Power software (version 3.1) [22], assuming an effect size of 0.1, α level of 0.05, and a statistical power of 0.95. The minimum required sample size per group was calculated to be 38. To increase analytical robustness, 60 MCQs were generated per LLM, resulting in a total of 300 items.

Source material and prompt design

Input content was derived from the Textbook of Oral and Maxillofacial Anatomy, published by Phenikaa University, Hanoi, Vietnam, in January 2025. Six chapters were selected based on their alignment with curricular learning objectives and clinical relevance:

- The skeletal anatomy of the head and neck
- Musculature of the head and neck
- Temporomandibular joint anatomy
- Vascular anatomy of the head and neck
- Neuroanatomy of the head and neck
- Lymphatic anatomy of the head and neck

Each LLM was prompted independently using a standardized instruction:

“Generate 10 dental board-style multiple-choice questions based on the anatomy of the head and neck as covered in the uploaded files. The number of questions should be distributed across Bloom’s Taxonomy as follows: 3 at the Remembering level, 2 at the Understanding level, 3 at the Applying level, 1 at the Analyzing level, and 1 at the Evaluating/Creating level. Each question must include four clearly written and distinct answer choices (A–D), with one correct answer and three plausible distractors. For each question, provide a brief explanation indicating why the correct answer is correct and why each incorrect option is incorrect.”

Ten MCQs were generated for each chapter, resulting in 60 MCQs per LLM. Examples of generated questions by ChatGPT-4o in each cognitive level of Bloom’s Taxonomy were presented in Table 1.

Download:

Table 1. Example of generated questions and their respective responses by ChatGPT-4o.

https://doi.org/10.1371/journal.pone.0341317.t001

Question evaluation

Each question was independently evaluated by two board-certified dental practitioners with over five years of clinical experience and academic appointments as lecturers in head and neck anatomy, each having at least three years of teaching experience. The evaluation was based on the five levels of Bloom’s Taxonomy, using a 5-point Likert scale. Final scores were calculated as the average of the two individual ratings. Table 2 summarizes the scoring scheme used for each level of Bloom’s Taxonomy.

Download:

Table 2. Scoring scheme for evaluating the five Bloom’s cognitive levels using a 5-point Likert scale.

https://doi.org/10.1371/journal.pone.0341317.t002

Statistical analysis

Statistical analyses were performed using SPSS software (version 23.0; IBM, Armonk, NY). Inter-rater reliability for each of the five evaluation criteria was assessed using weighted Cohen’s kappa. The performance of each LLM was reported as the median and interquartile range (IQR). As the data were ordinal, inter- and intra-model comparisons were conducted using the Kruskal–Wallis test, followed by Bonferroni-corrected post hoc tests. Graphs were generated using Python version 3.12.8 (https://www.python.org).

Results

Inter-rater reliability

To assess the consistency of two evaluators scoring AI-generated MCQs, we computed weighted Cohen’s kappa coefficients (κ) for each level of Bloom’s Taxonomy. Weighted Cohen’s kappa measures agreement for ordinal data, with values ranging from 0 (no agreement) to 1 (perfect agreement) [23]. The results showed moderate to strong agreement between evaluators: Remembering (κ = 0.74), Understanding (κ = 0.77), Applying (κ = 0.86), Analyzing (κ = 0.81), and Evaluating/Creating (κ = 0.78).

Comparison among five LLMs

Table 3 summarizes the median and interquartile range (IQR) of Likert ratings for each LLM across the five levels of Bloom’s Taxonomy.

Download:

Table 3. The median and interquartile range (IQR) of the score of each LLM across different Bloom’s cognitive levels.

https://doi.org/10.1371/journal.pone.0341317.t003

Values in parentheses represent the interquartile range (IQR).

Overall, scores clustered around 4 and 5, indicating that all models generally demonstrated adequate to strong performance in generating cognitively aligned multiple-choice questions (S1 Fig).

Ratings predominantly clustered at scores of 4 and 5, indicating that all models generally demonstrated adequate to strong performance in generating multiple-choice questions aligned with Bloom’s cognitive levels.

To further visualize performance patterns across cognitive domains, a heatmap of the mean ± standard deviation (SD) Likert scores is presented in Fig 2, illustrating how each model performed across the taxonomy levels.

Download:

Fig 2. Heatmap of mean ± standard deviation (SD) Likert scores for Bloom’s cognitive levels across five LLMs.

https://doi.org/10.1371/journal.pone.0341317.g002

All models achieved mean scores above 4, indicating generally strong performance. Among them, Claude Sonnet 4 demonstrated the most consistent and highest scores across all cognitive levels.

Statistical comparisons revealed no significant differences among the models at the Remembering and Understanding levels (p > 0.05, η² = 0.00). However, notable differences emerged at higher cognitive levels. At the Applying level, Claude Sonnet 4 (rank = 61.44) significantly outperformed Grok 3 (rank = 37.53, p = 0.033) and DeepSeek R1 (rank = 35.56, p = 0.015) (η² = 0.11). Similarly, at the Analyzing level, Claude Sonnet 4 again demonstrated superior performance compared to ChatGPT-4o (p = 0.011) and DeepSeek R1 (p = 0.015) (η² = 0.47). At the Evaluating/Creating level, both Claude Sonnet 4 (rank = 22.58) and Grok 3 (rank = 21.42) significantly outperformed ChatGPT-4o (rank = 7.67; p < 0.05) ((η² = 0.44) (Fig 3).

Download:

Fig 3. Distribution of Likert scores across Bloom’s cognitive levels.

https://doi.org/10.1371/journal.pone.0341317.g003

Boxplots represent the median (horizontal line), interquartile range (IQR), and mean (black dot) of Likert scores for AI-generated MCQs, categorized by LLMs. A) Remembering, B) Understanding, C) Applying 4, D) Analyzing, E) Evaluating/Creating. No significant differences were found at the Remembering and Understanding levels (p > 0.05). At higher levels, Claude Sonnet 4 showed superior performance at Applying and Analyzing (p < 0.05), and together with Grok 3 at Evaluating/Creating (p < 0.05). Statistical significance: *p < 0.05; Kruskal–Wallis test with Bonferroni correction.

Performance of LLMs across cognitive levels

Copilot Pro and Claude Sonnet 4 showed no significant variation in question quality across cognitive levels (p > 0.05, η² = 0.01). In contrast, ChatGPT-4o, Grok 3, and DeepSeek R1 exhibited significant differences (p < 0.01). For ChatGPT-4o, Remembering items (mean rank = 41.42) scores higher than Analyzing (rank = 12.67) and Evaluating/Creating (rank = 16.00) (p < 0.01, η² = 0.31). Grok 3 also performed better at Remembering level than Applying level (p < 0.01, η² = 0.25). In DeepSeek R1, lower-order levels (Remembering, Understanding) scored significantly higher than higher-order ones (Applying, Analyzing, Evaluating/Creating) (p ≤ 0.05, η² = 0.50) (Fig 4).

Download:

Fig 4. Distribution of Likert scores across Bloom’s cognitive levels for each LLM.

https://doi.org/10.1371/journal.pone.0341317.g004

Boxplots represent the median (horizontal line), interquartile range (IQR), and mean (black dot) of Likert scores for AI-generated MCQs, categorized by Bloom’s cognitive levels. A) ChatGPT-4o, B) Copilot Pro, C) Claude Sonnet 4, D) Grok 3, E) DeepSeek R1. Copilot Pro and Claude Sonnet 4 showed consistent performance across all levels (p > 0.05). ChatGPT-4o, Grok 3, and DeepSeek R1 demonstrated significant variation (p < 0.01), with higher quality at lower cognitive levels compared to higher-order levels. Statistical significance: *p < 0.05, **p < 0.001; Kruskal–Wallis test with Bonferroni correction.

Discussion

This study provides a detailed evaluation of five LLMs—ChatGPT-4o, Copilot Pro, Claude Sonnet 4, Grok 3, and DeepSeek R1—in generating multiple-choice questions aligned with Bloom’s Taxonomy, a widely used framework for categorizing cognitive skills in educational settings. To our knowledge, no prior study has systematically investigated the alignment of LLM-generated questions with Bloom’s cognitive taxonomy. By including a range of LLMs, this study offers a comprehensive view of their potential and applicability in real-world educational tasks.

The results revealed that while all models excel at generating MCQs for foundational cognitive levels (Remembering and Understanding), their performance varies significantly at higher-order levels (Applying, Analyzing, Evaluating/Creating). Notably, Claude Sonnet 4 achieved higher alignment at higher levels than ChatGPT-4o, Grok 3 and DeepSeek, highlighting variability in LLM capabilities for tasks requiring abstraction and synthesis. Although no previous studies have directly compared LLMs’ capabilities to generate MCQs aligned with Bloom’s Taxonomy, several investigations have demonstrated that LLMs can produce dental and medical questions with high levels of accuracy, relevance, and complexity [14,20]. Previous studies comparing the question-answering performance of ChatGPT and Claude have reported mixed results. For example, one study evaluating pediatric dentistry questions found that ChatGPT-4 outperformed Claude [24]. In contrast, a separate investigation focusing on colorectal cancer-related queries showed that Claude achieved higher accuracy (82.67%) compared to ChatGPT-4 Turbo (78.44%) and Bard (Gemini) (70%) [25]. These discrepancies may be attributed to differences in the specific models evaluated, as well as the rapid pace of AI development, which has led to the release of newer models with enhanced reasoning capabilities.

When comparing scores across cognitive levels within each model, Copilot Pro and Claude Sonnet 4 demonstrated similar capabilities in generating questions across cognitive levels. In contrast, ChatGPT-4o and DeepSeek R1 performed better at lower cognitive levels than at higher ones. This observation is consistent with findings by Alex KK Law et al. [21], who reported that although ChatGPT-4o was able to generate questions of comparable quality to those produced by human experts, the difficulty index of its questions was significantly lower. Similarly, research assessing LLM performance on the Japanese medical licensing exam found that accuracy rates were higher for easier questions compared to more challenging ones [26]. However, it is important to note that while Alex KK Law et al concluded that LLMs mainly generate questions at lower cognitive levels, their study did not specifically instruct the models to produce questions aligned with Bloom’s Taxonomy. In our evaluation, even at higher cognitive levels, models such as Claude, Grok 3, and Copilot Pro were able to generate questions with relatively high scores (above 4), with a narrow IQR (0–1). These findings underscore the importance of selecting LLMs based on the specific cognitive demands of educational tasks, particularly in healthcare education, where higher-order reasoning skills are critical [27]. Future research should explore advanced fine-tuning methods, curated datasets, and prompt engineering techniques, such as chain-of-thought prompting, to further address these performance gaps [28,29].

Several factors may contribute to variability in LLM performance when generating multiple-choice questions aligned with Bloom’s Taxonomy. These include differences in architecture, training data, and fine-tuning strategies, which influence effectiveness across cognitive levels from Remembering to Evaluating/Creating. LLMs rely on transformer architectures using self-attention to process complex inputs [30]. Architectural choices—such as layer depth, attention mechanisms, or Mixture of Experts (MoE)—can influence cognitive performance. Optimized attention or well-configured MoE may support higher-order tasks, while decoder-only models often favor fluency over deep reasoning [31,32]. Training data quality and diversity also matter. Pretraining on abstract or technical content may enhance reasoning and synthesis abilities [33,34]. Fine-tuning methods—such as Low-Rank Adaptation (LoRA), reinforcement learning with human feedback (RLHF), or Retrieval-Augmented Generation (RAG)—can further improve performance by aligning models with specific educational goals [35–37].

The observed performance differences have implications for educational applications, particularly in fields like healthcare, where higher-order cognitive engagement is critical for long-term knowledge retention and clinical problem-solving [27,29]. For instance, Claude Sonnet 4’s ability to generate high-quality MCQs at the higher cognitive levels suggests it can support educators in designing assessments that foster critical thinking and innovation, such as creating novel treatment plans or evaluating competing clinical approaches. Similarly, Grok 3’s strong performance at Evaluating/Creating level indicates its potential for generating questions that require students to synthesize information, and problem-solving, the skill essential for evidence-based practice. In contrast, ChatGPT-4o’s lower performance at Analyzing and DeepSeek R1’s moderate scores at higher levels suggest limitations in their ability to handle tasks requiring deeper reasoning or concepts with complexities. Automatically generated MCQs also enable medical students to reinforce their knowledge efficiently by providing virtually unlimited practice questions in a short period. In addition, these tools can tailor both the difficulty and cognitive level of the questions to match learners’ needs and stages of training. For example, undergraduate students may benefit from questions with less emphasis on case-based scenarios or complex reasoning, while those preparing for national board examinations or postgraduate assessments often require more advanced, higher-order questions [38]. In practice, educators may selectively use Claude Sonnet 4 for developing advanced assessments targeting higher-order cognition, whereas ChatGPT-4o may be more suitable for foundational knowledge testing. While LLMs can markedly shorten the time needed to generate multiple-choice questions, some level of expert review remains necessary to refine item clarity, ensure accuracy, and verify cognitive alignment. The overall time required for MCQ development may therefore shift from question creation to quality assurance. Additionally, effective utilization of LLMs requires a basic understanding of prompt formulation to elicit high-quality outputs. Although proficiency in prompt engineering may initially represent a barrier to adoption, the increasing accessibility of LLM interfaces and guided educational tools is expected to minimize this challenge over time.

Our study strength was that we compared both free and paid LLMs, providing a broad and balanced perspective on model performance across a rapidly evolving AI landscape. Despite subscription models offering greater usage limits and faster response speeds, certain free LLMs, particularly Claude Sonnet 4 and Grok 3 demonstrated equally strong or even superior performance at advanced cognitive levels (Table 4).

Download:

Table 4. Comparative Performance of Paid and Free Large Language Models (LLMs) in Generating MCQs Aligned with Bloom’s Taxonomy.

https://doi.org/10.1371/journal.pone.0341317.t004

Furthermore, by identifying models’ strengths and limitations at each cognitive level, this study establishes a valuable baseline for future work in AI-assisted instructional design, adaptive learning systems, and multimodal assessment tools. Despite its contributions, this study has several limitations that should be acknowledged. While two independent evaluators achieved strong inter-rater reliability (κ = 0.74–0.86), the process remained subjective, with potential variation in interpreting Bloom’s levels and question quality. The evaluation was also limited to MCQs generated from oral and maxillofacial anatomy content, which may restrict the generalizability of the findings to other medical or dental disciplines. In addition, the structure and wording of the prompts used to guide LLM outputs may have influenced the quality and cognitive alignment of the generated questions, introducing potential bias. Moreover, while our study focused specifically on alignment with Bloom’s Taxonomy, without assessing aspects such as clarity, clinical relevance, or suitability for dental exams, future research should address these dimensions and explore their impact on student learning outcomes. Finally, the analysis was limited to text-based MCQs, restricting its relevance to multimodal assessments. As advancements in medical imaging continue to address challenges in data generation and quality enhancement, future research exploring the potential of LLMs could offer deeper insights into their capabilities and transformative applications in this domain [39].

Conclusions

In conclusion, this study highlights significant variation among LLMs in their ability to generate MCQs aligned with Bloom’s Taxonomy, with Claude Sonnet 4 demonstrating the strongest ability to generate higher-order questions, while all evaluated LLMs performed well at lower cognitive levels. The findings offer insights for educators seeking to integrate AI into curriculum design and for developers aiming to enhance LLM capabilities for educational purposes. Continued research and model improvement may further enhance the effectiveness of LLMs in supporting critical thinking and deeper learning within healthcare education.

Supporting information

S1 Fig. Distribution of Likert ratings assigned by both evaluators across all large language models.

https://doi.org/10.1371/journal.pone.0341317.s001

(TIFF)

S1 File. Raw expert scoring data for MCQs generated by each LLM, used for all analyses reported in the manuscript.

https://doi.org/10.1371/journal.pone.0341317.s002

(XLSX)

S2 File. Source code.

https://doi.org/10.1371/journal.pone.0341317.s003

(HTML)

Acknowledgments

We are grateful to our colleagues for their valuable scientific discussions and contributions throughout the course of this research.

References

1. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? Research paper. BMC Med Educ. 2007;7:49. pmid:18045500
- View Article
- PubMed/NCBI
- Google Scholar
2. Ricketts C, Brice J, Coombes L. Are multiple choice tests fair to medical students with specific learning disabilities?. Adv Health Sci Educ Theory Pract. 2010;15(2):265–75. pmid:19763855
- View Article
- PubMed/NCBI
- Google Scholar
3. Riggs CD, Kang S, Rennie O. Positive impact of multiple-choice question authoring and regular quiz participation on student learning. CBE Life Sci Educ. 2020;19(2):ar16. pmid:32357094
- View Article
- PubMed/NCBI
- Google Scholar
4. Zaidi NLB, Grob KL, Monrad SM, Kurtz JB, Tai A, Ahmed AZ. Pushing critical thinking skills with multiple-choice questions: does Bloom’s taxonomy work?. Acad Med. 2018;93(6):856–9.
- View Article
- Google Scholar
5. Parekh P, Bahadoor V. The Utility of Multiple-Choice Assessment in Current Medical Education: A Critical Review. Cureus. 2024;16(5):e59778. pmid:38846235
- View Article
- PubMed/NCBI
- Google Scholar
6. Dell KA, Wantuch GA. How-to-guide for writing multiple choice questions for the pharmacy instructor. Curr Pharm Teach Learn. 2017;9(1):137–44. pmid:29180146
- View Article
- PubMed/NCBI
- Google Scholar
7. Shin J, Guo Q, Gierl MJ. Multiple-choice item distractor development using topic modeling approaches. Front Psychol. 2019;10:825. pmid:31133911
- View Article
- PubMed/NCBI
- Google Scholar
8. Anderson LWK, Airasian PW, Cruikshank KA, Mayer RE, Pintrich PR, Raths J, & Wittrock MC.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s Taxonomy of Educational Objectives. 2nd ed. New York: Longman; 2001.
9. Stringer JK, Santen SA, Lee E, Rawls M, Bailey J, Richards A, et al. Examining Bloom’s Taxonomy in multiple choice questions: students’ approach to questions. Med Sci Educ. 2021;31(4):1311–7. pmid:34457973
- View Article
- PubMed/NCBI
- Google Scholar
10. Adams NE. Bloom’s taxonomy of cognitive learning objectives. J Med Libr Assoc. 2015;103(3):152–3. pmid:26213509
- View Article
- PubMed/NCBI
- Google Scholar
11. Larsen TM, Endo BH, Yee AT, Do T, Lo SM. Probing internal assumptions of the revised Bloom’s taxonomy. CBE Life Sci Educ. 2022;21(4):ar66.
- View Article
- Google Scholar
12. Gonzalez-Cabezas C, Anderson OS, Wright MC, Fontana M. Association between dental student-developed exam questions and learning at higher cognitive levels. J Dent Educ. 2015;79(11):1295–304. pmid:26522634
- View Article
- PubMed/NCBI
- Google Scholar
13. Kim M-K, Patel RA, Uchizono JA, Beck L. Incorporation of Bloom’s Taxonomy into multiple-choice examination questions for a pharmacotherapeutics course. Am J Pharm Educ. 2012;76(6):114. pmid:22919090
- View Article
- PubMed/NCBI
- Google Scholar
14. AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R. Multimodal large language models in health care: applications, challenges, and future outlook. J Med Internet Res. 2024;26(e59505).
- View Article
- Google Scholar
15. Meskó B. The impact of multimodal large language models on health care’s future. J Med Internet Res. 2023;25:e52865. pmid:37917126
- View Article
- PubMed/NCBI
- Google Scholar
16. Zhang DW, Boey M, Tan YY, Jia AHS. Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Sci Learn. 2024;9(1):79.
- View Article
- Google Scholar
17. Küchemann S, Avila KE, Dinc Y, Hortmann C, Revenga N, Ruf V, et al. On opportunities and challenges of large multimodal foundation models in education. NPJ Sci Learn. 2025;10(1):11. pmid:40000649
- View Article
- PubMed/NCBI
- Google Scholar
18. Bojić L, Kovačević P, Čabarkapa M. Does GPT-4 surpass human performance in linguistic pragmatics?. Humanit Soc Sci Commun. 2025;12(1).
- View Article
- Google Scholar
19. Alohali KI, Almusaeeb LA, Almubarak AA, Alohali AI, Muaygil RA. Reasoning-based LLMs surpass average human performance on medical social skills. Scientific Reports. 2025;15(1):36453.
- View Article
- Google Scholar
20. Kim H-S, Kim G-T. Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study. J Dent Sci. 2025;20(2):895–900. pmid:40224064
- View Article
- PubMed/NCBI
- Google Scholar
21. Law AK, So J, Lui CT, Choi YF, Cheung KH, Kei-Ching Hung K, et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ. 2025;25(1):208. pmid:39923067
- View Article
- PubMed/NCBI
- Google Scholar
22. Faul F, Erdfelder E, Lang A-G, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39(2):175–91. pmid:17695343
- View Article
- PubMed/NCBI
- Google Scholar
23. Yilmaz AE, Demirhan H. Weighted kappa measures for ordinal multi-class classification performance. Applied Soft Computing. 2023;134:110020.
- View Article
- Google Scholar
24. Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study. J Dent. 2024;144:104938. pmid:38499280
- View Article
- PubMed/NCBI
- Google Scholar
25. Zhou S, Luo X, Chen C, Jiang H, Yang C, Ran G, et al. The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries. Int J Surg. 2024;110(10):6509–17. pmid:38935100
- View Article
- PubMed/NCBI
- Google Scholar
26. Liu M, Okuhara T, Dai Z, Huang W, Gu L, Okada H, et al. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using japanese national medical examination. Int J Med Inform. 2025;193:105673. pmid:39471700
- View Article
- PubMed/NCBI
- Google Scholar
27. Pickering JD. Cognitive engagement: a more reliable proxy for learning?. MedSciEduc. 2017;27(4):821–3.
- View Article
- Google Scholar
28. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems. 2022;35:24824–37.
- View Article
- Google Scholar
29. Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ. 2024;24(1):354. pmid:38553693
- View Article
- PubMed/NCBI
- Google Scholar
30. Sajjadi Mohammadabadi SM, Kara BC, Eyupoglu C, Uzay C, Tosun MS, Karakuş O. A Survey of large language models: evolution, architectures, adaptation, benchmarking, applications, challenges, and societal implications. Electronics. 2025;14(18):3580.
- View Article
- Google Scholar
31. Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J Machine Learning Research. 2022;23(120):1–39.
- View Article
- Google Scholar
32. Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A. Language models are few-shot learners. arXiv preprint. 2020. 3.
33. Ruis L, Mozes M, Bae J, Kamalakara SR, Talupuru D, Locatelli A. Procedural knowledge in pretraining drives reasoning in large language models. arXiv preprint. 2024.
- View Article
- Google Scholar
34. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Machine Learning Res. 2020;21(140):1–67.
- View Article
- Google Scholar
35. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S. LoRA: Low-rank adaptation of large language models. ICLR. 2022;1(2):3.
- View Article
- Google Scholar
36. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022;35:27730–44.
- View Article
- Google Scholar
37. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems. 2020;33:9459–74.
- View Article
- Google Scholar
38. de Armendi AJ, Butt AL, Konrad KM, Shukry M, Marek IIA. Factors associated with healthcare students’ cognitive levels measured by the classroom test of scientific reasoning. Graduate Medical Education Res J. 2024;6(2):2.
- View Article
- Google Scholar
39. Rouzrokh P, Khosravi B, Faghani S, Moassefi M, Shariatnia MM, Rouzrokh P, et al. A Current Review of Generative AI in Medicine: Core Concepts, Applications, and Current Limitations. Curr Rev Musculoskelet Med. 2025;18(7):246–66. pmid:40304941
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? Research paper. BMC Med Educ. 2007;7:49. pmid:18045500
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Ricketts C, Brice J, Coombes L. Are multiple choice tests fair to medical students with specific learning disabilities?. Adv Health Sci Educ Theory Pract. 2010;15(2):265–75. pmid:19763855
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Riggs CD, Kang S, Rennie O. Positive impact of multiple-choice question authoring and regular quiz participation on student learning. CBE Life Sci Educ. 2020;19(2):ar16. pmid:32357094
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Zaidi NLB, Grob KL, Monrad SM, Kurtz JB, Tai A, Ahmed AZ. Pushing critical thinking skills with multiple-choice questions: does Bloom’s taxonomy work?. Acad Med. 2018;93(6):856–9.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref5] 5. Parekh P, Bahadoor V. The Utility of Multiple-Choice Assessment in Current Medical Education: A Critical Review. Cureus. 2024;16(5):e59778. pmid:38846235
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Dell KA, Wantuch GA. How-to-guide for writing multiple choice questions for the pharmacy instructor. Curr Pharm Teach Learn. 2017;9(1):137–44. pmid:29180146
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Shin J, Guo Q, Gierl MJ. Multiple-choice item distractor development using topic modeling approaches. Front Psychol. 2019;10:825. pmid:31133911
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Anderson LWK, Airasian PW, Cruikshank KA, Mayer RE, Pintrich PR, Raths J, & Wittrock MC.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s Taxonomy of Educational Objectives. 2nd ed. New York: Longman; 2001.

[ref9] 9. Stringer JK, Santen SA, Lee E, Rawls M, Bailey J, Richards A, et al. Examining Bloom’s Taxonomy in multiple choice questions: students’ approach to questions. Med Sci Educ. 2021;31(4):1311–7. pmid:34457973
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref10] 10. Adams NE. Bloom’s taxonomy of cognitive learning objectives. J Med Libr Assoc. 2015;103(3):152–3. pmid:26213509
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref11] 11. Larsen TM, Endo BH, Yee AT, Do T, Lo SM. Probing internal assumptions of the revised Bloom’s taxonomy. CBE Life Sci Educ. 2022;21(4):ar66.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref12] 12. Gonzalez-Cabezas C, Anderson OS, Wright MC, Fontana M. Association between dental student-developed exam questions and learning at higher cognitive levels. J Dent Educ. 2015;79(11):1295–304. pmid:26522634
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Kim M-K, Patel RA, Uchizono JA, Beck L. Incorporation of Bloom’s Taxonomy into multiple-choice examination questions for a pharmacotherapeutics course. Am J Pharm Educ. 2012;76(6):114. pmid:22919090
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R. Multimodal large language models in health care: applications, challenges, and future outlook. J Med Internet Res. 2024;26(e59505).
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref15] 15. Meskó B. The impact of multimodal large language models on health care’s future. J Med Internet Res. 2023;25:e52865. pmid:37917126
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref16] 16. Zhang DW, Boey M, Tan YY, Jia AHS. Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Sci Learn. 2024;9(1):79.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref17] 17. Küchemann S, Avila KE, Dinc Y, Hortmann C, Revenga N, Ruf V, et al. On opportunities and challenges of large multimodal foundation models in education. NPJ Sci Learn. 2025;10(1):11. pmid:40000649
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref18] 18. Bojić L, Kovačević P, Čabarkapa M. Does GPT-4 surpass human performance in linguistic pragmatics?. Humanit Soc Sci Commun. 2025;12(1).
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref19] 19. Alohali KI, Almusaeeb LA, Almubarak AA, Alohali AI, Muaygil RA. Reasoning-based LLMs surpass average human performance on medical social skills. Scientific Reports. 2025;15(1):36453.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref20] 20. Kim H-S, Kim G-T. Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study. J Dent Sci. 2025;20(2):895–900. pmid:40224064
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref21] 21. Law AK, So J, Lui CT, Choi YF, Cheung KH, Kei-Ching Hung K, et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ. 2025;25(1):208. pmid:39923067
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref22] 22. Faul F, Erdfelder E, Lang A-G, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39(2):175–91. pmid:17695343
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref23] 23. Yilmaz AE, Demirhan H. Weighted kappa measures for ordinal multi-class classification performance. Applied Soft Computing. 2023;134:110020.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref24] 24. Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study. J Dent. 2024;144:104938. pmid:38499280
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref25] 25. Zhou S, Luo X, Chen C, Jiang H, Yang C, Ran G, et al. The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries. Int J Surg. 2024;110(10):6509–17. pmid:38935100
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref26] 26. Liu M, Okuhara T, Dai Z, Huang W, Gu L, Okada H, et al. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using japanese national medical examination. Int J Med Inform. 2025;193:105673. pmid:39471700
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref27] 27. Pickering JD. Cognitive engagement: a more reliable proxy for learning?. MedSciEduc. 2017;27(4):821–3.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref28] 28. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems. 2022;35:24824–37.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref29] 29. Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ. 2024;24(1):354. pmid:38553693
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref30] 30. Sajjadi Mohammadabadi SM, Kara BC, Eyupoglu C, Uzay C, Tosun MS, Karakuş O. A Survey of large language models: evolution, architectures, adaptation, benchmarking, applications, challenges, and societal implications. Electronics. 2025;14(18):3580.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref31] 31. Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J Machine Learning Research. 2022;23(120):1–39.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref32] 32. Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A. Language models are few-shot learners. arXiv preprint. 2020. 3.

[ref33] 33. Ruis L, Mozes M, Bae J, Kamalakara SR, Talupuru D, Locatelli A. Procedural knowledge in pretraining drives reasoning in large language models. arXiv preprint. 2024.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref34] 34. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Machine Learning Res. 2020;21(140):1–67.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref35] 35. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S. LoRA: Low-rank adaptation of large language models. ICLR. 2022;1(2):3.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref36] 36. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022;35:27730–44.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref37] 37. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems. 2020;33:9459–74.
View Article
Google Scholar

[125] View Article

[126] Google Scholar

[ref38] 38. de Armendi AJ, Butt AL, Konrad KM, Shukry M, Marek IIA. Factors associated with healthcare students’ cognitive levels measured by the classroom test of scientific reasoning. Graduate Medical Education Res J. 2024;6(2):2.
View Article
Google Scholar

[128] View Article

[129] Google Scholar

[ref39] 39. Rouzrokh P, Khosravi B, Faghani S, Moassefi M, Shariatnia MM, Rouzrokh P, et al. A Current Review of Generative AI in Medicine: Core Concepts, Applications, and Current Limitations. Curr Rev Musculoskelet Med. 2025;18(7):246–66. pmid:40304941
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

Figures

Abstract

Introduction

Materials and Methods

Results

Conclusions

Introduction

Materials and methods

Study design and duration

LLM Selection

Sample size calculation

Source material and prompt design

Question evaluation

Statistical analysis

Results

Inter-rater reliability

Comparison among five LLMs

Performance of LLMs across cognitive levels

Discussion

Conclusions

Supporting information

S1 Fig. Distribution of Likert ratings assigned by both evaluators across all large language models.

S1 File. Raw expert scoring data for MCQs generated by each LLM, used for all analyses reported in the manuscript.

S2 File. Source code.

Acknowledgments

References