Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy

  • Trang Thi Nguyen,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Validation, Writing – original draft, Writing – review & editing

    Affiliation Faculty of Dentistry, Phenikaa University, Hanoi, Vietnam

  • Linh Nguyen ,

    Roles Conceptualization, Data curation, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    nguyenlinh.hmu@gmail.com

    Affiliation School of Dentistry, Hanoi medical university, Hanoi, Vietnam

  • Ha Thi Nguyet Do,

    Roles Investigation, Visualization, Writing – review & editing

    Affiliation School of Dentistry, Hanoi medical university, Hanoi, Vietnam

  • Huong Thi Thu Nguyen,

    Roles Investigation, Methodology, Resources, Supervision

    Affiliation Faculty of Dentistry, Phenikaa University, Hanoi, Vietnam

  • Son Minh Tong

    Roles Conceptualization, Methodology, Project administration, Supervision

    Affiliation Faculty of Dentistry, Phenikaa University, Hanoi, Vietnam

Abstract

Introduction

While LLMs are used to generate medical and dental MCQs, their alignment with Bloom’s Taxonomy remains unexplored.

Materials and Methods

Five widely used LLMs, including ChatGPT-4o (OpenAI), Copilot Pro (Microsoft), Claude Sonnet 4 (Anthropic), Grok 3 (xAI), and DeepSeek R1 (DeepSeek) were evaluated. Each model generated 60 MCQs (total 300) based on content from an oral and maxillofacial anatomy textbook across the five cognitive levels of Bloom’s Taxonomy. Two independent investigators assessed each item using a 5-point Likert scale for remembering, understanding, applying, analyzing, and evaluating/creating. Inter-rater reliability was measured using weighted Cohen’s kappa. Model performance and inter-model differences were analyzed using the Kruskal–Wallis test.

Results

Inter-rater reliability was moderate to strong (kappa = 0.74–0.86). Median scores for remembering, understanding, applying, and evaluating/creating were above 4 across all LLMs, while the analyzing level scored a median of 3.5 for ChatGPT-4o and DeepSeek R1. No significant difference was found between models in remembering and understanding levels (p > 0.05). Claude Sonnet 4 outperformed the other models at the applying, analyzing, and evaluating/creating levels (p = 0.01, 0.003, and 0.005, respectively). Within-model analysis showed that only Copilot Pro and Claude Sonnet 4 consistently aligned with Bloom’s cognitive levels across all categories. In contrast, ChatGPT-4o, DeepSeek R1, and Grok 3 performed significantly better at the lower cognitive levels (p = 0.00, 0.00, and 0.001, respectively).

Conclusions

All LLMs performed well at lower cognitive levels, while Claude Sonnet 4 achieved the highest alignment at higher-order levels.

Introduction

Multiple Choice Questions (MCQs) are widely used in educational systems as a cost-effective way to assess knowledge, comprehension, and problem-solving in large cohorts [15], but creating high-quality, cognitively targeted items requires expertise and resources [6,7].

To ensure that MCQs assess a broad range of cognitive skills, educators frequently employ Bloom’s Taxonomy—a hierarchical model that classifies learning objectives from basic recall (Remember) to higher-order thinking skills such as application (Apply), analysis (Analyze), and evaluation/creation (Evaluate/Create) [8,9]. This taxonomy helps guide the construction of questions that target specific educational outcomes and cognitive domains [1013].

In recent years, artificial intelligence (AI) and, more specifically, large language models (LLMs) have demonstrated remarkable capabilities in natural language generation. These models have been used to assist in various educational tasks, such as analyzing data, answering questions, grading, tutoring, and even generating test items [1417]. With the rapid and ongoing development of LLMs, newer versions often outperform their predecessors in terms of linguistic capacities, contextual understanding, and accuracy [18,19].

While previous studies have explored AI’s role in educational applications, including automated MCQ generation [14,20,21], there is a lack of systematic evaluation of how well different LLMs generate questions that align with the cognitive levels outlined in Bloom’s Taxonomy. To date, no comparative study has assessed the ability of state-of-the-art LLMs to generate MCQs across Bloom’s levels.

This study aims to fill this gap by comparing several current-generation LLMs in their ability to generate MCQs across the spectrum of Bloom’s Taxonomy. By analyzing the cognitive depth of AI-generated questions, this research provides insights into the use of LLMs and evaluates their potential as tools for high-quality question generation in medical and dental education.

Materials and methods

Study design and duration

This cross-sectional comparative study was conducted from May 25, 2025, to Jun 25, 2025, to evaluate the capacity of LLMs to generate MCQs aligned with Bloom’s Taxonomy. The study was approved by the Institutional Ethical Board of the XXX University with the approval number XXX. Informed verbal consent was obtained from all participants, documented in the study records, and witnessed by a member of the research team. A schematic representation of the study workflow is presented in Fig 1.

MCQs and answers were generated using several widely used LLMs, followed by evaluation from two experienced clinicians.

LLM Selection

Five widely used LLMs were selected based on their public availability, documented widespread use, and technical relevance as of mid-2025. These included: ChatGPT-4o (OpenAI, https://chatgpt.com), released in May 2024; Copilot Pro (Microsoft, powered by GPT-4, https://copilot.microsoft.com), launched in March 2024; Claude Sonnet 4 (Anthropic, https://claude.ai/new), released in May 2025; Grok 3 (xAI, https://grok.com), released in February 2025; and DeepSeek R1 (DeepSeek, https://chat.deepseek.com), introduced in November 2023. Among the selected models, ChatGPT-4o and Copilot Pro required paid subscriptions, whereas Claude Sonnet 4, Grok 3, and DeepSeek R1 were freely accessible. All models were accessed through their respective web interfaces to ensure accessibility, standardization and reproducibility of inputs and outputs.

Sample size calculation

Sample size was determined using G*Power software (version 3.1) [22], assuming an effect size of 0.1, α level of 0.05, and a statistical power of 0.95. The minimum required sample size per group was calculated to be 38. To increase analytical robustness, 60 MCQs were generated per LLM, resulting in a total of 300 items.

Source material and prompt design

Input content was derived from the Textbook of Oral and Maxillofacial Anatomy, published by Phenikaa University, Hanoi, Vietnam, in January 2025. Six chapters were selected based on their alignment with curricular learning objectives and clinical relevance:

  1. - The skeletal anatomy of the head and neck
  2. - Musculature of the head and neck
  3. - Temporomandibular joint anatomy
  4. - Vascular anatomy of the head and neck
  5. - Neuroanatomy of the head and neck
  6. - Lymphatic anatomy of the head and neck

Each LLM was prompted independently using a standardized instruction:

“Generate 10 dental board-style multiple-choice questions based on the anatomy of the head and neck as covered in the uploaded files. The number of questions should be distributed across Bloom’s Taxonomy as follows: 3 at the Remembering level, 2 at the Understanding level, 3 at the Applying level, 1 at the Analyzing level, and 1 at the Evaluating/Creating level. Each question must include four clearly written and distinct answer choices (A–D), with one correct answer and three plausible distractors. For each question, provide a brief explanation indicating why the correct answer is correct and why each incorrect option is incorrect.”

Ten MCQs were generated for each chapter, resulting in 60 MCQs per LLM. Examples of generated questions by ChatGPT-4o in each cognitive level of Bloom’s Taxonomy were presented in Table 1.

thumbnail
Table 1. Example of generated questions and their respective responses by ChatGPT-4o.

https://doi.org/10.1371/journal.pone.0341317.t001

Question evaluation

Each question was independently evaluated by two board-certified dental practitioners with over five years of clinical experience and academic appointments as lecturers in head and neck anatomy, each having at least three years of teaching experience. The evaluation was based on the five levels of Bloom’s Taxonomy, using a 5-point Likert scale. Final scores were calculated as the average of the two individual ratings. Table 2 summarizes the scoring scheme used for each level of Bloom’s Taxonomy.

thumbnail
Table 2. Scoring scheme for evaluating the five Bloom’s cognitive levels using a 5-point Likert scale.

https://doi.org/10.1371/journal.pone.0341317.t002

Statistical analysis

Statistical analyses were performed using SPSS software (version 23.0; IBM, Armonk, NY). Inter-rater reliability for each of the five evaluation criteria was assessed using weighted Cohen’s kappa. The performance of each LLM was reported as the median and interquartile range (IQR). As the data were ordinal, inter- and intra-model comparisons were conducted using the Kruskal–Wallis test, followed by Bonferroni-corrected post hoc tests. Graphs were generated using Python version 3.12.8 (https://www.python.org).

Results

Inter-rater reliability

To assess the consistency of two evaluators scoring AI-generated MCQs, we computed weighted Cohen’s kappa coefficients (κ) for each level of Bloom’s Taxonomy. Weighted Cohen’s kappa measures agreement for ordinal data, with values ranging from 0 (no agreement) to 1 (perfect agreement) [23]. The results showed moderate to strong agreement between evaluators: Remembering (κ = 0.74), Understanding (κ = 0.77), Applying (κ = 0.86), Analyzing (κ = 0.81), and Evaluating/Creating (κ = 0.78).

Comparison among five LLMs

Table 3 summarizes the median and interquartile range (IQR) of Likert ratings for each LLM across the five levels of Bloom’s Taxonomy.

thumbnail
Table 3. The median and interquartile range (IQR) of the score of each LLM across different Bloom’s cognitive levels.

https://doi.org/10.1371/journal.pone.0341317.t003

Values in parentheses represent the interquartile range (IQR).

Overall, scores clustered around 4 and 5, indicating that all models generally demonstrated adequate to strong performance in generating cognitively aligned multiple-choice questions (S1 Fig).

Ratings predominantly clustered at scores of 4 and 5, indicating that all models generally demonstrated adequate to strong performance in generating multiple-choice questions aligned with Bloom’s cognitive levels.

To further visualize performance patterns across cognitive domains, a heatmap of the mean ± standard deviation (SD) Likert scores is presented in Fig 2, illustrating how each model performed across the taxonomy levels.

thumbnail
Fig 2. Heatmap of mean ± standard deviation (SD) Likert scores for Bloom’s cognitive levels across five LLMs.

https://doi.org/10.1371/journal.pone.0341317.g002

All models achieved mean scores above 4, indicating generally strong performance. Among them, Claude Sonnet 4 demonstrated the most consistent and highest scores across all cognitive levels.

Statistical comparisons revealed no significant differences among the models at the Remembering and Understanding levels (p > 0.05, η² = 0.00). However, notable differences emerged at higher cognitive levels. At the Applying level, Claude Sonnet 4 (rank = 61.44) significantly outperformed Grok 3 (rank = 37.53, p = 0.033) and DeepSeek R1 (rank = 35.56, p = 0.015) (η² = 0.11). Similarly, at the Analyzing level, Claude Sonnet 4 again demonstrated superior performance compared to ChatGPT-4o (p = 0.011) and DeepSeek R1 (p = 0.015) (η² = 0.47). At the Evaluating/Creating level, both Claude Sonnet 4 (rank = 22.58) and Grok 3 (rank = 21.42) significantly outperformed ChatGPT-4o (rank = 7.67; p < 0.05) ((η² = 0.44) (Fig 3).

thumbnail
Fig 3. Distribution of Likert scores across Bloom’s cognitive levels.

https://doi.org/10.1371/journal.pone.0341317.g003

Boxplots represent the median (horizontal line), interquartile range (IQR), and mean (black dot) of Likert scores for AI-generated MCQs, categorized by LLMs. A) Remembering, B) Understanding, C) Applying 4, D) Analyzing, E) Evaluating/Creating. No significant differences were found at the Remembering and Understanding levels (p > 0.05). At higher levels, Claude Sonnet 4 showed superior performance at Applying and Analyzing (p < 0.05), and together with Grok 3 at Evaluating/Creating (p < 0.05). Statistical significance: *p < 0.05; Kruskal–Wallis test with Bonferroni correction.

Performance of LLMs across cognitive levels

Copilot Pro and Claude Sonnet 4 showed no significant variation in question quality across cognitive levels (p > 0.05, η² = 0.01). In contrast, ChatGPT-4o, Grok 3, and DeepSeek R1 exhibited significant differences (p < 0.01). For ChatGPT-4o, Remembering items (mean rank = 41.42) scores higher than Analyzing (rank = 12.67) and Evaluating/Creating (rank = 16.00) (p < 0.01, η² = 0.31). Grok 3 also performed better at Remembering level than Applying level (p < 0.01, η² = 0.25). In DeepSeek R1, lower-order levels (Remembering, Understanding) scored significantly higher than higher-order ones (Applying, Analyzing, Evaluating/Creating) (p ≤ 0.05, η² = 0.50) (Fig 4).

thumbnail
Fig 4. Distribution of Likert scores across Bloom’s cognitive levels for each LLM.

https://doi.org/10.1371/journal.pone.0341317.g004

Boxplots represent the median (horizontal line), interquartile range (IQR), and mean (black dot) of Likert scores for AI-generated MCQs, categorized by Bloom’s cognitive levels. A) ChatGPT-4o, B) Copilot Pro, C) Claude Sonnet 4, D) Grok 3, E) DeepSeek R1. Copilot Pro and Claude Sonnet 4 showed consistent performance across all levels (p > 0.05). ChatGPT-4o, Grok 3, and DeepSeek R1 demonstrated significant variation (p < 0.01), with higher quality at lower cognitive levels compared to higher-order levels. Statistical significance: *p < 0.05, **p < 0.001; Kruskal–Wallis test with Bonferroni correction.

Discussion

This study provides a detailed evaluation of five LLMs—ChatGPT-4o, Copilot Pro, Claude Sonnet 4, Grok 3, and DeepSeek R1—in generating multiple-choice questions aligned with Bloom’s Taxonomy, a widely used framework for categorizing cognitive skills in educational settings. To our knowledge, no prior study has systematically investigated the alignment of LLM-generated questions with Bloom’s cognitive taxonomy. By including a range of LLMs, this study offers a comprehensive view of their potential and applicability in real-world educational tasks.

The results revealed that while all models excel at generating MCQs for foundational cognitive levels (Remembering and Understanding), their performance varies significantly at higher-order levels (Applying, Analyzing, Evaluating/Creating). Notably, Claude Sonnet 4 achieved higher alignment at higher levels than ChatGPT-4o, Grok 3 and DeepSeek, highlighting variability in LLM capabilities for tasks requiring abstraction and synthesis. Although no previous studies have directly compared LLMs’ capabilities to generate MCQs aligned with Bloom’s Taxonomy, several investigations have demonstrated that LLMs can produce dental and medical questions with high levels of accuracy, relevance, and complexity [14,20]. Previous studies comparing the question-answering performance of ChatGPT and Claude have reported mixed results. For example, one study evaluating pediatric dentistry questions found that ChatGPT-4 outperformed Claude [24]. In contrast, a separate investigation focusing on colorectal cancer-related queries showed that Claude achieved higher accuracy (82.67%) compared to ChatGPT-4 Turbo (78.44%) and Bard (Gemini) (70%) [25]. These discrepancies may be attributed to differences in the specific models evaluated, as well as the rapid pace of AI development, which has led to the release of newer models with enhanced reasoning capabilities.

When comparing scores across cognitive levels within each model, Copilot Pro and Claude Sonnet 4 demonstrated similar capabilities in generating questions across cognitive levels. In contrast, ChatGPT-4o and DeepSeek R1 performed better at lower cognitive levels than at higher ones. This observation is consistent with findings by Alex KK Law et al. [21], who reported that although ChatGPT-4o was able to generate questions of comparable quality to those produced by human experts, the difficulty index of its questions was significantly lower. Similarly, research assessing LLM performance on the Japanese medical licensing exam found that accuracy rates were higher for easier questions compared to more challenging ones [26]. However, it is important to note that while Alex KK Law et al concluded that LLMs mainly generate questions at lower cognitive levels, their study did not specifically instruct the models to produce questions aligned with Bloom’s Taxonomy. In our evaluation, even at higher cognitive levels, models such as Claude, Grok 3, and Copilot Pro were able to generate questions with relatively high scores (above 4), with a narrow IQR (0–1). These findings underscore the importance of selecting LLMs based on the specific cognitive demands of educational tasks, particularly in healthcare education, where higher-order reasoning skills are critical [27]. Future research should explore advanced fine-tuning methods, curated datasets, and prompt engineering techniques, such as chain-of-thought prompting, to further address these performance gaps [28,29].

Several factors may contribute to variability in LLM performance when generating multiple-choice questions aligned with Bloom’s Taxonomy. These include differences in architecture, training data, and fine-tuning strategies, which influence effectiveness across cognitive levels from Remembering to Evaluating/Creating. LLMs rely on transformer architectures using self-attention to process complex inputs [30]. Architectural choices—such as layer depth, attention mechanisms, or Mixture of Experts (MoE)—can influence cognitive performance. Optimized attention or well-configured MoE may support higher-order tasks, while decoder-only models often favor fluency over deep reasoning [31,32]. Training data quality and diversity also matter. Pretraining on abstract or technical content may enhance reasoning and synthesis abilities [33,34]. Fine-tuning methods—such as Low-Rank Adaptation (LoRA), reinforcement learning with human feedback (RLHF), or Retrieval-Augmented Generation (RAG)—can further improve performance by aligning models with specific educational goals [3537].

The observed performance differences have implications for educational applications, particularly in fields like healthcare, where higher-order cognitive engagement is critical for long-term knowledge retention and clinical problem-solving [27,29]. For instance, Claude Sonnet 4’s ability to generate high-quality MCQs at the higher cognitive levels suggests it can support educators in designing assessments that foster critical thinking and innovation, such as creating novel treatment plans or evaluating competing clinical approaches. Similarly, Grok 3’s strong performance at Evaluating/Creating level indicates its potential for generating questions that require students to synthesize information, and problem-solving, the skill essential for evidence-based practice. In contrast, ChatGPT-4o’s lower performance at Analyzing and DeepSeek R1’s moderate scores at higher levels suggest limitations in their ability to handle tasks requiring deeper reasoning or concepts with complexities. Automatically generated MCQs also enable medical students to reinforce their knowledge efficiently by providing virtually unlimited practice questions in a short period. In addition, these tools can tailor both the difficulty and cognitive level of the questions to match learners’ needs and stages of training. For example, undergraduate students may benefit from questions with less emphasis on case-based scenarios or complex reasoning, while those preparing for national board examinations or postgraduate assessments often require more advanced, higher-order questions [38]. In practice, educators may selectively use Claude Sonnet 4 for developing advanced assessments targeting higher-order cognition, whereas ChatGPT-4o may be more suitable for foundational knowledge testing. While LLMs can markedly shorten the time needed to generate multiple-choice questions, some level of expert review remains necessary to refine item clarity, ensure accuracy, and verify cognitive alignment. The overall time required for MCQ development may therefore shift from question creation to quality assurance. Additionally, effective utilization of LLMs requires a basic understanding of prompt formulation to elicit high-quality outputs. Although proficiency in prompt engineering may initially represent a barrier to adoption, the increasing accessibility of LLM interfaces and guided educational tools is expected to minimize this challenge over time.

Our study strength was that we compared both free and paid LLMs, providing a broad and balanced perspective on model performance across a rapidly evolving AI landscape. Despite subscription models offering greater usage limits and faster response speeds, certain free LLMs, particularly Claude Sonnet 4 and Grok 3 demonstrated equally strong or even superior performance at advanced cognitive levels (Table 4).

thumbnail
Table 4. Comparative Performance of Paid and Free Large Language Models (LLMs) in Generating MCQs Aligned with Bloom’s Taxonomy.

https://doi.org/10.1371/journal.pone.0341317.t004

Furthermore, by identifying models’ strengths and limitations at each cognitive level, this study establishes a valuable baseline for future work in AI-assisted instructional design, adaptive learning systems, and multimodal assessment tools. Despite its contributions, this study has several limitations that should be acknowledged. While two independent evaluators achieved strong inter-rater reliability (κ = 0.74–0.86), the process remained subjective, with potential variation in interpreting Bloom’s levels and question quality. The evaluation was also limited to MCQs generated from oral and maxillofacial anatomy content, which may restrict the generalizability of the findings to other medical or dental disciplines. In addition, the structure and wording of the prompts used to guide LLM outputs may have influenced the quality and cognitive alignment of the generated questions, introducing potential bias. Moreover, while our study focused specifically on alignment with Bloom’s Taxonomy, without assessing aspects such as clarity, clinical relevance, or suitability for dental exams, future research should address these dimensions and explore their impact on student learning outcomes. Finally, the analysis was limited to text-based MCQs, restricting its relevance to multimodal assessments. As advancements in medical imaging continue to address challenges in data generation and quality enhancement, future research exploring the potential of LLMs could offer deeper insights into their capabilities and transformative applications in this domain [39].

Conclusions

In conclusion, this study highlights significant variation among LLMs in their ability to generate MCQs aligned with Bloom’s Taxonomy, with Claude Sonnet 4 demonstrating the strongest ability to generate higher-order questions, while all evaluated LLMs performed well at lower cognitive levels. The findings offer insights for educators seeking to integrate AI into curriculum design and for developers aiming to enhance LLM capabilities for educational purposes. Continued research and model improvement may further enhance the effectiveness of LLMs in supporting critical thinking and deeper learning within healthcare education.

Supporting information

S1 Fig. Distribution of Likert ratings assigned by both evaluators across all large language models.

https://doi.org/10.1371/journal.pone.0341317.s001

(TIFF)

S1 File. Raw expert scoring data for MCQs generated by each LLM, used for all analyses reported in the manuscript.

https://doi.org/10.1371/journal.pone.0341317.s002

(XLSX)

Acknowledgments

We are grateful to our colleagues for their valuable scientific discussions and contributions throughout the course of this research.

References

  1. 1. Palmer EJ, Devitt PG. Assessment of higher order cognitive skills in undergraduate education: modified essay or multiple choice questions? Research paper. BMC Med Educ. 2007;7:49. pmid:18045500
  2. 2. Ricketts C, Brice J, Coombes L. Are multiple choice tests fair to medical students with specific learning disabilities?. Adv Health Sci Educ Theory Pract. 2010;15(2):265–75. pmid:19763855
  3. 3. Riggs CD, Kang S, Rennie O. Positive impact of multiple-choice question authoring and regular quiz participation on student learning. CBE Life Sci Educ. 2020;19(2):ar16. pmid:32357094
  4. 4. Zaidi NLB, Grob KL, Monrad SM, Kurtz JB, Tai A, Ahmed AZ. Pushing critical thinking skills with multiple-choice questions: does Bloom’s taxonomy work?. Acad Med. 2018;93(6):856–9.
  5. 5. Parekh P, Bahadoor V. The Utility of Multiple-Choice Assessment in Current Medical Education: A Critical Review. Cureus. 2024;16(5):e59778. pmid:38846235
  6. 6. Dell KA, Wantuch GA. How-to-guide for writing multiple choice questions for the pharmacy instructor. Curr Pharm Teach Learn. 2017;9(1):137–44. pmid:29180146
  7. 7. Shin J, Guo Q, Gierl MJ. Multiple-choice item distractor development using topic modeling approaches. Front Psychol. 2019;10:825. pmid:31133911
  8. 8. Anderson LWK, Airasian PW, Cruikshank KA, Mayer RE, Pintrich PR, Raths J, & Wittrock MC.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s Taxonomy of Educational Objectives. 2nd ed. New York: Longman; 2001.
  9. 9. Stringer JK, Santen SA, Lee E, Rawls M, Bailey J, Richards A, et al. Examining Bloom’s Taxonomy in multiple choice questions: students’ approach to questions. Med Sci Educ. 2021;31(4):1311–7. pmid:34457973
  10. 10. Adams NE. Bloom’s taxonomy of cognitive learning objectives. J Med Libr Assoc. 2015;103(3):152–3. pmid:26213509
  11. 11. Larsen TM, Endo BH, Yee AT, Do T, Lo SM. Probing internal assumptions of the revised Bloom’s taxonomy. CBE Life Sci Educ. 2022;21(4):ar66.
  12. 12. Gonzalez-Cabezas C, Anderson OS, Wright MC, Fontana M. Association between dental student-developed exam questions and learning at higher cognitive levels. J Dent Educ. 2015;79(11):1295–304. pmid:26522634
  13. 13. Kim M-K, Patel RA, Uchizono JA, Beck L. Incorporation of Bloom’s Taxonomy into multiple-choice examination questions for a pharmacotherapeutics course. Am J Pharm Educ. 2012;76(6):114. pmid:22919090
  14. 14. AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R. Multimodal large language models in health care: applications, challenges, and future outlook. J Med Internet Res. 2024;26(e59505).
  15. 15. Meskó B. The impact of multimodal large language models on health care’s future. J Med Internet Res. 2023;25:e52865. pmid:37917126
  16. 16. Zhang DW, Boey M, Tan YY, Jia AHS. Evaluating large language models for criterion-based grading from agreement to consistency. NPJ Sci Learn. 2024;9(1):79.
  17. 17. Küchemann S, Avila KE, Dinc Y, Hortmann C, Revenga N, Ruf V, et al. On opportunities and challenges of large multimodal foundation models in education. NPJ Sci Learn. 2025;10(1):11. pmid:40000649
  18. 18. Bojić L, Kovačević P, Čabarkapa M. Does GPT-4 surpass human performance in linguistic pragmatics?. Humanit Soc Sci Commun. 2025;12(1).
  19. 19. Alohali KI, Almusaeeb LA, Almubarak AA, Alohali AI, Muaygil RA. Reasoning-based LLMs surpass average human performance on medical social skills. Scientific Reports. 2025;15(1):36453.
  20. 20. Kim H-S, Kim G-T. Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study. J Dent Sci. 2025;20(2):895–900. pmid:40224064
  21. 21. Law AK, So J, Lui CT, Choi YF, Cheung KH, Kei-Ching Hung K, et al. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ. 2025;25(1):208. pmid:39923067
  22. 22. Faul F, Erdfelder E, Lang A-G, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39(2):175–91. pmid:17695343
  23. 23. Yilmaz AE, Demirhan H. Weighted kappa measures for ordinal multi-class classification performance. Applied Soft Computing. 2023;134:110020.
  24. 24. Rokhshad R, Zhang P, Mohammad-Rahimi H, Pitchika V, Entezari N, Schwendicke F. Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study. J Dent. 2024;144:104938. pmid:38499280
  25. 25. Zhou S, Luo X, Chen C, Jiang H, Yang C, Ran G, et al. The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries. Int J Surg. 2024;110(10):6509–17. pmid:38935100
  26. 26. Liu M, Okuhara T, Dai Z, Huang W, Gu L, Okada H, et al. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using japanese national medical examination. Int J Med Inform. 2025;193:105673. pmid:39471700
  27. 27. Pickering JD. Cognitive engagement: a more reliable proxy for learning?. MedSciEduc. 2017;27(4):821–3.
  28. 28. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems. 2022;35:24824–37.
  29. 29. Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ. 2024;24(1):354. pmid:38553693
  30. 30. Sajjadi Mohammadabadi SM, Kara BC, Eyupoglu C, Uzay C, Tosun MS, Karakuş O. A Survey of large language models: evolution, architectures, adaptation, benchmarking, applications, challenges, and societal implications. Electronics. 2025;14(18):3580.
  31. 31. Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J Machine Learning Research. 2022;23(120):1–39.
  32. 32. Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A. Language models are few-shot learners. arXiv preprint. 2020. 3.
  33. 33. Ruis L, Mozes M, Bae J, Kamalakara SR, Talupuru D, Locatelli A. Procedural knowledge in pretraining drives reasoning in large language models. arXiv preprint. 2024.
  34. 34. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Machine Learning Res. 2020;21(140):1–67.
  35. 35. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S. LoRA: Low-rank adaptation of large language models. ICLR. 2022;1(2):3.
  36. 36. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022;35:27730–44.
  37. 37. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems. 2020;33:9459–74.
  38. 38. de Armendi AJ, Butt AL, Konrad KM, Shukry M, Marek IIA. Factors associated with healthcare students’ cognitive levels measured by the classroom test of scientific reasoning. Graduate Medical Education Res J. 2024;6(2):2.
  39. 39. Rouzrokh P, Khosravi B, Faghani S, Moassefi M, Shariatnia MM, Rouzrokh P, et al. A Current Review of Generative AI in Medicine: Core Concepts, Applications, and Current Limitations. Curr Rev Musculoskelet Med. 2025;18(7):246–66. pmid:40304941