Evaluating cognitive depth of AI-generated multiple-choice questions with Bloom’s Taxonomy

Flowchart illustrating the study design.

Example of generated questions and their respective responses by ChatGPT-4o.

Scoring scheme for evaluating the five Bloom’s cognitive levels using a 5-point Likert scale.

The median and interquartile range (IQR) of the score of each LLM across different Bloom’s cognitive levels.

Heatmap of mean ± standard deviation (SD) Likert scores for Bloom’s cognitive levels across five LLMs.

Distribution of Likert scores across Bloom’s cognitive levels.

Distribution of Likert scores across Bloom’s cognitive levels for each LLM.

Comparative Performance of Paid and Free Large Language Models (LLMs) in Generating MCQs Aligned with Bloom’s Taxonomy.