Differential performance of large language models in advanced cardiac life support assessment: A comprehensive multi-dimensional analysis of accuracy, consistency, and visual recognition capabilities

doi:10.1371/journal.pone.0347611

Table 1.

Comparison of overall and category-specific accuracy rates among large language models.

More »

Expand

Fig 1.

Comparative accuracy of large language models in ACLS question performance.

Bar graph illustrating the percentage of correct responses provided by four large language models—ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1—across three predefined accuracy criteria: overall accuracy (all three responses correct), strict accuracy (at least two responses correct), and ideal accuracy (at least one correct response). ChatGPT-4o achieved 100% in all accuracy metrics. Claude 3.5 and Gemini 2.0 also demonstrated high performance, while DeepSeek R1 exhibited significantly lower accuracy, particularly under the “overall” criterion. Statistically significant pairwise differences between models were calculated using McNemar’s test and are annotated above the bars (p values). All group comparisons were found to be statistically significant (p < 0.001).

More »

Expand

Fig 2.

Performance percentages of large language models by question type.

Bar charts displaying the accuracy rates of four large language models—ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1—across three question categories: visual (n = 12), knowledge-based (n = 29), and case-based (n = 9). Each chart presents three predefined accuracy metrics: overall accuracy (all responses correct), strict accuracy (at least two correct responses), and ideal accuracy (at least one correct response). ChatGPT-4o achieved perfect performance (100%) across all question types and metrics. Claude 3.5 showed excellent accuracy in knowledge and visual items but relatively lower performance in case-based questions. Gemini 2.0 performed comparably in knowledge-based questions, but underperformed in visual and case-based categories. DeepSeek R1 achieved 100% accuracy in knowledge-based questions but failed to answer any visual question correctly and showed moderate performance in case-based items. These findings highlight model-specific variability in reasoning, domain knowledge, and visual recognition capabilities.

More »

Expand

Fig 3.

Response accuracy distribution of large language models across question types.

Scatter plot depicting the distribution of response accuracy scores (0 to 3) for each of the 50 ACLS questions, categorized by question type: visual (n = 12), case-based (n = 9), and knowledge-based (n = 29). Accuracy scores were defined as follows: 3 = Strict accuracy (all three responses correct), 2 = Overall accuracy (at least two correct responses), 1 = Ideal accuracy (at least one correct response), 0 = No accuracy (all responses incorrect). Each symbol represents the performance of a different large language model—ChatGPT-4o (green circles), Gemini 2.0 (blue diamonds), Claude 3.5 (purple triangles), and DeepSeek R1 (orange squares)—on a specific question. ChatGPT-4o maintained perfect accuracy (score = 3) across all items, while DeepSeek R1 consistently failed (score = 0) in visual questions. Gemini 2.0 and Claude 3.5 exhibited variable performance, particularly in case-based and visual domains. The figure highlights performance heterogeneity across both question types and model architectures.

More »

Expand