Figures
Abstract
Background
The cross-lingual and question-type variations affecting large language models (LLMs) accuracy on the Chinese national medical licensing educations remain insufficiently explored.
Methods
In this cross-sectional study (May 13–20, 2025), 396 educational questions (198 English–Chinese pairs) were extracted from the Chinese national medical licensing examination. ChatGPT-4o, ChatGPT-o3, Gemini-2.5-pro, Deepseek-V3, Deepseek-R1, and Doubao-1.5-pro were prompted to provide answers. Responses were compared against reference answers, and accuracy was computed for three question types: basic knowledge (Type A), case analysis (Type B), and integrative judgment (Type C).
Results
Across all question types and languages, Doubao-1.5-pro achieved the highest accuracy at 92.0% ± 1.3%, whereas ChatGPT-4o had the lowest accuracy at 82.8% ± 3.7%. There was a significant main effect of question type (P = 0.0038) but no main effect of language (P = 0.56). Post hoc tests confirmed that Type A performance exceeded Types B and C (P < 0.01), while B vs. C did not differ. Among the models, Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 demonstrated notable cross-lingual stability, with accuracy differences between Chinese and English versions remaining below 5%.
Conclusion
The question type was a key factor affecting LLMs performance on Chinese medical licensing exam questions, whereas language had no significant impact. Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 demonstrated particularly strong cross-lingual consistency. These findings point to the potential value of specialized LLMs for enhancing medical education in China.
Citation: Tang Y, Chen J, Wang S (2026) Performance benchmarking of LLMs on Chinese national medical licensing education: Cross-lingual and question-type effects. PLoS One 21(4): e0346518. https://doi.org/10.1371/journal.pone.0346518
Editor: Mohmed Isaqali Karobari, University of Puthisastra, CAMBODIA
Received: August 21, 2025; Accepted: March 18, 2026; Published: April 8, 2026
Copyright: © 2026 Tang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript files.
Funding: This work was supported by Chongqing Natural Science Foundation (CSTB2024NSCQ-KJFZMSX0065) and the Foundation of Nanjing Medical University (No. JX1231600602), Jiangsu Province Hospital (TS202401, CZ0121002010039). There was no additional external funding received for this study.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Artificial intelligence has advanced rapidly in recent years, leading to the integration of generative large language models (LLMs) into essential healthcare services. These models are now used in applications from clinical knowledge management to telemedicine consultations. The medical field, with its complex knowledge requirements and critical decision-making contexts, provides an important opportunity to evaluate how well artificial intelligence can perform in real-world scenarios. The study in medical education demonstrates that GPT-4 can improve medical students’ diagnostic accuracy by 19% [1]. Furthermore, extensive research shows that LLMs have significant clinical knowledge, performing at expert levels across various medical specialties [2–4].
The Chinese medical licensing examination is one of the world’s largest clinical competency assessments, using various question types to evaluate foundational knowledge, clinical reasoning, and practical application skills. While researchers have investigated large language model performance on the United States medical licensing exams and similar assessments [5,6], there is a lack of comprehensive research examining how LLMs perform when evaluated across different languages and question types [7].
This research presents a new mixed-format evaluation framework for Chinese medical licensing examination, utilizing six major large language models: OpenAI’s ChatGPT-o3 and ChatGPT-4o, Google’s Gemini-2.5-pro, DeepSeek’s R1 and V3 models, and ByteDance’s Doubao-1.5-pro [8,9]. The purpose of this study is to investigate the performance of LLMs on Chinese medical licensing examination, with the goal of providing valuable insights for medical education and training.
Materials and methods
Questions extraction
The Chinese medical licensing examination utilizes three standardized formats: Type A (independent items), Type B (shared-stem items), and Type C (shared-option items). While Type A items focus on foundational knowledge retrieval, Type B and Type C formats introduce significant contextual complexity. Type B items require the model to sustain clinical reasoning across a multi-stage vignette, while Type C items assess its precision in differential diagnosis when faced with a static set of competing distractors. Since official examination papers are not publicly released, all items for this study were sourced from widely recognized, publicly available preparatory materials provided by Beijing Medical Examination Assistance Technology Co., Ltd. We selected materials from 2020−2022 to ensure the inclusion of the most comprehensive and contemporary collections with verified answer keys. To ensure the statistical validity of this sample size, we calculated that 198 items drawn from an annual population of approximately 600 questions provide a margin of error of approximately ±5.8% at a 95% confidence interval (assuming a 50% response accuracy for maximum variance). This degree of precision is widely considered sufficient for the evaluative analysis of LLMs. To maintain consistency for cross-lingual evaluation, only text-based items were included, while questions containing images, tables, or chemical formulas were excluded. From this pool of available text-based items, we employed a stratified random sampling approach to select 198 questions, strictly preserving the original distribution of Type A, B, and C formats. This resulted in a dataset comprising 76 Type A, 62 Type B, and 60 Type C items. The English versions were generated through a standardized forward-translation process conducted by two authors (YX Tang and SJ Wang), both of whom possess over ten years of clinical experience and have completed more than one year of formal medical training in the United States. The study was conducted from May 13–20, 2025. The LLMs include ChatGPT-o3 (released April 17, 2025), ChatGPT-4o (released May 13, 2025), Gemini-2.5-pro (released March 25, 2025), Doubao-1.5-pro (released January 22, 2025), Deepseek R1 (released January 20, 2025), and Deepseek V3 (released December 26, 2024).
Response generation
All models were accessed through their official web interfaces using default parameters. Below are the input methods for different question types: for type A questions, each question was entered in a fresh conversation session; for type B questions, questions sharing the same stem were entered together in a single session, followed by a new conversation session for the next set of questions with a different shared stem. For type C questions, questions sharing the same set of options were entered together in one session, with a new conversation session initiated for the next group of questions with different shared options. During the experiment, research assistants used a standardized prompt: “You should act as an educational expert of Medicine. When faced with a choice question of type A, where each question has only one best answer, you need to provide the optimal response. The question is “The primary damage to DNA by ultraviolet rays is the formation of: A. Base deletion; B. Base insertion; C. Base substitution; D. Thymine dimers; E. Phosphodiester bond cleavage”. You are required to present your answer in the following structure: Response: Answer.” (Fig 1). Two medically trained assistants independently verified each model’s answer against the official key, labeling responses as correct or incorrect.
Yellow-colored placeholder denotes the question type (A, B, C), while blue-colored placeholder denotes the question stem. Type A (Single-Best-Answer Questions): Direct, standalone multiple-choice items that test the retrieval of basic medical facts. Type B (Case-Based Analysis Questions): Full-case prompts—including chief complaint, medical history, and diagnostic findings—designed to evaluate a candidate’s clinical reasoning processes. Type C (Shared-Option Questions): Multi-part items that present a common set of five diagnoses and then pose sequential questions on pathogen identification and radiologic feature matching, thereby probing candidates’ ability to transfer and apply their knowledge in multidimensional diagnostic scenarios.
Statistical analysis
Statistical analyses began with descriptive summaries of each model’s mean accuracy across every combination of question type (A, B, C) and language (Chinese, English). To formally assess differences, a two‐way repeated‐measures ANOVA was conducted with mean accuracy as the dependent variable, models as the subject factor, and question type (three levels) and language (two levels) as within‐subject factors; this allowed testing of the main effects of question type and language as well as their interaction. Following a significant main effect of question type, post hoc paired t‐tests were performed between each pair of question types (A vs. B, A vs. C, B vs. C), and a separate paired t‐test compared Chinese versus English, with all resulting p-values adjusted via the Bonferroni correction. All statistical analyses were performed in R (version 4.2) and figures were produced with ggplot2.
Results
Each model’s overall accuracy (mean ± SD) was calculated across all question types and languages: ChatGPT-4o 82.8% ± 3.7%, ChatGPT-o3 86.1% ± 2.5%, Deepseek-V3 89.8% ± 1.4%, Deepseek-R1 91.0% ± 1.5%, Doubao-1.5-pro 92.0% ± 1.3%, and Gemini-2.5-pro 88.7% ± 1.7% (Fig 2). All models performed most reliably on Type A questions, maintaining accuracy above 88% on average. Type C questions, however, showed the greatest variability in performance, ranging from 72% to 95%. ChatGPT-4o scored 88.2% on Type A, 82.3% on Type B, and 71.7% on Type C in Chinese, while in English the scores were 97.4%, 77.4%, and 80.0% respectively. Deepseek-R1 stood out with particularly strong performance on Type C questions, achieving 95.0% accuracy in Chinese and 88.3% in English, which was superior to most other models in our evaluation. Notably, Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 all showed strong cross-lingual stability, with language‐induced accuracy differences under 5%. For Type C questions, the performance difference between Deepseek-R1 (the top model) and ChatGPT-4o (the lowest-performing model) was quite large—almost 25% (Fig 3).
There was a significant main effect of question type (F(2,10) = 10.22, P = 0.0038), but no significant effect of language (F(1,5) = 0.39, P = 0.56) (Fig 4). A marginal interaction between question type and language was observed (F(2,10) = 3.57, P = 0.068). The performance of Type A was significantly better than both Type B (P = 0.002) and Type C (P = 0.025). However, neither the Type B vs. Type C comparison nor the Chinese vs. English comparison reached statistical significance after correction.
The central line indicates the median score, with box edges marking the 25th and 75th percentiles, defining the interquartile range (IQR) of the distributions.
Discussion
This study systematically evaluated the performance of six LLMs on Chinese medical licensing examination questions across both Chinese and English versions and different question types. Our comprehensive analysis yielded several key findings. Question type emerged as the primary factor influencing model performance, with a significant main effect observed. All models performed most consistently on basic knowledge questions and complex integrative questions revealed substantial performance variability. Language had minimal impact on overall performance, with no significant main effect detected. This stability suggests these models have developed robust semantic representations that transcend surface linguistic differences.
We also found that complex questions have the ability to distinguish between different models, which is in line with cognitive theories. Cognitive theories suggest that higher – order thinking tasks can better highlight differences in reasoning abilities. Several studies have shown that models with specialized reasoning capabilities, such as those using chain – of – thought prompting or other reasoning – enhancement techniques, usually perform better than general – purpose models when dealing with complex medical questions [10,11]. Our research also made a similar discovery. Models like Doubao-1.5-pro and Deepseek-R1, which may adopt more advanced reasoning architectures, performed particularly well on complex questions [12,13]. Our research further expands on previous research findings. We found that as the complexity of questions increases, the performance gap between different models becomes larger. For example, on Type C questions, the difference in accuracy between Deepseek-R1 and ChatGPT-4o was nearly 25%. This indicates that complex integrative tasks may be very useful for evaluating and comparing the reasoning abilities of different LLMs in the medical field.
Our findings align with and extend previous research evaluating LLM performance on medical licensing examinations in China and internationally. Prior studies have shown that LLMs generally perform well on foundational recall questions but exhibit greater variability on complex reasoning tasks, consistent with our observation of significant question‑type effect [12,14]. Other evaluations of Chinese medical licensing examinations similarly reported that English prompts may improve performance on basic knowledge tasks but offer limited benefit for higher‑order reasoning [15]. Additionally, recent work in dental and specialty examinations demonstrated that cross-lingual performance varies substantially across models, highlighting the importance of bilingual semantic alignment [16]. Our study contributes to this literature by providing a direct cross‑lingual comparison across three question types and identifying models with particularly strong cross‑lingual stability.
Some previous studies have found that using English prompts can make models do better in basic knowledge tasks [17–19]. But our research results show that this advantage fades when it comes to more complicated clinical reasoning tasks. This is in line with other research which points out that although large English language databases can help with simple questions, they don’t offer much help for complex medical reasoning [20,21]. It’s really worth noticing that models like Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 show strong stability across different languages. This finding fits well with other studies that prove models trained with bilingual biomedical data and semantic alignment methods can have great cross-lingual performance [22–24].
This study has several limitations. Although the passing standard for this examination is 60% accuracy, the results of this study do not indicate that these LLMs can pass the real examination. This is because the analysis was based on only 198 questions rather than the complete examination set, and it was limited to text-based single-choice questions, excluding items containing images, tables, chemical formulas, and other non-text modalities that are typically included in the full examination. This study only focused on accuracy, neglecting consistency, rationale, educational value, and clinical validity. Future studies are needed to address these issues. Extending assessments to various dimensions and real-world scenarios will better establish LLMs’ utility and reliability in clinical education.
In conclusion, this study provides a comprehensive evaluation of six LLMs on the Chinese national medical licensing examination across languages and question types. A central finding is that question type, rather than language, is the primary determinant of model performance. All models showed consistent accuracy on basic knowledge questions but exhibited substantial variability on complex questions. Among the models evaluated, Doubao-1.5-pro and Deepseek-R1 demonstrated superior overall performance and maintained strong cross-lingual stability, with low language-related accuracy differences. These findings have important implications for the development and deployment of LLMs in medical education.
Acknowledgments
We would like to express our sincere gratitude to Dr. Chuanbing Wang for his kind assistance in using ChatGPT and Gemini during his stay in the United States in May 2025.
References
- 1. Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: applications and implications. JMIR Med Educ. 2023;9:e50945. pmid:37578830
- 2. Choi J. Large language models in medicine. Healthc Inform Res. 2025;31(2):111–3.
- 3. Jung KH. Large language models in medicine: Clinical applications, technical challenges, and ethical considerations. Healthc Inform Res. 2025;31(2):114–24.
- 4. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80. pmid:37438534
- 5. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. pmid:36812645
- 6. Schubert MC, Wick W, Venkataramani V. Performance of Large Language Models on a Neurology Board-Style Examination. JAMA Netw Open. 2023;6(12):e2346721. pmid:38060223
- 7. Eysenbach G. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Med Educ. 2023;9:e46885. pmid:36863937
- 8. DeepSeek-AI GD, Yang D. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv. 2025.
- 9. Tordjman M, Liu Z, Yuce M, Fauveau V, Mei Y, Hadjadj J, et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med. 2025;31(8):2550–5. pmid:40267969
- 10. Zong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ. 2024;24(1):143. pmid:38355517
- 11. Zong H, et al. Large language models in worldwide medical exams: platform development and comprehensive analysis. J Med Internet Res. 2024;26:e66114.
- 12. Diao Y, et al. Multiple large language models’ performance on the Chinese medical licensing examination: quantitative comparative study. JMIR Hum Factors. 2025;12:e77978.
- 13. Li S. Towards A Fair Duel: reflections on the evaluation of DeepSeek-R1 and ChatGPT-4o in Chinese medical education. J Med Syst. 2025;49(1):172. pmid:41315131
- 14. Ming S, Guo Q, Cheng W, Lei B. Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study. JMIR Med Educ. 2024;10:e52784. pmid:39140269
- 15. Tseng LW, et al. Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study. JMIR Med Educ. 2025;11:e58897.
- 16. Chau RCW, et al. Performance of generative artificial intelligence in dental licensing examinations. Int Dent J. 2024;74(3):616–21.
- 17. Fang C, Wu Y, Fu W, Ling J, Wang Y, Liu X, et al. How does ChatGPT-4 preform on non-English national medical licensing examination? An evaluation in Chinese language. PLOS Digit Health. 2023;2(12):e0000397. pmid:38039286
- 18. Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, et al. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clin Pract. 2023;13(6):1460–87. pmid:37987431
- 19. Tong W, Guan Y, Chen J, Huang X, Zhong Y, Zhang C, et al. Artificial intelligence in global health equity: an evaluation and discussion on the application of ChatGPT, in the Chinese National Medical Licensing Examination. Front Med (Lausanne). 2023;10:1237432. pmid:38020160
- 20. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, et al. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res. 2024;26:e60807. pmid:39052324
- 21. Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep. 2023;13(1):20512. pmid:37993519
- 22. Wang H, Wu W, Dou Z, He L, Yang L. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI. Int J Med Inform. 2023;177:105173. pmid:37549499
- 23. Sozen Yanik I, Sahin Hazir D, Bilgin Avsar D. Cross-lingual performance of large language models in maxillofacial prosthodontics: a comparative evaluation. BMC Oral Health. 2025;25(1):1630. pmid:41107888
- 24. Zhong W. Performance of ChatGPT-4o and four open-source large language models in generating diagnoses based on china’s rare disease catalog: comparative study. J Med Internet Res. 2025;27:e69929.