Performance of DeepSeek and ChatGPT on the Chinese Health Professional and Technical Examination: A comparative study

doi:10.1371/journal.pone.0338328

Fig 1.

Comparison of overall accuracy of DeepSeek-R1 and GPT-4o API in different units.

For overall accuracy by question type, DeepSeek-R1 achieved 89.1% accuracy on Type A questions compared with 69.1% for the GPT-4o API (P < 0.001), and 86.5% versus 64.0% on Type B questions (P = 0.001), indicating significant differences between the two models across both formats.

More »

Expand

Fig 2.

Comparison of overall accuracy between DeepSeek-R1 and GPT-4o API in different disciplines.

We conducted a comparative analysis of the two models’ consistency, and the results showed that GPT-4o API was more consistent, with a consistency rate of 96.5%. The consistency rate of DeepSeek-R1 was 88.5%, and there was a significant difference between the two (P < 0.001). However, we also found that GPT-4o API’s consistent accuracy rate was only 66.7%, while DeepSeek-R1’s was 84.0%. This difference was also significant, P < 0.001.

More »

Expand

Table 1.

Comparison of consistent accuracy between DeepSeek-R1 and GPT-4o API in different categorizations.

More »

Expand