Comparing large language models and search engine responses to common orthodontic questions

doi:10.1371/journal.pone.0339908

Comparing large language models and search engine responses to common orthodontic questions

Fig 2

Overall Quality (A), Overall Empathy (B), Overall Readability (C), and Overall Satisfaction(D) of LLMs and search engine responses to questions. A indicates GPT-4o; B, GPT-4o mini; C, Claude 3.5 Sonnet; D, Kimi AI; E, ERNIE Bot; F, Google; G, Microsoft Bing; and H, Baidu. The midline indicates the median (50% percentile); the box, the 25% and 75% percentiles; and the density distribution plot represents the probability density of the response score distribution. Kruskal-Wallis tests were used for comparisons among LLMs and among search engines, and Mann-Whitney U tests were used for comparisons between LLMs and search engines.

doi: https://doi.org/10.1371/journal.pone.0339908.g002