Comparing large language models and search engine responses to common orthodontic questions

doi:10.1371/journal.pone.0339908

Fig 1.

The flow chart of research design.

More »

Expand

Fig 2.

Overall Quality (A), Overall Empathy (B), Overall Readability (C), and Overall Satisfaction(D) of LLMs and search engine responses to questions. A indicates GPT-4o; B, GPT-4o mini; C, Claude 3.5 Sonnet; D, Kimi AI; E, ERNIE Bot; F, Google; G, Microsoft Bing; and H, Baidu. The midline indicates the median (50% percentile); the box, the 25% and 75% percentiles; and the density distribution plot represents the probability density of the response score distribution. Kruskal-Wallis tests were used for comparisons among LLMs and among search engines, and Mann-Whitney U tests were used for comparisons between LLMs and search engines.

More »

Expand

Table 1.

Quality, Empathy, Readability, and Satisfaction Scores of LLMs and search engine responses to questions.

More »

Expand

Table 2.

Comparisons between LLMs and search engines scores on the indicators for the different question categories.

More »

Expand

Fig 3.

(A) Heat map of evaluation indicator scores for the LLMs. (B) Heat map of assessment evaluation indicator for the search engines. The horizontal axis lists the evaluation indicators, including medical accuracy, completeness, and so on; the vertical axis is the question number. Color shades indicate high or low scores, and color bars on the right side mark the range of scores. Median scores are shown in parentheses. Numerical data underlying this figure are provided in S13 Appendix.

More »

Expand