Fig 1.
The flow chart of research design.
Fig 2.
Overall Quality (A), Overall Empathy (B), Overall Readability (C), and Overall Satisfaction(D) of LLMs and search engine responses to questions. A indicates GPT-4o; B, GPT-4o mini; C, Claude 3.5 Sonnet; D, Kimi AI; E, ERNIE Bot; F, Google; G, Microsoft Bing; and H, Baidu. The midline indicates the median (50% percentile); the box, the 25% and 75% percentiles; and the density distribution plot represents the probability density of the response score distribution. Kruskal-Wallis tests were used for comparisons among LLMs and among search engines, and Mann-Whitney U tests were used for comparisons between LLMs and search engines.
Table 1.
Quality, Empathy, Readability, and Satisfaction Scores of LLMs and search engine responses to questions.
Table 2.
Comparisons between LLMs and search engines scores on the indicators for the different question categories.
Fig 3.
(A) Heat map of evaluation indicator scores for the LLMs. (B) Heat map of assessment evaluation indicator for the search engines. The horizontal axis lists the evaluation indicators, including medical accuracy, completeness, and so on; the vertical axis is the question number. Color shades indicate high or low scores, and color bars on the right side mark the range of scores. Median scores are shown in parentheses. Numerical data underlying this figure are provided in S13 Appendix.