Figures
Abstract
Background
Large Language Models (LLMs) highlight their potential in supporting patient education and self-management. Their performance in responses to orthodontic questions has yet to be explored.
Objectives
This study aims to compare the quality, empathy, readability, and satisfaction of responses from LLMs and search engines on common orthodontic questions.
Methods
Forty-five common orthodontic questions (six categories) and a prompt were developed, and a self-designed multidimensional evaluation questionnaire was constructed. Questions were presented to 5 LLMs and 3 search engines on December,22,2024. The primary outcomes were the median expert-rated scores of LLMs versus search engine responses on quality, empathy, readability, and satisfaction, using 5- or 10-point Likert scales.
Results
LLMs scored significantly higher than search engines in quality (4.00 vs. 3.50, p < 0.001), empathy (3.75 vs. 3.50, p < 0.001), readability (4.00 vs. 3.75, p < 0.001), and satisfaction (8.00 vs. 7.25, p < 0.001). LLM-generated responses were rated significantly higher than those from search engines in therapeutic outcomes category, appliance selection category and cost category.
Citation: Ren Y, Sun J (2026) Comparing large language models and search engine responses to common orthodontic questions. PLoS One 21(1): e0339908. https://doi.org/10.1371/journal.pone.0339908
Editor: Bekalu Tadesse Moges, Woldia University, ETHIOPIA
Received: September 16, 2025; Accepted: December 12, 2025; Published: January 2, 2026
Copyright: © 2026 Ren, Sun. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Patient education and self-management improve outcomes such as medication adherence (d = 0.316), and reduce 30-day readmissions by 21% and hospitalizations by 60% [1–3]. However, limited consultation time and heavy workloads hinder communication, increasing the risk of adverse outcomes (HR = 1.27–1.75) [4–6].
The Internet has become a primary source for individuals looking for health information [7]. Search engines deliver health information by retrieving and ranking existing web content through keyword-based algorithms [8], which were used by approximately 83% of health-information seekers from 2016 to 2021, and 75% report that the results influence their self-management decisions [9]. Artificial intelligence (AI), which seeks to create machines that perform tasks requiring human-like cognition [10], made a breakthrough with deep learning advances from 2012 [11]. A national survey [12] published as a peer-reviewed Data Brief in JAMA reported that 63% of respondents used AI tools for information. Large Language Models (LLMs), a class of artificial intelligence (AI) systems, leverage deep neural networks to generate natural language text by processing vast datasets [13]. LLMs excel in executing complex tasks such as natural language processing and human-computer interaction, demonstrating advanced capabilities in contextual understanding and linguistic pattern recognition [14], having rapidly expanded applications in healthcare [15,16], transforming areas such as documentation, clinical diagnosis, and patient education [17].
Wang et al.[18] conducted a bibliometric analysis of 5,284 articles of generative AI in medicine and identified that “medical education” was the burst keyword (intensity: 4.58) during 2023−2024. Ma et al. [19] used GPT-4 to role-play patients and deliver personalized ICU discharge education. Hao et al. [20] developed MedEduChat (an LLM-based chatbot for prostate cancer patient education) which enables personalized and semi-structured health education interactions. The EHRTutor framework, developed by Zhang et al, enables personalized education for patient discharge instructions [21]. While Serhat Aydin et al.[22] comprehensively reviewed advancements in LLMs for patient education, emphasizing that the accuracy and readability of LLMs-generated content require further research and suggested to evaluate the performance of LLM-generated responses in patient education.
Malocclusion, with a prevalence of 43.5% to 67.2% in the general population [23], is recommended to use orthodontic treatment for intervention in the early stage [24,25]. The average treatment duration for orthodontic treatment is 27.9 months, and the number of appointments is 23.8 [26]. Although Zhao et al. [27] demonstrated that improved oral health knowledge enhances self-management (β = 0.527) and treatment outcomes (p < 0.001), the demand for health education remains high and unmet [28].
This study aims to systematically evaluate and compare the performance of LLMs and mainstream search engines in responding to common orthodontic questions, and to assess the suitability of LLMs as tools for patient education and self-management.
Methods
This study was reviewed and approved by Peking University Institutional Review Board (Approval No: IRB00001052–24162). All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional research committee. Written informed consent was obtained from all individual participants prior to their inclusion in the study. Four expert raters provided written informed consent before participation and received no financial compensation. The participant recruitment and data collection period spanned from December 20, 2024, to January 5, 2025. This study followed the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) reporting guideline.
Common orthodontic questions (45 questions, 6 categories) and prompt (1)
Question pool development (117 questions).
The China National Knowledge Infrastructure (CNKI), the largest Chinese academic database, was searched with the subject terms ‘Orthodontics’ and ‘Orthodontic Treatment’, and 420 articles were read by researchers and 24 common orthodontic questions were collected based on the principle of saturation (Fig 1).
In the four high-traffic online platforms, Zhihu (https://www.zhihu.com/), DXY (https://dxy.com/), Chunyu Health (https://www.chunyuyisheng.com/), and Reddit (https://www.reddit.com/), with “orthodontics” as the key word, and based on the principle of saturation, the researcher collected 52 common orthodontic questions.
Semi-structured interview (S1 Appendix) of two orthodontics experts and three orthodontic patients (Basic information see S2 Appendix) were conducted and 41 common orthodontic questions were collected based on the principle of saturation.
Common orthodontic questions (45 questions, 6 categories) and prompt (1).
Two researchers independently collated the questions, merging duplicates and splitting multi-part items. Disagreements were resolved by consulting a third researcher. The 117 questions were collated into 45 common orthodontic questions (S3 Appendix). To ensure content validity and relevance, four orthodontic experts reviewed the final 45 questions, achieving good inter-rater agreement (ICC = 0.80) and excellent overall content validity (S-CVI/Ave = 0.93). Orthodontic questions were categorized into six categories based on clinical consultation through literature search and expert consultation: indications, therapeutic outcomes, appliance selection, cost, tooth extraction, and maintenance.
Through interviews with an NLP expert (specializing in LLMs) and an orthodontic expert, as well as references [29,30], optimized response length limits (approximately 125 words), modified 45 questions, and determined one prompt(RTF: Role: “You are an orthodontic specialist.”, Task: “Please provide an accurate and comprehensible answer to the following patient question: Under what circumstances is orthodontic treatment necessary?”(Question 1), Format: “Answer in Chinese within approximately 125 words.”). All AI-generated responses were reviewed by four orthodontic experts under researcher supervision to ensure factual accuracy and compliance with AI research ethics principles.
45 responses from 5 LLMs and 3 search engines for 45 common orthodontic questions
Responses were collected on December 22, 2024, within a single day to ensure consistency and reduce the potential impact of subsequent system updates. We selected the five most popular LLMs and the top three search engines supporting Chinese-language interaction. All 45 questions were formulated in Chinese and presented to five LLMs—GPT-4o [A] (OpenAI; released May 2024; API version 2024-11-20), GPT-4o mini [B] (OpenAI; released July 2024; API version 2024-11-20), Claude 3.5 Sonnet [C] (Anthropic; model version claude-3–5-sonnet-20241022; released October 22, 2024), Kimi AI [D] (Moonshot AI; accessed December 2024), and ERNIE Bot [E] (Baidu; version 4.0; accessed December 2024)—and three mainstream search engines—Google [F], Microsoft Bing [G], and Baidu [H] (all accessed December 22, 2024). Each question was submitted to a new LLM chat session with a standardized prompt, while the top-ranked search engine result was used for comparison. The response content table (S4 Appendix) was obtained. All original Chinese materials was translated by two bilingual domain experts with clinical and linguistic backgrounds, following a forward-backward translation and recall process commonly adopted in international scale adaptation studies (S1 Appendix, S3 Appendix-S7 Appendix). No major discrepancies were identified between the forward and backward translations.
Self-designed multidimensional evaluation questionnaire
A multidimensional evaluation questionnaire was developed based on existing tools, literature review, and expert input. The 4 primary indicators and 11 items were quality (medical accuracy, completeness, focus, overall quality scores), empathy (cognitive empathy, emotional empathy, overall empathy scores), readability (specialize vocabulary, logical clarity, overall readability score) and satisfaction (overall satisfaction scores). Cognitive empathy appears in phrases such as “This may be what you’re trying to understand”, while emotional empathy appears in phrases like “I understand this may worry you”. Detailed item definitions are provided in Supplementary File S5 Appenidix. A 5-point Likert scale was used for quality, empathy, and readability, and a 10-point Likert scale was used for satisfaction (S6 Appendix).
To ensure the reliability and validity of the self-designed multidimensional evaluation questionnaire, four orthodontic experts evaluated item relevance, clarity, and comprehensiveness (average CVI = 0.92). A pilot test of 20 responses was conducted, and internal consistency was assessed using Cronbach’s α after rescaling all 5-point and 10-point items to a common 5-point metric by dividing 10-point scores by two (e.g., 8 became 4). The questionnaire showed excellent overall reliability (α = 0.95), with subscale α values of 0.86 for quality, 0.86 for empathy, 0.79 for readability, and 0.64 for satisfaction.
Expert-evaluated scores
Four experts (Basic information see S2 Appendix) were selected from the Department of Dentistry (each with >10 years of clinical experience in orthodontics) to evaluate responses generated by 5 LLMs and 3 search engines across 11 predefined evaluation indicators. Each indicator was independently rated by 4 evaluators and averaged to determine a consensus score. All evaluators were blinded to the identity of the models during scoring to minimize potential bias.
Statistical analyses
The intraclass correlation coefficient (ICC) was calculated using a two-way mixed-effects model [ICC(C,1)] to assess overall inter-rater consistency across all evaluation items. Consensus scores for each item were computed as the mean of all raters’ scores. The strength of agreement was interpreted as poor (<0.50), moderate (0.50–0.75), good (0.75–0.90), or excellent (>0.90), following Koo and Li [31].Continuous variables were summarized as median (interquartile range, IQR), and nonparametric tests were employed due to violations of both normality and homogeneity of variance assumptions. The Kruskal-Wallis H test (two-sided) was applied to assess multi-group differences in indicator scores among five LLM models and multi-group differences in indicator scores among three search engines. To verify robustness, Friedman tests using question ID as a repeated-measures factor were additionally performed for LLMs (A-E) and search engines (F-H). For statistically significant findings (p < 0.05), post hoc multiple comparisons were conducted using the Mann-Whitney U test (two-sided) with Bonferroni corrections applied to adjust for multiple comparisons (p_adj < 0.05 considered significant), which was also applied to assess overall group differences between LLMs and search engines on the 11 indicators and question categories, as well as scores for each question. All statistical analyses were performed using R software (version 4.4.2), while the heatmap was generated using Manus.
Temporal reproducibility and dynamic model behavior
Because LLMs and search engines are continuously updated, their outputs may vary over time. To reduce temporal variability, all responses were collected on a single day. However, the findings reflect model performance only at that time point (December 22, 2024), and reproducibility beyond this date may not be guaranteed. To support future replication, the full anonymized dataset is archived in Supplementary File S14 Appendix, and re-evaluation after major model updates is recommended.
Results
ICC was calculated using a two-way mixed model and indicated moderate inter-expert agreement (ICC = 0.68, 95% CI [0.67, 0.69]), a level generally considered acceptable for studies involving subjective expert ratings.
Across the four dimensions (quality, empathy, readability, and satisfaction), GPT-4o[A] and GPT-4o mini[B] consistently outperformed other models, while Kimi AI, ERNIE Bot, and Google ranked lowest overall. The relative rankings across dimensions remained stable, highlighting consistent performance patterns among the models (Fig 2A-D, and S8 Appendix).
The LLMs scored significantly higher than the search engine in quality, empathy, readability, and satisfaction (all p < 0.001) (S9 Appendix). The best-performing LLMs, GPT-4o[A], achieved the highest median scores with relatively narrow IQRs compared with other LLMs in overall measures of quality (median: 4.25, IQR: 4.00 ~ 4.25), empathy (median: 4.25, IQR: 4.00 ~ 4.25), readability (median: 4.25, IQR: 4.00 ~ 4.50), and satisfaction (median: 8.75, IQR: 8.50 ~ 9.00) (Table 1, Bonferroni-adjusted p-values in S10 Appendix). Among LLMs, effect sizes were in the large range (η² ≈ 0.40–0.57) across all indicators.
As shown in Table 2, LLMs significantly outperformed search engines (p < 0.05) in most dimensions for “Cost” questions and in medical accuracy, completeness, focus, readability, and satisfaction for the “Appliance Selection” category. For the “Therapeutic Outcomes” category, LLMs also showed significant advantages in medical accuracy, quality, and satisfaction (p < 0.05). No significant differences were observed in other categories.
As shown in Fig 3A, among the 45 questions addressed by LLMs, Question 17 (“Does orthodontic treatment cause dental caries?”) achieved the highest score of 4.12. The second highest-scoring question was Question 25 (“What types of orthodontic appliances are available?”) with 4.15, followed by Question 23 (“What preparations are needed before starting orthodontic treatment?”) scoring 4.10. Compared with search engine responses, the differences in scores for these questions were all statistically significant (p < 0.01).
As shown in Fig 3B, for the 45 search engine responses, search engines achieved higher scores than LLMs (4.29) on Question 16 (“How can orthodontic relapse be prevented after treatment?”). Question 36 (“How should oral hygiene be maintained during orthodontic treatment?”) ranked second with 4.00. The lowest score of 2.9 was observed for Question 45 (“Can orthodontic treatment cause speech difficulties?”). Question 27 (“What is the cost of orthodontic treatment?”) received the second lowest score of 3.00. Questions 27–30, which fall under the cost category, all received significantly lower scores than those provided by LLMs (p < 0.01).
Discussion
This study systematically compiled 45 common orthodontic questions across six categories through literature search, online forums, and semi-structured interviews. Utilizing a self-designed multidimensional evaluation questionnaire (quality, empathy, readability, satisfaction), we comparatively analyzed responses from 5 LLMs and 3 search engines.
The high expert ratings obtained by GPT-4o and GPT-4o mini in this study indicate their potential for generating high-quality responses to patient education questions. This aligns with prior research, GPT-4o outperformed ERNIE Bot and other models in answering health questions [32]. The results suggest that LLMs outperform search engines in response quality, empathy, readability, and satisfaction. Prior research found that LLMs achieved significantly higher scores in comprehensiveness and quality compared to search engines (p < 0.05), when evaluated on patient queries about common chronic conditions [33]. LLMs have also outperformed human physicians on critical care questions, achieving accuracy scores as high as 93.3%, compared to an average physician score of 61.9% (p < 0.001) [34]. The higher expert ratings received by GPT-4o suggest its potential to provide comparatively higher-quality responses than the other LLMs evaluated in this study. In nutrition-related tasks, GPT-4o achieved an accuracy of 94.5%, outperforming Claude 3.5 Sonnet (92.7%) [35]. In radiology report evaluations, GPT-4o showed significantly higher agreement with expert radiologists (κ = 0.72 vs. κ = 0.15; p < 0.05), compared to other LLMs [36].
In contrast to traditional retrieval-based search engines, LLMs are generative, allowing them to synthesize information and produce logically consistent and highly readable responses [37]. LLMs’ deep learning architecture supports contextual understanding and empathetic expression, and is typically trained using vast corpora of human-generated text that include clinical guidelines [38], which may explain their relatively strong performance in providing accurate and complete responses. Observed performance differences between models may stem from differences in training datasets (including size, diversity, and specificity) as well as differences in algorithm design [39]. The superior performance of GPT-4o could be attributed to an end-to-end multimodal architecture [40], which seamlessly integrates text, image, and audio inputs within a unified Transformer framework, enhancing cross-modal alignment to generate more accurate, complete, and readable responses.
The expert-rated results indicate that the LLMs evaluated in this study have the potential to generate higher-quality responses for analytical reasoning questions (appliance selection and therapeutic outcomes) and for several cost-related questions. A benchmarking study [41] reports that LLMs achieve 100% appropriateness in responding to critical migraine treatment-related questions (e.g., “How is migraine treated?”, “I have a migraine, what will happen if I don’t treat it?”), demonstrating high accuracy in delivering guideline-aligned therapeutic information. According to research [42] published in Mayo Clinic Proceedings: Digital Health in 2024, LLMS showed high appropriateness compared to information site when answering questions in the category of ophthalmic appliance selection (e.g., “How should I decide which intraocular lens to choose?”). The significantly lower performance of search engines in answering cost-related orthodontic questions compared to LLMs may be attributed to their inability to model interdependent variables, which transformer-based LLMs effectively link through a non-linear attention mechanism [43]. Unlike the fragmented outputs of search engines, LLMs appear capable of integrating treatment time, aligner type, and geographic pricing to generate more realistic costing results.
Incorporating principles from Self-Determination Theory, LLMs could potentially support patients in setting personalized health goals and providing stage-based guidance to enhance autonomy and motivation [44]. Future health educators may increasingly shift their core responsibilities to “human-machine collaboration”, guiding patients from passively receiving information to actively using LLMs for self-management.
During expert review, occasional factual inaccuracies were identified, typically involving minor explanatory details, and occurred in fewer than 4% of the evaluated responses. Although LLM responses show high readability and apparent accuracy, they may still contain factual or contextual inaccuracies [45]. Professional oversight remains essential to ensure the reliability of health information.
The limitations of this study include the lack of patient raters to measure response empathy and satisfaction, and the reliance on expert-only validation may introduce potential scoring bias in these indicators. Future studies should integrate patient-reported evaluations to better capture subjective experiences and preferences, which are critical for evaluating health education tools in real-world contexts. Our study demonstrated moderate inter-rater consistency, reflecting variability in subjective expert evaluations. Despite blinding and standardized scoring, future studies should involve more diverse raters to enhance reliability. The rapid evolution of LLMs, with data collected in December 2024, limits temporal validity. Future replication using updated, version-controlled APIs is needed to verify model stability. A combined API–UI approach may better balance reproducibility with ecological validity in subsequent studies.
Conclusion
In this cross-sectional study, LLMs, particularly GPT-4o, demonstrated superior performance compared to search engines in expert evaluations, suggesting potential usefulness for orthodontic patient education and self-management.
Supporting information
S2 Appendix. Basic Information for Clients and Experts.
https://doi.org/10.1371/journal.pone.0339908.s002
(PDF)
S5 Appendix. Evaluation Indicators and Criteria.
https://doi.org/10.1371/journal.pone.0339908.s005
(PDF)
S6 Appendix. Self-designed multidimensional evaluation questionnaire.
https://doi.org/10.1371/journal.pone.0339908.s006
(PDF)
S7 Appendix. Original Chinese-language model outputs generated by five LLMs and three search engines.
https://doi.org/10.1371/journal.pone.0339908.s007
(PDF)
S8 Appendix. Medical accuracy (A), Completeness (B), Focus (C), Emotional Empathy (D), Cognitive Empathy (E), specialized vocabulary (F), and Logical Clearity (G) of LLMs and search engine responses to questions.
A indicates GPT-4o; B, GPT-4o mini; C, Claude 3.5 Sonnet; E, Kimi AI; F, ERNIE Bot; F, Google; G, Microsoft Bing; and H, Baidu. The midline indicates the median (50% percentile); the box, 25% and 75% percentile; the whiskers, 5% and 95% percentile; and the density distribution plot represents the probability density of the response score distribution.
https://doi.org/10.1371/journal.pone.0339908.s008
(PDF)
S9 Appendix. Quality, Empathy, Readability, and Satisfaction Scores of LLMs and search engine responses to questions.
p-value: conducting statistical significance tests on the score differences between LLMs and search engines.
https://doi.org/10.1371/journal.pone.0339908.s009
(PDF)
S10 Appendix. Post hoc pairwise comparisons with Bonferroni correction following Kruskal–Wallis tests.
https://doi.org/10.1371/journal.pone.0339908.s010
(XLSX)
S11 Appendix. Effect sizes for Kruskal-Wallis and Mann-Whitney U tests.
https://doi.org/10.1371/journal.pone.0339908.s011
(XLSX)
S13 Appendix. Numerical data underlying figures 3, including all heatmap cell values used for figure visualization.
https://doi.org/10.1371/journal.pone.0339908.s013
(XLSX)
S14 Appendix. The minimal anonymized data set.
https://doi.org/10.1371/journal.pone.0339908.s014
(XLSX)
Acknowledgments
Thank you to all the experts who participated in this research, and all the students who have offered help.
References
- 1. Simonsmeier BA, Flaig M, Simacek T, Schneider M. What sixty years of research says about the effectiveness of patient education on health: a second order meta-analysis. Health Psychol Rev. 2022;16(3):450–74. pmid:34384337
- 2. Leppin AL, Gionfriddo MR, Kessler M, Brito JP, Mair FS, Gallacher K, et al. Preventing 30-day hospital readmissions: a systematic review and meta-analysis of randomized trials. JAMA Intern Med. 2014;174(7):1095–107. pmid:24820131
- 3. Zhao Q, Chen C, Zhang J, Ye Y, Fan X. Effects of self-management interventions on heart failure: Systematic review and meta-analysis of randomized controlled trials. Int J Nurs Stud. 2020;110:103689. pmid:32679402
- 4. Barry A, Shahbaz A. The challenges and opportunities clinical education in the context of psychological, educational and therapeutic dimensions in teaching hospital. BMC Med Educ. 2025;25(1):154. pmid:39885476
- 5. Hakim A. Investigating the challenges of clinical education from the viewpoint of nursing educators and students: A cross-sectional study. SAGE Open Med. 2023;11:20503121221143578. pmid:36760513
- 6. Berkman ND, Sheridan SL, Donahue KE, Halpern DJ, Crotty K. Low health literacy and health outcomes: an updated systematic review. Ann Intern Med. 2011;155(2):97–107. pmid:21768583
- 7. Soldaini L. The Knowledge and Language Gap in Medical Information Seeking. SIGIR Forum. 2019;52(2):178–9.
- 8. Allam A, Schulz PJ, Nakamoto K. The impact of search engine selection and sorting criteria on vaccination beliefs and attitudes: two experiments manipulating Google output. J Med Internet Res. 2014;16(4):e100. pmid:24694866
- 9. Jia X, Pang Y, Liu LS. Online Health Information Seeking Behavior: A Systematic Review. Healthcare (Basel). 2021;9(12):1740. pmid:34946466
- 10. Mian SM, Khan MS, Shawez M, Kaur A. Artificial Intelligence (AI), Machine Learning (ML) & Deep Learning (DL): A Comprehensive Overview on Techniques, Applications and Research Directions. In: 2024 2nd International Conference on Sustainable Computing and Smart Systems (ICSCSS). Hindustan Coll Engn & Technol, Coimbatore, INDIA. NEW YORK. 2024. 1404–9.
- 11. Zhang C, Lu Y. Study on artificial intelligence: The state of the art and future prospects. Journal of Industrial Information Integration. 2021;23:100224.
- 12. Orrall A, Rekito A. Poll: Trust in AI for Accurate Health Information Is Low. JAMA. 2025;333(16):1383–4. pmid:40116838
- 13. Mitchell M, Krakauer DC. The debate over understanding in AI’s large language models. Proc Natl Acad Sci U S A. 2023;120(13):e2215907120. pmid:36943882
- 14. Haupt CE, Marks M. AI-Generated medical advice-GPT and beyond. JAMA. 2023;329(16):1349–50. pmid:36972070
- 15. Shibue K. Artificial intelligence and machine learning in clinical medicine. N Engl J Med. 2023;388(25):2397–9.
- 16. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40. pmid:37460753
- 17. Meng X, Yan X, Zhang K, Liu D, Cui X, Yang Y, et al. The application of large language models in medicine: A scoping review. iScience. 2024;27(5):109713. pmid:38746668
- 18. Jimenez G, Lum E, Car J. Examining diabetes management apps recommended from a google search: content analysis. JMIR Mhealth Uhealth. 2019;7(1):e11848. pmid:30303485
- 19. Ma X, Zhu R, Wang Z, Xiong J, Chen Q, Tang H, et al. Enhancing patient-centric communication: Leveraging LLMs to simulate patient perspectives. arXiv. 2025.
- 20. Hao Y, Holmes J, Waddle M, Yu N, Vickers K, Preston H. Outlining the borders for llm applications in patient education: developing an expert-in-the-loop llm-powered chatbot for prostate cancer patient education. arXiv. 2024. 19100.
- 21. Zhang Z, Hao Y, Li X, Luo Y, Xie L, Sun J. EHRTutor: Enhancing Patient Understanding of Discharge Instructions. 2023. https://arxiv.org/abs/231019212
- 22. Aydin S, Karabacak M, Vlachos V, Margetis K. Large language models in patient education: a scoping review of applications in medicine. Front Med (Lausanne). 2024;11:1477898. pmid:39534227
- 23. Lombardo G, Vena F, Negri P, Pagano S, Barilotti C, Paglia L, et al. Worldwide prevalence of malocclusion in the different stages of dentition: A systematic review and meta-analysis. Eur J Paediatr Dent. 2020;21(2):115–22. pmid:32567942
- 24. Batista KB, Thiruvenkatachari B, Harrison JE, O’Brien KD. Orthodontic treatment for prominent upper front teeth (Class II malocclusion) in children and adolescents. Cochrane Database Syst Rev. 2018;3(3):CD003452. pmid:29534303
- 25. Zhou X, Chen S, Zhou C, Jin Z, He H, Bai Y, et al. Expert consensus on early orthodontic treatment of class III malocclusion. Int J Oral Sci. 2025;17(1):20. pmid:40164594
- 26. Dehghani M, Fazeli F, Sattarzadeh AP. Efficiency and duration of orthodontic/orthognathic surgery treatment. J Craniofac Surg. 2017;28(8):1997–2000. pmid:28968333
- 27. Zhao J, Cao A, Xie L, Shao L. Knowledge, attitude, and practice toward oral health management among orthodontic patients: a cross-sectional study. BMC Oral Health. 2024;24(1):1500. pmid:39695598
- 28. Wang Y, Long H, Zhao Z, Bai D, Han X, Wang J, et al. Expert consensus on the clinical strategies for orthodontic treatment with clear aligners. Int J Oral Sci. 2025;17(1):19. pmid:40074738
- 29. Giray L. Prompt engineering with chatgpt: a guide for academic writers. Ann Biomed Eng. 2023;51(12):2629–33. pmid:37284994
- 30. Chen D, Parsa R, Hope A, Hannon B, Mak E, Eng L, et al. Physician and Artificial Intelligence Chatbot Responses to Cancer Questions From Social Media. JAMA Oncol. 2024;10(7):956–60. pmid:38753317
- 31. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155–63. pmid:27330520
- 32. Wang W, Fu J, Zhang Y, Hu K. A comparative analysis of GPT-4o and ERNIE Bot in a Chinese radiation oncology exam. J Cancer Educ. 2025;:10.1007/s13187-025-02652–9. pmid:40418520
- 33. Rao A, Mu A, Enichen E, Gupta D, Hall N, Koranteng E, et al. A future of self-directed patient internet research: large language model-based tools versus standard search engines. Ann Biomed Eng. 2025;53(5):1199–208. pmid:40025252
- 34. Workum JD, Volkers BWS, van de Sande D, Arora S, Goeijenbier M, Gommers D, et al. Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study. Crit Care. 2025;29(1):72. pmid:39930514
- 35. Azimi I, Qi M, Wang L, Rahmani AM, Li Y. Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval. Sci Rep. 2025;15(1):1506. pmid:39789057
- 36. Atsukawa N, Tatekawa H, Oura T, Matsushita S, Horiuchi D, Takita H, et al. Evaluation of radiology residents’ reporting skills using large language models: an observational study. Jpn J Radiol. 2025;43(7):1204–12. pmid:40056344
- 37. Bakhshandeh S. Benchmarking medical large language models. Nat Rev Bioeng. 2023;1(8):543–543.
- 38. Oniani D, Wu X, Visweswaran S, Kapoor S, Kooragayalu S, Polanska K, et al. Enhancing Large Language Models for Clinical Decision Support by Incorporating Clinical Practice Guidelines. Proc (IEEE Int Conf Healthc Inform). 2024;2024:694–702. pmid:40092288
- 39. Silhadi M, Nassrallah WB, Mikhail D, Milad D, Harissi-Dagher M. Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology. Can J Ophthalmol. 2025;60(4):e507–14. pmid:39863285
- 40. Mao Y, Xu N, Wu Y, Wang L, Wang H, He Q, et al. Assessments of lung nodules by an artificial intelligence chatbot using longitudinal CT images. Cell Rep Med. 2025;6(3):101988. pmid:40043704
- 41. Li L, Li P, Wang K, Zhang L, Ji H, Zhao H. Benchmarking state-of-the-art large language models for migraine patient education: performance comparison of responses to common queries. J Med Internet Res. 2024;26:e55927. pmid:38828692
- 42. Tailor PD, Xu TT, Fortes BH, Iezzi R, Olsen TW, Starr MR, et al. Appropriateness of ophthalmology recommendations from an online chat-based artificial intelligence model. Mayo Clin Proc Digit Health. 2024;2(1):119–28. pmid:38577703
- 43. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al, editors. Language Models are Few-Shot Learners. 34th Conference on Neural Information Processing Systems (NeurIPS); 2020 Dec 06-12; Electr Network. LA JOLLA: Neural Information Processing Systems (Nips); arXiv [Preprint]. 2020. arXiv:2005.14165v4. Available from: Accessed January 20, 2025.
- 44. Deci EL, Ryan RM. The “What” and “Why” of goal pursuits: human needs and the self-determination of behavior. Psychological Inquiry. 2000;11(4):227–68.
- 45. Liu A, Sheng Q, Hu X. Preventing and Detecting Misinformation Generated by Large Language Models. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024;3001–4.