Evaluating the influence of prompt formulation on the reliability and repeatability of ChatGPT in implant-supported prostheses

Yolanda Freire; Andrea Santamaría Laorden; Jaime Orejas Pérez; Ignacio Ortiz Collado; Margarita Gómez Sánchez; Israel J. Thuissard Vasallo; Víctor Díaz-Flores García; Ana Suárez

doi:10.1371/journal.pone.0323086

Abstract

Language models (LLMs) such as ChatGPT are widely available to any dental professional. However, there is limited evidence to evaluate the reliability and reproducibility of ChatGPT-4 in relation to implant-supported prostheses, as well as the impact of prompt design on its responses. This constrains understanding of its application within this specific area of dentistry. The purpose of this study was to evaluate the performance of ChatGPT-4 in generating answers about implant-supported prostheses using different prompts. Thirty questions on implant-supported and implant-retained prostheses were posed, with 30 answers generated per question using general and specific prompts, totaling 1800 answers. Experts assessed reliability (agreement with expert grading) and repeatability (response consistency) using a 3-point Likert scale. General prompts achieved 70.89% reliability, with repeatability ranging from moderate to almost perfect. Specific prompts showed higher performance, with 78.8% reliability and substantial to almost perfect repeatability. The specific prompt significantly improved reliability compared to the general prompt. Despite these promising results, ChatGPT’s ability to generate reliable answers on implant-supported prostheses remains limited, highlighting the need for professional oversight. Using specific prompts can enhance its performance. The use of a specific prompt might improve the answer generation performance of ChatGPT.

Citation: Freire Y, Santamaría Laorden A, Orejas Pérez J, Ortiz Collado I, Gómez Sánchez M, Thuissard Vasallo IJ, et al. (2025) Evaluating the influence of prompt formulation on the reliability and repeatability of ChatGPT in implant-supported prostheses. PLoS One 20(5): e0323086. https://doi.org/10.1371/journal.pone.0323086

Editor: Jafar Kolahi, Dental Hypothesis, IRAN, ISLAMIC REPUBLIC OF

Received: February 4, 2025; Accepted: April 2, 2025; Published: May 30, 2025

Copyright: © 2025 Freire et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Relevant data used for this study are publicly available in the Open Science Framework (OSF) repository under the following DOI: 10.17605/OSF.IO/Y4E9B.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Large Language Models (LLMs) are a category of artificial intelligence (AI) specifically designed to emulate human language processing capabilities [1]. These models have been developed through extensive training on massive databases. They are characterized by their strong ability to contextualize and interpret human language [2], which allows them to generate human-like responses [3].

In the field of LLMs, the Generative Pre-trained Transformers (GPT) series from OpenAI [4] have gained prominence as one of the most advanced implementations models. ChatGPT-3.5 was introduced in November 2022n [5] and ChatGPT-4 was introduced in March 2023 [6]. Currently, ChatGPT has positioned itself as one of the most comprehensive and accessible language models to the public [7], widely used by millions of users for various purposes [4], including the search for health-related information [2]. In dentistry, several studies have evaluated the performance of ChatGPT in areas such as Endodontics [8,9], Oral and Maxillofacial Surgery [10,11], Periodontics [12,13], Orthodontics [14–16] or Prosthodontics [17]. In prosthodontics, ChatGPT was found to have limited ability to generate answers related to removable dental prostheses and tooth-supported fixed dental prostheses [17]. However, to date, the performance in implant-supported prosthesis is limited. This area is particularly relevant due to the complexity of implant prosthodontics, which requires the integration of biomechanical principles, prosthetic design, and peri-implant health considerations [18–21]. Unlike other fields of dentistry, where treatment protocols may be more standardized, implant-supported restorations demand highly individualized decision-making based on patient-specific factors [22]. Given that AI tools like ChatGPT are increasingly used for educational and clinical support, assessing their reliability in this domain is essential to determine their potential usefulness and limitations.

In addition, when using these models, possible response bias should be considered. Among the most significant biases is the possibility of producing meaningless content [23] by presenting incorrect information as if it were accurate [24]. The phenomenon where the model generates answers that appear reliable but lack substance or relevance has been described as artificial hallucination [25]. Another possible bias could be related to the use of prompts, as these models have the ability to capture the nuances and complexities of human language through input prompts [7]. Therefore, the generation of answers by ChatGPT depends on prompts entered by users [4]. In this context, prompt engineering is becoming increasingly important, focusing on the design, improvement, and implementation of these prompts to guide the results of LLMs towards concrete answers, thus optimising the interaction with artificial intelligence systems. However, there is a lack of studies analysing the influence of prompts on the answers obtained [26].

Therefore, given the lack of studies in implant-supported prosthesis, it is important to evaluate the performance of ChatGPT to determine its reliability and reproducibility, as well as its performance depending on the type of prompt used, to provide a critical insight into a potential use.

Thus, the aim of this study was to analyze the reliability and repeatability of ChatGPT-4 answer generation to specific implant-supported and implant-retained prostheses questions, and to compare different prompts in generating the answers.

The research hypothesis was that the implant prosthetics answers provided by ChatGPT-4 would not exhibit reliability and repeatability, and that there would be no significant differences between the use of different prompts.

Materials and methods

This research adhered to the Declaration of Helsinki and did not require ethical approval as it did not involve human participants.

The methodology of this study was based on previous studies published in the literature [8,10,17], adapted to the objectives of this research. The STAGER: Standardized Testing and Assessment Guidelines for Evaluating Generative Artificial Intelligence Reliability [27] and the TRIPOD [28] checklists were used to guide the reporting of this study (S1 and S2 Appendix).

Two authors (A.S., Y.F.) with experience in the design of questions for answer generation in ChatGPT-4 [8,10,17], developed an initial set of 60 questions related to implant-supported and implant-retained prostheses. The questions were designed based on clinical practice guidelines, specifically The Proceedings of the Seventh ITI Consensus Conference of the International Team for Implantology (ITI) [29]. These guidelines were selected as a reference to ensure that the questions covered key aspects of implant-supported prostheses in a structured and evidence-based manner.

An expert judgement approach was used to assess the readability and clinical relevance of the answers generated by ChatGPT-4. These questions were independently evaluated by 2 prosthodontic graduate program faculty members (I.O.C, J.O.P.) for clarity, relevance, and inclusion of key concepts, using a 3-point Likert scale (0 = disagree; 1 = neutral; 2 = agree). To minimize selection bias, the evaluation was blinded to the study objectives. Discrepancies in this evaluation were reviewed by a third prosthodontic graduate program faculty member (A.S.L.). Based on the evaluation scores, the 30 highest-rated questions were selected to ensure a representative and unbiased assessment of ChatGPT’s performance.

To compare the performance of ChatGPT-4 according to the type of prompt, 2 question formats were designed, one general and one specific. The general prompt consisted of a question with no additional instructions [9,11]. The specific prompt was characterized by being more specific and direct, aiming to guide ChatGPT towards more relevant answers [10] Therefore, ChatGPT was instructed to assume the role of a prosthodontist and the target audience was a general dentist, and to answer the questions accurately and directly, without digressions or creative answers. The selected prompt was ‘Imagine that you are a prosthodontist and I am a general dentist. Please answer the following question accurately and directly, without rambling or creative answers.’

Two authors (M.G.S., V.DF.G.), independently and using 2 different ChatGPT-4 Plus accounts, entered the 30 previously selected questions using the 2 types of prompts (Tables 1 and 2). As a result, 60 answers were generated, 30 for each type of prompt. In order to assess the repeatability, 30 answers were obtained for each of the questions. This process was repeated 3 times during the day (morning, afternoon, and evening) in March 2024, using the “new chat” option for each question to reduce memory retention bias [30,31].

Download:

Table 1. Questions included for ChatGPT-4 to generate answers using the general prompt.

https://doi.org/10.1371/journal.pone.0323086.t001

Download:

Table 2. Questions included for ChatGPT-4 to generate answers using the specific prompt.

https://doi.org/10.1371/journal.pone.0323086.t002

The 1800 answers generated by ChatGPT-4 were independently evaluated by 2 prosthodontic graduate program faculty members (expert 1, J.O.P.; expert 2, I.O.C) who were blinded to the study objectives. A 3-point Likert scale was used for assessment (Table 3). Discrepancies in the evaluation were resolved by a third prosthodontic graduate program faculty member (expert 3, A.S.L.). The experts had 3 years of experience in natural language processing and artificial intelligence in this field.

Download:

Table 3. Grading system for answers generated by ChatGPT.

https://doi.org/10.1371/journal.pone.0323086.t003

All data obtained were recorded in an Excel spreadsheet (Excel version 16; Microsoft Corp). STATA statistical software program (STATA version BE 14; StataCorp) was used to analyze the data. The relative frequency (n) and absolute percentage (%) of the generated answers, categorized according to the gradings given by the experts (0 = incorrect; 1 = incomplete or partially correct; 2 = correct). The consistency of each expert and level of agreement between experts’ gradings was assessed for the entire set of answers generated by ChatGPT-4.

To evaluate the performance of ChatGPT-4 in generating answers in implant-supported prostheses, reliability and repeatability were examined for each prompt used (general or specific). Reliability was calculated as the proportion of questions yielding an answer with a grade of 2 (correct), along with its 95% confidence interval (Wald binomial method). This calculation was performed for the total set of answers as well as for each individual question. The difference in reliability between the general and specific prompts was examined using the Chi-square test, and Cramer’s V effect size was calculated. Repeatability was examined using concordance analysis weighted by ordinal categories and multiple repetitions with 95% confidence intervals (percent agreement, Brennan and Prediger coefficient, Conger generalized Cohen kappa, Fleiss kappa, Gwet AC, and Krippendorff alpha).According to the benchmark scale proposed by Gwet [32], the estimated coefficients were classified as follows: < 0.0 Poor, 0.0–0.2, Slight, 0.2–0.4 Fair, 0.4–0.6 Moderate, 0.6–0.8 Substantial, 0.8–1.0 Almost Perfect. The difference in repeatability between the general and specific prompts was analyzed by examining the overlap of the 95% confidence intervals for the coefficients.

Results

The reliability distribution of expert grading of the 1800 implant-supported prostheses answers generated by ChatGPT for the general prompt and the specific prompt are shown in Table 4.

Download:

Table 4. Distribution of expert gradings for ChatGPT answers with the general prompt.

https://doi.org/10.1371/journal.pone.0323086.t004

Regarding the consistency of the reviewers’ gradings, expert 1 achieved an agreement percentage of 94.98% in grading the 1,800 answers generated by ChatGPT, with a Gwet’s AC1 of 91.94%. Similarly, expert 2 achieved an agreement percentage of 95.43%, with a Gwet’s AC1 of 92.57% for the same set of answers. Expert agreement was observed in 1,688 (93.78%) of the 1,800 answers generated by ChatGPT-4. In only 112 of the 1,800 responses did the experts disagree, requiring the intervention of Expert 3 to resolve the discrepancy.

The percentage of reliable answers ranged from 0 to 100% depending on the question and prompt used. Of the 30 questions posed to ChatGPT using the general prompt, 19 received correct answers in all 30 repetitions (i.e., 100% reliability for those questions). Conversely, 5 questions did not receive any correct answers across the 30 repetitions (i.e., 0% reliability for those questions). Meanwhile, using the specific prompt, 18 questions achieved 100% reliability, and only 2 questions had 0% reliability (Fig 1).

Download:

Fig 1. Number of incorrect, partially correct, or incomplete and correct answers obtained from ChatGPT for the 30 questions asked, according to the type of prompt used.

https://doi.org/10.1371/journal.pone.0323086.g001

Overall, the set of questions asked with the general prompt showed a reliability of 70.89% with a 95% confidence interval ranging from 67.84% to 73.76%. The specific prompt showed a reliability of 78.8% with a 95% confidence interval from 75.29% to 80.69%. Thus, the reliability of the specific prompt was significantly higher than the reliability of the general prompt (p < 0.001) (p < 0.001; Cramer’s V effect size = 0.083). However, when analysing each question separately (general prompt vs. specific prompt), no statistically significant differences were found (Table 4).

The repeatability of the experts’ gradings of the generated answers ranged from moderate to almost perfect for the general prompt (Table 5) and from substantial to almost perfect for the specific prompt (Table 6). The pronounced overlap of the 95% confidence intervals for the different coefficients indicates a lack of significant differences in the repeatability of answers between general and specific prompts (Fig 2).

Download:

Table 5. Evaluation of repeatability, based on expert grading, for 30 repetitions of 30 questions generated by ChatGPT with the general prompt.

https://doi.org/10.1371/journal.pone.0323086.t005

Download:

Table 6. Evaluation of repeatability, based on expert grading, for 30 repetitions of 30 questions generated by ChatGPT with the specific prompt.

https://doi.org/10.1371/journal.pone.0323086.t006

Download:

Fig 2. Repeatability (degree of agreement) of answers generated by ChatGPT using a general prompt and a specific prompt.

https://doi.org/10.1371/journal.pone.0323086.g002

Discussion

ChatGPT performance evaluation aimed to analyse the generation of answers in the field of implant-supported prostheses. For this purpose, questions were formulated using two types of prompts, and the generated answers were graded by experts to measure their reliability and repeatability. The research hypothesis that the implant-supported prostheses answers provided by ChatGPT-4 would not exhibit reliability and repeatability was partially rejected as the answers showed limited levels of reliability and repeatability, although better performance was observed when using the specific prompt.

As the accessibility of AI has shown a significant increase, the performance of ChatGPT in generating answers to dental questions needs to be evaluated. The results of this study show a reliability of 70.89% with the general prompt, and a reliability of 78.8% with the specific prompt. These results were higher than those observed in a study that analysed the performance of ChatGPT on questions about implant-supported and implant-retained prostheses, where a mean reliability of 25.6% was observed [17]. However, different performance rates have been reported in other dental specialties. Similar results to this study have been observed in Dental Surgery. It has been reported a mean reliability of 71.7% for Dental Surgery answers, with the proportion of correct answers varying between 0 and 100% [10] and values of 3.94, 3.85 and 3.96 over 4 for answers related to anatomical landmarks, oral and maxillofacial pathologies and radiographic features of the pathologies, respectively [6]. Nevertheless, better values were found for patient questions (4.62 out of 5) than for technical questions (3.1 out of 5) [11]. Furthermore, in Periodontology, several studies evaluating the performance of ChatGPT on patient questions found that the quality of most answers was rated as “good” based on the DISCERN instrument [12], while the accuracy and completeness for periodontal questions were 5.5 out of 6 and 2.3 out of 3, respectively [13]. These differences between the studies might be related to the specific dental specialty, as ChatGPT retrieves information from different Internet sources [1] whose origin is unknown [10]. In addition to comparing the performance of ChatGPT in different dental specialties, it is also important to evaluate its performance in relation to other LLMs. In this regard, several studies [33,34] have highlighted ChatGPT-4 as the best performing model. However, another study in the field of orthodontics [35] found no significant differences between the models, with Microsoft Bing Chat ranking highest, followed by ChatGPT-4, Google Bard and ChatGPT-3.5. However, the different versions of the models used, the data collection periods, and the different specialties may have influenced the variability of results between studies.

Repeatability is a factor to consider as ChatGPT might not always give the same answer to the same question [8]. Furthermore, ChatGPT might randomly give seemingly correct answers mixed with incorrect answers [25]. In this study, from moderate to almost perfect repeatability range was obtained for the general prompt, and from substantial to almost perfect repeatability for the specific prompts. However, the number of studies investigating the repeatability of answers in ChatGPT is limited. Previous research has shown that the repeatability for generating dichotomous endodontic answers was 85.44% [8], and was similar between ChatGPT.3–5, Google Bard and Bing [9]. However, the repeatability of ChatGPT-4 varied depending on the dental specialty analyzed. In Oral Surgery, repeatability was observed with moderate to almost perfect ranges [10], while in Prosthodontics substantial to moderate ranges were found [17].

Regarding the prompts used, it was observed that the specific prompt had a better performance than the general prompt, and this difference was statistically significant in the reliability of the generated answers. These results are in line with previous studies that emphasise the importance of careful prompt design to ensure high quality outputs [36].

Therefore, the way a prompt is designed, known as prompt engineering, should be considered [37], as it is a key factor in optimising model performance. Prompt engineering involves formulating effective instructions that efficiently guide models to generate the desired response. [38,39].

Thus, the observed results could therefore be attributed to the additional clarity given to ChatGPT about its role, the target audience, and the instruction to respond in a precise and direct manner, without digressions or creative answers. This increased guidance may have facilitated the generation of more accurate and relevant answers. Therefore, the use of a specific prompt by the professional could be recommended to optimise the performance of ChatGPT. In the field of dentistry, most studies [8,9,17] have used general prompts to formulate the question, and only a few studies specifically framed the question in ChatGPT [5,10] further research is therefore needed to determine the impact of different types of prompts on the quality of answers generated by ChatGPT to to determine their precise effect and optimise their design to improve the model’s performance.

According to the results obtained, the use of ChatGPT to generate answers in implant-supported and implant-retained prostheses is promising. However, given the level of reliability and repeatability, as well as the unreliability observed, it needs to be carried out under the supervision of a professional. In addition, it would be recommended to use a specific prompt that provides more accurate answers. Further research is needed to analyze the performance of ChatGPT in implant-supported prostheses, as well as the analysis of different prompts to obtain the most accurate answers.

An advantage of this study is the number of answers analyzed, a total of 1800. This large dataset provides a solid basis for evaluating the performance of ChatGPT in implant-supported prostheses, allowing more robust and representative conclusions to be drawn about its reliability and repeatability. In addition, this study compares the use of different prompts, thus contributing to the understanding of how question wording might affect the performance of ChatGPT. To ensure the high quality of the questions entered into ChatGPT, they were designed by the researchers based on The Proceedings of the Seventh ITI Consensus Conference [29]. These questions were evaluated by experts with over 15 years of experience in the field using a 3-point Likert scale (0 = disagree, 1 = neutral, 2 = agree). The 30 highest-scoring questions were selected for the study, reducing potential biases related to the questions. Regarding the quality of answer grading, the experts graded the 1,800 answers generated by ChatGPT, both with a high consistency. The level of agreement between expert 1 and expert 2 was high. Additionally, in case of discrepancies, expert 3 intervened to resolve the differences. This helped to minimise potential biases in the grading process.

One of the main limitations of this study was that only technical questions were analysed, which may not fully represent the variety of questions encountered in clinical practice. The performance of ChatGPT may vary depending on the complexity and context of the questions, particularly in real-world clinical scenarios or patient-generated queries. Future studies should evaluate its reliability in these contexts.

Future research should investigate its reliability in these contexts, as well as its potential impact on clinical decision making. In addition, further studies should evaluate its performance with patient-generated questions and analyse the influence of different prompt design strategies to optimise response reliability. Moreover, comparing ChatGPT’s performance with other AI models would provide valuable insights into its relative strengths and weaknesses in implant prosthodontics. Expanding the scope of research in these areas would provide a more comprehensive understanding of the capabilities and limitations of ChatGPT in implant prosthodontics.

Conclusions

ChatGPT showed promising reliability and repeatability in generating answers in implant-supported and implant-retained prostheses. Better results were obtained when using a specific prompt compared to a general prompt. However, the results suggest that ChatGPT should always be used under the supervision of a professional who can identify and manage limitations.

Supporting information

S1 Appendix. STAGER checklist.

https://doi.org/10.1371/journal.pone.0323086.s001

(PDF)

S2 Appendix. TRIPOD checklist

https://doi.org/10.1371/journal.pone.0323086.s002

(PDF)

References

1. Deiana G, Dettori M, Arghittu A, Azara A, Gabutti G, Castiglia P. Artificial intelligence and public health: evaluating ChatGPT responses to vaccination myths and misconceptions. Vaccines (Basel). 2023;11(7):1217. pmid:37515033
- View Article
- PubMed/NCBI
- Google Scholar
2. Coskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int. 2024;44(3):509–15. pmid:37747564
- View Article
- PubMed/NCBI
- Google Scholar
3. Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun C-H, Lam JSH, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. pmid:37625267
- View Article
- PubMed/NCBI
- Google Scholar
4. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291. pmid:37261894
- View Article
- PubMed/NCBI
- Google Scholar
5. Dashti M, Londono J, Ghasemi S, Moghaddasi N. How much can we rely on artificial intelligence chatbots such as the ChatGPT software program to assist with scientific writing? J Prosthet Dent. 2025;133(4):1082–8. pmid:37438164
- View Article
- PubMed/NCBI
- Google Scholar
6. Mago J, Sharma M. The potential usefulness of ChatGPT in oral and maxillofacial radiology. Cureus. 2023;15(7):e42133. pmid:37476297
- View Article
- PubMed/NCBI
- Google Scholar
7. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. pmid:37215063
- View Article
- PubMed/NCBI
- Google Scholar
8. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024;57(1):108–13. pmid:37814369
- View Article
- PubMed/NCBI
- Google Scholar
9. Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J. 2024;57(3):305–14. pmid:38117284
- View Article
- PubMed/NCBI
- Google Scholar
10. Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, et al. Beyond the scalpel: assessing ChatGPT’s potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J. 2023;24:46–52. pmid:38162955
- View Article
- PubMed/NCBI
- Google Scholar
11. Balel Y. Can ChatGPT be used in oral and maxillofacial surgery? J Stomatol Oral Maxillofac Surg. 2023;124(5):101471. pmid:37061037
- View Article
- PubMed/NCBI
- Google Scholar
12. Alan R, Alan BM. Utilizing ChatGPT-4 for providing information on periodontal disease to patients: a DISCERN quality analysis. Cureus. 2023;15(9):e46213. pmid:37908933
- View Article
- PubMed/NCBI
- Google Scholar
13. Chatzopoulos GS, Koidou VP, Tsalikis L, Kaklamanos EG. Large language models in periodontology: assessing their performance in clinically relevant questions. J Prosthet Dent. 2024:S0022–3913(24)00714-5. pmid:39562221
- View Article
- PubMed/NCBI
- Google Scholar
14. Tanaka OM, Gasparello GG, Hartmann GC, Casagrande FA, Pithon MM. Assessing the reliability of ChatGPT: a content analysis of self-generated and self-answered questions on clear aligners, TADs and digital imaging. Dental Press J Orthod. 2023;28(5):e2323183. pmid:37937680
- View Article
- PubMed/NCBI
- Google Scholar
15. Abu Arqub S, Al-Moghrabi D, Allareddy V, Upadhyay M, Vaid N, Yadav S. Content analysis of AI-generated (ChatGPT) responses concerning orthodontic clear aligners. Angle Orthod. 2024;94(3):263–72. pmid:38195060
- View Article
- PubMed/NCBI
- Google Scholar
16. Makrygiannakis MA, Giannakopoulos K, Kaklamanos EG. Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing. Eur J Orthod. 2024:cjae017. pmid:38613510
- View Article
- PubMed/NCBI
- Google Scholar
17. Freire Y, Santamaría Laorden A, Orejas Pérez J, Gómez Sánchez M, Díaz-Flores García V, Suárez A. ChatGPT performance in prosthodontics: assessment of accuracy and repeatability in answer generation. J Prosthet Dent. 2024;131(4):659.e1-659.e6. pmid:38310063
- View Article
- PubMed/NCBI
- Google Scholar
18. Anitua E, Larrazabal Saez de Ibarra N, Saracho Rotaeche L. Implant-supported prostheses in the edentulous mandible: biomechanical analysis of different implant configurations via finite element analysis. Dent J (Basel). 2022;11(1):4. pmid:36661541
- View Article
- PubMed/NCBI
- Google Scholar
19. Mistry G, Rathod A, Singh S, Kini A, Mehta K, Mistry R. Digital versus traditional workflows for fabrication of implant-supported rehabilitation: a systematic review. Bioinformation. 2024;20(9):1075–85. pmid:39917200
- View Article
- PubMed/NCBI
- Google Scholar
20. Froimovici F-O, Butnărașu CC, Montanari M, Săndulescu M. Fixed full-arch implant-supported restorations: techniques review and proposal for improvement. Dent J (Basel). 2024;12(12):408. pmid:39727465
- View Article
- PubMed/NCBI
- Google Scholar
21. Tabarak N, Srivastava G, Padhiary SK, Manisha J, Choudhury GK. Zirconia-ceramic versus metal-ceramic implant-supported multiunit fixed dental prostheses: a systematic review and meta-analysis. Dent Res J. 2024;21:5.
- View Article
- Google Scholar
22. Vazouras K, Taylor T. Full-arch removable vs fixed implant restorations: a literature review of factors to consider regarding treatment choice and decision-making in elderly patients. Int J Prosthodont. 2021;34:s93–101. pmid:33571329
- View Article
- PubMed/NCBI
- Google Scholar
23. Eggmann F, Weiger R, Zitzmann NU, Blatz MB. Implications of large language models such as ChatGPT for dental medicine. J Esthet Restor Dent. 2023;35(7):1098–102. pmid:37017291
- View Article
- PubMed/NCBI
- Google Scholar
24. Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng P-C, et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 2023;16(1):20. pmid:37443040
- View Article
- PubMed/NCBI
- Google Scholar
25. Gajjar K, Balakumaran K, Kim AS. Reversible left ventricular systolic dysfunction secondary to pazopanib. Cureus. 2018;10(10):e3517. pmid:30648052
- View Article
- PubMed/NCBI
- Google Scholar
26. Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023;25:e50638. pmid:37792434
- View Article
- PubMed/NCBI
- Google Scholar
27. Chen J, Zhu L, Mou W, Lin A, Zeng D, Qi C, et al. STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability. iMetaOmics. 2024;1(1).
- View Article
- Google Scholar
28. Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31(1):60–9. pmid:39779929
- View Article
- PubMed/NCBI
- Google Scholar
29. Proceedings of the seventh ITI consensus conference. special issue. Clin Oral Implants Res. 2023;
- View Article
- Google Scholar
30. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. pmid:36812645
- View Article
- PubMed/NCBI
- Google Scholar
31. Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep. 2024;14(1):9330. pmid:38654011
- View Article
- PubMed/NCBI
- Google Scholar
32. Gwet KL Handbook of inter-rater reliability. 4th ed. Gaithersburg, MD: Advanced Analytics, LLC; 2014.
33. Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the performance of generative AI large language models ChatGPT, Google Bard, and Microsoft Bing Chat in supporting evidence-based dentistry: comparative mixed methods study. J Med Internet Res. 2023;25:e51580. pmid:38009003
- View Article
- PubMed/NCBI
- Google Scholar
34. Yamaguchi S, Morishita M, Fukuda H, Muraoka K, Nakamura T, Yoshioka I, et al. Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: a comparative analysis of ChatGPT, Bard, and Bing Chat. J Dent Sci. 2024;19(4):2262–7. pmid:39347065
- View Article
- PubMed/NCBI
- Google Scholar
35. Makrygiannakis MA, Giannakopoulos K, Kaklamanos EG. Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing. Eur J Orthod. 2024:cjae017. pmid:38613510
- View Article
- PubMed/NCBI
- Google Scholar
36. Kıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024;100(1189):858–65. pmid:38840505
- View Article
- PubMed/NCBI
- Google Scholar
37. Toyama Y, Harigai A, Abe M, Nagano M, Kawabata M, Seki Y, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol. 2024;42(2):201–7. pmid:37792149
- View Article
- PubMed/NCBI
- Google Scholar
38. Cesur T, Güneş YC. Optimizing diagnostic performance of ChatGPT: the impact of prompt engineering on thoracic radiology cases. Cureus. 2024;16(5):e60009. pmid:38854352
- View Article
- PubMed/NCBI
- Google Scholar
39. Almeida LC, Farina EMJM, Kuriki PEA, Abdala N, Kitamura FC. Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations. Radiol Artif Intell. 2024;6(1):e230103. pmid:38294325
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Deiana G, Dettori M, Arghittu A, Azara A, Gabutti G, Castiglia P. Artificial intelligence and public health: evaluating ChatGPT responses to vaccination myths and misconceptions. Vaccines (Basel). 2023;11(7):1217. pmid:37515033
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Coskun BN, Yagiz B, Ocakoglu G, Dalkilic E, Pehlivan Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int. 2024;44(3):509–15. pmid:37747564
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun C-H, Lam JSH, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. pmid:37625267
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291. pmid:37261894
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Dashti M, Londono J, Ghasemi S, Moghaddasi N. How much can we rely on artificial intelligence chatbots such as the ChatGPT software program to assist with scientific writing? J Prosthet Dent. 2025;133(4):1082–8. pmid:37438164
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Mago J, Sharma M. The potential usefulness of ChatGPT in oral and maxillofacial radiology. Cureus. 2023;15(7):e42133. pmid:37476297
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. pmid:37215063
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Suárez A, Díaz-Flores García V, Algar J, Gómez Sánchez M, Llorente de Pedro M, Freire Y. Unveiling the ChatGPT phenomenon: evaluating the consistency and accuracy of endodontic question answers. Int Endod J. 2024;57(1):108–13. pmid:37814369
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J. 2024;57(3):305–14. pmid:38117284
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, et al. Beyond the scalpel: assessing ChatGPT’s potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J. 2023;24:46–52. pmid:38162955
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Balel Y. Can ChatGPT be used in oral and maxillofacial surgery? J Stomatol Oral Maxillofac Surg. 2023;124(5):101471. pmid:37061037
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Alan R, Alan BM. Utilizing ChatGPT-4 for providing information on periodontal disease to patients: a DISCERN quality analysis. Cureus. 2023;15(9):e46213. pmid:37908933
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Chatzopoulos GS, Koidou VP, Tsalikis L, Kaklamanos EG. Large language models in periodontology: assessing their performance in clinically relevant questions. J Prosthet Dent. 2024:S0022–3913(24)00714-5. pmid:39562221
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Tanaka OM, Gasparello GG, Hartmann GC, Casagrande FA, Pithon MM. Assessing the reliability of ChatGPT: a content analysis of self-generated and self-answered questions on clear aligners, TADs and digital imaging. Dental Press J Orthod. 2023;28(5):e2323183. pmid:37937680
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Abu Arqub S, Al-Moghrabi D, Allareddy V, Upadhyay M, Vaid N, Yadav S. Content analysis of AI-generated (ChatGPT) responses concerning orthodontic clear aligners. Angle Orthod. 2024;94(3):263–72. pmid:38195060
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Makrygiannakis MA, Giannakopoulos K, Kaklamanos EG. Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing. Eur J Orthod. 2024:cjae017. pmid:38613510
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Freire Y, Santamaría Laorden A, Orejas Pérez J, Gómez Sánchez M, Díaz-Flores García V, Suárez A. ChatGPT performance in prosthodontics: assessment of accuracy and repeatability in answer generation. J Prosthet Dent. 2024;131(4):659.e1-659.e6. pmid:38310063
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Anitua E, Larrazabal Saez de Ibarra N, Saracho Rotaeche L. Implant-supported prostheses in the edentulous mandible: biomechanical analysis of different implant configurations via finite element analysis. Dent J (Basel). 2022;11(1):4. pmid:36661541
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref19] 19. Mistry G, Rathod A, Singh S, Kini A, Mehta K, Mistry R. Digital versus traditional workflows for fabrication of implant-supported rehabilitation: a systematic review. Bioinformation. 2024;20(9):1075–85. pmid:39917200
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref20] 20. Froimovici F-O, Butnărașu CC, Montanari M, Săndulescu M. Fixed full-arch implant-supported restorations: techniques review and proposal for improvement. Dent J (Basel). 2024;12(12):408. pmid:39727465
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref21] 21. Tabarak N, Srivastava G, Padhiary SK, Manisha J, Choudhury GK. Zirconia-ceramic versus metal-ceramic implant-supported multiunit fixed dental prostheses: a systematic review and meta-analysis. Dent Res J. 2024;21:5.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref22] 22. Vazouras K, Taylor T. Full-arch removable vs fixed implant restorations: a literature review of factors to consider regarding treatment choice and decision-making in elderly patients. Int J Prosthodont. 2021;34:s93–101. pmid:33571329
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref23] 23. Eggmann F, Weiger R, Zitzmann NU, Blatz MB. Implications of large language models such as ChatGPT for dental medicine. J Esthet Restor Dent. 2023;35(7):1098–102. pmid:37017291
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref24] 24. Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng P-C, et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 2023;16(1):20. pmid:37443040
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref25] 25. Gajjar K, Balakumaran K, Kim AS. Reversible left ventricular systolic dysfunction secondary to pazopanib. Cureus. 2018;10(10):e3517. pmid:30648052
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref26] 26. Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023;25:e50638. pmid:37792434
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref27] 27. Chen J, Zhu L, Mou W, Lin A, Zeng D, Qi C, et al. STAGER checklist: Standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability. iMetaOmics. 2024;1(1).
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref28] 28. Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31(1):60–9. pmid:39779929
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref29] 29. Proceedings of the seventh ITI consensus conference. special issue. Clin Oral Implants Res. 2023;
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref30] 30. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. pmid:36812645
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref31] 31. Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep. 2024;14(1):9330. pmid:38654011
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref32] 32. Gwet KL Handbook of inter-rater reliability. 4th ed. Gaithersburg, MD: Advanced Analytics, LLC; 2014.

[ref33] 33. Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the performance of generative AI large language models ChatGPT, Google Bard, and Microsoft Bing Chat in supporting evidence-based dentistry: comparative mixed methods study. J Med Internet Res. 2023;25:e51580. pmid:38009003
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref34] 34. Yamaguchi S, Morishita M, Fukuda H, Muraoka K, Nakamura T, Yoshioka I, et al. Evaluating the efficacy of leading large language models in the Japanese national dental hygienist examination: a comparative analysis of ChatGPT, Bard, and Bing Chat. J Dent Sci. 2024;19(4):2262–7. pmid:39347065
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref35] 35. Makrygiannakis MA, Giannakopoulos K, Kaklamanos EG. Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing. Eur J Orthod. 2024:cjae017. pmid:38613510
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref36] 36. Kıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024;100(1189):858–65. pmid:38840505
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref37] 37. Toyama Y, Harigai A, Abe M, Nagano M, Kawabata M, Seki Y, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol. 2024;42(2):201–7. pmid:37792149
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref38] 38. Cesur T, Güneş YC. Optimizing diagnostic performance of ChatGPT: the impact of prompt engineering on thoracic radiology cases. Cureus. 2024;16(5):e60009. pmid:38854352
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref39] 39. Almeida LC, Farina EMJM, Kuriki PEA, Abdala N, Kitamura FC. Performance of ChatGPT on the Brazilian Radiology and Diagnostic Imaging and Mammography Board Examinations. Radiol Artif Intell. 2024;6(1):e230103. pmid:38294325
View Article
PubMed/NCBI
Google Scholar

[148] View Article

[149] PubMed/NCBI

[150] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Results

Discussion

Conclusions

Supporting information

S1 Appendix. STAGER checklist.

S2 Appendix. TRIPOD checklist

References