Figures
Abstract
Efforts are being made to improve the time effectiveness of healthcare providers. Artificial intelligence tools can help transcript and summarize physician-patient encounters and produce medical notes and medical recommendations. However, in addition to medical information, discussion between healthcare and patients includes small talk and other information irrelevant to medical concerns. As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple-choice and open-ended questions. First, we gathered small talk sentences from human participants using the Mechanical Turk platform. Second, both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. Finally, a board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data (66.8% vs. 56.6%; p = 0.025). Specifically, for multiple-choice questions (72.1% vs. 68.9%; p = 0.67) and for the open questions (61.5% vs. 44.3%; p = 0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6% and 66.2%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.
Citation: Safrai M, Azaria A (2024) Does small talk with a medical provider affect ChatGPT’s medical counsel? Performance of ChatGPT on USMLE with and without distractions. PLoS ONE 19(4): e0302217. https://doi.org/10.1371/journal.pone.0302217
Editor: Massimo Stella, Università di Trento: Universita degli Studi di Trento, ITALY
Received: September 18, 2023; Accepted: March 28, 2024; Published: April 30, 2024
Copyright: © 2024 Safrai, Azaria. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: https://datadryad.org/stash/share/y1HZFsTgw4W6LaodQm5ROY6oJ7M4CaEaMo0BFGBn-2I.
Funding: Amos Azaria: Ministry of Science and Technology, Israel. No award number. https://www.gov.il/he/departments/general/most_rfp_application_guide The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
One of the key, yet most time-consuming, healthcare tasks is charting and creating medical notes, a task requiring extensive time of healthcare providers [1]. In fact, this task often requires healthcare providers to spend as much, if not more time than they do in direct patient interaction [2, 3]. For example, in a survey, 67% of the residents reported spending in excess of 4 hours daily on documentation [1]. Despite the importance of medical notes [4, 5] no changes have been made to their format, besides having transferred the responsibility of writing them from other medical team members to the physicians [6]. This shift has created a burden for medical providers [7] and physician burnout [8]. Moreover, the recent implementation of electronic health records (EHRs) has significantly increased clinician documentation time [9], making it the most time-consuming physician activity [10]. This emphasizes the pressing need to improve the way of charting and making medical notes.
Large Language Models (LLMs) have been suggested as a possible solution, improving healthcare documentation, creating notes, summarizing physician-patient encounters, and even providing meaningful suggestions for further treatments [11, 12]. For example, Chat Generative Pre-trained Transformer 3.5 (ChatGPT-3.5) has been shown to generate a correct diagnosis for 93% of clinical cases with common chief complaints [13] and screening breast cancer with an average correct rate of 88.9% [14]. In addition, ChatGTP-3.5 was able to provide general medical information on common retinal disease [15], in almost every subject in genecology [16] and in cancer subjects [17]. Moreover, another article demonstrated ChatGPT-3.5’s ability to generate clinical letters with high overall accuracy and humanization [18]. Recent investigations have also shown ChatGPT-3.5’s ability to write medical notes [12] and to generate a discharge note based on a brief description [19]. More recently, a newer version ChatGPT-4 was released. ChatGPT-4 has the ability to process a greater word limit, a stronger ability to solve complex problems, and image recognition [20]. This version has additionally shown greater capabilities in terms of clinical evaluation [21, 22]. Namely, while ChatGPT-3.5 has obtained a score of 60.9% on a US sample clinical exam, ChatGPT-4 obtained a score of 89.8% on the same exam [21]. A similar result was obtained on the Japanese medical exam, in which ChatGPT-3.5 obtained an average score of 121.3 on the first part of the exam and 149.7 on the second part, while ChatGPT-4 obtained an average score of 167.7 on the first part of the exam and 221.5 on the second part [22].
Following the success of ChatGPT in the medical field, the technology has been tested to summarize physician-patient encounters [23]. Those appointments between healthcare providers and patients form the foundation of medical care [24]. They necessitate medical evaluations, including the provider’s focus on patient needs, obtaining medical histories [25, 26], conducting physical examinations [27, 28] and performing additional tests if necessary [29, 30]. Moreover, they also entail non-medical tasks such as documenting patient records, organizing notes, and making referrals.
However, since healthcare and patient discussions are unique and based on trust, in addition to medical information, they often include small talk and other information irrelevant to medical concerns [31, 32]. Those unique exchanges are an important part of the relationship between medical providers and patients and are common among different cultures [33]. In traditional Chinese medicine doctors actively initiate small talks to acquire holistic information for diagnosis and attach significant importance to them [33]. In contrast, such interaction with small talk, has been found to alter the technical skills and performances of medical students, [34]. These controversies raise concerns regarding their potential impact on LLMs.
LLMs have the potential to streamline the note-taking process for health practitioners by automating it in two steps. Current technologies enable converting audio information, such as conversations, into transcripts [35]. LLMs can then analyze and create medical notes, such as discharge summaries, in real-time [19]. However, the unique nature of discussions between healthcare providers and patients, which often interweave small talk with critical information, poses a challenge. Given that LLMs are predictive models that generate responses based on the input prompt [36], further investigation is needed to assess whether small talk impede the LLM’s ability to formulate accurate medical notes and summaries.
Despite the growing number of studies on the potential of using AI for healthcare purposes, to the best of our knowledge, none have assessed this unique aspect of healthcare-patient content interactions and the effect that casual conversation and unrelated information could have on the efficacy of ChatGPT to process medical information and therefore be used to write medical notes summarizing physician-patient interaction. This study aims to investigate the impact of interspersing medical data with casual conversation on the precision of medical recommendations provided by ChatGPT3.5 and ChatGPT-4.
To clearly delineate the practical application of this paper, the study targets the use of LLM’s alongside a practitioner for charting and summarizing health care patient interaction, including small talk, but with practitioners having significantly less manual work to accomplish. LLM’s can save time and resources by automating the note-taking process of practitioners. This requires the ability of the LLM’s to process relevant information. However, investigation is required to determine whether small talk inhibits the LLM in formulating accurate medical conclusions. Moreover, LLM’s can suggest medical conclusions such as potential treatment and diagnosis, saving time for the practitioner. If small talk interferes with the performance of LLM’s, practitioners cannot rely on them for note-taking due to ‘noise’ causing inaccuracies in medical records.
Material and methods
Medical information
To assess ChatGPT’s capabilities in medical reasoning, we evaluate its responses to questions from the United States Medical Licensing Examination (USMLE). This exam has been successfully used to assess the medical logic of LLMs in previous studies [37]. Specifically, to evaluate the LLM’s proficiency in addressing clinical queries, we selected the Step 3 exam, which is the final examination in the USMLE sequence that qualifies individuals to practice medicine unsupervised. The multiple-choice questions in this exam primarily test knowledge related to diagnosis and clinical management and reflect clinical situations that a general physician might encounter https://www.usmle.org/step-exams/step-3/step-3-exam-content.
USMLE Step 3 questions were sourced from the dataset provided by Kung et al. [37]. Two distinct sets of questions were utilized in the study. The first comprised the original multiple-choice (MC) questions from the USMLE Step 3 exam, while the second presented the same questions in an open-ended (OE) format. Each set contained 122 questions. This study is exempt from Institutional Review Board (IRB) review. It did not involve any interaction or intervention with human subjects nor did it access identifiable private information. Participants provided unidentifiable, non-medical, generic sentences for analysis. The primary focus of this research was ChatGPT, not human subjects. All other applicable ethical guidelines were adhered to during the conduct of this study.
Obtaining small talk sentences
We conducted a survey on Amazon’s Mechanical Turk platform, which allows researchers to recruit participants for various tasks, including online surveys and experiments. Mechanical Turk has gained considerable popularity in recent years as a tool for research due to its efficiency, cost-effectiveness, and the ability to reach a vast pool of participants [38]. The survey was conducted from 7/24/2023 to 7/25/2023.
In our survey, we required the participants to write sentences with at least 10 words to encourage more thoughtful and meaningful responses and reduce the likelihood of individuals providing rushed, brief answers (e.g., “I ate something”, “I saw someone”, etc.). This is because we aim for participants to produce meaningful sentences that emulate small talk, ensuring they convey information in a casual conversational manner. The participants confirmed that they approve that the provided information will be used for research.
The participants were provided the following instructions. “Please write 5 different sentences as if you were talking to your friend. Each sentence must describe something that has happened to you or an action that you have performed in the past few days. The sentences should not depend on each-other. It is OK to write sentences about simple everyday occurrences (e.g., “1. I sat on a chair on my balcony and looked at the cars passing by.”). Each sentence should be at least 10 words long.”
We note that we intentionally framed the small talk in the context of “talking to a friend” rather than talking to a physician, since we did not want the small talk sentences to have any true influence on the correct answer. By framing the small talk in the context of talking to a friend, we aimed for the correct diagnosis to remain unchanged.
We elicited 35 participants, each provided 5 sentences. This resulted in 175 sentences. The following are some examples of sentences we received from the Mechanical Turk workers:
- I had a great time catching up with my friends at the coffee shop.
- I finished reading a great book and I’m looking for my next one.
- I biked to the park and watched the birds for an hour.
All sentences shorter than 10 words were removed. The remaining sentences were converted to a third person’s view, to better align with the USMLE format. This resulted in a list of 143 small talk sentences, which are provided in the appendix.
Converting the three aforementioned sentences to a third person’s view, obtains the following:
- The person had a great time catching up with their friends at the coffee shop.
- The person finished reading a great book and is looking for their next one.
- The person biked to the park and watched the birds for an hour.
Small talk integration into medical information
A program was developed that executed the following procedure on the USMLE Step 3 questions. Through sentence tokenization, each question was broken down into individual sentences, and a small talk sentence was inserted. Once processed, each sentence from the USMLE question was followed by a sentence from the small talk file, creating an alternating sequence, as shown in Fig 1.
The small talk sentences, added for this illustration, are highlighted in green (the actual dataset does not contain any color highlighting).
The final dataset included a total of 488 questions: 122 multiple-choice questions and 122 open-ended questions, each presented with and without small talk.
ChatGPT queries
ChatGPT was prompted using the OpenAI API (in Python). Each query was submitted separately as a new query, i.e., our program read each question from the file and submitted it to ChatGPT. We used the openai.ChatCompletion.create a function with the default parameters https://openai.com/blog/openai-api. The full set of questions was submitted as a user query without a system message for both versions of ChatGPT. In addition, the questions, including small talk, were submitted to ChatGPT-3.5 using the following system message: “You will be asked a question that may contain some irrelevant information. You must first write all the relevant information, then reason about the person’s medical condition, and only then attempt to answer the question.” We refer to the version with the system message as ChatGPT-3.5 ST-Identify.
ChatGPT answers assessment
All the responses from ChatGPT to the various datasets were evaluated by a single board-certified physician (MS). For both multiple-choice and open-ended formats, ChatGPT’s responses were validated against the official answers of the original multiple-choice questions.
Statistical analysis
The study investigates the impact of small talk mixed with medical data on the accuracy of medical advice provided by ChatGPT, comparing its performance between versions 3.5 and 4. The primary outcome measures were the accuracy of responses to USMLE Step 3 questions, both multiple-choice and open-ended, with and without small talk sentences included. The study employed a mixed-model design, incorporating both within-subject and between-subject factors. The primary independent variable was the presence or absence of small talk sentences added to the USMLE Step 3 questions, while the version of ChatGPT (3.5 vs. 4) served as a between-subject factor. For each question type (multiple-choice and open-ended), the accuracy of ChatGPT’s responses with and without small talk sentences was compared using statistical tests appropriate for the data distribution and study design. Specifically, paired t-tests or Wilcoxon signed-rank tests were conducted to assess within-subject differences in accuracy between conditions (with vs. without small talk sentences) for each ChatGPT version. Additionally, independent t-tests, or Mann-Whitney U tests were used to compare the accuracy of ChatGPT-3.5 and ChatGPT-4 across conditions. Descriptive statistics, including means, standard deviations, medians, and interquartile ranges, were reported to summarize the accuracy of ChatGPT’s responses in each condition and for each version. Inferential statistics, such as p-values, were provided to determine the significance of differences observed. The statistical analyses were performed using Python (version 3.10.1) with the Scipy library (version 1.10.1). The chi2_contingency function from the scipy. Stats library was utilized to compare different groups, and P values of less than 0.05 were considered statistically significant.
Results
The overall performance of ChatGPT-4 was significantly better than ChatGPT-3.5, with an overall of 61.7% of correct responses vs. 75.4% respectively (p. value <0.001). A significantly better score was observed for ChatGPT-4 version when comparing the overall answer to the USMLE question without the addition of small talk (75.4% vs 66.8%, p.value = 0.045) and (75.4% vs 56.6%, p. value <0.001) for the question including small talk addition (Fig 2). In addition, the effect of small talk integration within medical information differs between the two ChatGPT versions. ChatGPT-3.5, showed a clear decrease in the answers’ accuracy when small talk sentences were added to the medical data, with a significant decrease of 66.8% to 56.6% for all ChatGPT-3.5 answers (p. value = 0.025).
The figure shows the significant difference in ChatGPT-3.5 and ChatGPT-4 performances’ with and without the addition of small talk sentences. In addition, it demonstrates the significant difference in the performance of ChatGPT-3.5 for the datasets with and without small talk addition. ST—Small talk, with the addition of small talk to the original question. * and ** indicate statistical significance at levels p < 0.05 and p < 0.001, respectively.
While looking at each separate data set of questions, the influence of small talk integration on each type of question is more prominent. ChatGPT-3.5 demonstrates a non-significant reduction from 72.1% to 68.9% for the multiple-choice questions, while a more considerable and significant drop in performance from 61.5% to 44.3% (p. value = 0.01) was observed for open-ended questions. In contrast, the performance of ChatGPT-4 remained unchanged despite the introduction of small talk, displaying 67.2% and 83.6% of correct answers for open and multiple-choice questions, respectively (Fig 3).
ChatGPT-4 performed significantly better than ChatGPT-3.5 (p<0.001). The small talk seemed to have a larger effect on the performance of ChatGPT-3.5 in the open-ended questions. ST—Small talk, with the addition of small talk to the original question. OE—Open-ended questions, MC—Multiple-choice questions. * and ** indicate statistical significance at levels p < 0.05 and p < 0.001, respectively.
Upon closer examination of the answers of ChatGPT to each question, a pattern of error can be observed in ChatGPT responses’ when the correct answer is that no further test or investigation was required. For instance, each dataset included two questions, with the correct answer being “No further evaluation is necessary” or “No additional study is indicated.” Both ChatGPT versions responded incorrectly in the case of open questions suggesting further investigation or treatment regardless in small talk addition. In contrast, when prompted with the dataset of multiple-choice of questions, ChatGPT 3.5 answered one of the 2 questions right when no small talk was inserted, and was disturbed by the addition of small talk and responded wrong for both questions after adding small talk. ChatGPT-4 also improved its score on multiple questions and got one correct answer. In contradiction to ChatGPT3.5, its answer was not impaired by adding small talk, and the performance was the same even after adding irrelevant information.
In other questions, where the correct answer was a diagnosis or treatment, the addition of small talk impaired ChatGPT-3.5 performance. For example, as seen in Fig 4a, the response was correct before adding small talk. However, as shown in Fig 4b, once small talk phrases were added to the question, ChatGPT-3.5 failed and provided an incorrect response. Interestingly, even though the small talk caused ChatGPT to respond incorrectly, it does not explicitly mention any of the small talk information in its answer and does not explain its wrong answer based on the specific interference added to this question.
(a) ChatGPT-3.5 provides a correct answer to a question that does not include small talk interference. (b) ChatGPT-3.5 incorrectly responds to a question mixed with small talk (highlighted in green).
Finally, we compare ChatGPT-3.5 ST (without a system message) with ChatGPT-3.5 ST-Identify, which contained a system message encouraging ChatGPT-3.5 first to identify the important information and only then answer. The system message did not improve the overall performance of ChatGPT-3.5 on the questions with small talk (p = 0.577). While the performance on the open-ended questions slightly increased from 44.3% to 50.0%, the performance on the multiple-choice questions decreased from 68.9% to 62.3%, with an average performance of 56.1%.
Discussion
The primary purpose of this study was to investigate the effect of the addition of small talk to medical data on the accuracy of medical advice provided by ChatGPT. First, as expected, ChatGPT-4 outperforms ChatGPT-3.5 with an overall higher score for open and multiple-choice questions. This matches the expectation as ChatGPT-4 is a more advanced version and has been shown to surpass ChatGPT-3.5 on multiple-choice questions in the US and Japan medical exams [21, 37]. However, this is the first study to show a similar improvement in the capacity of ChaGPT-4 to surpass ChatGTP-3.5, giving medical recommendations to open questions that simulate daily clinical needs. The high score of almost three-quarters of correct answers of ChatGPT-4 for open questions in our study indicates its ability to process medical information. These findings suggest the capacity of ChatGPT-4 to respond and provide medical advice and demonstrate its potential future use in the medical field.
When evaluating the effect of small talk addition to the different datasets, ChatGPT-3.5 showed a slight drop in performance for multiple-choice questions and a significant one in answering the open-ended questions following the addition of small talk. In contrast, ChatGPT-4’s performance was consistent regardless of small talk, with stable accuracy rates for both question types. To our knowledge, this is the first study evaluating the effect of small talk on ChatGPT and other LLMs’ efficacy in processing medical information in the context of unformal or irrelevant information. Our study demonstrates the various impacts of adding small talk on different versions of ChatGPT. It implies that the addition of small talk does not impair ChatGPT-4 performance in processing medical data, which can provide the same accuracy in medical recommendations as in ‘medical only’ conversation. During a provider-patient interaction, irrelevant information is often mixed with medical data, which needs to be processed and summarized in contrast to the small talk. It has been demonstrated in a previous study that ChatGPT can summarize and provide a note for ‘medical only’ physician–patient encounters [23]. Therefore, our data suggests that ChatGPT-4 can assist in this task without being impaired by a patient-provider casual discussion that might occur and be provided to ChatGPT in a transcript. These findings provide important answers for medical practitioners and LLM developers regarding the potential of the implication of ChatGPT and other LLMs as a tool in medicine. This is especially important as it is predicted that chatbots will be used by medical professionals, as well as by patients, with increasing frequency [23].
The analysis of the exact scoring of ChatGPT in our study demonstrates that ChatGPT-3.5 answered 72.1% of the multiple-choice questions correctly without small talk integration. This score is higher than the one reported by Kung et al. [37] ranging from 68.8% to 61.5%. It should be noted, however, that our study was conducted approximately 8 months after the original assessment. A possible explanation for this difference is that ChatGPT, as an Artificial Intelligence system, has learned and adapted from the data. As it encounters more information, it refines its models, which often leads to improved performance and accuracy [39]. It is plausible that the elevated scores observed in our research can be attributed to a marked learning enhancement. These findings likely underscore the continuous improvement of ChatGPT over time. We are optimistic that subsequent studies will yield even more favorable outcomes, enhancing ChatGPT’s ability to offer even better medical recommendations and furnish dependable support to healthcare providers in medical record documentation.
Each dataset included 2 questions, where the correct answer was that no further investigation was required. Both versions of ChatGPT answers to the open-ended questions were wrong. In contrast, for multiple-choice questions, ChatGPT3.5 had one of the two questions answered correctly if no small talk was added and both were wrong after this addition, whereas ChatGPT-4 was not influenced by small talk addition constantly answers to one of two questions correctly. Our study is the first to report the need and complexity of LLMs to respond to those types of questions. These types of answers are crucial in medicine as patients can be easily referred to countless further tests and investigations, burdening the patients and the medical system [40]. These queries challenge LLMs for whom specific wording of the prompt influence dramatically the answer provided [41]. In those examples, asking what should be the next step may induce that a next step is indeed required. That finding demonstrates the complexity of using ChatGPT in different queries and the need to acknowledge the limit of this technology at this current development.
Finally, we seek to analyze the cause of the small talk disturbance to ChatGPT-3.5 processing. We hypothesized that adding different subjects and specific words would engender a failure in the process of ChatGPT-3.5. However, while the presence of small talk impaired the performance of ChatGPT-3.5 for the datasets of questions, the answer provided by ChatGPT-3.5 did not explain the wrong answer based on a specific subject or word included in the small talk. This result is concerning, as by delivering incorrect responses but still not mentioning any unrelated information, it may be difficult for a health provider reviewing the answers to pinpoint errors. A sub-analysis of ChatGPT-3.5 did not show significant differences when assessing a different prompt, indicating the irrelevance of some information. This is noteworthy, as previous studies suggested varying prompts could yield different responses in ChatGPT. However, due to the endless options of possible prompts, it’s essential to be cautious about generalizing these findings, even though they strengthen our results. We recommend exploring different prompts across various versions to assess their impact.
Therefore, according to these findings, the study suggests that ChatGPT3.5 should not be utilized by practitioners for charting and summarizing health care patient interaction. Additionally, our findings suggests that LLM’s have the capacity to streamline the note-taking process for health practitioners to enable doctors to have more time to assess each case and provide medical advice. However, it is important to note that our result also reflected the errors made by ChatGPT in both versions. Therefore, a health practitioner should always be involved as ChatGPT cannot be used independently.
Our study has several limitations. The prominent one is that it is challenging to mimic the small talk that is occurring between the health provider and a patient. In our model, we framed the small talk in the context of “talking to a friend” rather than a physician to avoid bias and integration of medical terms. However, in practice, the patient will be talking to a physician; thus, even the small talk may resemble medical information being conveyed. Such small talk might deteriorate the performance of ChatGPT3.5 and might even affect the performance of ChatGPT-4, which, in our analysis, seemed immune to small talk.
To elucidate the differences in communication context—specifically between sentences someone might express to a friend versus those they might share with their physician—we conducted a supplementary survey using Mechanical Turk. In this survey, we asked 10 participants to produce 5 sentences each that they would likely convey to their physician. We then utilized BERT embeddings [42] to assess the average cosine similarity between pairs of sentences: one from the USMLE original questions and the other from our collection of ‘small talk’ sentences. This yielded a similarity value of 0.6451. Subsequently, we compared the USMLE questions cosine similarity with the original small talk sentences, obtaining a value of 0.6007. For perspective, we used ChatGPT-4 to generate 50 random sentences and measured their similarity with the USMLE questions. This produced a result of 0.5620. This demonstrates that sentences spoken in a medical context are more semantically related to the USMLE questions than those from casual conversations or random utterances. This distinction emphasizes the need to recognize the unique characteristics of different conversational contexts and the risks of drawing broad conclusions without considering these nuances.
In addition, in our work, the small talk sentences and the medical information were added in an alternating sequence to USMLE questions, each small talk sentence was added as a standalone piece of information. However, in medical practice, the transcript of physician-patient interaction may be much longer than a USMLE question, and the small talk might be structured differently. USMLE question has been used previously to assess medical data processing [37], reinforcing the use of a dataset for such a purpose and allowing us to compare our results. Nevertheless, it is possible that different patterns of small talk integration on different scripts might have various effects on ChatGPT’s ability to provide medical counsel. We would also like to stress that this work focuses on both medical information and small talk conveyed in text; however, in practice, the irrelevant information can be conveyed in different modes, such as images (either medical-related images or pictures of the patient’s family, pets, etc.) or sounds (either caused by a medical condition of the patient, or the patient laughing as a response to a joke, imitating their boss, etc.). Despite this, the present analysis provides important new information about the impact of the most common way of communicating [19, 26], including irrelevant information, in physician–patient encounters on the ability of the different versions of ChatGPT to provide medical advice.
Another potential limitation of this study is that it focuses on ChatGPT-only and has not assessed different LLMs and therefore cannot be generalized to other forms of LLMs. Future research could thus attempt to investigate whether the addition of small talk interferes with other LLMs (such as BERT, Cloude, LLAMA-1, and LLAMA-2) ability to provide medical advice.
In this paper, we took the first step toward understanding the performance of the two ChatGPT versions, when faced with physician-patient interactions including medical mixed with irrelevant information. Those unique interactions raised a challenge to discern the impact of casual conversations on the accuracy and reliability of medical recommendations made by these LLMs. This analysis shows that while ChatGPT-3.5 performance was significantly impaired by small talk addition, ChatGPT-4 performance was not affected. The results demonstrate that for some LLMs (i.e., ChatGPT-4 in our case) adding casual conversation does not impair medical advice or diagnosis made by them. Therefore, some LLMs can potentially be used to generate clinical notes from a written transcript. Today’s technology has already automated the conversion of audio transcription to written transcript in real-time. Combining these technologies can reduce the time healthcare workers invest in generating medical notes. However, LLM developers, and especially healthcare providers, must be aware of the limitations present in other LLMs (ChatGPT-3.5 in our case), that do not perform well once clinical information is mixed with casual conversation.
Acknowledgments
This work was supported, in part, but the Ministry of Science and Technology, Israel.
References
- 1. Oxentenko AS, West CP, Popkave C, Weinberger SE, Kolars JC. Time spent on clinical documentation: a survey of internal medicine residents and program directors. Archives of internal medicine. 2010;170(4):377–380. pmid:20177042
- 2. Ammenwerth E, Spötl HP. The time needed for clinical documentation versus direct patient care. Methods of information in medicine. 2009;48(01):84–91. pmid:19151888
- 3. Füchtbauer LM, Nørgaard B, Mogensen CB. Emergency department physicians spend only 25% of their working time on direct patient care. Dan Med J. 2013;60(1):A4558. pmid:23340186
- 4. Harvey MA. More Documentation? Who Needs It? Critical Care Medicine. 2022;50(9):1394–1396. pmid:35984052
- 5. Epstein AS, Riley M, Nelson JE, Bernal C, Martin S, Xiao H. Goals of care documentation by medical oncologists and oncology patient end-of-life care outcomes. Cancer. 2022;128(18):3400–3407. pmid:35866716
- 6. Preiksaitis C, Sinsky CA, Rose C. Chatgpt is not the solution to physicians’ documentation burden. Nature Medicine. 2023; p. 1–2.
- 7. Apathy NC, Rotenstein L, Bates DW, Holmgren AJ. Documentation dynamics: Note composition, burden, and physician efficiency. Health Services Research. 2023;58(3):674–685. pmid:36342001
- 8. Sanderson AL, Burns JP. Clinical documentation for intensivists: the impact of diagnosis documentation. Critical Care Medicine. 2020;48(4):579–587. pmid:32205605
- 9. Poissant L, Pereira J, Tamblyn R, Kawasumi Y. The impact of electronic health records on time efficiency of physicians and nurses: a systematic review. Journal of the American Medical Informatics Association. 2005;12(5):505–516. pmid:15905487
- 10. Hill RG Jr, Sears LM, Melanson SW. 4000 clicks: a productivity analysis of electronic medical records in a community hospital ED. The American journal of emergency medicine. 2013;31(11):1591–1594. pmid:24060331
- 11. Liu J, Wang C, Liu S. Utility of chatgpt in clinical practice. Journal of Medical Internet Research. 2023;25:e48568. pmid:37379067
- 12. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. Journal of Medical Systems. 2023;47(1):33. pmid:36869927
- 13. Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: A pilot study. International journal of environmental research and public health. 2023;20(4):3378. pmid:36834073
- 14. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv. 2023; p. 2023–02.
- 15. Potapenko I, Boberg-Ans LC, Hansen S, Klefter ON, van Dijk EH, Subhi Y. Artificial intelligence-based chatbot patient information on common retinal diseases using ChatGPT. Acta Ophthalmologica. 2023;. pmid:36912780
- 16. Grünebaum A, Chervenak J, Pollet SL, Katz A, Chervenak FA. The exciting potential for ChatGPT in obstetrics and gynecology. American Journal of Obstetrics and Gynecology. 2023;228(6):696–705. pmid:36924907
- 17. Johnson SB, King AJ, Warner EL, Aneja S, Kann BH, Bylund CL. Using ChatGPT to evaluate cancer myths and misconceptions: artificial intelligence and cancer information. JNCI cancer spectrum. 2023;7(2):pkad015. pmid:36929393
- 18. Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. Using ChatGPT to write patient clinic letters. The Lancet Digital Health. 2023;5(4):e179–e181. pmid:36894409
- 19. Patel SB, Lam K. ChatGPT: the future of discharge summaries? The Lancet Digital Health. 2023;5(3):e107–e108. pmid:36754724
- 20. Waisberg E, Ong J, Masalkhi M, Kamran SA, Zaman N, Sarker P, et al. GPT-4: a new era of artificial intelligence in medicine. Irish Journal of Medical Science (1971-). 2023; p. 1–4. pmid:37076707
- 21.
Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:230313375. 2023;.
- 22.
Kasai J, Kasai Y, Sakaguchi K, Yamada Y, Radev D. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations. arXiv preprint arXiv:230318027. 2023;.
- 23. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine. 2023;388(13):1233–1239. pmid:36988602
- 24.
Bickley L, Szilagyi PG. Bates’ guide to physical examination and history-taking. Lippincott Williams & Wilkins; 2012.
- 25. Peterson MC, Holbrook JH, Von Hales D, Smith N, Staker L. Contributions of the history, physical examination, and laboratory investigation in making medical diagnoses. Western Journal of Medicine. 1992;156(2):163. pmid:1536065
- 26. Roshan M, Rao A. A study on relative contributions of the history, physical examination and investigations in making medical diagnosis. The Journal of the Association of Physicians of India. 2000;48(8):771–775. pmid:11273467
- 27. Woolf AD. History and physical examination. Best Practice & Research Clinical Rheumatology. 2003;17(3):381–402. pmid:12787508
- 28.
Jarvis C. Physical examination and health assessment-Canadian E-book. Elsevier Health Sciences; 2023.
- 29. Naucler P, Ryd W, Törnberg S, Strand A, Wadell G, Elfgren K, et al. Human papillomavirus and Papanicolaou tests to screen for cervical cancer. New England Journal of Medicine. 2007;357(16):1589–1597. pmid:17942872
- 30.
David C Dugdale M III, David Zieve M MD, Conaway B. Health screenings for women age 65 and older: MedlinePlus Medical Encyclopedia; 2023. Available from: https://medlineplus.gov/ency/article/007463.htm.
- 31. Jin Y. Small talk in medical conversations: Data from China. Journal of Pragmatics. 2018;134:31–44.
- 32. Alvaro Aranda C, Lazaro Gutierrez R. Functions of small talk in healthcare interpreting: an exploratory study in medical encounters facilitated by healthcare interpreters. Language and Intercultural Communication. 2022;22(1):21–34.
- 33. Wei S, Mao Y. Small talk is a big deal: A discursive analysis of online off-topic doctor-patient interaction in Traditional Chinese Medicine. Social Science & Medicine. 2023;317:115632.
- 34. Posner GD, Hamstra SJ. Too much small talk? Medical students’ pelvic examination skills falter with pleasant patients. Medical education. 2013;47(12):1209–1214. pmid:24206154
- 35. Alharbi S, Alrazgan M, Alrashed A, Alnomasi T, Almojel R, Alharbi R, et al. Automatic speech recognition: Systematic literature review. IEEE Access. 2021;9:131858–131876.
- 36.
Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency; 2021. p. 610–623.
- 37. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS digital health. 2023;2(2):e0000198. pmid:36812645
- 38. Paolacci G, Chandler J, Ipeirotis PG. Running experiments on amazon mechanical turk. Judgment and Decision making. 2010;5(5):411–419.
- 39.
Chen L, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:230709009. 2023;.
- 40. Carpenter CR, Raja AS, Brown MD. Overtesting and the downstream consequences of overtreatment: implications of “preventing overdiagnosis” for emergency medicine. Academic Emergency Medicine. 2015;22(12):1484–1492. pmid:26568269
- 41.
Qin G, Eisner J. Learning how to ask: Querying LMs with mixtures of soft prompts. arXiv preprint arXiv:210406599. 2021;.
- 42.
Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.