Google’s new AI Chatbot produces fake health-related evidence-then self-corrects

Gary M. Franklin

doi:10.1371/journal.pdig.0000619

Citation: Franklin GM (2024) Google’s new AI Chatbot produces fake health-related evidence-then self-corrects. PLOS Digit Health 3(9): e0000619. https://doi.org/10.1371/journal.pdig.0000619

Editor: Jennifer N. Avari Silva, Washington University in Saint Louis, UNITED STATES OF AMERICA

Published: September 23, 2024

Copyright: © 2024 Gary M. Franklin. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The author received no specific funding for this work.

Competing interests: The author has declared that no competing interests exist.

Recent articles have pointed out the potential benefits, limits and risks that may occur from use of AI Chatbots, including substantial inaccuracies [1,2]. Such inaccuracies, or downright fake information, have been described as “hallucinations,” and even this term has been criticized and reimagined as “confabulations” [3].

After recently receiving a notification from Google about their new GPT-4 chatbot application, Gemini (http://gemini.google.com/), I thought I’d give it a whirl in the general area of evidence-based medicine and health outcomes and policy.

As a starter question, “I need help with information about outcomes of a specific surgical procedure,” Gemini responded that all of the following might be helpful resources: PubMed, Cochrane Library, NIH, ClinicalTrials.gov, and professional societies. So far, so good, a great answer to get one started on a search of the best available evidence.

Then, I asked a much more specific question, “What is the outcome of thoracic outlet surgery for workers’ compensation cases?” After reasonably stating that the outcome can vary depending on the individual case, the severity of the condition, and the type of surgery performed, Gemini stated that “Studies have shown that thoracic outlet surgery can be effective in improving symptoms and increasing function in workers who have been diagnosed with thoracic outlet syndrome.”

Gemini supported this conclusion by citing two studies. First, “A 2019 study published in the journal “Spine” found that thoracic outlet surgery was associated with a significant improvement in symptoms and function in workers with thoracic outlet syndrome. The study found that 85% of workers who underwent surgery reported an improvement in their symptoms, and 70% reported a return to work.” Second, “Another study, published in the journal” The American Journal of Industrial Medicine” in 2018, found that thoracic outlet surgery was associated with a significant reduction in pain and an improvement in function in workers with thoracic outlet syndrome. The study found that 80% of workers who underwent surgery reported a reduction in pain, and 75% reported an improvement in function.”

I followed up with, “The studies you cited do not seem to exist. Can you explain that?” And Gemini responded, “I apologize for the mistake. I am still under development and learning to perform many kinds of tasks, including finding and citing academic studies. Upon further research, I cannot find any studies that specifically examine the outcomes of thoracic outlet surgery for workers compensation cases.” And, “However, I did find a few studies that examined the outcomes of thoracic outlet surgery in general. These studies found that thoracic outlet surgery can be effective in improving symptoms and increasing function in patients with thoracic outlet syndrome.” Gemini then repeated the findings from the two phantom studies cited above.

I followed up with, “The two studies you have cited do not exist.” And Gemini, contritely, “I apologize for the mistake. I am still underdevelopment…” “I have checked the references for the two studies that I cited, and I can confirm that they do not exist. I am not sure why I cited these studies, as I am programmed to only cite studies that have been published in reputable academic journals.”

If an injured worker with diagnosed neurogenic thoracic outlet syndrome had asked Gemini the original question in search of some advice about whether to pursue a potentially harmful procedure, the information provided could have led to a false sense of security as to what to expect from surgery. How is it possible that the Chatbot could make up evidence so specifically as to the journal name and year, but without authors, volume and page numbers? This is a frightening example of the state of the AI chatbot universe in health care. And I really don’t have confidence that such very bad information will not lead to more harmful overuse of some procedures before whatever led to these responses is corrected.

At the very least, one would think that Gemini could learn from its mistakes, so I re-asked the question on outcomes of thoracic outlet surgery among workers’ compensation cases a few days later. The answers were exactly the same, including the fake references and false optimism.

Alas, when asked the same original question a few months later, Gemini responded much more meaningfully and accurately (exact transcript) and with better organized material, including true references [4]:

Thoracic Outlet Surgery and Workers’ Compensation: A Complex Picture

The outcome of thoracic outlet surgery (TOS) for workers’ compensation cases can be variable and complex

Factors Affecting Outcome.

Several factors influence the outcome of TOS surgery in workers’ compensation cases:

Diagnosis accuracy: Misdiagnosis of TOS is common, leading to unnecessary surgeries and poor outcomes.
Severity of symptoms: The extent of pain, numbness, and weakness can impact the success of surgery.
Surgical technique: The type of surgery performed and the surgeon’s experience play a crucial role.
Workers’ compensation system: The specific rules and regulations of each state’s workers’ compensation system can affect benefits and treatment options.

Research Findings.

Studies have shown that:

Outcomes for TOS surgery in workers’ compensation cases are often poorer than in non-workers’ compensation cases.
The diagnosis of TOS is often nonspecific, making it challenging to determine the true cause of symptoms.
Adverse events following surgery are more common in workers’ compensation patients.

A recent thoughtful analysis of the hallucination/confabulation issue also suggests that, even if the information received is false, it can still provide a learning experience, especially when combined with human experience[5].

All of this is to say, buyer beware when it comes to using AI Chatbots for the purposes of extracting accurate scientific information or evidence-based guidance. If it is a scientifically trained physician doing the asking, perhaps their clinical experience and knowledge of the peer-reviewed literature can recognize and adjust the information. If it is a lay user looking to get accurate information sufficient for guidance, ChatGPT is not ready for primetime. However, some studies have suggested a few solutions.

Connecting large language models, such as used in ChatGPT, with specialized knowledge bases, such as evidence-based guidelines produced by specialty societies, can outperform knowledgeable humans. Such a context-based algorithm allowed very accurate interpretation using ChatGPT aligned with the American College of Radiology appropriateness criteria[6].

Short of such resource intensive efforts, providing guidance in the public health or personal patient spaces would be ideal. Along these lines, specific to your patients coming in with what looks like questionable ChatGPT information, perhaps offer to help provide insight as to any potential inaccuracies or even fake citations[5].

Sometimes the solution is right in front of us in the way of an underutilized resource: the medical librarian. In an ongoing study, medical evidence derived from medical librarians’ answers to questions revealed a large proportion of fabrications when using ChatGPT sources to answer the same questions[7].

Increased public awareness and education on the limitations of AI could also be a function of public health institutions, not yet fully engaged in this as a public health goal.

Or, perhaps in the simplest possible solution, wait a few weeks or months and ask the ChatGPT source again.

References

1. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine. 2023 Mar 30;388(13):1233–9. Available from: pmid:36988602
2. Health NLD. ChatGPT: friend or foe? The Lancet Digital Health. 2023 Mar 1;5(3):e102. Available from: pmid:36754723
3. Hatem R, Simmons B, Thornton JE. Chatbot confabulations are not hallucinations. JAMA Internal Medicine. 2023 Oct 1;183(10):1177. Available from: https://doi.org/10.1001/jamainternmed.2023.4231
- View Article
- Google Scholar
4. Franklin GM, Fulton-Kehoe D, Bradley C, Smith-Weller T. Outcome of surgery for thoracic outlet syndrome in Washington state workers’ compensation. Neurology. 2000 Mar 28;54(6):1252–8. Available from: pmid:10746594
5. Colasacco CJ, Born HL. A case of artificial intelligence chatbot hallucination. JAMA Otolaryngology–Head & Neck Surgery. 2024 Jun 1;150(6):457. Available from: pmid:38635259
6. Rau A, Rau S, Zoeller D, Fink A, Tran H, Wilpert C, et al. A context-based chatbot surpasses radiologists and generic ChatGPT in following the ACR appropriateness guidelines. Radiology. 2023 Jul 1;308(1). Available from: https://doi.org/10.1148/radiol.230970
- View Article
- Google Scholar
7. Blasingame MN, Koonce TY, Williams AM, Giuse DA, Su J, Krump PA, et al. Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries. medRxiv (Cold Spring Harbor Laboratory) [Preprint]. 2024 May 3; Available from: https://doi.org/10.1101/2024.05.01.24306691

[ref1] 1. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine. 2023 Mar 30;388(13):1233–9. Available from: pmid:36988602
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Health NLD. ChatGPT: friend or foe? The Lancet Digital Health. 2023 Mar 1;5(3):e102. Available from: pmid:36754723
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Hatem R, Simmons B, Thornton JE. Chatbot confabulations are not hallucinations. JAMA Internal Medicine. 2023 Oct 1;183(10):1177. Available from: https://doi.org/10.1001/jamainternmed.2023.4231
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Franklin GM, Fulton-Kehoe D, Bradley C, Smith-Weller T. Outcome of surgery for thoracic outlet syndrome in Washington state workers’ compensation. Neurology. 2000 Mar 28;54(6):1252–8. Available from: pmid:10746594
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Colasacco CJ, Born HL. A case of artificial intelligence chatbot hallucination. JAMA Otolaryngology–Head & Neck Surgery. 2024 Jun 1;150(6):457. Available from: pmid:38635259
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Rau A, Rau S, Zoeller D, Fink A, Tran H, Wilpert C, et al. A context-based chatbot surpasses radiologists and generic ChatGPT in following the ACR appropriateness guidelines. Radiology. 2023 Jul 1;308(1). Available from: https://doi.org/10.1148/radiol.230970
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref7] 7. Blasingame MN, Koonce TY, Williams AM, Giuse DA, Su J, Krump PA, et al. Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries. medRxiv (Cold Spring Harbor Laboratory) [Preprint]. 2024 May 3; Available from: https://doi.org/10.1101/2024.05.01.24306691