Skip to main content
Advertisement
  • Loading metrics

Large language models and their performance for the diagnosis of histoplasmosis

Progressive disseminated histoplasmosis remains a major but largely undiagnosed AIDS-defining opportunistic infection [1]. Low awareness of the disease, its proteiform presentation and the lack of rapid diagnostic tools in most endemic areas still cause potentially fatal delays in antifungal treatment [2]. This is particularly true in South and Central America where disease incidence is high but probably in Africa and Asia where it is suspected to be mostly overlooked [35]. In this context, Large language models (LLM) processing written case descriptions could be help avoiding missing diagnoses and reduce dangerous treatment delays [6,7]. About a year ago we published a letter on the poor capacity of CHATGPT 3.5 to identify vignettes of HIV-associated histoplasmosis—when prompted “what are the diagnostic hypotheses”, it missed 16 of 20 vignettes [8]. But since then, the capacity of AI has continued to improve and we hypothesized that the poor performance we had observed with older versions of CHATGPT may have improved. We also wished to test other LLM’s performance to identify histoplasmosis.

We thus retested different LLM’s ability to suggest a diagnosis of histoplasmosis from the same 20 clinical vignettes of people with HIV-associated histoplasmosis on which CHATGPT 3.5 had stumbled upon in early 2024. Ten of these vignettes were drawn from published case reports and 10 were from our histoplasmosis cohort (S1 Appendix) [8]. We removed the identification of Histoplasma capsulatum before prompting ChatGPT 3.5, ChatGPT 4.0, Microsoft copilot, Google’s AI Gemini, and Deepseek about what are the diagnostic hypotheses for the 20 clinical vignettes. We then examined the outputs to see whether histoplasmosis was mentioned as a differential diagnosis.

Thus, on March 3rd, 2025, after uploading the 20 vignettes, we prompted different LLM with “what are the diagnostic hypotheses”. The results are summarized in Fig 1 and detailed by vignette in S1 Table. ChatGPT 3.5 and 4.0 listed histoplasmosis or fungal infection among the differential diagnoses in 15 and 14 out of 20 histoplasmosis vignettes, respectively. By contrast, Gemini listed histoplasmosis for only 3 of 20 histoplasmosis vignettes. Deepseek also listed it in 3/20 histoplasmosis vignettes. We replaced all case locations by Indianapolis, a notorious histoplasmosis hotspot, to see if this impacted the results. This led CHATGPT 4.0 to revise its hypothetical diagnoses with 16 of the 20 vignettes. This was however not the case for CHATGPT 3.5, Deepseek or Gemini whose outputs looked quite stereotypical for the shorter vignettes (S1 Appendix). Surprisingly, Microsoft copilot (it is CHATGPT 4-based) listed 18 histoplasmosis and 1 invasive fungal infection for the 20 vignettes, which is still an information suggesting the need for rapid initiation of antifungal treatement. Surely, different prompts may lead to more or less alternative diagnoses and asking for the top 10 diagnoses will increase the probability of histoplasmosis to be listed [9]. However, since culture takes weeks and treatment delays may be fatal, having histoplasmosis ranked as one of a long list of diseases is not likely to translate into early treatment. By contrast, experienced clinicians from endemic areas for Histoplasma capsulatum will usually rapidly zoom on the diagnosis of histoplasmosis when given the same vignettes [10]. This is exactly what CHATGPT now does. CHATGPT uses Bayesian-like reasoning but does not explicitly compute probabilities. Instead, based on prior knowledge, it weighs likelihoods, and updates diagnoses dynamically when new data is introduced. Prompted whether the publication of more prevalence or incidence studies in more countries would improve CHATGPT ‘s ability to provide the correct diagnosis the answer was: “if more prevalence or incidence studies were published and incorporated into CHATGPT’s knowledge base, it would significantly enhance diagnostic accuracy by refining the prior probabilities of diseases in different regions”.

thumbnail
Fig 1. Performance of different Large Langage Models when faced with histoplasmosis vignettes and prompted with "what are the diagnostic hypotheses".

https://doi.org/10.1371/journal.pntd.0013151.g001

We show that, in 2025, the best results—90% sensitivity—were given by Microsoft copilot, that CHATGPT has greatly improved for identifying histoplasmosis, that it is much better than Gemini or Deepseek. Copilot features differences in training focus, fine-tuning, context handling, and safety constraints that could explain variations between different models’ diagnostic accuracy. The striking performance of Microsoft Copilot was on par with recent antigen detection tests’ sensitivity which is above 90% [11,12]. Given the rapid progress of LLM such performances should be reevaluated regularly and perhaps expanded to vignettes of persons with advanced HIV without histoplasmosis to estimate their specificity. The strength of the present data is that it uses a fixed set of histoplasmosis vignettes to compare now and then and to compare different LLM. To our knowledge, the use of LLM for the diagnosis of HIV associated histoplasmosis from clinical vignettes has not been studied. More generally, in the context of infectious disease consultations, the black box nature of LLM, their tendency to confabulation, have raised some legitimate concerns about the safety of their indiscriminate use but also recognition that this is a rapidly evolving field [13]. Others have also warned that LLM performed significantly worse than physicians and thus were not ready for autonomous clinical decision making [14]. A randomized study found that diagnostic reasoning performance was not improved by LLM [15]. Other recent studies emphasized the potential use of LLM to identify differential diagnoses and their gradual improvements [16]. However, the authors pointed that the correct responses were correlated with availability in the literature but not with actual disease incidence, which is an important perspective for improvement. Less surprisingly, others have emphasized that incorporating laboratory results improved performance [17].

How to integrate this concretely into clinical practice remains to be clarified as copilot or CHATGPT operate with written text. However, other voice-to-text software could transcribe the clinical vignette at the bedside before asking for the diagnostic hypotheses. This suggests that, as for classical grand rounds, quality clinical notes and the ability to verbalize a synthetic and accurate clinical description of cases will remain crucial skills if LLM make their way into the physician’s toolkit.

In conclusion, although we were highly skeptical a year ago about the usefulness of LLM in the very specific context of HIV-associated histoplasmosis, we must admit that a year of progress has tilted our position toward a more favorable view. We show for the first time that LLM may have potential as a point of care tool for the differential diagnosis of diseases that are neglected [18] and hard to diagnose.

Supporting information

S1 Appendix. The supplementary file lists 20 vignettes of HIV-associated disseminated histoplasmosis case reports.

https://doi.org/10.1371/journal.pntd.0013151.s001

(DOCX)

S1 Table. Is a diagnosis of histoplasmosis suggested by AI when asked to give diagnostic hypotheses?

https://doi.org/10.1371/journal.pntd.0013151.s002

(DOCX)

References

  1. 1. Adenis AA, Valdes A, Cropet C, McCotter OZ, Derado G, Couppie P, et al. Burden of HIV-associated histoplasmosis compared with tuberculosis in Latin America: a modelling study. Lancet Infect Dis. 2018;18(10):1150–9. pmid:30146320
  2. 2. Caceres DH, Adenis A, de Souza JVB, Gomez BL, Cruz KS, Pasqualotto AC, et al. The Manaus declaration: current situation of histoplasmosis in the Americas, Report of the II regional meeting of the international histoplasmosis advocacy group. Curr Fungal Infect Rep. 2019;13(4):244–9.
  3. 3. Oladele RO, Ayanlowo OO, Richardson MD, Denning DW. Histoplasmosis in Africa: an emerging or a neglected disease? PLoS Negl Trop Dis. 2018;12(1):e0006046. pmid:29346384
  4. 4. Mandengue CE, Ngandjio A, Atangana PJA. Histoplasmosis in HIV-infected persons, Yaoundé, Cameroon. Emerg Infect Dis. 2015;21(11):2094–6. pmid:26488076
  5. 5. Baker J, Setianingrum F, Wahyuningsih R, Denning DW. Mapping histoplasmosis in South East Asia—implications for diagnosis in AIDS. Emerg Microbes Infect. 2019;8(1):1139–45. pmid:31364950
  6. 6. Kuroiwa T, Sarcon A, Ibara T, Yamada E, Yamamoto A, Tsukamoto K, et al. The potential of ChatGPT as a self-diagnostic tool in common orthopedic diseases: exploratory study. J Med Internet Res. 2023;25:e47621. pmid:37713254
  7. 7. Hirosawa T, Kawamura R, Harada Y, Mizuta K, Tokumasu K, Kaji Y, et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform. 2023;11:e48808. pmid:37812468
  8. 8. Nacher M, Françoise U, Adenis A. ChatGPT neglects a neglected disease. Lancet Infect Dis. 2024;24(2):e76. pmid:38211603
  9. 9. Armitage R. ChatGPT does not neglect a neglected disease, if appropriately prompted. Lancet Infect Dis. 2024;24(3):e155. pmid:38242141
  10. 10. Nacher M, Françoise U, Adenis A. ChatGPT does not neglect a neglected disease, if appropriately prompted—authors’ reply. Lancet Infect Dis. 2024;24(4):e213. pmid:38359856
  11. 11. Martínez-Gamboa A, Niembro-Ortega MD, Torres-González P, Santiago-Cruz J, Velázquez-Zavala NG, Rangel-Cordero A, et al. Diagnostic accuracy of antigen detection in urine and molecular assays testing in different clinical samples for the diagnosis of progressive disseminated histoplasmosis in patients living with HIV/AIDS: a prospective multicenter study in Mexico. PLoS Negl Trop Dis. 2021;15(3):e0009215. pmid:33684128
  12. 12. Caceres DH, Knuth M, Derado G, Lindsley MD. Diagnosis of progressive disseminated histoplasmosis in advanced HIV: a meta-analysis of assay analytical performance. J Fungi (Basel). 2019;5(3):76. pmid:31426618
  13. 13. Schwartz IS, Link KE, Daneshjou R, Cortés-Penfield N. Black box warning: large language models and the future of infectious diseases consultation. Clin Infect Dis. 2024;78(4):860–6. pmid:37971399
  14. 14. Hager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, Knauer M, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;30(9):2613–22. pmid:38965432
  15. 15. Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open. 2024;7(10):e2440969. pmid:39466245
  16. 16. Ríos-Hoyo A, Shan NL, Li A, Pearson AT, Pusztai L, Howard FM. Evaluation of large language models as a diagnostic aid for complex medical cases. Front Med (Lausanne). 2024;11:1380148. pmid:38966538
  17. 17. Bhasuran B, Jin Q, Xie Y, Yang C, Hanna K, Costa J, et al. Preliminary analysis of the impact of lab results on large language model generated differential diagnoses. NPJ Digit Med. 2025;8(1):166. pmid:40102561
  18. 18. Nacher M, Adenis A, Mc Donald S, Do Socorro Mendonca Gomes M, Singh S, Lopes Lima I, et al. Disseminated histoplasmosis in HIV-infected patients in South America: a neglected killer continues on its rampage. PLoS Negl Trop Dis. 2013;7(11):e2319. pmid:24278485