Figures
Abstract
Traditional deep learning models for lung sound analysis require large, labeled datasets, whereas multimodal large language models (LLMs) may offer a flexible, prompt-based alternative. This study aimed to evaluate the utility of a general-purpose multimodal LLM, GPT-4o, for lung sound classification from mel-spectrograms and assess whether a few-shot prompt approach improves performance over zero-shot prompting. Using the ICBHI 2017 Respiratory Sound Database, 6898 annotated respiratory cycles were converted into mel-spectrograms. GPT-4o was prompted to classify each spectrogram using both zero-shot and few-shot strategies. Model outputs were evaluated against ground truth labels using performance metrics including accuracy, precision, recall, and F1-score. Few-shot prompting improved overall accuracy (0.363 vs. 0.320) and yielded modest gains in precision (0.316 vs. 0.283), recall (0.300 vs. 0.287), and F1-score (0.308 vs. 0.285) across labels. McNemar’s test indicated a statistically significant difference in performance between prompting strategies (p < 0.001). Model repeatability analysis demonstrated high agreement (κ = 0.76–0.88; agreement: 89–96%), indicating excellent consistency. GPT-4o demonstrated limited but statistically significant performance gains using few-shot prompting for lung sound classification. While current performance remains insufficient for clinical deployment, this prompt-based approach provides a baseline for spectrogram-based multimodal tasks and a foundation for future exploration of prompt-based multimodal inference.
Author summary
Lung sounds, such as wheezes and crackles, can offer important clues about respiratory health. Traditionally, doctors use a stethoscope to listen for these sounds, but interpreting them accurately can be difficult and often requires experience. We wanted to explore whether a new type of artificial intelligence (AI), called a multimodal language model, could help identify different lung sounds by looking at visual representations of those sounds known as spectrograms. Specifically, we used a model called GPT-4o, which is designed to understand both images and text, and tested whether giving it a few examples of labeled lung sounds would help it perform better. We found that this few-example or “few-shot” prompting approach led to modest but meaningful improvements in how accurately the model could identify different types of lung sounds compared to giving it no examples at all. While the model’s current performance is insufficient for clinical deployment, our findings establish a foundational baseline, demonstrating that general-purpose AI tools can exhibit in-context learning to improve lung sound classification. This provides a direction for developing flexible and accessible AI support in resource-limited settings.
Citation: Dietrich N, McShannon D, Rzepka MF (2026) Evaluating few-shot prompting for spectrogram-based lung sound classification using a multimodal language model. PLOS Digit Health 5(1): e0001179. https://doi.org/10.1371/journal.pdig.0001179
Editor: Dhiya Al-Jumeily OBE, Liverpool John Moores University - City Campus: Liverpool John Moores University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: July 31, 2025; Accepted: December 17, 2025; Published: January 7, 2026
Copyright: © 2026 Dietrich et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The code used for spectrogram generation and model inference, along with the specific prompt text and associated processing parameters, is available from rdm@utoronto.ca upon reasonable request. The dataset used in this study is available from the ICBHI 2017 Database: Rocha, B. M., Filos, D., Mendes, L., Serbes, G., Ulukaya, S., Kahya, Y. P., Jakovljevic, N., Turukalo, T. L., Vogiatzis, I. M., Perantoni, E., Kaimakamis, E., Natsiavas, P., Oliveira, A., Jácome, C., Marques, A., Maglaveras, N., Pedro Paiva, R., Chouvarda, I., & de Carvalho, P. (2019). An open access database for the evaluation of respiratory sound classification algorithms. Physiological measurement, 40(3), 035001. https://doi.org/10.1088/1361-6579/ab03ea. Email contact: icbhi_challenge@med.auth.gr.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Lung sounds provide critical insights into pulmonary health and underlying disease. Traditionally, auscultation with a stethoscope has been the gold standard for bedside respiratory assessment [1–3]. However, advancements in technology have now enabled the use of electronic devices, such as digital stethoscopes and audio recorders, to capture and analyze respiratory sounds. These tools can be used to facilitate longitudinal monitoring, remote sharing, and integration of audio data into research pipelines for model development [4,5].
In addition, recent efforts have focused on developing machine learning (ML) models for automated lung sound classification, particularly in detecting adventitious sounds such as crackles and wheezes [6,7]. A common approach involves converting audio recordings into spectrograms and applying deep learning architectures, such as convolutional neural networks (CNNs), to build binary classifiers or detection models [8–10]. Despite promising performance, these models typically require substantial computational resources and extensive dataset-specific optimization, constraining their generalizability and limiting their use beyond specialized centres.
A potential alternative to CNNs for automated lung sound analysis is using general-purpose multimodal large language models (LLMs) to classify lung sounds [11]. Unlike traditional deep learning models, which rely on large, labeled datasets and require dedicated training for specific tasks, multimodal LLMs leverage pre-trained knowledge across diverse data formats, such as images, videos, and code. This flexibility may reduce the need for building task-specific architectures from scratch and instead shift the focus toward designing effective prompts. In this context, prompting strategies, which refer to structured instructions that shape how the model interprets and responds to an input, serve as a mechanism to optimize performance without explicit retraining. Such approaches may offer a scalable pathway for applying multimodal LLMs to defined tasks, like respiratory sound classification [12].
One such strategy is few-shot prompting, where a model is given a small set of labeled examples within the prompt to guide its output. Unlike zero-shot prompting, which provides no prior examples, few-shot prompting aims for models to infer patterns from limited contextual information. This strategy has been shown to improve LLM output accuracy for well-defined text-based tasks [12–14]. For example, prior research has demonstrated that few-shot prompting enhances performance in tasks, such as differential diagnosis generation and text classification [15]. However, whether this in-context learning extends to multimodal tasks, particularly in the classification of lung sound spectrograms, remains unexplored.
To our knowledge, no prior studies have evaluated the performance of a general-purpose multimodal LLM for lung sound analysis or investigated the impact of few-shot prompting in this context. Therefore, this study aims to address these gaps by assessing the ability of a multimodal LLM to classify lung sounds from mel-spectrograms and by comparing the baseline performance of few-shot versus zero-shot prompting.
Materials and methods
Population
This study utilized the previously validated International Conference on Biomedical and Health Informatics (ICBHI) 2017 Respiratory Sound Database [16], a publicly available dataset comprising lung sound recordings from 126 subjects across multiple clinical sites. The cohort included 46 females, 79 males, and 1 participant with missing sex data. Age distribution included 49 pediatric (<18 years), 22 adult (18–64 years), and 54 older adult (≥65 years) participants (1 participant with missing age data), with a median age of 60.0 years (interquartile range: 4.0–70.25). Audio recordings were collected using four types of electronic stethoscopes and microphones. The dataset contains 920 annotated recordings segmented into 6898 respiratory cycles (inspiration and expiration), labeled as “crackles” (n = 1864), “wheezes” (n = 886), “both” (presence of both crackles and wheezes; n = 506), or “normal” (absence of both crackles and wheezes; n = 3642) based on expert consensus.
Spectrogram generation
Each audio file was segmented into individual respiratory cycles based on the start and end times provided in the ICBHI database [16]. Respiratory cycles were resampled to 44.1 kHz for uniformity, and a custom Python pipeline was developed to generate mel-spectrograms from these cycles, applying a Short-Time Fourier transform for frequency decomposition [17,18]. The mel-spectrograms were generated using 128 mel bands, a window size of 2048 samples, and a hop length of 512 samples, ensuring adequate time-frequency resolution. A viridis color filter was applied to enhance visual interpretability [19], and each spectrogram was structured with frequency (Hz) on the y-axis, time (s) on the x-axis, and amplitude in decibels (dB) to preserve critical acoustic features. Each spectrogram was saved in PNG format to maintain consistency in downstream analysis.
Large language model selection
GPT-4o (model GPT-4o-2024-08-06) [20] was selected as the multimodal LLM for this study due to its native vision capabilities, and widespread use in prior multimodal research, including recent work demonstrating its ability to interpret non-medical spectrograms [21]. Developed by OpenAI, GPT-4o was trained on a diverse dataset that includes image, video, audio, and text modalities up to October 2023. The model was accessed via the OpenAI API, which allowed for precise control over temperature (set to 0 to minimize non-determinism), response length, and token constraints for deterministic outputs. Following common practice, the API was utilized in a stateless manner with data retention disabled, ensuring the model’s underlying weights did not change or “learn” during the study period [22]. The API’s batch processing capabilities enabled synchronous execution of prompts, maintaining uniformity across model queries while minimizing variability in responses.
Prompt engineering and development
A standardized prompt structure was developed for both zero-shot and few-shot strategies. For the zero-shot strategy, GPT-4o was presented with the spectrogram and a predefined task question regarding lung sound classification, without any prior exposure to labeled examples (Fig 1). In contrast, the few-shot prompts incorporated examples of labeled reference spectrograms, providing the model with input-output pairs before classification of the spectrogram under evaluation (Fig 2). The reference spectrograms were derived from previously validated audio samples, representing crackles, wheezes, both, and normal sounds, obtained from an external dataset [23], and generated using the same preprocessing pipeline as the ICBHI database. This approach ensured a strict separation from the ICBHI testing set to prevent data leakage [24,25]. Few-shot exemplars were manually curated based on their spectral clarity and characteristic representation of the intended auscultatory category. The few-shot prompt design was informed by multimodal prompting strategies from prior studies, which have been shown to improve classification accuracy when contextual examples are provided [21,26].
Data processing architecture
Each spectrogram was classified using the zero-shot and the few-shot prompts, with model settings and computational conditions identical across both prompting strategies. A synchronous processing pipeline was implemented to ensure that each spectrogram was sequentially processed, with each API request waiting for the previous request to complete before execution of the next. This controlled approach minimized variability and ensured that all classifications were performed under standardized conditions. The retrieved text outputs generated by GPT-4o were stored in .txt format, and a Python post-processing script was used to convert these outputs into a structure consistent with the ICBHI annotation format [16].
Statistical analysis
The classification outputs from GPT-4o, using both zero-shot and few-shot strategies, were compared to the ground truth labels from the ICBHI dataset [16]. Performance metrics included accuracy, sensitivity (recall), specificity, precision (positive predictive value [PPV]), negative predictive value (NPV), and F1-score, computed as previously described [8,27]. These metrics were calculated both overall and on a per-label basis. Balanced accuracy (macro-averaged recall across the four classes) was also calculated to account for class imbalance. Raw classification counts for both prompting strategies were summarized using confusion matrices. Differences in overall classification performance between zero-shot and few-shot prompt strategies were assessed using McNemar’s test. Statistical significance was set at a two-sided p-value of 0.05.
For the per-label analysis, two analytical approaches were employed [16,27]. First, in the multiclass classification approach, the model was treated as a four-label classifier with mutually exclusive categories: only crackles, only wheezes, both, and normal. Second, in the binary classification approach, the model was treated as two independent binary classifiers, one for crackles (present/absent) and one for wheezes (present/absent), allowing for separate identification of each sound even when both were present in a single respiratory cycle. We also conducted a subgroup analysis for the “both” and “normal” ground truth labels to compare model performance on detecting crackles versus wheezes within these categories. This included calculating the frequency of each label prediction and evaluating differences between zero- and few-shot prompting using McNemar’s test for paired binary outcomes.
To assess model output repeatability, 10% of the dataset was randomly selected and re-evaluated using the same zero- and few-shot prompts under identical model conditions. This repeatability analysis was chosen over cross-validation because the study involves inference on a pre-trained LLM without model training or weight updates [28]. Classification labels for crackles and wheezes were compared between the initial and repeated outputs. For each condition, both Cohen’s kappa and percentage agreement were calculated to quantify the consistency between runs.
Ethics statement
This study utilized the publicly available ICBHI 2017 Respiratory Sound Database [16], which comprises previously collected and anonymized recordings of human lung sounds. No new data were collected, and no direct interaction with human participants occurred. As the dataset is fully de-identified and publicly accessible, informed consent and ethical review were not required. Thus, in accordance with institutional research ethics guidelines, this study was determined to be exempt from ethics board review.
Results
Overall classification performance
Overall classification accuracy was 0.320 (95% confidence interval [CI]: 0.309–0.331) for zero-shot prompting and 0.363 (95% CI: 0.352–0.374) for few-shot prompting. For the zero-shot prompting, the model achieved a true positive (TP) count of 935, true negative (TN) of 1271, false positive (FP) of 2371, and false negative (FN) of 2321. These counts corresponded to a precision of 0.283 (95% CI: 0.267–0.298), recall of 0.287 (95% CI: 0.272–0.303), F1-score of 0.285, specificity of 0.349 (95% CI: 0.334–0.364), and NPV of 0.354 (95% CI: 0.338–0.369).
Using few-shot prompting, the model produced counts of 976 TP, 1529 TN, 2113 FP, and 2280 FN. Corresponding performance metrics included a precision of 0.316 (95% CI: 0.300–0.332), recall of 0.300 (95% CI: 0.284–0.315), F1-score of 0.308, specificity of 0.420 (95% CI: 0.404–0.436), and NPV of 0.401 (95% CI: 0.386–0.417). McNemar’s test revealed a statistically significant difference in performance (χ² = 86.47, p < 0.001), indicating that the few-shot strategy significantly outperformed the zero-shot strategy.
For zero-shot prompting, balanced accuracy was 0.292, and few-shot prompting achieved a balanced accuracy of 0.299. Confusion matrices illustrating the distribution of correct and incorrect predictions across all four lung sound categories are provided for zero-shot (Fig 3) and few-shot (Fig 4) prompting strategies.
The matrix displays the distribution of predicted versus ground truth labels across the four lung sound categories (normal, crackles, wheezes, both) for the zero-shot prompting strategy. Values represent the number of respiratory cycles classified into each category. Rows indicate ground truth labels, and columns indicate predicted labels.
Repeatability analysis
The repeatability analysis demonstrated Cohen’s kappa values of 0.88 and 0.78 for zero-shot crackles and wheezes, respectively, and 0.83 and 0.76 for few-shot crackles and wheezes, indicating substantial agreement across model runs. Agreement percentages were high for all comparisons, reported as 95.5% for zero-shot crackles, 89.3% for zero-shot wheezes, 92.6% for few-shot crackles, and 90.6% for few-shot wheezes. No statistically significant differences were observed in overall classification performance between the initial and repeated runs (p > 0.05), supporting the consistency and reliability of model predictions across repeated evaluations.
Per-label classification performance
In the multiclass analysis, accuracy for the “only crackles” label was 0.603 (95% CI: 0.592–0.615) for zero-shot prompting and 0.582 (CI: 0.571–0.594) for few-shot prompting. For “only wheezes,” accuracy improved from 0.634 (95% CI: 0.623–0.645) in the zero-shot strategy to 0.737 (95% CI: 0.726–0.747) in the few-shot strategy. The “both” category showed high accuracy for both strategies, with 0.924 (95% CI: 0.918–0.931) for zero-shot and 0.915 (95% CI: 0.909–0.922) for few-shot prompting. Accuracy for the “normal” label was 0.478 (95% CI: 0.466–0.490) for zero-shot and 0.492 (95% CI: 0.480–0.504) for few-shot prompting.
Using the binary classification analysis, accuracy for detecting crackles was 0.563 (95% CI: 0.552–0.575) in the zero-shot strategy and 0.545 (95% CI: 0.533–0.557) in the few-shot strategy. For wheezes, accuracy improved from 0.628 (95% CI: 0.617–0.640) in the zero-shot strategy to 0.710 (95% CI: 0.700–0.721) with few-shot prompting. The remaining per-label performance metrics are reported in Table 1 for the four classes and Table 2 for the two binary classes.
The matrix displays the distribution of predicted versus ground truth labels across the four lung sound categories (normal, crackles, wheezes, both) for the few-shot prompting strategy. Values represent the number of respiratory cycles classified into each category. Rows indicate ground truth labels, and columns indicate predicted labels.
On sub-analysis of the classification performance for the “both” label (where both crackles and wheezes were present in the ground truth), the model identified crackles in 126 zero-shot and 170 few-shot cases, and wheezes in 244 zero-shot and 186 few-shot cases. Only 8 instances had both crackles and wheezes correctly predicted in the zero-shot strategy, compared to 18 in the few-shot strategy. McNemar’s test revealed statistically significant differences in performance between zero- and few-shot strategies for both crackles (χ² = 27.191, p < 0.001) and wheezes (χ² = 38.679, p < 0.001).
For the “normal” label (where both crackles and wheezes were absent), the model correctly predicted absence of crackles in 2,643 zero-shot and 2,383 few-shot cases, and absence of wheezes in 2,256 and 2,737 cases, respectively. Both crackles and wheezes were jointly absent in 1,271 zero-shot and 1,529 few-shot cases. Again, McNemar’s test showed significant differences between zero- and few-shot prompting for both crackles (χ² = 140.337, p < 0.001) and wheezes (χ² = 344.395, p < 0.001).
Discussion
To our knowledge, this study is the first to evaluate the classification of lung sound spectrograms using a general-purpose multimodal LLM and to compare few-shot and zero-shot prompting strategies in a respiratory sound context. Few-shot prompting produced marginal improvements in spectrogram-based classification performance, yielding statistically significant gains over zero-shot prompting across several metrics, including accuracy, specificity, and F1-score for the majority of lung sound labels. These findings align with recent work demonstrating that few-shot prompting can enhance performance on text-based medical tasks [25,26,29]. Despite these improvements, overall model performance remains insufficient for clinical deployment, positioning this work as an exploratory benchmark for visual in-context learning in this domain.
Although GPT-4o was evaluated on a large and clinically diverse dataset using a standardized preprocessing pipeline, the overall performance of both prompting strategies fell below clinically acceptable thresholds, with overall accuracy for zero-shot and few-shot strategies at 32.0% and 36.3%, respectively [8]. These values are lower than those reported for task-specific deep learning models trained on the ICBHI dataset [16], including CNNs, audio spectrogram transformer models, and other architectures, which typically achieve accuracy ranges of approximately 50 to 90% depending on preprocessing, model design, and class balancing techniques [8,27,30–32]. Human benchmark studies similarly report diagnostic accuracy between 55 and 90%, with crackles and wheezes consistently posing greater interpretive difficulty than normal breath sounds [31,33]. For comparison, in the multiclass setting, a random-chance classifier would achieve an accuracy of 25%; thus, the observed 32.0% (zero-shot) and 36.3% (few-shot) performance, while above random chance, underscores the limited discriminative ability of both strategies on this task. Consistent with these findings, multiclass balanced accuracy also remained low (0.292 for zero-shot vs. 0.299 for few-shot), indicating that explicitly adjusting for class imbalance does not materially change the overall performance profile. At the same time, the slight improvement with few-shot prompting suggests a small but measurable benefit in more evenly distributing predictions across classes, though this remains far from closing the gap with task-specific models.
Both prompting strategies showed notably poor performance in identifying lung cycles containing simultaneous crackles and wheezes, indicating that the multimodal LLM may struggle to integrate compound acoustic features. In the multiclass setting, recall for this category remained below 4%, and F1-scores were less than 10%. These findings suggest that when acoustic features overlap or co-occur with considerable variability, GPT-4o may be unable to separate individual sounds without task-specific training [34]. Furthermore, the inconsistent class-wise trends, such as the drop in crackles accuracy (from 60.3 to 58.2%) alongside a rise in wheezes accuracy (from 63.4 to 73.7%) with few-shot prompting, suggest that the effect of few-shot learning is not uniformly beneficial across all classes. This may be due to the more continuous and tonal nature of wheezes, which manifest in spectrograms as sustained horizontal patterns, features that may be more visually distinctive to a multimodal LLM. Crackles, on the other hand, are brief, discontinuous events whose sporadic nature may be less readily interpretable by a vision model without prior domain training [35]. In contrast, the “normal” class showed comparatively strong and consistent performance, likely due to the relative visual uniformity of spectrograms lacking adventitious sounds and the absence of unpredictable crackle or wheeze patterns within a respiratory cycle [35].
Taken together, our results suggest that GPT-4o shows limited but measurable responsiveness to few-shot prompting for spectrogram-based classification. Although its performance remained substantially lower than that of task-specific deep learning models–and well below the gains typically reported for text-based tasks [26]–the modest gains observed with contextual examples indicate a small degree of visual in-context learning without domain-specific training [26]. Considering these findings, exploration of other prompt-based approaches—for example, varying the number or visual characteristics of exemplars, or considering techniques such as chain-of-thought or retrieval-augmented prompting—could help clarify the extent to which prompting strategies influence model performance [36–39]. In parallel, broader evaluation across diverse patient populations, real-world or noisy recordings, and additional adventitious lung sounds such as rhonchi and stridor may help clarify how these models behave across a fuller spectrum of pulmonary acoustics [40–42].
Beyond technical performance, translation of multimodal LLMs into clinical respiratory assessment tools requires addressing significant practical, regulatory, and ethical hurdles. As no LLMs are currently cleared for standalone diagnostic use, robust “human-in-the-loop” oversight is critical for ensuring safety and accountability as these technologies advance toward clinical evaluation and potential deployment [25,43]. This approach aligns with existing regulatory precedents for AI-augmented digital auscultation tools, where authorization is limited to assistive interpretation requiring clinician oversight [44–46]. Additional concerns include liability arising from model errors or “hallucinations”, particularly regarding who bears responsibility if incorrect or misleading outputs contribute to diagnostic delay, inappropriate management, or missed pathology. These issues are further compounded by uncertainty about sustaining reliable performance across diverse patient populations and variable recording conditions, as well as the need for mechanisms to monitor and detect performance drift over time [12,47,48]. The susceptibility of LLMs to adversarial inputs, subtle prompt manipulations, and context-dependent errors adds another layer of complexity that must be carefully examined before clinical adoption [49,50]. Advancing toward regulated medical use will also require clear pathways for accountability and validation. While formal regulatory structures for generative AI and multimodal models are limited, emerging frameworks, such as Health Canada’s pre-market guidance for ML-enabled medical devices [51] and the FDA’s draft guidance for AI-enabled device software function [52], provide an early structure for safety documentation and risk mitigation. Ultimately, meaningful clinical integration requires large-scale real-world validation studies, standardized evaluation protocols, and a focus on equitable access to ensure that AI-augmented respiratory assessment tools are usable across both high- and low-resource environments [53–55].
This study was not without its limitations. First, although spectrograms were generated from the ICBHI 2017 dataset using a standardized preprocessing pipeline to ensure visual consistency, this transformation may not preserve all acoustic nuances present in the original recordings, particularly subtle transient features or sounds masked by background noise [16]. Nonetheless, such noise and variability reflect real-world conditions and support generalizability. Second, while the zero- and few-shot prompts followed previously validated approaches and were applied consistently across all 6,898 cases, model outputs remain inherently sensitive to prompt formulation, and alternative prompt structures could yield different results [37]. Lastly, this study evaluated the performance of a single multimodal LLM, which, although GPT-4o is a popular and well-studied foundation model, may not reflect the performance of smaller, domain-specific, or medically fine-tuned models [56]. Future work should benchmark performance across emerging multimodal architectures to contextualize these findings within the rapidly evolving LLM landscape.
Conclusion
Our study demonstrates that few-shot prompting generally achieves marginal improvements over zero-shot prompting for mel-spectrogram-based lung sound classification using a general-purpose multimodal LLM. Although overall performance remains far below that of task-specific deep learning models and insufficient for clinical use, this approach provides a baseline for visual in-context learning and foundation for subsequent refinement. Further work should focus on optimizing prompting strategies and evaluating performance across more diverse, clinically representative datasets.
References
- 1. Sarkar M, Madabhavi I, Niranjan N, Dogra M. Auscultation of the respiratory system. Ann Thorac Med. 2015;10(3):158–68. pmid:26229557
- 2. Benbassat J, Baumal R. Narrative review: should teaching of the respiratory physical examination be restricted only to signs with proven reliability and validity?. J Gen Intern Med. 2010;25(8):865–72. pmid:20349154
- 3. Brooks D, Thomas J. Interrater reliability of auscultation of breath sounds among physical therapists. Phys Ther. 1995;75(12):1082–8. pmid:7501711
- 4. Prodhan P, Dela Rosa RS, Shubina M, Haver KE, Matthews BD, Buck S, et al. Wheeze detection in the pediatric intensive care unit: comparison among physician, nurses, respiratory therapists, and a computerized respiratory sound monitor. Respir Care. 2008;53(10):1304–9. pmid:18811991
- 5. Kevat AC, Kalirajah A, Roseby R. Digital stethoscopes compared to standard auscultation for detecting abnormal paediatric breath sounds. Eur J Pediatr. 2017;176(7):989–92. pmid:28508991
- 6. Grzywalski T, Piecuch M, Szajek M, Bręborowicz A, Hafke-Dys H, Kociński J, et al. Practical implementation of artificial intelligence algorithms in pulmonary auscultation examination. Eur J Pediatr. 2019;178(6):883–90. pmid:30927097
- 7. Kevat A, Kalirajah A, Roseby R. Artificial intelligence accuracy in detecting pathological breath sounds in children using digital stethoscopes. Respir Res. 2020;21(1):253. pmid:32993620
- 8. Park JS, Kim K, Kim JH, Choi YJ, Kim K, Suh DI. A machine learning approach to the development and prospective evaluation of a pediatric lung sound classification model. Sci Rep. 2023;13(1):1289. pmid:36690658
- 9. Acharya J, Basu A. Deep Neural Network for Respiratory Sound Classification in Wearable Devices Enabled by Patient Specific Model Tuning. IEEE Trans Biomed Circuits Syst. 2020;14(3):535–44. pmid:32191898
- 10. Rocha BM, Pessoa D, Marques A, Carvalho P, Paiva RP. Automatic Classification of Adventitious Respiratory Sounds: A (Un)Solved Problem?. Sensors (Basel). 2020;21(1):57. pmid:33374363
- 11. Meskó B. The Impact of Multimodal Large Language Models on Health Care’s Future. J Med Internet Res. 2023;25:e52865. pmid:37917126
- 12. Dietrich N, Bradbury NC, Loh C. Prompt Engineering for Large Language Models in Interventional Radiology. AJR Am J Roentgenol. 2025;225(2):e2532956. pmid:40334089
- 13. Shah K, Xu AY, Sharma Y, Daher M, McDonald C, Diebo BG, et al. Large Language Model Prompting Techniques for Advancement in Clinical Medicine. J Clin Med. 2024;13(17):5101. pmid:39274316
- 14. Russe MF, Reisert M, Bamberg F, Rau A. Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning. Rofo. 2024;196(11):1166–70. pmid:38408477
- 15. Brin D, Sorin V, Barash Y, Konen E, Glicksberg BS, Nadkarni GN, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 2025;35(4):1959–65. pmid:39214893
- 16. Rocha BM, Filos D, Mendes L, Serbes G, Ulukaya S, Kahya YP, et al. An open access database for the evaluation of respiratory sound classification algorithms. Physiol Meas. 2019;40(3):035001. pmid:30708353
- 17. Wanasinghe T, Bandara S, Madusanka S, Meedeniya D, Bandara M, Díez IDLT. Lung Sound Classification With Multi-Feature Integration Utilizing Lightweight CNN Model. IEEE Access. 2024;12:21262–76.
- 18. Owens FJ, Murphy MS. A short-time Fourier transform. Signal Processing. 1988;14(1):3–10.
- 19. Nuñez JR, Anderton CR, Renslow RS. Optimizing colormaps with consideration for color vision deficiency to enable accurate interpretation of scientific data. PLoS One. 2018;13(7):e0199239. pmid:30067751
- 20. OpenAI. GPT-4o-2024-08-06. Large language model. Preprint. 2024. https://www.openai.com/research/gpt-4o-2024-08-06
- 21.
Dixit S, Heller LM, Donahue C. Vision language models are few-shot audio spectrogram classifiers. ArXiv. 2024.
- 22. Park CR, Heo H, Suh CH, Shim WH. Uncover This Tech Term: Application Programming Interface for Large Language Models. Korean J Radiol. 2025;26(8):793–6. pmid:40736411
- 23. Fraiwan M, Fraiwan L, Khassawneh B, Ibnian A. A dataset of lung sounds recorded from the chest wall using an electronic stethoscope. Data Brief. 2021;35:106913. pmid:33732827
- 24. Kapoor S, Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns (N Y). 2023;4(9):100804. pmid:37720327
- 25. Dietrich N, Bradbury NC, Loh C. Advanced Prompt Engineering for Large Language Models in Interventional Radiology: Practical Strategies and Future Perspectives. AJR Am J Roentgenol. 2025;:10.2214/AJR.25.33947. pmid:41222246
- 26. Zhang J, Sun K, Jagadeesh A, Falakaflaki P, Kayayan E, Tao G, et al. The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant. J Am Med Inform Assoc. 2024;31(9):1884–91. pmid:39018498
- 27. Bae S, Kim JW, Cho WY. Patch-mix contrastive learning with audio spectrogram transformer on respiratory sound classification. ArXiv. 2024. https://github.com/raymin0223/
- 28. Berrar D. Cross-Validation. Encyclopedia of Bioinformatics and Computational Biology. Elsevier. 2019. p. 542–5.
- 29.
Brown TB, Mann B, Ryder N. Language Models are Few-Shot Learners. ArXiv. 2020.
- 30. Yu S, Yu J, Chen L, Zhu B, Liang X, Xie Y, et al. Advances and Challenges in Respiratory Sound Analysis: A Technique Review Based on the ICBHI2017 Database. Electronics. 2025;14(14):2794.
- 31. Kim Y, Hyon Y, Jung SS, Lee S, Yoo G, Chung C, et al. Respiratory sound classification for crackles, wheezes, and rhonchi in the clinical field using deep learning. Sci Rep. 2021;11(1):17186. pmid:34433880
- 32. Xiao L, Fang L, Yang Y, Tu W. LungAdapter: Efficient Adapting Audio Spectrogram Transformer for Lung Sound Classification. In: Interspeech 2024, 2024. 4738–42.
- 33. Moriki D, Koumpagioti D, Kalogiannis M, Sardeli O, Galani A, Priftis KN, et al. Physicians’ ability to recognize adventitious lung sounds. Pediatr Pulmonol. 2023;58(3):866–70. pmid:36453611
- 34. Jang H, McCormack D, Tong F. Noise-trained deep neural networks effectively predict human vision and its neural responses to challenging images. PLoS Biol. 2021;19(12):e3001418. pmid:34882676
- 35. Aviles-Solis JC, Storvoll I, Vanbelle S, Melbye H. The use of spectrograms improves the classification of wheezes and crackles in an educational setting. Sci Rep. 2020;10(1):8461. pmid:32440001
- 36. Chen W, Si C, Zhang Z, Wang L, Wang Z, Tan T. Semantic Prompt for Few-Shot Image Recognition. ArXiv. 2023.
- 37. Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study. JMIR Med Inform. 2024;12:e55318. pmid:38587879
- 38. Dietrich N, Stubbert B. Evaluating Adherence to Canadian Radiology Guidelines for Incidental Hepatobiliary Findings Using RAG-Enabled LLMs. Can Assoc Radiol J. 2025;76(4):674–82. pmid:40016861
- 39. Chen B, Zhang Z, Langrené N, Zhu S. Unleashing the potential of prompt engineering for large language models. Patterns. 2025;6(6):101260.
- 40. Jensen EA, Panitch H, Feng R, Moore PE, Schmidt B. Interobserver Reliability of the Respiratory Physical Examination in Premature Infants: A Multicenter Study. The Journal of Pediatrics. 2016;178:87–92.
- 41. Dietrich N. Navigating the Noise: Thoughtful Artificial Intelligence Engagement in Medicine and Radiology. AJR Am J Roentgenol. 2025;225(1):e2532918. pmid:40105384
- 42. Paik KE, Hicklen R, Kaggwa F, Puyat CV, Nakayama LF, Ong BA, et al. Digital Determinants of Health: Health data poverty amplifies existing health disparities-A scoping review. PLOS Digit Health. 2023;2(10):e0000313. pmid:37824445
- 43. Dietrich N. Agentic AI in radiology: emerging potential and unresolved challenges. Br J Radiol. 2025;98(1174):1582–4. pmid:40705666
- 44. Tyto Care L. TytoCare Lung Sounds Analyzer. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K221614 2023. 2025 November 1.
- 45.
Tyto Care L. Tyto insights for crackles detection. 2024.
- 46. Respiri Limited. Wheezo WheezeRate Detector. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K202062 2021. 2025 November 1.
- 47. Ong JCL, Chang SY-H, William W, Butte AJ, Shah NH, Chew LST, et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit Health. 2024;6(6):e428–32. pmid:38658283
- 48. Weissman GE, Mankowitz T, Kanter GP. Unregulated large language models produce medical device-like output. NPJ Digit Med. 2025;8(1):148. pmid:40055537
- 49. Dietrich N, Gong B, Patlas MN. Adversarial artificial intelligence in radiology: Attacks, defenses, and future considerations. Diagn Interv Imaging. 2025;106(11):375–84. pmid:40404555
- 50. Dietrich N, Patlas MN. Adversarial AI in Radiology: A Hidden Threat. Can Assoc Radiol J. 2025;76(4):564–5. pmid:40170271
- 51.
Government of Canada. Pre-market guidance for machine learning-enabled medical devices. Health Canada. https://www.canada.ca/en/health-canada/services/drugs-health-products/medical-devices/application-information/guidance-documents/pre-market-guidance-machine-learning-enabled-medical-devices.html. 2025. 2025 October 8.
- 52.
U.S. Food and Drug Administration. Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations. 2025.
- 53. Doo FX, Vosshenrich J, Cook TS, Moy L, Almeida EPRP, Woolen SA, et al. Environmental Sustainability and AI in Radiology: A Double-Edged Sword. Radiology. 2024;310(2):e232030. pmid:38411520
- 54. Dietrich N, Hanneman K. Greener by Design: Weighing the Environmental Impact of Radiology AI Development. Can Assoc Radiol J. 2025;:8465371251364698. pmid:40792465
- 55. Hanneman K, Szava-Kovats A, Burbridge B, Leswick D, Nadeau B, Islam O, et al. Canadian Association of Radiologists Statement on Environmental Sustainability in Medical Imaging. Can Assoc Radiol J. 2025;76(1):44–54. pmid:39080832
- 56. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems. ArXiv. 2023.