Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Towards reliable use of artificial intelligence to classify otitis media using otoscopic images: Addressing bias and improving data quality

  • Yixi Xu ,

    Contributed equally to this work with: Yixi Xu, Al-Rahim Habib

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    yixx@microsoft.com

    Affiliation AI for Good Lab, Microsoft, Redmond, Washington, United States of America

  • Al-Rahim Habib ,

    Contributed equally to this work with: Yixi Xu, Al-Rahim Habib

    Roles Conceptualization, Data curation, Funding acquisition, Project administration, Validation, Writing – original draft, Writing – review & editing

    Affiliations Sydney Medical School, Faculty of Medicine and Health, University of Sydney, Camperdown, New South Wales, Australia, Department of Otolaryngology, Head and Neck Surgery, Westmead Hospital, Sydney, New South Wales, Australia, Department of Otolaryngology – Head and Neck Surgery, Queensland Children’s Hospital, South Brisbane, Queensland, Australia

  • Graeme Crossland,

    Roles Writing – review & editing

    Affiliation Department of Otolaryngology – Head and Neck Surgery, Royal Darwin Hospital, Tiwi, Northern Territory, Australia

  • Hemi Patel,

    Roles Writing – review & editing

    Affiliation Department of Otolaryngology – Head and Neck Surgery, Royal Darwin Hospital, Tiwi, Northern Territory, Australia

  • Chris Perry,

    Roles Writing – review & editing

    Affiliation University of Queensland Medical School, Brisbane, Queensland, Australia

  • Kris Bock,

    Roles Software, Writing – review & editing

    Affiliation Azure FastTrack Engineering, Microsoft, Brisbane, Queensland, Australia

  • Tony Lian,

    Roles Writing – review & editing

    Affiliations Sydney Medical School, Faculty of Medicine and Health, University of Sydney, Camperdown, New South Wales, Australia, Department of Otolaryngology, Head and Neck Surgery, Westmead Hospital, Sydney, New South Wales, Australia

  • William B. Weeks,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation AI for Good Lab, Microsoft, Redmond, Washington, United States of America

  • Rahul Dodhia,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation AI for Good Lab, Microsoft, Redmond, Washington, United States of America

  • Juan Lavista Ferres,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation AI for Good Lab, Microsoft, Redmond, Washington, United States of America

  • Narinder Pal Singh

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliations Sydney Medical School, Faculty of Medicine and Health, University of Sydney, Camperdown, New South Wales, Australia, Department of Otolaryngology, Head and Neck Surgery, Westmead Hospital, Sydney, New South Wales, Australia

Abstract

Ear disease contributes significantly to global hearing loss, with recurrent otitis media being a primary preventable cause in children, impacting development. Artificial intelligence (AI) offers promise for early diagnosis via otoscopic image analysis, but dataset biases and inconsistencies limit model generalizability and reliability. This retrospective study systematically evaluated three public otoscopic image datasets (Chile; Ohio, USA; Türkiye) using quantitative and qualitative methods. Two counterfactual experiments were performed: (1) obscuring clinically relevant features to assess model reliance on non-clinical artifacts, and (2) evaluating the impact of hue, saturation, and value on diagnostic outcomes. Quantitative analysis revealed significant biases in the Chile and Ohio, USA datasets. Counterfactual Experiment I found high internal performance (AUC > 0.90) but poor external generalization, because of dataset-specific artifacts. The Türkiye dataset had fewer biases, with AUC decreasing from 0.86 to 0.65 as masking increased, suggesting higher reliance on clinically meaningful features. Counterfactual Experiment II identified common artifacts in the Chile and Ohio, USA datasets. A logistic regression model trained on clinically irrelevant features from the Chile dataset achieved high internal (AUC = 0.89) and external (Ohio, USA: AUC = 0.87) performance. Qualitative analysis identified redundancy in all the datasets and stylistic biases in the Ohio, USA dataset that correlated with clinical outcomes. In summary, dataset biases significantly compromise reliability and generalizability of AI-based otoscopic diagnostic models. Addressing these biases through standardized imaging protocols, diverse dataset inclusion, and improved labeling methods is crucial for developing robust AI solutions, improving high-quality healthcare access, and enhancing diagnostic accuracy.

Introduction

Ear disease represents a significant global public health challenge, affecting individuals across all age groups and socioeconomic background [1]. It is a leading cause of disability, contributing to communication barriers, social isolation, and reduced quality of life. Among children, recurrent acute otitis media and chronic otitis media are preventable causes of hearing loss that can profoundly impact speech and language development, academic achievement, and long-term social integration [2]. Long-term hearing loss can hinder their ability to participate in social activities and achieve educational milestones, potentially reducing quality of life and future employment opportunities [3]. It is estimated that nearly 60% of hearing loss in children is due to avoidable causes such as vaccine-preventable diseases, ear infections, birth-related causes and ototoxic medicines [1]. In underserved populations, where access to timely healthcare is limited, these conditions often go undiagnosed or untreated, exacerbating health disparities [3].

Timely diagnosis and intervention are critical for mitigating the effects of ear disease, yet access to specialty ear disease and hearing health services poses a significant barrier [4]. Many low- and middle-income countries have limited access to audiologists and otolaryngologists, leaving general practitioners and community health workers to manage complex cases with limited diagnostic tools [57]. For instance, only 56% of countries in the African region have one or more otolaryngologists per 1 million people, while 67% of European countries have more than 50 otolaryngologists per million population [1]. This disparity is particularly acute in rural and remote areas, where delays in diagnosis can lead to irreversible complications and poor otologic and hearing outcomes [8]. To address these challenges, artificial intelligence (AI) and deep learning have emerged as promising adjunct tools for ear disease diagnosis. These technologies have been explored to analyse otoscopic images, with the aim of helping healthcare workers detect abnormalities, reducing diagnostic variability, and supporting triage in busy clinics or resource-constrained settings [9]. For example, AI models have shown potential in identifying conditions such as acute and chronic otitis media, offering healthcare workers a powerful tool to enhance diagnostic accuracy and streamline workflows [1014]. However, the clinical implementation of these models is hindered by limitations in the existing literature, including biases in training datasets and challenges with generalizability [15].

AI models excel at recognizing patterns within training data, sometimes unintentionally linking certain confounding features to clinical outcomes. For instance, melanoma detection models might rely on shortcuts based on surgical skin markings [16]. Unfortunately, assessing data bias has remained challenging and typically requires both domain expertise and technical ability [17]. Existing AI models in healthcare image analysis often perform well in controlled settings but may underperform in real-world clinical environments. Many models rely on datasets collected from a single institution, leading to biases that compromise their performance across diverse populations and settings. These biases, such as models learning irrelevant features like lighting conditions or camera settings, can undermine the reliability of diagnostic decisions [17,18]. Addressing these gaps is essential to ensure that AI tools improve patient outcomes rather than introduce new risks.

Otoscopy plays an important role in identifying middle ear pathology, however it is not a comprehensive hearing screening modality. Otoscopic assessment cannot detect sensorineural hearing loss or retro-cochlear pathology and therefore does not replace audiometric evaluation. In this context, artificial intelligence–based otoscopic image analysis is positioned as a diagnostic support tool for the identification and triage of middle ear disease—particularly otitis media—within a broader ear and hearing care framework. It is intended to complement, rather than substitute, formal hearing assessment and comprehensive ear and hearing health services.

The aim of this study was to investigate the potential of deep learning to enhance otoscopic workflows by addressing the limitations of existing datasets. The specific objectives were to identify and analyse biases in publicly available otoscopic image datasets, develop practical guidelines to mitigate these biases and improve data collection practices. By focusing on the clinical utility of AI tools and their integration into real-world settings, this study provides a foundation for developing reliable, equitable, and impactful solutions to support clinicians and improve patient care.

Methods

Dataset Overview

We used three publicly available otoscopic image datasets in this study (Table 1). The Chile dataset comprised 880 images collected from 180 patients aged 7–65 years using a Firefly DE500 digital video otoscope. [19]. The Ohio, USA dataset included 454 images from the Ohio, USA State University and Nationwide Children’s Hospital using a JEDMED HORUS + digital video otoscope [20]. The Türkiye dataset consisted of 956 images from the Ozel Van Akdamar Hospital using a standard otoscope device with the specific brand not reported. [21,22]

thumbnail
Table 1. Image counts of sub-types in the otoscopy datasets.

https://doi.org/10.1371/journal.pone.0338867.t001

Adjustments were made for consistency: tube images in the Ohio, USA dataset were excluded, and tympanostomy tube images and the ‘other’ category in the Türkiye dataset were excluded, retaining only tympanosclerosis images. All images included in this study had complete diagnostic labels, and no imputation or special handling of missing data was required.

Ethics Statement

This study is a secondary analysis of publicly available otoscopic image datasets that were originally collected and published in prior studies. No new data were collected, and no additional interaction with human participants occurred. Ethical approval and informed consent procedures were obtained by the relevant institutional review boards or ethics committees in the original studies, as indicated by each of the authorship groups (University of Chile Scientific and Research Ethics Committee [19], Ohio State University Institutional Review Board [20], Bitlis Eren University Ethics Committee [21,22]).

Data Splits and Validation Setups

All analysis was conducted at the image level. For internal validation, the Türkiye and Chile datasets used the predefined training and validation splits provided in the original source studies. For the Ohio, USA dataset, which did not include predefined splits, 20% of images were randomly sampled for internal validation using stratified random sampling to preserve class balance. Sampling was performed with shuffling enabled and a fixed random seed to ensure reproducibility.

In the rest of the paper, internal validation refers to evaluation on held‑out data from the same dataset used for model training. External testing refers to evaluation on entirely separate datasets collected from different institutions and geographic regions. No images were shared across datasets.

Table 2 summarizes the training, internal validation, and external testing set sizes for each training source. When a dataset was used as the training source, it was not used for external testing.

thumbnail
Table 2. Dataset sizes used for model training, internal validation, and external validation across the three otoscopic image datasets.

https://doi.org/10.1371/journal.pone.0338867.t002

Bias Identification: Quantitative assessment

Experiment I: Eclipse Extent.

To evaluate whether AI models relied on irrelevant visual artifacts, a dataset manipulation technique was used. As shown in Fig 1, portions of the otoscopic images were covered with an elliptical black mask to obscure the tympanic membrane. A metric termed “Eclipse Extent” is the ratio of the mask’s dimensions to those of the image. It quantifies the degree of masking, ranging from 0 (original, unmasked image) to 1 (almost fully obscured). Four deep learning models—ResNet-50, DenseNet-161, ViT-B-16, and ViT-B-16–384—were trained on the masked dataset (referred to as the “Eclipsed Dataset”) and evaluated on their ability to classify normal and abnormal images both internally (within the same dataset) and externally (using different datasets). This experiment aimed to detect reliance on confounding features, such as background or lighting, that could artificially inflate model performance.

thumbnail
Fig 1. A sample otoscopic image and its eclipsed versions under varying Eclipse Extents of 0.0 (original), 0.9, and 1.0 from the(a) Chile, (b) Ohio, USA and (c) Türkiye datasets.

https://doi.org/10.1371/journal.pone.0338867.g001

Experiment II: Image Saturation.

To investigate the impact of image acquisition settings, otoscopic images were converted into hue, saturation, and value (HSV) color space. Two logistic regression models were developed using these color features: one trained on a pre-defined HSV feature set including hue mean, hue standard deviation (std), saturation mean, saturation std, value mean and value std and the other using only one feature – saturation std. No feature selection or feature standardization has been performed. These models underwent internal and external validation to determine if diagnostic performance was influenced by image color characteristics, which are highly dependent on filming conditions (e.g., lighting, camera settings).

Bias Identification: Qualitative assessment

To detect redundancy and stylistic biases in the datasets, distance-based clustering methods were employed. Three open-source datasets were split into five stratified folds for 5-fold cross-validation. Five ViT-B-16–384 models were trained, and their image feature embeddings were averaged. Cosine distance thresholds were used to group near-duplicate images (images that are visually nearly identical) and stylistically similar images. The clustering threshold (α) was adjusted iteratively to identify redundancies and stylistic correlations with clinical outcomes. This analysis highlighted instances where dataset biases, such as over-representation of certain image styles or redundant images, could undermine the generalizability of AI models.

Deep learning model training

To illustrate the susceptibility of deep learning models to data bias, four distinct architectures commonly used in medical imaging and deep learning were chosen: ResNet-50, a convolutional neural network utilizing residual connections to enable the training of deeper networks; DenseNet-161, another convolutional architecture noted for its dense connectivity where each layer is directly connected to all preceding layers; ViT-B-16, which employs the Transformer architecture directly on sequences of image patches; and ViT-B-16–384, a vision transformer with an input resolution of 384x384.

Following the original studies of the three public datasets, duplicates were retained for model training. Data augmentation was applied during training including random resized cropping, horizontal and vertical flipping, color jitter, and elastic transformations. All models were trained using stochastic gradient descent with a learning rate of 0.01 and a batch size of 32 for 100 epochs. No hyperparameter tuning was performed beyond using a fixed learning rate and batch size, in order to avoid implicit information leakage. The final model used for evaluation was the model checkpoint saved at the 100th epoch. Model training was performed using Python on a single Nvidia V100 GPU.

Statistical analysis

Model performance was evaluated using AUC and 95% confidence interval (CI) by DeLong’s method. DeLong’s test was used for pairwise AUC comparisons. One-sided p values < 0.05 were considered statistically significant when comparing model performance on different subsets. For logistic regression models, the Wald method was used to derive the confidence interval for odds ratios and to assess the statistical significance of individual predictors. All statistical analysis was performed using R.

Results

Quantitative bias identification

Counterfactual Experiment I: Eclipse Extent.

Models trained on images with clinically relevant features obscured by elliptical masks (Eclipse Extent = 1.0) demonstrated high internal performance (AUC > 0.90) for the Chile and Ohio, USA datasets, despite minimal view of the tympanic membrane being available (Table 3). However, these models performed poorly on external datasets, indicating reliance on dataset-specific artifacts, such as ear canal skin or lighting conditions, rather than clinically meaningful features. By contrast, the Türkiye dataset exhibited less bias. Using ViT-B-16–384 as an example, as masking increased (Eclipse Extent from 0 to 0.9 to 1.0), the internal AUC decreased from 0.88 (95% confidence intervals 0.83,0.94) to 0.62 (0.53, 0.71) to 0.53 (0.44, 0.62), suggesting a stronger reliance on meaningful diagnostic information.

thumbnail
Table 3. Internal and external performance of models trained on eclipsed images (Eclipse Extent = 0, 0.9, 1.0) from a single data source.

https://doi.org/10.1371/journal.pone.0338867.t003

Counterfactual Experiment II: Image Saturation.

Models trained on HSV color features achieved high internal AUCs (Chile: 0.91 (0.86, 0.96), Ohio, USA: 0.92 (0.86, 0.99)) but showed reduced generalizability in external testing (Table 4). Abnormal images in the Chile and Ohio, USA datasets exhibited higher saturation variability, likely influenced by lighting conditions or camera settings during image acquisition. (Table 5) A logistic regression model trained using the saturation std value on the Chile dataset showed high internal performance (AUC = 0.89 (0.83, 0.95)) and generalized effectively to the Ohio, USA dataset (AUC = 0.87 (0.84, 0.91)). A logistic regression model trained using the saturation std value on the Ohio, USA dataset performed well both internally (AUC = 0.86 (0.77, 0.95)) and externally on the Chile dataset (AUC = 0.85 (0.83, 0.88)) (Table 4). The Türkiye dataset displayed weaker correlations between saturation and clinical outcomes, resulting in lower internal AUCs (0.52 (0.43, 0.61)) but better external performance, indicating reduced reliance on artifacts. (Table 4, Table 6).

thumbnail
Table 4. Internal and external validation of logistic regression models using the HSV feature set and the single feature set.

https://doi.org/10.1371/journal.pone.0338867.t004

thumbnail
Table 5. Odds ratios for various variables in logistic regression models utilizing the HSV feature set.

https://doi.org/10.1371/journal.pone.0338867.t005

thumbnail
Table 6. The odds ratio for saturation standard deviation in logistic regression models utilizing the single feature set.

https://doi.org/10.1371/journal.pone.0338867.t006

Qualitative bias identification

Near-duplicate images.

Fig 2 presents three sets of near-duplicate images from the Chile, Ohio, USA and Türkiye datasets, respectively. The Chile dataset contained 145 sets of near-duplicate images, which accounted for 61% of the dataset; 52% of the testing data had a near-duplicate copy in the training set (Table 7). These redundant images likely contributed to inflated internal performance by allowing models to memorize rather than generalize from the data. Model performance on testing samples with near-duplicates in the training set is significantly (p-value < 0.01) higher than the rest of the testing data regardless of model architecture and eclipse values (Table 8). On average, across four model architectures, the difference in AUC increased from 0.08 to 0.26 (Eclipse Extent = 0.9). This suggests that models exploiting dataset-specific artifacts tend to overestimate performance on testing samples that have near-duplicates in the training set.

thumbnail
Table 7. Statistics of identified near-duplicate image sets.

https://doi.org/10.1371/journal.pone.0338867.t007

thumbnail
Table 8. Comparison of model performance on images with and without near duplicates in the Chile training set.

https://doi.org/10.1371/journal.pone.0338867.t008

thumbnail
Fig 2. Near-duplicate image sets in the(a) Chile, (b) Ohio, USA and (c) Türkiye datasets.

https://doi.org/10.1371/journal.pone.0338867.g002

Stylistic biases.

In the Ohio, USA dataset, two dominant image style categories were identified. As shown in Fig 3, Style I included 117 images, all classified as normal cases, while Style II comprised 90 images, predominantly representing effusion cases. These stylistic differences, driven by variations in imaging techniques and settings, strongly influenced model predictions, emphasizing the role of image acquisition protocols in shaping AI performance (Fig 3).

thumbnail
Fig 3. One hundred random samples from (a) the Style I set and (b) the Style II set.

https://doi.org/10.1371/journal.pone.0338867.g003

Discussion

We sought to evaluate the biases inherent in publicly available otoscopic image datasets and their impact on the performance of deep learning models for diagnosing middle ear conditions. Two counterfactual experiments focused on identifying biases linked to non-clinical features such as image masking and saturation. Models trained on the Chile and Ohio, USA datasets achieved high internal performance (AUC > 0.90) but failed to generalize to external datasets, largely due to reliance on dataset-specific artifacts such as lighting and ear canal skin patterns. In contrast, the Türkiye dataset demonstrated less bias, with internal AUCs decreasing as masking increased, indicating greater reliance on clinically meaningful features. Qualitative analyses revealed redundant and stylistically biased images, particularly in the Chile and Ohio, USA datasets, where duplication and stylistic correlations with clinical outcomes were evident. These findings underscore the critical influence of dataset bias on AI model performance and reliability.

The Ohio, USA dataset demonstrated a notable bias in image framing, with abnormal cases frequently focusing on specific regions of interest, such as partial views of the eardrum, while normal cases typically encompassed the entire tympanic membrane. This discrepancy allows models to exploit non-clinical cues, such as image framing, rather than clinically relevant features. To mitigate this form of bias, otoscopic images should consistently capture the entire tympanic membrane regardless of the diagnosis. The use of cameras equipped with automatic tympanic membrane detection systems could help standardize image quality while minimizing the need for extensive training for healthcare workers.

The second counterfactual experiment revealed that image saturation, a factor demonstrably influenced by lighting and camera settings, could effectively differentiate between normal and abnormal cases in both the Chile and Ohio, USA datasets (AUC > 0.85). This highlights the potential for bias introduced during image acquisition. This finding raises concerns that the gold standard of external validation may no longer be reliable if data bias is not adequately addressed. Ensuring that otoscope models and imaging protocols are independent of diagnostic outcomes and expert input is essential to reduce reliance on artifacts and enhance the clinical reliability of AI tools.

Analysis of the Chile dataset showed that a significant number of test images had near-identical counterparts in the training set, leading to artificially inflated performance metrics. For instance, models trained on masked (eclipsed) images still achieved high internal AUCs due to these redundancies, even when clinically relevant information was obscured. To address this, patient-based rather than image-based partitioning should be used when creating training and testing datasets to ensure a robust evaluation of model generalizability. Further, as a general quality check of publicly available datasets, duplicate images routinely should be eliminated prior to beginning analytic processes or splitting into training and testing datasets.

These findings have significant implications for the analysis of publicly available datasets, and, thereby, use of AI models in otoscopic diagnosis. For AI to become a reliable adjunct in clinical workflows, datasets must be curated to minimize biases, include diverse populations, and reflect varied imaging conditions. Rigorous external validation across geographically and demographically diverse datasets is essential to bridge the gap between research settings and real-world applications. This study highlights the need for standardized imaging protocols and improved data collection practices to enhance the reliability of AI models in otoscopy and elsewhere.

The common use of binary “normal/abnormal” classification can be expanded for clinically robust AI-otoscopy systems, notably when visually abnormal findings (e.g., tympanostomy tubes, tympanosclerosis, healed perforations) may represent stable or functionally healthy states. Recent expert recommendations on otoscopy image collection and annotation [23]. provide a structured foundation for improving dataset quality, including standardized acquisition parameters and pathology-specific descriptors. Building on this framework, future otoscopy AI datasets should adopt a simplified, management-aligned classification system similar to those used in established tele-otology programs in the Northern Territory of Australia, where categories reflect active disease processes, follow-up requirements, and treatment implications [24].

A tiered framework based on structured qualifiers—such as effusion presence, perforation status, tympanic membrane retraction severity and location (pars flaccida versus pars tensa), features of Eustachian tube dysfunction, discharge, post-surgical changes, external canal obstruction, and image adequacy. These attributes could map to universal clinical labels: (1) Normal; (2) Normal with sequelae; (3) Active middle ear disease requiring medical management; (4) Structural pathology requiring specialist assessment; (5) External canal pathology/obstruction; and (6) Uninterpretable image. A standardized, clinically meaningful labeling structure may reduce misclassification, improve generalisability, and enhance the translational impact of otoscopic AI systems (Table 9).

thumbnail
Table 9. Proposed recommendations for standardised otoscopic image labels.

https://doi.org/10.1371/journal.pone.0338867.t009

This study has several limitations. First, the focus on publicly available datasets may not encompass the full range of conditions encountered in routine clinical practice, including variation in otoscope devices, imaging protocols, and patient demographics. Second, the study’s retrospective nature limits the ability to control for confounding factors related to image acquisition, including differences in operator technique, lighting conditions, and anatomical variations between adult and paediatric patients. Third, the methodology used to assess bias, specifically the application of an elliptical mask centered on the tympanic membrane, does not fully reflect real-world diagnostic challenges. In clinical practice, abnormalities may be evaluated in specific anatomical regions of the tympanic membrane, [25] which were not specifically targeted in the masking approach. For example, clinically significant pathology (e.g., attic retractions, cholesteatoma, marginal perforations) frequently occurs at the periphery of the tympanic membrane rather than its center. Future studies could explore region-specific masking techniques to better understand bias in AI model predictions. Fourth, the simulated masking and image saturation manipulations used in this study provide an artificial framework to evaluate bias but may not fully replicate the complexities of clinical image variability, such as motion artifacts, cerumen obstruction, or variability in clinician positioning. Finally, the reliance on a limited number of public datasets restricts the generalizability of findings, as publicly available datasets may not accurately reflect the diversity of patients, clinical environments, and healthcare settings globally. Expanding the analysis to include larger, more diverse datasets from multiple institutions and geographic regions would strengthen the applicability of these findings. This analysis did not identify evidence that one camera system was inherently superior to another. Differences in generalization performance appeared to be influenced more by acquisition protocols, such as lighting consistency, field-of-view completeness, and image quality standards. A definitive assessment of device performance would require a controlled comparison using a standardized image acquisition protocol and user training.

The strengths of this study lie in its comprehensive evaluation of dataset biases through both quantitative and qualitative methods. Counterfactual experiments combined with clustering-based analyses provided a nuanced understanding of factors influencing AI model performance. The inclusion of diverse datasets and systematic evaluation of non-clinical features, such as lighting and image style, offers valuable insights into the challenges of AI generalizability. The proposed data collection guidelines offer actionable recommendations for improving dataset quality and clinical applicability. Further, the tests conducted in this study may be considered as part of a pre-analytic data quality assessment to identify bias prior to model creation.

Future research should focus on mitigating dataset biases during model training through advanced techniques such as feature disentanglement and augmentation strategies that reduce reliance on confounding features [26]. Expanding the analysis to include larger and more diverse private datasets would provide a broader understanding of biases in otoscopic imaging. Incorporating multi-expert annotations and assessing inter-observer variability could further enhance dataset quality and reliability. Testing AI tools in real-world clinical environments with standardized imaging protocols will be critical for translating these findings into impactful healthcare solutions.

Conclusion

This study revealed significant biases in publicly available otoscopic image datasets that impact the generalizability and reliability of AI models in diagnosing middle ear conditions. The findings demonstrate that AI models often rely on dataset-specific artifacts, such as lighting conditions and framing inconsistencies, rather than only clinical features in the region of the tympanic membrane, leading to compromised performance in external settings. Addressing these biases requires rigorous data curation, including the elimination of redundant images, standardized imaging protocols, patient-centered data partitioning to improve model generalizability, and rigorous quality assessment prior to embarking on analysis. Furthermore, ensuring consistency in image acquisition settings and implementing automated tympanic membrane detection can enhance diagnostic accuracy while reducing variability introduced by different operators and imaging equipment. Future research should focus on refining AI training strategies through feature disentanglement techniques, expanding dataset diversity to include varied populations and clinical settings, and integrating multi-expert annotations to improve data quality and reliability. These efforts are crucial for developing AI-based diagnostic tools that are robust, promote access to high-quality healthcare, and have clinical applicability across diverse healthcare environments.

Acknowledgments

We extend our gratitude to Anthony Ortiz and Zhongqi Miao for their insightful discussions that helped shape this research.

References

  1. 1. World Health Organization. World report on hearing. World Health Organization. 2021. https://www.who.int/publications/i/item/9789240020481
  2. 2. DeLacy J, Dune T, Macdonald JJ. The social determinants of otitis media in aboriginal children in Australia: are we addressing the primary causes? A systematic content review. BMC Public Health. 2020;20(1):492. pmid:32295570
  3. 3. Butler CC, Williams RG. The Etiology, Pathophysiology, and Management of Otitis Media with Effusion. Curr Infect Dis Rep. 2003;5(3):205–12. pmid:12760817
  4. 4. World Health Organization. Chronic suppurative otitis media: burden of illness and management options. Geneva, Switzerland: World Health Organization. 2004. https://iris.who.int/bitstream/handle/10665/42941/9241591587.pdf
  5. 5. Graydon K, Waterworth C, Miller H, Gunasekera H. Global burden of hearing impairment and ear disease. J Laryngol Otol. 2019;133(1):18–25. pmid:30047343
  6. 6. Gunasekera H, O’Connor TE, Vijayasekaran S, Del Mar CB. Primary care management of otitis media among Australian children. Med J Aust. 2009;191(S9):S55-9. pmid:19883358
  7. 7. Gunasekera H, Morris PS, Daniels J, Couzos S, Craig JC. Otitis media in Aboriginal children: the discordance between burden of illness and access to services in rural/remote and urban Australia. J Paediatr Child Health. 2009;45(7–8):425–30. pmid:19722295
  8. 8. Bright T, Mújica OJ, Ramke J, Moreno CM, Der C, Melendez A, et al. Inequality in the distribution of ear, nose and throat specialists in 15 Latin American countries: an ecological study. BMJ Open. 2019;9(7):e030220. pmid:31326937
  9. 9. Habib A-R, Kajbafzadeh M, Hasan Z, Wong E, Gunasekera H, Perry C, et al. Artificial intelligence to classify ear disease from otoscopy: A systematic review and meta-analysis. Clin Otolaryngol. 2022;47(3):401–13. pmid:35253378
  10. 10. Habib A-R, Wong E, Sacks R, Singh N. Artificial intelligence to detect tympanic membrane perforations. J Laryngol Otol. 2020;134(4):311–5. pmid:32238202
  11. 11. Habib A-R, Crossland G, Patel H, Wong E, Kong K, Gunasekera H, et al. An Artificial Intelligence Computer-vision Algorithm to Triage Otoscopic Images From Australian Aboriginal and Torres Strait Islander Children. Otol Neurotol. 2022;43(4):481–8. pmid:35239622
  12. 12. Alhudhaif A, Cömert Z, Polat K. Otitis media detection using tympanic membrane images with a novel multi-class machine learning algorithm. PeerJ Comput Sci. 2021;7:e405. pmid:33817048
  13. 13. Yüce S, Polat K, Önder İ, Doğan M, Müderris S. Chronic otitis media with multiple complications. J Craniofac Surg. 2023;13.
  14. 14. Zeng J, Kang W, Chen S, Lin Y, Deng W, Wang Y, et al. A Deep Learning Approach to Predict Conductive Hearing Loss in Patients With Otitis Media With Effusion Using Otoscopic Images. JAMA Otolaryngol Head Neck Surg. 2022;148(7):612–20. pmid:35588049
  15. 15. Habib A-R, Xu Y, Bock K, Mohanty S, Sederholm T, Weeks WB, et al. Evaluating the generalizability of deep learning image classification algorithms to detect middle ear disease using otoscopy. Sci Rep. 2023;13(1):5368. pmid:37005441
  16. 16. Winkler JK, Fink C, Toberer F, Enk A, Deinlein T, Hofmann-Wellenhof R, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for Melanoma Recognition. JAMA Dermatol. 2019;155(10):1135–41. pmid:31411641
  17. 17. Banerjee I, Bhattacharjee K, Burns JL, Trivedi H, Purkayastha S, Seyyed-Kalantari L, et al. “Shortcuts” Causing Bias in Radiology Artificial Intelligence: Causes, Evaluation, and Mitigation. J Am Coll Radiol. 2023;20(9):842–51. pmid:37506964
  18. 18. Geirhos R, Jacobsen JH, Michaelis C, Zemel R, Brendel W, Bethge M. Shortcut learning in deep neural networks. Nature Machine Intelligence. 2020;2:665–73.
  19. 19. Viscaino M, Maass JC, Delano PH, Torrente M, Stott C, Auat Cheein F. Computer-aided diagnosis of external and middle ear conditions: A machine learning approach. PLoS One. 2020;15(3):e0229226. pmid:32163427
  20. 20. Camalan S, Niazi MKK, Moberly AC, Teknos T, Essig G, Elmaraghy C, et al. OtoMatch: Content-based eardrum image retrieval using deep learning. PLoS One. 2020;15(5):e0232776. pmid:32413096
  21. 21. Zafer C. Fusing fine-tuned deep features for recognizing different tympanic membranes. Biocybern Biomed Eng. 2020;40:40–51.
  22. 22. Başaran E, Cömert Z, Çelik Y, Velappan S, Toğaçar M. Determination of tympanic membrane region in the middle ear otoscope images with convolutional neural network based YOLO method. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi. 2020;22: 919–28.
  23. 23. Cai Y, Zeng J, Lan L, Chen S, Ou Y, Zeng L, et al. Expert recommendations on collection and annotation of otoscopy images for intelligent medicine. Intelligent Medicine. 2022;2(4):230–4.
  24. 24. Habib A-R, Crossland G, Sacks R, Singh N, Patel H. Tele-otology for Aboriginal and Torres Strait Islander People Living in Rural and Remote Areas. Laryngoscope. 2024;134(12):5096–102. pmid:38982868
  25. 25. Mahomed F, De Wet Swanepoel JA. Open access guide to audiology and hearing aids for otolaryngologists. Pretoria; 2014.
  26. 26. Trivedi A, Robinson C, Blazes M, Ortiz A, Desbiens J, Gupta S, et al. Deep learning models for COVID-19 chest x-ray classification: Preventing shortcut learning using feature disentanglement. PLoS One. 2022;17(10):e0274098. pmid:36201483