Automating detection of diagnostic error of infectious diseases using machine learning

Kelly S. Peterson; Alec B. Chapman; Wathsala Widanagamaachchi; Jesse Sutton; Brennan Ochoa; Barbara E. Jones; Vanessa Stevens; David C. Classen; Makoto M. Jones

doi:10.1371/journal.pdig.0000528

Abstract

Diagnostic error, a cause of substantial morbidity and mortality, is largely discovered and evaluated through self-report and manual review, which is costly and not suitable to real-time intervention. Opportunities exist to leverage electronic health record data for automated detection of potential misdiagnosis, executed at scale and generalized across diseases. We propose a novel automated approach to identifying diagnostic divergence considering both diagnosis and risk of mortality. Our objective was to identify cases of emergency department infectious disease misdiagnoses by measuring the deviation between predicted diagnosis and documented diagnosis, weighted by mortality. Two machine learning models were trained for prediction of infectious disease and mortality using the first 24h of data. Charts were manually reviewed by clinicians to determine whether there could have been a more correct or timely diagnosis. The proposed approach was validated against manual reviews and compared using the Spearman rank correlation. We analyzed 6.5 million ED visits and over 700 million associated clinical features from over one hundred emergency departments. The testing set performances of the infectious disease (Macro F1 = 86.7, AUROC 90.6 to 94.7) and mortality model (Macro F1 = 97.6, AUROC 89.1 to 89.1) were in expected ranges. Human reviews and the proposed automated metric demonstrated positive correlations ranging from 0.231 to 0.358. The proposed approach for diagnostic deviation shows promise as a potential tool for clinicians to find diagnostic errors. Given the vast number of clinical features used in this analysis, further improvements likely need to either take greater account of data structure (what occurs before when) or involve natural language processing. Further work is needed to explain the potential reasons for divergence and to refine and validate the approach for implementation in real-world settings.

Author summary

Identifying diagnostic error is challenging since it is often found only through review so time consuming not all patient data can be reviewed, let alone in a timely fashion to potentially prevent harm. In this work we address this gap by proposing machine learning methods which leverage millions of patient encounters. Since such methods could potentially be automated, they could scale to identify situations when the diagnosis may not be correct, timely, or when there may be a risk of death to the patient. This approach was validated by clinicians and shows promise for continued development. Future work will be needed to translate this work to protect patients and support clinicians.

Citation: Peterson KS, Chapman AB, Widanagamaachchi W, Sutton J, Ochoa B, Jones BE, et al. (2024) Automating detection of diagnostic error of infectious diseases using machine learning. PLOS Digit Health 3(6): e0000528. https://doi.org/10.1371/journal.pdig.0000528

Editor: Hualou Liang, Drexel University, UNITED STATES

Received: January 30, 2024; Accepted: May 7, 2024; Published: June 7, 2024

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: Per IRB requirements and VA regulations, patient-level data from this study cannot be shared directly. Access to source data could be accessed by VA-credentialed investigators with an approved IRB and proper VA research authorization. Inquiries about this process for data access can be addressed to VINCI@VA.GOV.

Funding: This work was supported by Gordon and Betty Moore Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Diagnostic errors are harmful to patients, traumatic for providers, and costly for healthcare systems. A recent study showed that infectious diseases are one of three major disease categories causing the majority of misdiagnosis-related harms [1]. It estimated 40,000 to 80,000 deaths in hospitals in the US related to misdiagnosis. Diagnostic error evaluation can be conducted with instruments such as Safer Dx [2,3]to identify areas of improvement; however, these instruments require manual case review from clinicians with expertise and are most efficient when applied to known or probable cases of error. Thus, while such instruments are useful, they are not scalable to large populations or ultimately amenable to providing rapid feedback at the point of care. Since current gaps cannot be addressed with audits and voluntary reporting systems alone, additional capabilities are needed.

There are different approaches to measuring diagnostic accuracy. SPADE uses large amounts of administrative and billing data to quantify diagnostic errors where diagnosis or treatment change indicate an outcome which may have been preventable had the diagnosis or treatment occurred more promptly [4].This statistical approach quantifies diagnostic errors on average but is difficult to interpret on the individual level, lacking traceability. Others have long used expert rules, which are challenging to maintain and scale to additional diseases or contexts [5].

One means of scaling up capabilities is through automation, including machine learning (ML) where models may be applied in large volume. Machine learning models have already been used to predict infectious disease. Examples include sepsis [6–9], pneumonia [10–12], upper respiratory infections (URI) [13], urinary tract infections (UTI) [14,15], and skin and soft tissue infections (SSTI) [16]. The difference between documented and predicted diagnosis can represent a kind of diagnostic divergence. Because diagnostic specificity is likely lower when the stakes are low, we also need to classify acuity. Several prior studies predict adverse outcomes such post-procedural 30-day mortality or adverse events [17–19].

We developed automated models using electronic health records to construct a flexible approach to detect diagnostic divergence which considers infectious disease diagnosis as well as risk of potential mortality. This approach was then validated by expert reviewers.

Materials and methods

Ethics statement

This project was reviewed by the VA Salt Lake City Research & Development Committee and the Institutional Review Board at the University of Utah, and a waiver of consent and authorization was granted (127273).

Data

This study was performed using data from the Veterans Health Administration (VHA) Health Care System which cares for more than 9 million living Veterans at over one hundred emergency departments [20]. The study population included all emergency department (ED) visits to a VA medical center from January 1, 2017, to December 31, 2019. Data were extracted from the Corporate Data Warehouse (CDW), VHA’s repository for electronic clinical and administrative records.

Short visits with either little or no clinical detail (e.g., patients visiting ED for a medication refill) were excluded from further analysis if they did not have a minimum of 5 features in the first 24 hours of the ED visit. This minimum feature threshold was set to avoid uninformative visits. Additionally, ED visits were excluded from these sets if the patient was on hospice care or placed on hospice care within 72 hours. While both exclusions were made in consultation with a technical expert panel, the latter creates the possibility of missing diagnostic errors that lead to hospice and death. However, including these data in the training set would lead to learning hospice practices as “normal” and thus the decision was to exclude.

Feature extraction

Features for each visit were extracted from clinical data starting at the time of the visit through 24 hours following the emergency department encounter, including inpatient admission for patients who were subsequently hospitalized.

Features for the present ED visit included orders, medications, laboratory results, radiology imaging results, and vital signs. Medical orders were normalized where possible to values in standard vocabularies. Specifically, when possible, these orders were Logical Observation Identifiers Names and Codes (LOINC) for laboratories, RxNorm for medications, and Current Procedural Terminology (CPT) for procedures [21–23]. Medication features were normalized to RxNorm ingredient level. Laboratory results standardized to LOINC were assigned to categories High, Low, or Normal based upon available structured data flags, categorical findings, or numerical results with respect to reference ranges (e.g., a serum creatinine of 2.5 mg/dL was simply coded a high serum creatinine). Radiology imaging results were categorized as either Normal or Abnormal based on an available structured data flag. Vital signs were only included as features if noted as abnormally High or Low.

To allow context of existing comorbidities prior to the visit and equal lookback, all diagnosis codes from the 365 days prior to the ED visit were extracted from both inpatient and outpatient settings.

All features were treated as binary, setting as true only if the specific event was documented for the patient during the encounter. As concrete examples, a feature name such as “order_comprehensive_metabolic_panel” would be present only if this order was present. The laboratory result was represented in a separate feature using suffixes added to feature names. For example, “lab_hemoglobin_n” represented that a laboratory for hemoglobin had a normal result. “lab_erythrocytes_l” and “lab_leukocytes_h” were interpreted similarly except that “l” and “h” represented Low and High results, respectively. Radiology imaging results such as “radiology_x_ray_exam_of_knee_a” and “radiology_x_ray_chest_left_view_n” represented Abnormal and Normal results, respectively. Vital sign example features of “vitals_blood_pressure_h” and “vitals_temperature_l” represented High and Low findings, respectively. Orders for medications were represented as “order_gabapentin” to reflect the order was made and a feature of “medication_gabapentin” if this medication was administered. Finally, historical diagnosis codes before the ED visit such as “Chronic obstructive pulmonary disease, unspecified” were represented as “icd10_dx_J44.9”.

Data splits

Included ED visits were split into 3 datasets: training, validation, and testing. The testing set was defined to ensure that a set of ED visits were held out from all training and validation activities. [24]. Because documentation may vary over time within a medical center or across medical centers, the testing set was defined to include ED visits from facilities from July 1, 2019, to Dec 31, 2019. Additionally, five VA medical centers were selected at random for this set (including all their data for the entire time period). All remaining ED visits were assigned to the training and validation sets where 90% of visits were randomly assigned to the training set and the remainder to the validation set.

Classification models

Two separate classification models were trained to predict outcomes in the proposed diagnosis metric. These are referred to here as the mortality and infectious disease models.

The mortality model was trained as a binary classifier where the label was defined by whether the patient died in the 30 days following the visit regardless of cause. Predictors included Elixhauser score using diagnosis codes from the 365 days prior to the visit [25] as well as other features mentioned above.

The infectious disease model was a multiclass classifier which predicted one of the following: pneumonia, sepsis, SSTI, UTI, URI, or no infection. These diagnosis outcome labels were assigned using sets of ICD-10 codes from previously published studies on pneumonia [26], sepsis [27–29], SSTI [30–32], UTI [33,34]. Labels for URI were assigned by manual curation of 45 ICD-10 codes for bronchitis (e.g., J20.9), pharyngitis (e.g., J03.91), sinusitis (e.g., J01.21), or generally for upper respiratory infection (e.g., J06.9). The definition for UTI also included microbiology results to expand the evidence used in assigning a label. A diagnosis was present for a visit if any one of these diagnosis codes (or microbiology data for UTI) was identified between in the period from 24 hours prior to 48 hours after an ED visit.

Both mortality and infectious disease models were trained using gradient boosted trees with the scikit-learn [35] package in Python. While many types of models could be trained and evaluated in this work, given available computing resources, one model type was selected for evaluation. Specifically, an implementation of gradient boosted trees, specifically XGBoost [36] was chosen for its familiarity and its proficiency in managing overfitting to promote generalization. Additionally, this implementation has been shown to scale to millions or billions of data instances while requiring fewer computing resources than some model types [36]. This implementation includes several hyperparameters which enforce model regularization to prevent models from becoming overly complex and fit to training data alone.

Hyperparameters for both models were tuned using randomized search and cross validation [37,38]. This was performed on the training set only using 3-fold cross validation and the best model with the best hyperparameters was measured against the validation set using the scikit-learn implementation. No metrics were gathered for the testing set until hyperparameters were finalized for both models. Some of the hyperparameters for this model included learning rate, maximum number of trees, and maximum depth permitted per tree. No explicit feature selection or feature weighting was performed prior to model training as each tree constructed in the training iterations could incorporate different features.

Negative classes were under sampled so that as many visits available could be leveraged while minimizing computational resource requirements. We used positive predictive value (PPV), sensitivity, and Area Under the Receiver Operating Characteristic (AUROC) to assess model performance. The area under the precision-recall curve (PRAUC) was also used as a metric since it has been shown to be informative in assessing performance of imbalanced classes [39].

Diagnosis deviation

Using the predictions from the trained models as well as the diagnoses assigned and the estimated pre-visit mortality, a formula was developed to quantify potential deviation in diagnosis. This formula includes the predictions from the infectious disease classification model while also being weighted by predictions from the mortality model for increased mortality. This mathematical derivation of this diagnostic deviation (DD) is illustrated in the equation presented here.

More specifically, for any given infectious disease class (e.g., pneumonia) d represents the observed diagnosis, or a 0/1 representation of whether an ICD-10 code for that diagnosis, represents the probability of that disease given predictions from the infectious disease classifier. Meanwhile, m reflects the pre-visit mortality probability from the Elixhauser score leading up to the visit when available and represents the probability from the mortality model predicting 30-day mortality. Note that only increases in mortality probability are considered in this calculation so that if the probability of mortality during the visit decreases compared to the pre-visit mortality, this term will be 0.

Validation

To validate how well this proposed automated metric compares to clinician judgement, ED visits were manually reviewed. Reviewers assessed whether each of the five infectious disease classes or a non-infectious disease was initially documented. Next, changes in diagnostic approach or the investigation of multiple diagnoses were assessed, and for the same disease classes, identified the final diagnosis for the reason the patient came to the ED.

Additionally, two questions were adapted from the revised Safer Dx framework to rate opportunities for improved diagnosis [3]. On a Likert scale of 1 (i.e., strongly disagree) to 7 (i.e., strongly agree), two questions were asked: “The final diagnosis was not an evolution of the care team’s initial presumed diagnosis” and “The patient was at risk significant harm–or experienced harm–that could have generally been prevented by a correct and timely diagnosis”.

Case reviews were performed by 3 clinical experts in infectious disease. After an initial exploratory round of chart reviews, the scope of the reviews was narrowed to pneumonia given available time for reviewers. Cases assigned to each reviewer were a stratified sample where half were assigned by the highest values according to our metric with respect to pneumonia and the other half were randomly sampled. Cases were excluded from sampling if the difference between expected mortality (m) and predicted mortality () reflected a decrease.

The three reviewers triple annotated a set of 20 ED visits to measure inter-rater reliability (IRR) between reviewers. After these visits, reviewers were asked to continue reviewing visits as they had available time such that a total of 130 unique ED visits were reviewed. After reviews were completed, we assessed the correlation between the diagnosis deviation and the Likert scores for each of the two questions using weighted Cohen’s kappa [40]. We also used Cohen’s kappa to evaluate IRR on the cases reviewed by multiple reviewers.

Comparisons between diagnosis deviation and reviewer scores on the two questions for review were calculated by Spearman rank correlation [41].

Results

Data

A total of 6,536,315 ED visits were initially included from 104 distinct VA medical centers. These visits were across 2,141,271 unique patients where the mean age at the time of the ED visit was 60 years old and 88.1% were male.