Skip to main content
Advertisement
  • Loading metrics

Automating detection of diagnostic error of infectious diseases using machine learning

  • Kelly S. Peterson ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

    kelly.peterson@hsc.utah.edu

    Affiliations Veterans Health Administration, Office of Analytics and Performance Integration, Washington D.C., District of Columbia, United States of America, Department of Internal Medicine, Division of Epidemiology, University of Utah, Salt Lake City, Utah, United States of America

  • Alec B. Chapman,

    Roles Data curation, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliations Department of Internal Medicine, Division of Epidemiology, University of Utah, Salt Lake City, Utah, United States of America, Veterans Affairs Health Care System, Salt Lake City, Utah, United States of America

  • Wathsala Widanagamaachchi,

    Roles Data curation, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

    Affiliations Department of Internal Medicine, Division of Epidemiology, University of Utah, Salt Lake City, Utah, United States of America, Veterans Affairs Health Care System, Salt Lake City, Utah, United States of America

  • Jesse Sutton,

    Roles Validation

    Affiliation Veterans Affairs Health Care System, Minneapolis, Minnesota, United States of America

  • Brennan Ochoa,

    Roles Validation

    Affiliation Rocky Mountain Infectious Diseases Specialists, Aurora, Colorado, United States of America

  • Barbara E. Jones,

    Roles Conceptualization

    Affiliations Veterans Affairs Health Care System, Salt Lake City, Utah, United States of America, Division of Pulmonary & Critical Care Medicine, University of Utah, Salt Lake City, Utah, United States of America

  • Vanessa Stevens,

    Roles Writing – original draft, Writing – review & editing

    Affiliations Veterans Health Administration, Office of Analytics and Performance Integration, Washington D.C., District of Columbia, United States of America, Department of Internal Medicine, Division of Epidemiology, University of Utah, Salt Lake City, Utah, United States of America

  • David C. Classen,

    Roles Conceptualization, Methodology

    Affiliation Department of Internal Medicine, Division of Epidemiology, University of Utah, Salt Lake City, Utah, United States of America

  • Makoto M. Jones

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Validation, Writing – original draft, Writing – review & editing

    Affiliations Veterans Health Administration, Office of Analytics and Performance Integration, Washington D.C., District of Columbia, United States of America, Department of Internal Medicine, Division of Epidemiology, University of Utah, Salt Lake City, Utah, United States of America, Veterans Affairs Health Care System, Salt Lake City, Utah, United States of America

Abstract

Diagnostic error, a cause of substantial morbidity and mortality, is largely discovered and evaluated through self-report and manual review, which is costly and not suitable to real-time intervention. Opportunities exist to leverage electronic health record data for automated detection of potential misdiagnosis, executed at scale and generalized across diseases. We propose a novel automated approach to identifying diagnostic divergence considering both diagnosis and risk of mortality. Our objective was to identify cases of emergency department infectious disease misdiagnoses by measuring the deviation between predicted diagnosis and documented diagnosis, weighted by mortality. Two machine learning models were trained for prediction of infectious disease and mortality using the first 24h of data. Charts were manually reviewed by clinicians to determine whether there could have been a more correct or timely diagnosis. The proposed approach was validated against manual reviews and compared using the Spearman rank correlation. We analyzed 6.5 million ED visits and over 700 million associated clinical features from over one hundred emergency departments. The testing set performances of the infectious disease (Macro F1 = 86.7, AUROC 90.6 to 94.7) and mortality model (Macro F1 = 97.6, AUROC 89.1 to 89.1) were in expected ranges. Human reviews and the proposed automated metric demonstrated positive correlations ranging from 0.231 to 0.358. The proposed approach for diagnostic deviation shows promise as a potential tool for clinicians to find diagnostic errors. Given the vast number of clinical features used in this analysis, further improvements likely need to either take greater account of data structure (what occurs before when) or involve natural language processing. Further work is needed to explain the potential reasons for divergence and to refine and validate the approach for implementation in real-world settings.

Author summary

Identifying diagnostic error is challenging since it is often found only through review so time consuming not all patient data can be reviewed, let alone in a timely fashion to potentially prevent harm. In this work we address this gap by proposing machine learning methods which leverage millions of patient encounters. Since such methods could potentially be automated, they could scale to identify situations when the diagnosis may not be correct, timely, or when there may be a risk of death to the patient. This approach was validated by clinicians and shows promise for continued development. Future work will be needed to translate this work to protect patients and support clinicians.

Introduction

Diagnostic errors are harmful to patients, traumatic for providers, and costly for healthcare systems. A recent study showed that infectious diseases are one of three major disease categories causing the majority of misdiagnosis-related harms [1]. It estimated 40,000 to 80,000 deaths in hospitals in the US related to misdiagnosis. Diagnostic error evaluation can be conducted with instruments such as Safer Dx [2,3]to identify areas of improvement; however, these instruments require manual case review from clinicians with expertise and are most efficient when applied to known or probable cases of error. Thus, while such instruments are useful, they are not scalable to large populations or ultimately amenable to providing rapid feedback at the point of care. Since current gaps cannot be addressed with audits and voluntary reporting systems alone, additional capabilities are needed.

There are different approaches to measuring diagnostic accuracy. SPADE uses large amounts of administrative and billing data to quantify diagnostic errors where diagnosis or treatment change indicate an outcome which may have been preventable had the diagnosis or treatment occurred more promptly [4].This statistical approach quantifies diagnostic errors on average but is difficult to interpret on the individual level, lacking traceability. Others have long used expert rules, which are challenging to maintain and scale to additional diseases or contexts [5].

One means of scaling up capabilities is through automation, including machine learning (ML) where models may be applied in large volume. Machine learning models have already been used to predict infectious disease. Examples include sepsis [69], pneumonia [1012], upper respiratory infections (URI) [13], urinary tract infections (UTI) [14,15], and skin and soft tissue infections (SSTI) [16]. The difference between documented and predicted diagnosis can represent a kind of diagnostic divergence. Because diagnostic specificity is likely lower when the stakes are low, we also need to classify acuity. Several prior studies predict adverse outcomes such post-procedural 30-day mortality or adverse events [1719].

We developed automated models using electronic health records to construct a flexible approach to detect diagnostic divergence which considers infectious disease diagnosis as well as risk of potential mortality. This approach was then validated by expert reviewers.

Materials and methods

Ethics statement

This project was reviewed by the VA Salt Lake City Research & Development Committee and the Institutional Review Board at the University of Utah, and a waiver of consent and authorization was granted (127273).

Data

This study was performed using data from the Veterans Health Administration (VHA) Health Care System which cares for more than 9 million living Veterans at over one hundred emergency departments [20]. The study population included all emergency department (ED) visits to a VA medical center from January 1, 2017, to December 31, 2019. Data were extracted from the Corporate Data Warehouse (CDW), VHA’s repository for electronic clinical and administrative records.

Short visits with either little or no clinical detail (e.g., patients visiting ED for a medication refill) were excluded from further analysis if they did not have a minimum of 5 features in the first 24 hours of the ED visit. This minimum feature threshold was set to avoid uninformative visits. Additionally, ED visits were excluded from these sets if the patient was on hospice care or placed on hospice care within 72 hours. While both exclusions were made in consultation with a technical expert panel, the latter creates the possibility of missing diagnostic errors that lead to hospice and death. However, including these data in the training set would lead to learning hospice practices as “normal” and thus the decision was to exclude.

Feature extraction

Features for each visit were extracted from clinical data starting at the time of the visit through 24 hours following the emergency department encounter, including inpatient admission for patients who were subsequently hospitalized.

Features for the present ED visit included orders, medications, laboratory results, radiology imaging results, and vital signs. Medical orders were normalized where possible to values in standard vocabularies. Specifically, when possible, these orders were Logical Observation Identifiers Names and Codes (LOINC) for laboratories, RxNorm for medications, and Current Procedural Terminology (CPT) for procedures [2123]. Medication features were normalized to RxNorm ingredient level. Laboratory results standardized to LOINC were assigned to categories High, Low, or Normal based upon available structured data flags, categorical findings, or numerical results with respect to reference ranges (e.g., a serum creatinine of 2.5 mg/dL was simply coded a high serum creatinine). Radiology imaging results were categorized as either Normal or Abnormal based on an available structured data flag. Vital signs were only included as features if noted as abnormally High or Low.

To allow context of existing comorbidities prior to the visit and equal lookback, all diagnosis codes from the 365 days prior to the ED visit were extracted from both inpatient and outpatient settings.

All features were treated as binary, setting as true only if the specific event was documented for the patient during the encounter. As concrete examples, a feature name such as “order_comprehensive_metabolic_panel” would be present only if this order was present. The laboratory result was represented in a separate feature using suffixes added to feature names. For example, “lab_hemoglobin_n” represented that a laboratory for hemoglobin had a normal result. “lab_erythrocytes_l” and “lab_leukocytes_h” were interpreted similarly except that “l” and “h” represented Low and High results, respectively. Radiology imaging results such as “radiology_x_ray_exam_of_knee_a” and “radiology_x_ray_chest_left_view_n” represented Abnormal and Normal results, respectively. Vital sign example features of “vitals_blood_pressure_h” and “vitals_temperature_l” represented High and Low findings, respectively. Orders for medications were represented as “order_gabapentin” to reflect the order was made and a feature of “medication_gabapentin” if this medication was administered. Finally, historical diagnosis codes before the ED visit such as “Chronic obstructive pulmonary disease, unspecified” were represented as “icd10_dx_J44.9”.

Data splits

Included ED visits were split into 3 datasets: training, validation, and testing. The testing set was defined to ensure that a set of ED visits were held out from all training and validation activities. [24]. Because documentation may vary over time within a medical center or across medical centers, the testing set was defined to include ED visits from facilities from July 1, 2019, to Dec 31, 2019. Additionally, five VA medical centers were selected at random for this set (including all their data for the entire time period). All remaining ED visits were assigned to the training and validation sets where 90% of visits were randomly assigned to the training set and the remainder to the validation set.

Classification models

Two separate classification models were trained to predict outcomes in the proposed diagnosis metric. These are referred to here as the mortality and infectious disease models.

The mortality model was trained as a binary classifier where the label was defined by whether the patient died in the 30 days following the visit regardless of cause. Predictors included Elixhauser score using diagnosis codes from the 365 days prior to the visit [25] as well as other features mentioned above.

The infectious disease model was a multiclass classifier which predicted one of the following: pneumonia, sepsis, SSTI, UTI, URI, or no infection. These diagnosis outcome labels were assigned using sets of ICD-10 codes from previously published studies on pneumonia [26], sepsis [2729], SSTI [3032], UTI [33,34]. Labels for URI were assigned by manual curation of 45 ICD-10 codes for bronchitis (e.g., J20.9), pharyngitis (e.g., J03.91), sinusitis (e.g., J01.21), or generally for upper respiratory infection (e.g., J06.9). The definition for UTI also included microbiology results to expand the evidence used in assigning a label. A diagnosis was present for a visit if any one of these diagnosis codes (or microbiology data for UTI) was identified between in the period from 24 hours prior to 48 hours after an ED visit.

Both mortality and infectious disease models were trained using gradient boosted trees with the scikit-learn [35] package in Python. While many types of models could be trained and evaluated in this work, given available computing resources, one model type was selected for evaluation. Specifically, an implementation of gradient boosted trees, specifically XGBoost [36] was chosen for its familiarity and its proficiency in managing overfitting to promote generalization. Additionally, this implementation has been shown to scale to millions or billions of data instances while requiring fewer computing resources than some model types [36]. This implementation includes several hyperparameters which enforce model regularization to prevent models from becoming overly complex and fit to training data alone.

Hyperparameters for both models were tuned using randomized search and cross validation [37,38]. This was performed on the training set only using 3-fold cross validation and the best model with the best hyperparameters was measured against the validation set using the scikit-learn implementation. No metrics were gathered for the testing set until hyperparameters were finalized for both models. Some of the hyperparameters for this model included learning rate, maximum number of trees, and maximum depth permitted per tree. No explicit feature selection or feature weighting was performed prior to model training as each tree constructed in the training iterations could incorporate different features.

Negative classes were under sampled so that as many visits available could be leveraged while minimizing computational resource requirements. We used positive predictive value (PPV), sensitivity, and Area Under the Receiver Operating Characteristic (AUROC) to assess model performance. The area under the precision-recall curve (PRAUC) was also used as a metric since it has been shown to be informative in assessing performance of imbalanced classes [39].

Diagnosis deviation

Using the predictions from the trained models as well as the diagnoses assigned and the estimated pre-visit mortality, a formula was developed to quantify potential deviation in diagnosis. This formula includes the predictions from the infectious disease classification model while also being weighted by predictions from the mortality model for increased mortality. This mathematical derivation of this diagnostic deviation (DD) is illustrated in the equation presented here.

More specifically, for any given infectious disease class (e.g., pneumonia) d represents the observed diagnosis, or a 0/1 representation of whether an ICD-10 code for that diagnosis, represents the probability of that disease given predictions from the infectious disease classifier. Meanwhile, m reflects the pre-visit mortality probability from the Elixhauser score leading up to the visit when available and represents the probability from the mortality model predicting 30-day mortality. Note that only increases in mortality probability are considered in this calculation so that if the probability of mortality during the visit decreases compared to the pre-visit mortality, this term will be 0.

Validation

To validate how well this proposed automated metric compares to clinician judgement, ED visits were manually reviewed. Reviewers assessed whether each of the five infectious disease classes or a non-infectious disease was initially documented. Next, changes in diagnostic approach or the investigation of multiple diagnoses were assessed, and for the same disease classes, identified the final diagnosis for the reason the patient came to the ED.

Additionally, two questions were adapted from the revised Safer Dx framework to rate opportunities for improved diagnosis [3]. On a Likert scale of 1 (i.e., strongly disagree) to 7 (i.e., strongly agree), two questions were asked: “The final diagnosis was not an evolution of the care team’s initial presumed diagnosis” and “The patient was at risk significant harm–or experienced harm–that could have generally been prevented by a correct and timely diagnosis”.

Case reviews were performed by 3 clinical experts in infectious disease. After an initial exploratory round of chart reviews, the scope of the reviews was narrowed to pneumonia given available time for reviewers. Cases assigned to each reviewer were a stratified sample where half were assigned by the highest values according to our metric with respect to pneumonia and the other half were randomly sampled. Cases were excluded from sampling if the difference between expected mortality (m) and predicted mortality () reflected a decrease.

The three reviewers triple annotated a set of 20 ED visits to measure inter-rater reliability (IRR) between reviewers. After these visits, reviewers were asked to continue reviewing visits as they had available time such that a total of 130 unique ED visits were reviewed. After reviews were completed, we assessed the correlation between the diagnosis deviation and the Likert scores for each of the two questions using weighted Cohen’s kappa [40]. We also used Cohen’s kappa to evaluate IRR on the cases reviewed by multiple reviewers.

Comparisons between diagnosis deviation and reviewer scores on the two questions for review were calculated by Spearman rank correlation [41].

Results

Data

A total of 6,536,315 ED visits were initially included from 104 distinct VA medical centers. These visits were across 2,141,271 unique patients where the mean age at the time of the ED visit was 60 years old and 88.1% were male.

Feature extraction

From these ED visits, over 738 million features were extracted. Over 656 million features were extracted with respect to the first 24 hours of data in the ED visit and over 82 million features were extracted from diagnosis codes prior to the visit. A summary of feature counts and distinct feature values is shown in Table 1. The median number of features in the first 24 hours of the visit was 22 and the median number of diagnosis codes prior to the visit was 8.

thumbnail
Table 1. Counts of features for each type as well as the count of distinct features values in each.

https://doi.org/10.1371/journal.pdig.0000528.t001

Data splits

A total of 4,261,730 visits were divided into sets. 1,211,698 visits were assigned to the testing set where 641,765 were from the temporal holdout and 569,933 were from one of the five random medical centers. The remaining visits were split among the training set and validation sets which resulted in sets of 2,733,493 and 316,539 visits, respectively.

Classification models

Following hyperparameter tuning and cross validation of both the mortality and infectious disease models using the training and validation sets, these models were finalized and applied to the held-out testing set.

Performance metrics on this testing set for the mortality model is shown in Table 2. The metrics for the same set for the infectious disease model is in Table 3. The receiver operating characteristic curves for these models on the testing set are shown in Fig 1 and Fig 2, respectively.

thumbnail
Fig 1. Receiver operating characteristic curve for the mortality binary model on the testing set.

https://doi.org/10.1371/journal.pdig.0000528.g001

thumbnail
Fig 2. Receiver operating characteristic curve for the infectious disease multiclass model on the testing set.

https://doi.org/10.1371/journal.pdig.0000528.g002

thumbnail
Table 2. Performance of the mortality binary model on the testing set.

https://doi.org/10.1371/journal.pdig.0000528.t002

thumbnail
Table 3. Performance of the infectious disease multiclass model on the testing set.

https://doi.org/10.1371/journal.pdig.0000528.t003

Diagnostic deviation

After models were trained and validated, the proposed Diagnostic Deviation Metric was applied to all ED visits in the testing set. The distribution of this metric for pneumonia is shown in Fig 3.

thumbnail
Fig 3. Distribution of the Diagnostic Deviation for Pneumonia for each visit in the test set of this analysis. Since most values are 0, this is not shown.

https://doi.org/10.1371/journal.pdig.0000528.g003

Validation

The distribution of values assigned during review is shown in Fig 4. Pairwise inter-rater reliability is shown in Table 4 where the Cohen’s Kappa ranges from 0.242 to 0.528 across both questions posed.

thumbnail
Fig 4. Distribution of all scores assigned by all three reviewers for the two questions asked on each reviewed case.

https://doi.org/10.1371/journal.pdig.0000528.g004

thumbnail
Table 4. Inter-rater reliability between pairs of reviewers. Each question is measured using weighted Cohen’s Kappa.

https://doi.org/10.1371/journal.pdig.0000528.t004

Since our proposed measure combines predictions of both mortality and diagnosis, human review scores were not only compared to the composite metric, but also to each of these components of the measure independently. These correlations for all reviewers are presented in Table 5. Our proposed composite measure shows a positive correlation between human reviewers and this automated approach which may be interpreted as weak to moderate given the Spearman correlations between 0.231 and 0.358. The composite metric also showed a higher correlation than the individual components of mortality or diagnosis deviation, although the difference was greater in the human score of diagnostic evolution than it was for potential patient harm. A correlation for each reviewer among the two questions response is provided in Table 6. For the question of “not a diagnostic evolution,” correlations ranged from 0.35 to 0.614 with Reviewer 1 showing the strongest correlation. Meanwhile, for potential patient harm, these ranged from 0.273 to 0.357 where the strongest correlation was with Reviewer 2.

thumbnail
Table 5. Comparisons of the individual components of our metric as well as the composite metric when compared to each of the human review scores for all reviewers.

https://doi.org/10.1371/journal.pdig.0000528.t005

thumbnail
Table 6. Correlation between our proposed metric (DD) and each reviewer by responses to each question. Each question is measured using Spearman rank correlation.

https://doi.org/10.1371/journal.pdig.0000528.t006

Reviewers were allowed to provide comments to summarize cases including whether they agreed with the diagnoses assigned in the cases, noteworthy outcomes in the cases and any other useful notes. Some review scores had higher correlation with our proposed measure and comments provided by the reviewer seem to indicate that an automated review may have been useful. Some of these comments are presented in Table 7. Some of these include cases where one treatment was provided, but pneumonia was missed as the diagnosis. Others indicate an initial diagnosis, but by the time the diagnosis changed there was an adverse outcome.

thumbnail
Table 7. Comments from reviewers on case reviews where the measure was helpful and correlated with reviewer scoring.

https://doi.org/10.1371/journal.pdig.0000528.t007

Given that the correlation between the proposed measure and review scores is suboptimal, there were also cases where the score from the measure may have been high, but the reviewer did not identify an opportunity for a more timely or correct diagnosis. These are shown in Table 8. For example, in some of these the reviewer agreed with the diagnoses assigned in the cases. In at least one of these, a reviewer indicated that the patient was already in hospice, which should have been excluded from our analysis and review.

thumbnail
Table 8. Comments from reviewers where the measure was not clearly helpful and not correlated with reviewer scoring.

https://doi.org/10.1371/journal.pdig.0000528.t008

Discussion

Individual models for infectious diseases and mortality demonstrated reasonable diagnostic performance statistics but positive predictive value and PR AUC were low given the low prevalence of infectious diseases. These models were trained with a very large number of features, representing a substantial fraction of the structured data readily available from an EHR.

Our primary concern was whether our combined measures of diagnostic divergence was related to diagnostic error. Correlations between human review of cases and the proposed measure showed a weak positive correlation. When examining the interrater reliability of our subjective diagnostic error measures and when performing an analysis of discordant cases, it became apparent that there was substantial disagreement in interpreting cases and how to weigh the individual components of diagnostic error. This suggested a further need to explore robust instruments for diagnostic error assessment. It also underscores that diagnostic error is usually not clearly present or absent but is evaluated on a continuum. Analysis also showed that cases with high diagnostic divergence scores that were obviously not diagnostic errors generally fell into three categories: diagnostic miscoding, highly complex cases, and model failures, of which the first two were most common.

By searching on the very high end of the diagnostic divergence spectrum, we demonstrated the feasibility of enriching data sets with infectious disease-related diagnostic error cases using automated means. Such identified divergence could be used to highlight a diagnosis that should have been included or to offer information where the assigned diagnosis differs from how peers may have assigned a diagnosis. Further, leveraging methods for explainable machine learning could help to identify which specifics of diagnostic testing or treatment contributed to this divergence.

The performance of our individual models was similar to other reported ML models to date when comparing common metric like AUROC, although study criteria and settings make comparisons non-trivial [68,11,12,1419]. However, given the amount of data and number of features, this is somewhat disappointing. We included millions of clinical events in modeling efforts in which thousands of model iterations were trained and evaluated with a range of hyperparameters. This suggests that the features added and that are not usually present in other models did not provide much additional information over more traditional features when predicting the ultimate diagnosis. This may be, in part, due to sparse documentation on observations supporting the diagnosis and/or selective documentation of features supporting the diagnosis, which would make it difficult to identify diagnostic error as well. This work has limitations, particularly with respect to assumptions made with the data. First, since ICD-10-CM codes are used in training and validation of the infectious disease classification model, there are both false positives and false negatives in these assigned codes with respect to the actual working diagnosis, [4244], e.g., a provider may diagnose dyspnea (a non-specific diagnosis) but be thinking about and treating pneumonia correctly. Issues like these are not unique to infectious disease, but there may be opportunities to improve classification by intent via other data sources such as medications, results from laboratory or microbiology specimens, or from text processing. The inter-rater reliability of our instruments for assessing diagnostic error was low, so it is difficult to interpret correlation with our diagnostic divergence score. Also, further work on ontologies could help to disambiguate when the documented code represents an error as opposed to simply a different stage along a natural evolution of the diagnostic workup.

We also categorized quantitative results (e.g., laboratory values) which could have destroyed predictive information. We did not use unstructured data which could have provided more nuance. Models did not consider temporal sequence which also could have lost predictive information. However, even with this feature engineering, we still had a very large number of features. It may be necessary to roll-up features for optimal learning, perhaps with the use of ontologies [45]. An additional limitation is that while many model types exist, this work selected one type of model and evaluated it exclusively. In continued work, it would be worthwhile to evaluate other types of models to compare their performance on these tasks, as well as their respective computational burdens.

Another limitation is that while this VA data comes from several medical centers with diverse geographic locations, our findings and methods may not generalize to other healthcare systems; however, we believe that the same approach of model training and divergence measurement will hold. In this study, patients placed on hospice care were excluded, thereby excluding an important population for whom errors may have occurred. We believe that this is likely unavoidable without precise documentation of when hospice begins to be seriously considered. Training on data of patients on hospice without a diagnostic workup may result in models learning that this pattern is not as unusual as it should be outside of the hospice setting.

Perhaps most importantly, we were limited by a lack of a gold standard—hampered both by data to train models on and a large data set to validate against. The first problem prompted our use of anomaly detection methods in the first place and will require further research. The second underscores a lingering lack of conceptual clarity around a multifaceted concept that also merits further work.

While performance of these models may be acceptable for initial feasibility, future studies should optimize for other data sources and methods to improve model performance as both PPV and sensitivity are relatively low on some classification classes. We also aim to develop methods to provide better interpretation of flagged cases. Several studies with clinicians have shown that merely showing a score is not enough—clinicians benefit from more rationale for why a model made its prediction and what action might be taken with the information [4648].

We imagine that this approach could eventually be used to support quality-oriented chart reviews by retrieving records enriched for diagnostic error. To be routinely implemented, however, the positive predictive value will need to be improved. Before use in other healthcare systems, issues of generalizability will also need to be established in other environments. Evaluation methods also need to be improved to more adequately refine the types of misdiagnoses that we hope to focus on. We expect that besides continued improvements in the adaptation of anomaly detection algorithms to detect misdiagnosis, that normal data (through improved data quality, reduced miscoding, and documentation) will cluster together more clearly thereby making anomaly detection of misdiagnosis easier to detect. In the future, we anticipate that this approach will be accurate enough in detecting misdiagnosis that it could be incorporated into quality and safety measures.

Conclusion

Our proposed method for detecting diagnostic deviance yields candidate cases enriched for diagnostic error. It also finds miscodes and difficult cases. Further refinement could yield a tool for flagging charts for review. Comparisons between human review and our approach indicate preliminary feasibility. Increases in accuracy will likely require natural language processing and methods to leverage information on time and concept relatedness. Further work is needed to develop reliable instruments for rapidly evaluating diagnostic error. Continued development is necessary to allow reviewers and users to explore more detailed information and be convinced that the measurements are valid before a metric can be implemented in clinical practice.

Acknowledgments

This material is the result of work supported with resources and the use of facilities at the George E. Wahlen Department of Veterans Affairs Medical Center, Salt Lake City, Utah. Data was obtained and accessed through the VA Informatics and Computing Infrastructure (VINCI). The authors thank Battelle and the University of Utah. The authors also thank the members of the Technical Expert Panel (TEP) for their feedback and participation. The views expressed in this paper are those of the authors and do not necessarily represent the position or policy of the US Department of Veterans Affairs or the United States Government.

References

  1. 1. Newman-Toker DE, Schaffer AC, Yu-Moe CW, Nassery N, Tehrani ASS, Clemens GD, et al. Serious misdiagnosis-related harms in malpractice claims: the “Big Three”–vascular events, infections, and cancers. Diagnosis. 2019;6: 227–240. pmid:31535832
  2. 2. Singh H, Sittig DF. Advancing the science of measurement of diagnostic errors in healthcare: the Safer Dx framework. BMJ Qual Saf. 2015;24: 103–110. pmid:25589094
  3. 3. Singh H, Khanna A, Spitzmueller C, Meyer AND. Recommendations for using the Revised Safer Dx Instrument to help measure and improve diagnostic safety. Diagnosis. 2019;6: 315–323. pmid:31287795
  4. 4. Liberman AL, Newman-Toker DE. Symptom-Disease Pair Analysis of Diagnostic Error (SPADE): a conceptual framework and methodological approach for unearthing misdiagnosis-related harms using big data. BMJ Qual Saf. 2018;27: 557–566. pmid:29358313
  5. 5. Campbell SM, Bell BG, Marsden K, Spencer R, Kadam U, Perryman K, et al. A patient safety toolkit for family practices. J Patient Saf. 2020;16: e182. pmid:29461334
  6. 6. Calvert JS, Price DA, Chettipally UK, Barton CW, Feldman MD, Hoffman JL, et al. A computational approach to early sepsis detection. Comput Biol Med. 2016;74: 69–73. pmid:27208704
  7. 7. Nemati S, Holder A, Razmi F, Stanley MD, Clifford GD, Buchman TG. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit Care Med. 2018;46: 547. pmid:29286945
  8. 8. Mao Q, Jay M, Hoffman JL, Calvert J, Barton C, Shimabukuro D, et al. Multicentre validation of a sepsis prediction algorithm using only vital sign data in the emergency department, general ward and ICU. BMJ Open. 2018;8: e017833. pmid:29374661
  9. 9. Henry KE, Hager DN, Pronovost PJ, Saria S. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med. 2015;7. pmid:26246167
  10. 10. Cooper GF, Aliferis CF, Ambrosino R, Aronis J, Buchanan BG, Caruana R, et al. An evaluation of machine-learning methods for predicting pneumonia mortality. Artif Intell Med. 1997;9: 107–138. pmid:9040894
  11. 11. Luo Y, Tang Z, Hu X, Lu S, Miao B, Hong S, et al. Machine learning for the prediction of severe pneumonia during posttransplant hospitalization in recipients of a deceased-donor kidney transplant. Ann Transl Med. 2020;8.
  12. 12. Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015. pp. 1721–1730.
  13. 13. Chen M-J, Yang P-H, Hsieh M-T, Yeh C-H, Huang C-H, Yang C-M, et al. Machine learning to relate PM2. 5 and PM10 concentrations to outpatient visits for upper respiratory tract infections in Taiwan: A nationwide analysis. World J Clin Cases. 2018;6: 200.
  14. 14. Taylor RA, Moore CL, Cheung K-H, Brandt C. Predicting urinary tract infections in the emergency department with machine learning. PLoS One. 2018;13: e0194085. pmid:29513742
  15. 15. Møller JK, Sørensen M, Hardahl C. Prediction of risk of acquiring urinary tract infection during hospital stay based on machine-learning: A retrospective cohort study. PLoS One. 2021;16: e0248636. pmid:33788888
  16. 16. O’Brien WJ, Ramos RD, Gupta K, Itani KMF. Neural network model to detect long-term skin and soft tissue infection after hernia repair. Surg Infect (Larchmt). 2021;22: 668–674. pmid:33253060
  17. 17. Shouval R, Hadanny A, Shlomo N, Iakobishvili Z, Unger R, Zahger D, et al. Machine learning for prediction of 30-day mortality after ST elevation myocardial infraction: an Acute Coronary Syndrome Israeli Survey data mining study. Int J Cardiol. 2017;246: 7–13. pmid:28867023
  18. 18. Blom MC, Ashfaq A, Sant’Anna A, Anderson PD, Lingman M. Training machine learning models to predict 30-day mortality in patients discharged from the emergency department: a retrospective, population-based registry study. BMJ Open. 2019;9: e028015. pmid:31401594
  19. 19. Heyman ET, Ashfaq A, Khoshnood A, Ohlsson M, Ekelund U, Holmqvist LD, et al. Improving Machine Learning 30-Day Mortality Prediction by Discounting Surprising Deaths. J Emerg Med. 2021;61: 763–773. pmid:34716042
  20. 20. Veterans Health Administration. 30 Nov 2023 [cited 29 Nov 2023]. Available: https://www.va.gov/health/.
  21. 21. Liu S, Ma W, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug information exchange. IT Prof. 2005;7: 17–23.
  22. 22. Forrey AW, Mcdonald CJ, DeMoor G, Huff SM, Leavelle D, Leland D, et al. Logical observation identifier names and codes (LOINC) database: a public use set of codes and names for electronic reporting of clinical laboratory test results. Clin Chem. 1996;42: 81–90. pmid:8565239
  23. 23. Thorwarth WT Jr. CPT: an open system that describes all that you do. Journal of the American College of Radiology. 2008;5: 555–560. pmid:18359442
  24. 24. Quiñonero-Candela J, Sugiyama M, Lawrence ND, Schwaighofer A. Dataset shift in machine learning. Mit Press; 2009.
  25. 25. Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Med Care. 1998; 8–27. pmid:9431328
  26. 26. Stevenson KB, Khan Y, Dickman J, Gillenwater T, Kulich P, Myers C, et al. Administrative coding data, compared with CDC/NHSN criteria, are poor indicators of health care–associated infections. Am J Infect Control. 2008;36: 155–164. pmid:18371510
  27. 27. Fleischmann-Struzek C, Thomas-Rüddel DO, Schettler A, Schwarzkopf D, Stacke A, Seymour CW, et al. Comparing the validity of different ICD coding abstraction strategies for sepsis case identification in German claims data. PLoS One. 2018;13: e0198847. pmid:30059504
  28. 28. Bouza C, Lopez-Cuadrado T, Amate-Blanco JM. Use of explicit ICD9-CM codes to identify adult severe sepsis: impacts on epidemiological estimates. Crit Care. 2016;20: 313. pmid:27716355
  29. 29. Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M, et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA. 2016;315: 801–810. pmid:26903338
  30. 30. Levine PJ, Elman MR, Kullar R, Townes JM, Bearden DT, Vilches-Tran R, et al. Use of electronic health record data to identify skin and soft tissue infections in primary care settings: a validation study. BMC Infect Dis. 2013;13: 171. pmid:23574801
  31. 31. Walsh TL, Chan L, Konopka CI, Burkitt MJ, Moffa MA, Bremmer DN, et al. Appropriateness of antibiotic management of uncomplicated skin and soft tissue infections in hospitalized adult patients. BMC Infect Dis. 2016;16: 721. pmid:27899072
  32. 32. Suaya JA, Eisenberg DF, Fang C, Miller LG. Skin and soft tissue infections and associated complications among commercially insured patients aged 0–64 years with and without diabetes in the US. PLoS One. 2013;8: e60057.
  33. 33. Daniels KR, Lee GC, Frei CR. Trends in catheter-associated urinary tract infections among a national cohort of hospitalized adults, 2001–2010. Am J Infect Control. 2014;42: 17–22. pmid:24268457
  34. 34. Carbo JF, Ruh CA, Kurtzhalts KE, Ott MC, Sellick JA, Mergenhagen KA. Male veterans with complicated urinary tract infections: Influence of a patient-centered antimicrobial stewardship program. Am J Infect Control. 2016;44: 1549–1553. pmid:27388268
  35. 35. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research. 2011;12: 2825–2830.
  36. 36. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. pp. 785–794.
  37. 37. Bergstra J, research YB-J of machine learning, 2012 undefined. Random search for hyper-parameter optimization. jmlr.orgJ Bergstra, Y BengioJournal of machine learning research, 2012•jmlr.org. 2012;13: 281–305. Available: https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf?ref=blog.floydhub.com.
  38. 38. Hastie T, Friedman J, Tibshirani R. The Elements of Statistical Learning. 2001 [cited 29 Nov 2023].
  39. 39. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10: e0118432. pmid:25738806
  40. 40. Cohen J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull. 1968;70: 213.
  41. 41. Zar JH. Spearman Rank Correlation. Encyclopedia of Biostatistics. 2005.
  42. 42. Linder JA, Kaleba EO, Kmetik KS. Using electronic health records to measure physician performance for acute conditions in primary care: empirical evaluation of the community-acquired pneumonia clinical quality measure set. Med Care. 2009;47: 208–216. pmid:19169122
  43. 43. Aronsky D, Haug PJ, Lagor C, Dean NC. Accuracy of administrative data for identifying patients with pneumonia. Am J Med Qual. 2005;20: 319–328. pmid:16280395
  44. 44. van de Garde EMW, Oosterheert JJ, Bonten M, Kaplan RC, Leufkens HGM. International classification of diseases codes showed modest sensitivity for detecting community-acquired pneumonia. J Clin Epidemiol. 2007;60: 834–838. pmid:17606180
  45. 45. Choi E, Google Brain ⇤, Xiao C, Stewart WF, Sun J. Mime: Multilevel medical embedding of electronic health records for predictive healthcare. proceedings.neurips.ccE Choi, C Xiao, W Stewart, J SunAdvances in neural information processing systems, 2018•proceedings.neurips.cc. [cited 20 Dec 2023]. Available: https://proceedings.neurips.cc/paper/2018/hash/934b535800b1cba8f96a5d72f72f1611-Abstract.html.
  46. 46. Tonekaboni S, Joshi S, Mccradden MD, Goldenberg A, Ai AG. What clinicians want: contextualizing explainable machine learning for clinical end use. proceedings.mlr.press. [cited 26 Oct 2023]. Available: https://proceedings.mlr.press/v106/tonekaboni19a.html.
  47. 47. Henry K, Kornfield R, Sridharan A, … RL-N digital, 2022 undefined. Human–machine teaming is key to AI adoption: clinicians’ experiences with a deployed machine learning system. nature.com. [cited 26 Oct 2023]. Available: https://www.nature.com/articles/s41746-022-00597-7.
  48. 48. Sandhu S, Lin A, Brajer N, Sperling J, … WR-J of M, 2020 undefined. Integrating a machine learning system into clinical workflows: qualitative study. jmir.orgS Sandhu, AL Lin, N Brajer, J Sperling, W Ratliff, AD Bedoya, S Balu, C O’Brien, MP SendakJournal of Medical Internet Research, 2020•jmir.org. [cited 26 Oct 2023]. Available: https://www.jmir.org/2020/11/e22421/.