Figures
Abstract
Diagnostic error, a cause of substantial morbidity and mortality, is largely discovered and evaluated through self-report and manual review, which is costly and not suitable to real-time intervention. Opportunities exist to leverage electronic health record data for automated detection of potential misdiagnosis, executed at scale and generalized across diseases. We propose a novel automated approach to identifying diagnostic divergence considering both diagnosis and risk of mortality. Our objective was to identify cases of emergency department infectious disease misdiagnoses by measuring the deviation between predicted diagnosis and documented diagnosis, weighted by mortality. Two machine learning models were trained for prediction of infectious disease and mortality using the first 24h of data. Charts were manually reviewed by clinicians to determine whether there could have been a more correct or timely diagnosis. The proposed approach was validated against manual reviews and compared using the Spearman rank correlation. We analyzed 6.5 million ED visits and over 700 million associated clinical features from over one hundred emergency departments. The testing set performances of the infectious disease (Macro F1 = 86.7, AUROC 90.6 to 94.7) and mortality model (Macro F1 = 97.6, AUROC 89.1 to 89.1) were in expected ranges. Human reviews and the proposed automated metric demonstrated positive correlations ranging from 0.231 to 0.358. The proposed approach for diagnostic deviation shows promise as a potential tool for clinicians to find diagnostic errors. Given the vast number of clinical features used in this analysis, further improvements likely need to either take greater account of data structure (what occurs before when) or involve natural language processing. Further work is needed to explain the potential reasons for divergence and to refine and validate the approach for implementation in real-world settings.
Author summary
Identifying diagnostic error is challenging since it is often found only through review so time consuming not all patient data can be reviewed, let alone in a timely fashion to potentially prevent harm. In this work we address this gap by proposing machine learning methods which leverage millions of patient encounters. Since such methods could potentially be automated, they could scale to identify situations when the diagnosis may not be correct, timely, or when there may be a risk of death to the patient. This approach was validated by clinicians and shows promise for continued development. Future work will be needed to translate this work to protect patients and support clinicians.
Citation: Peterson KS, Chapman AB, Widanagamaachchi W, Sutton J, Ochoa B, Jones BE, et al. (2024) Automating detection of diagnostic error of infectious diseases using machine learning. PLOS Digit Health 3(6): e0000528. https://doi.org/10.1371/journal.pdig.0000528
Editor: Hualou Liang, Drexel University, UNITED STATES
Received: January 30, 2024; Accepted: May 7, 2024; Published: June 7, 2024
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: Per IRB requirements and VA regulations, patient-level data from this study cannot be shared directly. Access to source data could be accessed by VA-credentialed investigators with an approved IRB and proper VA research authorization. Inquiries about this process for data access can be addressed to VINCI@VA.GOV.
Funding: This work was supported by Gordon and Betty Moore Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Diagnostic errors are harmful to patients, traumatic for providers, and costly for healthcare systems. A recent study showed that infectious diseases are one of three major disease categories causing the majority of misdiagnosis-related harms [1]. It estimated 40,000 to 80,000 deaths in hospitals in the US related to misdiagnosis. Diagnostic error evaluation can be conducted with instruments such as Safer Dx [2,3]to identify areas of improvement; however, these instruments require manual case review from clinicians with expertise and are most efficient when applied to known or probable cases of error. Thus, while such instruments are useful, they are not scalable to large populations or ultimately amenable to providing rapid feedback at the point of care. Since current gaps cannot be addressed with audits and voluntary reporting systems alone, additional capabilities are needed.
There are different approaches to measuring diagnostic accuracy. SPADE uses large amounts of administrative and billing data to quantify diagnostic errors where diagnosis or treatment change indicate an outcome which may have been preventable had the diagnosis or treatment occurred more promptly [4].This statistical approach quantifies diagnostic errors on average but is difficult to interpret on the individual level, lacking traceability. Others have long used expert rules, which are challenging to maintain and scale to additional diseases or contexts [5].
One means of scaling up capabilities is through automation, including machine learning (ML) where models may be applied in large volume. Machine learning models have already been used to predict infectious disease. Examples include sepsis [6–9], pneumonia [10–12], upper respiratory infections (URI) [13], urinary tract infections (UTI) [14,15], and skin and soft tissue infections (SSTI) [16]. The difference between documented and predicted diagnosis can represent a kind of diagnostic divergence. Because diagnostic specificity is likely lower when the stakes are low, we also need to classify acuity. Several prior studies predict adverse outcomes such post-procedural 30-day mortality or adverse events [17–19].
We developed automated models using electronic health records to construct a flexible approach to detect diagnostic divergence which considers infectious disease diagnosis as well as risk of potential mortality. This approach was then validated by expert reviewers.
Materials and methods
Ethics statement
This project was reviewed by the VA Salt Lake City Research & Development Committee and the Institutional Review Board at the University of Utah, and a waiver of consent and authorization was granted (127273).
Data
This study was performed using data from the Veterans Health Administration (VHA) Health Care System which cares for more than 9 million living Veterans at over one hundred emergency departments [20]. The study population included all emergency department (ED) visits to a VA medical center from January 1, 2017, to December 31, 2019. Data were extracted from the Corporate Data Warehouse (CDW), VHA’s repository for electronic clinical and administrative records.
Short visits with either little or no clinical detail (e.g., patients visiting ED for a medication refill) were excluded from further analysis if they did not have a minimum of 5 features in the first 24 hours of the ED visit. This minimum feature threshold was set to avoid uninformative visits. Additionally, ED visits were excluded from these sets if the patient was on hospice care or placed on hospice care within 72 hours. While both exclusions were made in consultation with a technical expert panel, the latter creates the possibility of missing diagnostic errors that lead to hospice and death. However, including these data in the training set would lead to learning hospice practices as “normal” and thus the decision was to exclude.
Feature extraction
Features for each visit were extracted from clinical data starting at the time of the visit through 24 hours following the emergency department encounter, including inpatient admission for patients who were subsequently hospitalized.
Features for the present ED visit included orders, medications, laboratory results, radiology imaging results, and vital signs. Medical orders were normalized where possible to values in standard vocabularies. Specifically, when possible, these orders were Logical Observation Identifiers Names and Codes (LOINC) for laboratories, RxNorm for medications, and Current Procedural Terminology (CPT) for procedures [21–23]. Medication features were normalized to RxNorm ingredient level. Laboratory results standardized to LOINC were assigned to categories High, Low, or Normal based upon available structured data flags, categorical findings, or numerical results with respect to reference ranges (e.g., a serum creatinine of 2.5 mg/dL was simply coded a high serum creatinine). Radiology imaging results were categorized as either Normal or Abnormal based on an available structured data flag. Vital signs were only included as features if noted as abnormally High or Low.
To allow context of existing comorbidities prior to the visit and equal lookback, all diagnosis codes from the 365 days prior to the ED visit were extracted from both inpatient and outpatient settings.
All features were treated as binary, setting as true only if the specific event was documented for the patient during the encounter. As concrete examples, a feature name such as “order_comprehensive_metabolic_panel” would be present only if this order was present. The laboratory result was represented in a separate feature using suffixes added to feature names. For example, “lab_hemoglobin_n” represented that a laboratory for hemoglobin had a normal result. “lab_erythrocytes_l” and “lab_leukocytes_h” were interpreted similarly except that “l” and “h” represented Low and High results, respectively. Radiology imaging results such as “radiology_x_ray_exam_of_knee_a” and “radiology_x_ray_chest_left_view_n” represented Abnormal and Normal results, respectively. Vital sign example features of “vitals_blood_pressure_h” and “vitals_temperature_l” represented High and Low findings, respectively. Orders for medications were represented as “order_gabapentin” to reflect the order was made and a feature of “medication_gabapentin” if this medication was administered. Finally, historical diagnosis codes before the ED visit such as “Chronic obstructive pulmonary disease, unspecified” were represented as “icd10_dx_J44.9”.
Data splits
Included ED visits were split into 3 datasets: training, validation, and testing. The testing set was defined to ensure that a set of ED visits were held out from all training and validation activities. [24]. Because documentation may vary over time within a medical center or across medical centers, the testing set was defined to include ED visits from facilities from July 1, 2019, to Dec 31, 2019. Additionally, five VA medical centers were selected at random for this set (including all their data for the entire time period). All remaining ED visits were assigned to the training and validation sets where 90% of visits were randomly assigned to the training set and the remainder to the validation set.
Classification models
Two separate classification models were trained to predict outcomes in the proposed diagnosis metric. These are referred to here as the mortality and infectious disease models.
The mortality model was trained as a binary classifier where the label was defined by whether the patient died in the 30 days following the visit regardless of cause. Predictors included Elixhauser score using diagnosis codes from the 365 days prior to the visit [25] as well as other features mentioned above.
The infectious disease model was a multiclass classifier which predicted one of the following: pneumonia, sepsis, SSTI, UTI, URI, or no infection. These diagnosis outcome labels were assigned using sets of ICD-10 codes from previously published studies on pneumonia [26], sepsis [27–29], SSTI [30–32], UTI [33,34]. Labels for URI were assigned by manual curation of 45 ICD-10 codes for bronchitis (e.g., J20.9), pharyngitis (e.g., J03.91), sinusitis (e.g., J01.21), or generally for upper respiratory infection (e.g., J06.9). The definition for UTI also included microbiology results to expand the evidence used in assigning a label. A diagnosis was present for a visit if any one of these diagnosis codes (or microbiology data for UTI) was identified between in the period from 24 hours prior to 48 hours after an ED visit.
Both mortality and infectious disease models were trained using gradient boosted trees with the scikit-learn [35] package in Python. While many types of models could be trained and evaluated in this work, given available computing resources, one model type was selected for evaluation. Specifically, an implementation of gradient boosted trees, specifically XGBoost [36] was chosen for its familiarity and its proficiency in managing overfitting to promote generalization. Additionally, this implementation has been shown to scale to millions or billions of data instances while requiring fewer computing resources than some model types [36]. This implementation includes several hyperparameters which enforce model regularization to prevent models from becoming overly complex and fit to training data alone.
Hyperparameters for both models were tuned using randomized search and cross validation [37,38]. This was performed on the training set only using 3-fold cross validation and the best model with the best hyperparameters was measured against the validation set using the scikit-learn implementation. No metrics were gathered for the testing set until hyperparameters were finalized for both models. Some of the hyperparameters for this model included learning rate, maximum number of trees, and maximum depth permitted per tree. No explicit feature selection or feature weighting was performed prior to model training as each tree constructed in the training iterations could incorporate different features.
Negative classes were under sampled so that as many visits available could be leveraged while minimizing computational resource requirements. We used positive predictive value (PPV), sensitivity, and Area Under the Receiver Operating Characteristic (AUROC) to assess model performance. The area under the precision-recall curve (PRAUC) was also used as a metric since it has been shown to be informative in assessing performance of imbalanced classes [39].
Diagnosis deviation
Using the predictions from the trained models as well as the diagnoses assigned and the estimated pre-visit mortality, a formula was developed to quantify potential deviation in diagnosis. This formula includes the predictions from the infectious disease classification model while also being weighted by predictions from the mortality model for increased mortality. This mathematical derivation of this diagnostic deviation (DD) is illustrated in the equation presented here.
More specifically, for any given infectious disease class (e.g., pneumonia) d represents the observed diagnosis, or a 0/1 representation of whether an ICD-10 code for that diagnosis, represents the probability of that disease given predictions from the infectious disease classifier. Meanwhile, m reflects the pre-visit mortality probability from the Elixhauser score leading up to the visit when available and represents the probability from the mortality model predicting 30-day mortality. Note that only increases in mortality probability are considered in this calculation so that if the probability of mortality during the visit decreases compared to the pre-visit mortality, this term will be 0.
Validation
To validate how well this proposed automated metric compares to clinician judgement, ED visits were manually reviewed. Reviewers assessed whether each of the five infectious disease classes or a non-infectious disease was initially documented. Next, changes in diagnostic approach or the investigation of multiple diagnoses were assessed, and for the same disease classes, identified the final diagnosis for the reason the patient came to the ED.
Additionally, two questions were adapted from the revised Safer Dx framework to rate opportunities for improved diagnosis [3]. On a Likert scale of 1 (i.e., strongly disagree) to 7 (i.e., strongly agree), two questions were asked: “The final diagnosis was not an evolution of the care team’s initial presumed diagnosis” and “The patient was at risk significant harm–or experienced harm–that could have generally been prevented by a correct and timely diagnosis”.
Case reviews were performed by 3 clinical experts in infectious disease. After an initial exploratory round of chart reviews, the scope of the reviews was narrowed to pneumonia given available time for reviewers. Cases assigned to each reviewer were a stratified sample where half were assigned by the highest values according to our metric with respect to pneumonia and the other half were randomly sampled. Cases were excluded from sampling if the difference between expected mortality (m) and predicted mortality () reflected a decrease.
The three reviewers triple annotated a set of 20 ED visits to measure inter-rater reliability (IRR) between reviewers. After these visits, reviewers were asked to continue reviewing visits as they had available time such that a total of 130 unique ED visits were reviewed. After reviews were completed, we assessed the correlation between the diagnosis deviation and the Likert scores for each of the two questions using weighted Cohen’s kappa [40]. We also used Cohen’s kappa to evaluate IRR on the cases reviewed by multiple reviewers.
Comparisons between diagnosis deviation and reviewer scores on the two questions for review were calculated by Spearman rank correlation [41].
Results
Data
A total of 6,536,315 ED visits were initially included from 104 distinct VA medical centers. These visits were across 2,141,271 unique patients where the mean age at the time of the ED visit was 60 years old and 88.1% were male.
Feature extraction
From these ED visits, over 738 million features were extracted. Over 656 million features were extracted with respect to the first 24 hours of data in the ED visit and over 82 million features were extracted from diagnosis codes prior to the visit. A summary of feature counts and distinct feature values is shown in Table 1. The median number of features in the first 24 hours of the visit was 22 and the median number of diagnosis codes prior to the visit was 8.
Data splits
A total of 4,261,730 visits were divided into sets. 1,211,698 visits were assigned to the testing set where 641,765 were from the temporal holdout and 569,933 were from one of the five random medical centers. The remaining visits were split among the training set and validation sets which resulted in sets of 2,733,493 and 316,539 visits, respectively.
Classification models
Following hyperparameter tuning and cross validation of both the mortality and infectious disease models using the training and validation sets, these models were finalized and applied to the held-out testing set.
Performance metrics on this testing set for the mortality model is shown in Table 2. The metrics for the same set for the infectious disease model is in Table 3. The receiver operating characteristic curves for these models on the testing set are shown in Fig 1 and Fig 2, respectively.
Diagnostic deviation
After models were trained and validated, the proposed Diagnostic Deviation Metric was applied to all ED visits in the testing set. The distribution of this metric for pneumonia is shown in Fig 3.
Validation
The distribution of values assigned during review is shown in Fig 4. Pairwise inter-rater reliability is shown in Table 4 where the Cohen’s Kappa ranges from 0.242 to 0.528 across both questions posed.
Since our proposed measure combines predictions of both mortality and diagnosis, human review scores were not only compared to the composite metric, but also to each of these components of the measure independently. These correlations for all reviewers are presented in Table 5. Our proposed composite measure shows a positive correlation between human reviewers and this automated approach which may be interpreted as weak to moderate given the Spearman correlations between 0.231 and 0.358. The composite metric also showed a higher correlation than the individual components of mortality or diagnosis deviation, although the difference was greater in the human score of diagnostic evolution than it was for potential patient harm. A correlation for each reviewer among the two questions response is provided in Table 6. For the question of “not a diagnostic evolution,” correlations ranged from 0.35 to 0.614 with Reviewer 1 showing the strongest correlation. Meanwhile, for potential patient harm, these ranged from 0.273 to 0.357 where the strongest correlation was with Reviewer 2.
Reviewers were allowed to provide comments to summarize cases including whether they agreed with the diagnoses assigned in the cases, noteworthy outcomes in the cases and any other useful notes. Some review scores had higher correlation with our proposed measure and comments provided by the reviewer seem to indicate that an automated review may have been useful. Some of these comments are presented in Table 7. Some of these include cases where one treatment was provided, but pneumonia was missed as the diagnosis. Others indicate an initial diagnosis, but by the time the diagnosis changed there was an adverse outcome.
Given that the correlation between the proposed measure and review scores is suboptimal, there were also cases where the score from the measure may have been high, but the reviewer did not identify an opportunity for a more timely or correct diagnosis. These are shown in Table 8. For example, in some of these the reviewer agreed with the diagnoses assigned in the cases. In at least one of these, a reviewer indicated that the patient was already in hospice, which should have been excluded from our analysis and review.
Discussion
Individual models for infectious diseases and mortality demonstrated reasonable diagnostic performance statistics but positive predictive value and PR AUC were low given the low prevalence of infectious diseases. These models were trained with a very large number of features, representing a substantial fraction of the structured data readily available from an EHR.
Our primary concern was whether our combined measures of diagnostic divergence was related to diagnostic error. Correlations between human review of cases and the proposed measure showed a weak positive correlation. When examining the interrater reliability of our subjective diagnostic error measures and when performing an analysis of discordant cases, it became apparent that there was substantial disagreement in interpreting cases and how to weigh the individual components of diagnostic error. This suggested a further need to explore robust instruments for diagnostic error assessment. It also underscores that diagnostic error is usually not clearly present or absent but is evaluated on a continuum. Analysis also showed that cases with high diagnostic divergence scores that were obviously not diagnostic errors generally fell into three categories: diagnostic miscoding, highly complex cases, and model failures, of which the first two were most common.
By searching on the very high end of the diagnostic divergence spectrum, we demonstrated the feasibility of enriching data sets with infectious disease-related diagnostic error cases using automated means. Such identified divergence could be used to highlight a diagnosis that should have been included or to offer information where the assigned diagnosis differs from how peers may have assigned a diagnosis. Further, leveraging methods for explainable machine learning could help to identify which specifics of diagnostic testing or treatment contributed to this divergence.
The performance of our individual models was similar to other reported ML models to date when comparing common metric like AUROC, although study criteria and settings make comparisons non-trivial [6–8,11,12,14–19]. However, given the amount of data and number of features, this is somewhat disappointing. We included millions of clinical events in modeling efforts in which thousands of model iterations were trained and evaluated with a range of hyperparameters. This suggests that the features added and that are not usually present in other models did not provide much additional information over more traditional features when predicting the ultimate diagnosis. This may be, in part, due to sparse documentation on observations supporting the diagnosis and/or selective documentation of features supporting the diagnosis, which would make it difficult to identify diagnostic error as well. This work has limitations, particularly with respect to assumptions made with the data. First, since ICD-10-CM codes are used in training and validation of the infectious disease classification model, there are both false positives and false negatives in these assigned codes with respect to the actual working diagnosis, [42–44], e.g., a provider may diagnose dyspnea (a non-specific diagnosis) but be thinking about and treating pneumonia correctly. Issues like these are not unique to infectious disease, but there may be opportunities to improve classification by intent via other data sources such as medications, results from laboratory or microbiology specimens, or from text processing. The inter-rater reliability of our instruments for assessing diagnostic error was low, so it is difficult to interpret correlation with our diagnostic divergence score. Also, further work on ontologies could help to disambiguate when the documented code represents an error as opposed to simply a different stage along a natural evolution of the diagnostic workup.
We also categorized quantitative results (e.g., laboratory values) which could have destroyed predictive information. We did not use unstructured data which could have provided more nuance. Models did not consider temporal sequence which also could have lost predictive information. However, even with this feature engineering, we still had a very large number of features. It may be necessary to roll-up features for optimal learning, perhaps with the use of ontologies [45]. An additional limitation is that while many model types exist, this work selected one type of model and evaluated it exclusively. In continued work, it would be worthwhile to evaluate other types of models to compare their performance on these tasks, as well as their respective computational burdens.
Another limitation is that while this VA data comes from several medical centers with diverse geographic locations, our findings and methods may not generalize to other healthcare systems; however, we believe that the same approach of model training and divergence measurement will hold. In this study, patients placed on hospice care were excluded, thereby excluding an important population for whom errors may have occurred. We believe that this is likely unavoidable without precise documentation of when hospice begins to be seriously considered. Training on data of patients on hospice without a diagnostic workup may result in models learning that this pattern is not as unusual as it should be outside of the hospice setting.
Perhaps most importantly, we were limited by a lack of a gold standard—hampered both by data to train models on and a large data set to validate against. The first problem prompted our use of anomaly detection methods in the first place and will require further research. The second underscores a lingering lack of conceptual clarity around a multifaceted concept that also merits further work.
While performance of these models may be acceptable for initial feasibility, future studies should optimize for other data sources and methods to improve model performance as both PPV and sensitivity are relatively low on some classification classes. We also aim to develop methods to provide better interpretation of flagged cases. Several studies with clinicians have shown that merely showing a score is not enough—clinicians benefit from more rationale for why a model made its prediction and what action might be taken with the information [46–48].
We imagine that this approach could eventually be used to support quality-oriented chart reviews by retrieving records enriched for diagnostic error. To be routinely implemented, however, the positive predictive value will need to be improved. Before use in other healthcare systems, issues of generalizability will also need to be established in other environments. Evaluation methods also need to be improved to more adequately refine the types of misdiagnoses that we hope to focus on. We expect that besides continued improvements in the adaptation of anomaly detection algorithms to detect misdiagnosis, that normal data (through improved data quality, reduced miscoding, and documentation) will cluster together more clearly thereby making anomaly detection of misdiagnosis easier to detect. In the future, we anticipate that this approach will be accurate enough in detecting misdiagnosis that it could be incorporated into quality and safety measures.
Conclusion
Our proposed method for detecting diagnostic deviance yields candidate cases enriched for diagnostic error. It also finds miscodes and difficult cases. Further refinement could yield a tool for flagging charts for review. Comparisons between human review and our approach indicate preliminary feasibility. Increases in accuracy will likely require natural language processing and methods to leverage information on time and concept relatedness. Further work is needed to develop reliable instruments for rapidly evaluating diagnostic error. Continued development is necessary to allow reviewers and users to explore more detailed information and be convinced that the measurements are valid before a metric can be implemented in clinical practice.
Acknowledgments
This material is the result of work supported with resources and the use of facilities at the George E. Wahlen Department of Veterans Affairs Medical Center, Salt Lake City, Utah. Data was obtained and accessed through the VA Informatics and Computing Infrastructure (VINCI). The authors thank Battelle and the University of Utah. The authors also thank the members of the Technical Expert Panel (TEP) for their feedback and participation. The views expressed in this paper are those of the authors and do not necessarily represent the position or policy of the US Department of Veterans Affairs or the United States Government.
References
- 1. Newman-Toker DE, Schaffer AC, Yu-Moe CW, Nassery N, Tehrani ASS, Clemens GD, et al. Serious misdiagnosis-related harms in malpractice claims: the “Big Three”–vascular events, infections, and cancers. Diagnosis. 2019;6: 227–240. pmid:31535832
- 2. Singh H, Sittig DF. Advancing the science of measurement of diagnostic errors in healthcare: the Safer Dx framework. BMJ Qual Saf. 2015;24: 103–110. pmid:25589094
- 3. Singh H, Khanna A, Spitzmueller C, Meyer AND. Recommendations for using the Revised Safer Dx Instrument to help measure and improve diagnostic safety. Diagnosis. 2019;6: 315–323. pmid:31287795
- 4. Liberman AL, Newman-Toker DE. Symptom-Disease Pair Analysis of Diagnostic Error (SPADE): a conceptual framework and methodological approach for unearthing misdiagnosis-related harms using big data. BMJ Qual Saf. 2018;27: 557–566. pmid:29358313
- 5. Campbell SM, Bell BG, Marsden K, Spencer R, Kadam U, Perryman K, et al. A patient safety toolkit for family practices. J Patient Saf. 2020;16: e182. pmid:29461334
- 6. Calvert JS, Price DA, Chettipally UK, Barton CW, Feldman MD, Hoffman JL, et al. A computational approach to early sepsis detection. Comput Biol Med. 2016;74: 69–73. pmid:27208704
- 7. Nemati S, Holder A, Razmi F, Stanley MD, Clifford GD, Buchman TG. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit Care Med. 2018;46: 547. pmid:29286945
- 8. Mao Q, Jay M, Hoffman JL, Calvert J, Barton C, Shimabukuro D, et al. Multicentre validation of a sepsis prediction algorithm using only vital sign data in the emergency department, general ward and ICU. BMJ Open. 2018;8: e017833. pmid:29374661
- 9. Henry KE, Hager DN, Pronovost PJ, Saria S. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med. 2015;7. pmid:26246167
- 10. Cooper GF, Aliferis CF, Ambrosino R, Aronis J, Buchanan BG, Caruana R, et al. An evaluation of machine-learning methods for predicting pneumonia mortality. Artif Intell Med. 1997;9: 107–138. pmid:9040894
- 11. Luo Y, Tang Z, Hu X, Lu S, Miao B, Hong S, et al. Machine learning for the prediction of severe pneumonia during posttransplant hospitalization in recipients of a deceased-donor kidney transplant. Ann Transl Med. 2020;8.
- 12.
Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015. pp. 1721–1730.
- 13. Chen M-J, Yang P-H, Hsieh M-T, Yeh C-H, Huang C-H, Yang C-M, et al. Machine learning to relate PM2. 5 and PM10 concentrations to outpatient visits for upper respiratory tract infections in Taiwan: A nationwide analysis. World J Clin Cases. 2018;6: 200.
- 14. Taylor RA, Moore CL, Cheung K-H, Brandt C. Predicting urinary tract infections in the emergency department with machine learning. PLoS One. 2018;13: e0194085. pmid:29513742
- 15. Møller JK, Sørensen M, Hardahl C. Prediction of risk of acquiring urinary tract infection during hospital stay based on machine-learning: A retrospective cohort study. PLoS One. 2021;16: e0248636. pmid:33788888
- 16. O’Brien WJ, Ramos RD, Gupta K, Itani KMF. Neural network model to detect long-term skin and soft tissue infection after hernia repair. Surg Infect (Larchmt). 2021;22: 668–674. pmid:33253060
- 17. Shouval R, Hadanny A, Shlomo N, Iakobishvili Z, Unger R, Zahger D, et al. Machine learning for prediction of 30-day mortality after ST elevation myocardial infraction: an Acute Coronary Syndrome Israeli Survey data mining study. Int J Cardiol. 2017;246: 7–13. pmid:28867023
- 18. Blom MC, Ashfaq A, Sant’Anna A, Anderson PD, Lingman M. Training machine learning models to predict 30-day mortality in patients discharged from the emergency department: a retrospective, population-based registry study. BMJ Open. 2019;9: e028015. pmid:31401594
- 19. Heyman ET, Ashfaq A, Khoshnood A, Ohlsson M, Ekelund U, Holmqvist LD, et al. Improving Machine Learning 30-Day Mortality Prediction by Discounting Surprising Deaths. J Emerg Med. 2021;61: 763–773. pmid:34716042
- 20.
Veterans Health Administration. 30 Nov 2023 [cited 29 Nov 2023]. Available: https://www.va.gov/health/.
- 21. Liu S, Ma W, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug information exchange. IT Prof. 2005;7: 17–23.
- 22. Forrey AW, Mcdonald CJ, DeMoor G, Huff SM, Leavelle D, Leland D, et al. Logical observation identifier names and codes (LOINC) database: a public use set of codes and names for electronic reporting of clinical laboratory test results. Clin Chem. 1996;42: 81–90. pmid:8565239
- 23. Thorwarth WT Jr. CPT: an open system that describes all that you do. Journal of the American College of Radiology. 2008;5: 555–560. pmid:18359442
- 24.
Quiñonero-Candela J, Sugiyama M, Lawrence ND, Schwaighofer A. Dataset shift in machine learning. Mit Press; 2009.
- 25. Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative data. Med Care. 1998; 8–27. pmid:9431328
- 26. Stevenson KB, Khan Y, Dickman J, Gillenwater T, Kulich P, Myers C, et al. Administrative coding data, compared with CDC/NHSN criteria, are poor indicators of health care–associated infections. Am J Infect Control. 2008;36: 155–164. pmid:18371510
- 27. Fleischmann-Struzek C, Thomas-Rüddel DO, Schettler A, Schwarzkopf D, Stacke A, Seymour CW, et al. Comparing the validity of different ICD coding abstraction strategies for sepsis case identification in German claims data. PLoS One. 2018;13: e0198847. pmid:30059504
- 28. Bouza C, Lopez-Cuadrado T, Amate-Blanco JM. Use of explicit ICD9-CM codes to identify adult severe sepsis: impacts on epidemiological estimates. Crit Care. 2016;20: 313. pmid:27716355
- 29. Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M, et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA. 2016;315: 801–810. pmid:26903338
- 30. Levine PJ, Elman MR, Kullar R, Townes JM, Bearden DT, Vilches-Tran R, et al. Use of electronic health record data to identify skin and soft tissue infections in primary care settings: a validation study. BMC Infect Dis. 2013;13: 171. pmid:23574801
- 31. Walsh TL, Chan L, Konopka CI, Burkitt MJ, Moffa MA, Bremmer DN, et al. Appropriateness of antibiotic management of uncomplicated skin and soft tissue infections in hospitalized adult patients. BMC Infect Dis. 2016;16: 721. pmid:27899072
- 32. Suaya JA, Eisenberg DF, Fang C, Miller LG. Skin and soft tissue infections and associated complications among commercially insured patients aged 0–64 years with and without diabetes in the US. PLoS One. 2013;8: e60057.
- 33. Daniels KR, Lee GC, Frei CR. Trends in catheter-associated urinary tract infections among a national cohort of hospitalized adults, 2001–2010. Am J Infect Control. 2014;42: 17–22. pmid:24268457
- 34. Carbo JF, Ruh CA, Kurtzhalts KE, Ott MC, Sellick JA, Mergenhagen KA. Male veterans with complicated urinary tract infections: Influence of a patient-centered antimicrobial stewardship program. Am J Infect Control. 2016;44: 1549–1553. pmid:27388268
- 35. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research. 2011;12: 2825–2830.
- 36.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. pp. 785–794.
- 37. Bergstra J, research YB-J of machine learning, 2012 undefined. Random search for hyper-parameter optimization. jmlr.orgJ Bergstra, Y BengioJournal of machine learning research, 2012•jmlr.org. 2012;13: 281–305. Available: https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf?ref=blog.floydhub.com.
- 38. Hastie T, Friedman J, Tibshirani R. The Elements of Statistical Learning. 2001 [cited 29 Nov 2023].
- 39. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10: e0118432. pmid:25738806
- 40. Cohen J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull. 1968;70: 213.
- 41. Zar JH. Spearman Rank Correlation. Encyclopedia of Biostatistics. 2005.
- 42. Linder JA, Kaleba EO, Kmetik KS. Using electronic health records to measure physician performance for acute conditions in primary care: empirical evaluation of the community-acquired pneumonia clinical quality measure set. Med Care. 2009;47: 208–216. pmid:19169122
- 43. Aronsky D, Haug PJ, Lagor C, Dean NC. Accuracy of administrative data for identifying patients with pneumonia. Am J Med Qual. 2005;20: 319–328. pmid:16280395
- 44. van de Garde EMW, Oosterheert JJ, Bonten M, Kaplan RC, Leufkens HGM. International classification of diseases codes showed modest sensitivity for detecting community-acquired pneumonia. J Clin Epidemiol. 2007;60: 834–838. pmid:17606180
- 45.
Choi E, Google Brain ⇤, Xiao C, Stewart WF, Sun J. Mime: Multilevel medical embedding of electronic health records for predictive healthcare. proceedings.neurips.ccE Choi, C Xiao, W Stewart, J SunAdvances in neural information processing systems, 2018•proceedings.neurips.cc. [cited 20 Dec 2023]. Available: https://proceedings.neurips.cc/paper/2018/hash/934b535800b1cba8f96a5d72f72f1611-Abstract.html.
- 46. Tonekaboni S, Joshi S, Mccradden MD, Goldenberg A, Ai AG. What clinicians want: contextualizing explainable machine learning for clinical end use. proceedings.mlr.press. [cited 26 Oct 2023]. Available: https://proceedings.mlr.press/v106/tonekaboni19a.html.
- 47. Henry K, Kornfield R, Sridharan A, … RL-N digital, 2022 undefined. Human–machine teaming is key to AI adoption: clinicians’ experiences with a deployed machine learning system. nature.com. [cited 26 Oct 2023]. Available: https://www.nature.com/articles/s41746-022-00597-7.
- 48.
Sandhu S, Lin A, Brajer N, Sperling J, … WR-J of M, 2020 undefined. Integrating a machine learning system into clinical workflows: qualitative study. jmir.orgS Sandhu, AL Lin, N Brajer, J Sperling, W Ratliff, AD Bedoya, S Balu, C O’Brien, MP SendakJournal of Medical Internet Research, 2020•jmir.org. [cited 26 Oct 2023]. Available: https://www.jmir.org/2020/11/e22421/.