Triage is important in identifying high-risk patients amongst many less urgent patients as emergency department (ED) overcrowding has become a national crisis recently. This study aims to validate that a Deep-learning-based Triage and Acuity Score (DTAS) identifies high-risk patients more accurately than existing triage and acuity scores using a large national dataset.
We conducted a retrospective observational cohort study using data from the Korean National Emergency Department Information System (NEDIS), which collected data on visits in real time from 151 EDs. The NEDIS data was split into derivation data (January 2014-June 2016) and validation data (July-December 2016). We also used data from the Sejong General Hospital (SGH) for external validation (January-December 2017). We predicted in-hospital mortality, critical care, and hospitalization using initial information of ED patients (age, sex, chief complaint, time from symptom onset to ED visit, arrival mode, trauma, initial vital signs and mental status as predictor variables).
A total of 11,656,559 patients were included in this study. The primary outcome was in-hospital mortality. The Area Under the Receiver Operating Characteristic curve (AUROC) and Area Under the Precision and Recall Curve (AUPRC) of DTAS were 0.935 and 0.264. It significantly outperformed Korean triage and acuity score (AUROC:0.785, AUPRC:0.192), modified early warning score (AUROC:0.810, AUPRC:0.116), logistic regression (AUROC:0.903, AUPRC:0.209), and random forest (AUROC:0.910, AUPRC:0.179).
Citation: Kwon J-m, Lee Y, Lee Y, Lee S, Park H, Park J (2018) Validation of deep-learning-based triage and acuity score using a large national dataset. PLoS ONE 13(10): e0205836. https://doi.org/10.1371/journal.pone.0205836
Editor: Nan Liu, Duke-NUS Medical School, SINGAPORE
Received: March 19, 2018; Accepted: October 2, 2018; Published: October 15, 2018
Copyright: © 2018 Kwon et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data underlying this study belong to the National Emergency Medical Center (NEMC) of Korea. NEMC provides de-identified National Emergency Department Information System (NEDIS) data to researchers for nonprofit academic research. Any researchers who propose a study subject and plans with a standardized proposal form and are approved by the NEMC review committee on research support can access the raw data. Details of this process and a provision guide are now available at the NEMC website (https://dw.nemc.or.kr) or contact point of NEMC review committee (email@example.com). The authors accessed the data used in this study in the same manner that they expect future researchers to do so and did not receive special privileges from the NEMC of Korea.
Funding: VUNO provided support in the form of salaries for authors (Youngnam Lee, Yeha Lee, Seungwoo Lee, and Hyunho Park), but did not have any additional role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript.
Competing interests: VUNO provided support in the form of salaries for authors (Youngnam Lee, Yeha Lee, Seungwoo Lee, and Hyunho Park), but did not have any additional role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript. Dr. Yeha Lee is the co-founder and stakeholder in VUNO Inc., a medical artificial intelligence company. Mr. Youngnam Lee and Hyunho Park are employees of VUNO Inc. Mr. Seungwoo Lee was an employee of VUNO Inc. There are no patents, products in development or marketed products to declare. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Overcrowding in an emergency department (ED) has been identified as a healthcare crisis in many nations.[1,2] Triage is important in identifying vulnerable and high-risk patients among a large number of less urgent patients as ED overcrowding and delay in care are associated with increased mortality in many conditions. The rapid assessment of the patient’s risk and urgency is necessary to identify high-risk patients and determine treatment priority on arrival at the ED.
The Canadian Triage and Acuity Scale (CTAS) was developed in 1999 after studying the successful National Triage Scale (NTS) from Australia. The Korean Triage and Acuity System (KTAS) was developed in 2012 based on CTAS and has been used nationwide as triage since 2016 in Korea. Although these Triage and Acuity Scores (TASs) help identify patients with high-risk of death, they have two limitations. First, they rely on the provider’s subjective judgement of critical care needs and pain intensity.[6,7] As a decision can be different for each provider, outcomes predicted by these TASs have high variation and low reliability. Second, they can be a bottleneck in the ED patient’s flow because subjective information cannot be instantly judged and is often ambiguous. In addition, the time to judge can take more depending on the experience of the provider as subjective information is based on clinical expertise. This delay is a risk to patient safety.
The Modified Early Warning Score (MEWS) is a widely used tool and overcomes two limitations using physiological parameters (systolic blood pressure, pulse rate, respiratory rate, temperature, and level of consciousness (Alert, Voice, Pain, Unresponsive).[10–14] However, it has a limitation in capturing the relationship between parameters. MEWS is the sum of the scores for each parameter, and the score for each parameter is calculated independently. For example, systolic blood pressure is not considered when calculating the score for the temperature even though the temperature is interpreted differently according to systolic blood pressure.
Machine learning (ML) based overcomes the limitation of MEWS and shows higher performance than MEWS. ML is an algorithm that allows a computer to learn by itself from given data without explicitly programming (i.e., improved performance on a specific task). Until the last few years, several domains, including TAS, used ML such as logistic regression (LR) and random forest (RF).[16–18] LR finds the relationship parameters and outcome and expresses it as a linear combination of parameters. RF creates several decision trees with ensemble technique and combines the results from them. The decision tree is to build a tree-like graph (i.e., model) that predicts the outcome by learning discrete cut-points (i.e., rule). Recently, deep learning (DL) has achieved state of the art performance in several domains through deep hierarchical feature construction.[19–21] One of the most important advantages of DL compared to ML is feature learning. From a large number of data, the deep learning automatically learns the features or representations needed for given tasks such as classification and detection using several non-linear modules. In this study, we developed a Deep-learning-based TAS (DTAS) and validated that DTAS significantly outperforms existing TAS using a large, multicenter dataset.
We conducted a retrospective observational cohort study using data from the Korean National Emergency Department Information System (NEDIS) which collected data on all visits in real time from 151 EDs in Korea. The NEDIS data was split into derivation data (January 2014-June 2016) and validation data (July-December 2016). Furthermore, we used data from the Sejong General Hospital (SGH) for external validation (January-December 2017). The hospital is a specialist cardiovascular teaching hospital, with approximately 14,000 patients visiting the ED each year. As shown in Table 1, internal and external validation data had different characteristics. We verified that DTAS was not biased towards specific characteristics through the validation of both data. The Sejong General Hospital Institutional Review Board approved this study and granted waivers of informed consent based on general impracticability and minimal harm. Patient information was anonymized and de-identified before the analysis.
The NEDIS data included age, sex, arrival time, chief complaint, arrival mode, initial vital signs, trauma, ED treatment result, place of hospitalization, admission result, KTAS, discharge diagnosis, etc. The study subjects were adult patients (≥18 years), and patients who were dead on arrival or had missing value were excluded.
The primary outcome was in-hospital mortality. The secondary outcome was critical care, and the tertiary outcome was hospitalization in this study. The critical care outcomes comprised of direct admission to the intensive care unit (ICU), transfer to other hospitals for ICU admission, and in-hospital mortality. The hospitalization outcomes consisted of direct admission to hospital, transfer to other hospitals for admission, and in-hospital mortality. Admitted patients who eventually die were included in the critical care outcome and the hospitalization outcome. However, each outcome was not double counted because we predicted independently for each outcome whether it would occur or not: "hospitalization or non-hospitalization," "critical care or non-critical care," and "mortality or non-mortality." We use age, sex, chief complaint, time from symptom onset to ED visit, arrival mode, trauma, initial vital signs and mental status as predictor variables (Table 1).
We developed DTAS using multilayer perceptron, a method of deep learning, with 5 hidden layers. Because there was no gain in accuracy when adding more than 5 layers, we made up 5 layers to minimize the parameters to be learned. The first to fourth layers consisted of 32, 32, 16, and 8 nodes, and applied a rectified linear activation. The last layer consisted of 1 node which represented the risk of each outcome and applied a sigmoid function. We learned DTAS as the Adam optimizer and used a binary-cross entropy as a loss function. To validate our model, we used the hyperparameters of the model with the best performance on 10% of the data from the derivation data during the training process.
We compared the performance of DTAS with KTAS, MEWS, LR, and RF. KTAS has been used nationwide as triage since 2016 in Korea. MEWS is widely used as a tool to identify patients at risk of deterioration, and several studies have shown good results with MEWS in predicting poor outcomes of ED patient.[13,23,24] In the previous studies, LR and RF were the most commonly used machine learning algorithms and showed better performance than MEWS.[25–27]
We conducted a performance test exclusively for each outcome. We used the area under the receiver operating characteristic curve (AUROC) and area under the precision and recall curve (AUPRC) as the comparative measures. AUROC is one of the most used metrics in evaluating binary classifiers and shows sensitivity against 1-specificity. Compared with AUROC, AUPRC is useful with an imbalanced data like our study and show precision (i.e., 1-false positive) against recall (i.e., sensitivity). With imbalanced data, in which the number of negatives outweighs the number of positives, AUROC has a limitation for evaluating the performance because the false positive rate (false positive/total real negatives) does not decrease dramatically when the total negatives are large.
A total of 11,656,559 ED visits to 151 hospitals were included in the NEDIS. We excluded 689,041 visits due to 114,368 dead on arrivals and 574,673 missing values. Study subjects comprised of 10,967,518 ED visits and the outcomes were 153,217 in-hospital death (1.4%), 625,117 critical care admissions (5.7%), and 2,964,367 hospitalization (27.0%) (Table 1). DTAS was developed using derivation data of 8,981,184 patients, and the validation study was performed using data of 1,986,334 patients on the NEDIS. External validation was performed using 13,989 visits to SGH ED, where the outcomes were 150 in-hospital death (1.1%), 987 critical care admissions (7.1%), and 4,337 hospitalizations (31.0%).
As shown in Fig 1 and Table 2, DTAS (AUROC: 0.935, AUPRC: 0.264) significantly outperformed KTAS (AUROC: 0.785, AUPRC: 0.192), MEWS (AUROC: 0.810, AUPRC: 0.116), LR (AUROC: 0.903, AUPRC: 0.209), and RF (AUROC: 0.910, AUPRC: 0.179) with respect to in-hospital mortality. DTAS also outperformed KTAS, MEWS, LR, and RF with respect to critical care and hospitalization (Table 2). With respect to external validation, DTAS consistently showed better performance than other TASs.
Fig 1 shows Receiver operating characteristic (ROC) curve and precision-recall (PR) curve for predicting in-hospital mortality. ROC curve of internal validation (A) and PR curve of internal validation (B) show that the Deep-learning-based Triage and Acuity Score (DTAS) predicted in-hospital mortality more accurately than Korean Triage and Acuity System (KTAS), Modified Early Warning Score (MEWS), Random Forest (RF), and Logistic Regression (LR) using the National Emergency Department Information System (NEDIS) data (Table 1). The ROC curve of external validation (C) and PR curve of external validation (D) demonstrated that DTAS predicted in-hospital mortality more accurately than other methods using the Sejong General Hospital (SGH) dataset. With respect to external validation, DTAS (AUROC: 0.92, AUPRC: 0.23) significantly outperformed KTAS (AUROC:0.80, AUPRC: 0.13), MEWS (AUROC: 0.74, AUPRC: 0.06), RF (AUROC: 0.89, AUPRC: 0.14), and LR (AUROC: 0.89, AUPRC:0.16).
As shown in Fig 1, the sensitivity of KTAS level 3 was 0.49 for predicting in-hospital mortality. At this point, the precisions of DTAS, KTAS, MEWS, RF, and LR were 0.24, 0.08, 0.09, 0.16, and 0.18, respectively.
We found that DTAS showed the best performance for predicting in-hospital mortality, critical care, and hospitalization based on a large, multicenter dataset. DTAS can reduce a false positive by 67% compared to KTAS. This reduction in false positives increases the practical applicability of DTAS.
Several previous studies attempted to predict outcomes of ED patients. Taylor et al. reported a new random forest method for predicting in-hospital mortality of emergency department patients with sepsis. Ong et al. reported a conventional machine learning model for predicting cardiac arrest in critically ill patients presenting to the ED. But two studies used small population and did not perform multicenter validation. The performance of algorithms based on given data rather than medical knowledge, such as machine and deep learning, is not guaranteed in other environments. The algorithms can memorize only the characteristics of derivation data. Because they learn the relationship between the predictor variables and outcome from only given data. Wolpert explains the No Free Lunch theorem; if optimized in one situation, a model cannot produce good results in other situations.
We used the national big data NEDIS to develop and validate DTAS, and the subjects of this study were those who visited ED across the whole country. Therefore, DTAS learned the characteristics of all patients nationwide rather than any particular area. However, DTAS can be biased to the average of NEDIS (i.e., overfitting). So, we verified DTAS using SGH (external validation) which had different characteristics from NEDIS. Through multicenter validation, we showed that the performance of DTAS was not biased towards specific characteristics and guaranteed in other environments.
Most patients do not experience rare events such as in-hospital mortality and critical care (i.e., imbalanced data). In this environment, AUPRC is a more important metric than AUROC. With imbalanced data, in which the number of negatives outweighs the number of positives, AUROC has a limitation for evaluating the performance because the false positive rate (false positive/total real negatives) does not decrease dramatically when the total negatives are large. AUPRC, on the other hand, is suitable for imbalanced data, as they consider the fraction of true positives among positive predictions. Although DTAS can reduce false positives by 67% compared to KTAS, the AUROC of DTAS is only 19% higher than the AUROC of KTAS for predicted in-hospital mortality. On the other hand, AUPRC of DTAS is 38% higher than AUPRC of KTAS.
Unfortunately, traditional triage tools are complex scoring methods that require detailed history taking and physical exams (e.g., pain score, evidence of dehydration, pitting edema, and blood sugar test result), and judgment based on clinical experience (e.g., expected emergency department resource).[4,7] These tools require considerable time for triage and are of limited use in resource-constrained settings of circumstances in which junior triage provider, who have limited training and experience, practice.[9,33] Numerous studies concluded that dedicating a senior doctor in triage reduced the waiting time for patients to see a doctor, decreased the LOS, and lowered the proportion of leftover patients without being seen.[33,34] However, this solution requires enormous cost.
On the other hand, DTAS requires only age, sex, chief complaint, symptom to visit time, arrival mode, trauma or not, initial vital sign, and mental status as input parameters. This allows DTAS to have three strengths. First, outcomes predicted by DTAS have low variation and high reliability because input parameters are basic information with low inter-physician variation. Second, input parameters do not require expert judgment and can be collected very quickly, it would be of great value in a resource-constrained ED setting. Third, parameters of DTAS can be checked in a pre-hospital setting and DTAS score can be calculated in pre-hospital transport and out-of-hospital situations. Therefore, DTAS has the potential to make the process of pre-hospital emergency medical service (EMS) and ED efficient. Our next area of focus for research is the prospective study of EMS and ED triage to verify the performance and efficiency of DEWS.
Deep-learning-based Triage and Acuity Score predicted in-hospital mortality, critical care, and hospitalization more accurately than existing triages and acuity, and it was validated using a large, multicenter dataset.
- 1. Pines JM, Hilton JA, Weber EJ, Alkemade AJ, Al Shabanah H, Anderson PD, et al. International perspectives on emergency department crowding. Acad Emerg Med. 2011;18: 1358–1370. pmid:22168200
- 2. Hoot NR, Aronsky D. Systematic Review of Emergency Department Crowding: Causes, Effects, and Solutions. Ann Emerg Med. 2008;52.
- 3. Bernstein SL, Aronsky D, Duseja R, Epstein S, Handel D, Hwang U, et al. The effect of emergency department crowding on clinically oriented outcomes. Acad Emerg Med. 2009;16: 1–10. pmid:19007346
- 4. Bullard MJ, Musgrave E, Warren D, Unger B, Skeldon T, Grierson R, et al. Revisions to the Canadian Emergency Department Triage and Acuity Scale (CTAS) Guidelines 2016. Can J Emerg Med. 2017;19: S18–S27.
- 5. Lee B, Kim DK, Park JD, Kwak YH. Clinical considerations when applying vital signs in pediatric korean triage and acuity scale. J Korean Med Sci. 2017;32: 1702–1707. pmid:28875617
- 6. Tanabe P, Gimbel R, Yarnold PR, Kyriacou DN, Adams JG. Reliability and Validity of Scores on the Emergency Severity Index Version 3. Acad Emerg Med. 2004;11: 59–65. pmid:14709429
- 7. Christ M, Grossmann F, Winter D, Bingisser R, Platz E. Modern triage in the emergency department. Dtsch Arztebl Int. 2010;107: 892–8. pmid:21246025
- 8. Farrohknia N, Castrén M, Ehrenberg A, Lind L, Oredsson S, Jonsson H, et al. Emergency Department Triage Scales and Their Components: A Systematic Review of the Scientific Evidence. Scand J Trauma Resusc Emerg Med. BioMed Central Ltd; 2011;19: 42. pmid:21718476
- 9. Welch SJ, Davidson SJ. The performance limits of traditional triage. Ann Emerg Med. Elsevier Inc.; 2011;58: 143–144. pmid:21601312
- 10. Burch VC, Tarr G, Morroni C. Modified early warning score predicts the need for hospital admission and inhospital mortality. Emerg Med J. 2008;25: 674–678. pmid:18843068
- 11. Subbe CP, Davies RG, Williams E, Rutherford P, Gemmell L. Effect of introducing the Modified Early Warning score on clinical outcomes, cardio-pulmonary arrests and intensive care utilisation in acute medical admissions*. Anaesthesia. 2003;58: 797–802. pmid:12859475
- 12. Armagan E, Yilmaz Y, Olmez OF, Simsek G, Gul CB. Predictive value of the modified early warning score in a turkish emergency department. Eur J Emerg Med. 2008;15: 338–340. pmid:19078837
- 13. Gottschalk SB, Wood D, Devries S, Wallis LA, Bruijns S. The cape triage score: A new triage system South Africa. Proposal from the cape triage group. Emerg Med J. 2006;23: 149–153. pmid:16439753
- 14. Mullan PC, Torrey SB, Chandra A, Caruso N, Kestler A. Reduced overtriage and undertriage with a new triage system in an urban accident and emergency department in Botswana: A cohort study. Emerg Med J. 2014;31: 356–360. pmid:23407375
- 15. Levin S, Toerper M, Hamrock E, Hinson JS, Barnes S, Gardner H, et al. Machine-Learning-Based Electronic Triage More Accurately Differentiates Patients With Respect to Clinical Outcomes Compared With the Emergency Severity Index. Ann Emerg Med. American College of Emergency Physicians; 2017; 18–20.
- 16. Zhai H, Brady P, Li Q, Lingren T, Ni Y, Wheeler DS, et al. Developing and evaluating a machine learning based algorithm to predict the need of pediatric intensive care unit transfer for newly hospitalized children. Resuscitation. 2014;85: 1065–1071. pmid:24813568
- 17. Blomberg SN, Folke F, Lippert F. Machine Learning–A novel approach to increase recognition of out-of-hospital cardiac arrest. Resuscitation. Elsevier Ireland Ltd; 2017;118: e19.
- 18. Green M, Lander H, Snyder A, Hudson P, Churpek M, Edelson D. Comparison of the Between the Flags calling criteria to the MEWS, NEWS and the electronic Cardiac Arrest Risk Triage (eCART) score for the identification of deteriorating ward patients. Resuscitation. European Resuscitation Council, American Heart Association, Inc., and International Liaison Committee on Resuscitation.~Published by Elsevier Ireland Ltd; 2018;123: 86–91. pmid:29169912
- 19. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521: 436–444. pmid:26017442
- 20. Son J, Park SJ, Jung K-H. Retinal Vessel Segmentation in Fundoscopic Images with Generative Adversarial Networks. 2017; Available: http://arxiv.org/abs/1706.09318
- 21. Kwon J-M, Lee Y, Lee Y, Lee S, Park J. An Algorithm Based on Deep Learning for Predicting In-Hospital Cardiac Arrest. J Am Heart Assoc. 2018;7: e008678. pmid:29945914
- 22. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. 2017 IEEE Int Conf Consum Electron ICCE 2017. 2014; 434–435.
- 23. Liu N, Koh ZX, Goh J, Lin Z, Haaland B, Ting BP, et al. Prediction of adverse cardiac events in emergency department patients with chest pain using machine learning for variable selection. BMC Med Inform Decis Mak. 2014;14: 75. pmid:25150702
- 24. Subbe CP, Slater A, Menon D, Gemmell L. Validation of physiological scoring systems in the accident and emergency department. Emerg Med J. 2006;23: 841–845. pmid:17057134
- 25. Churpek MM, Yuen TC, Winslow C, Meltzer DO, Kattan MW, Edelson DP. Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards. Crit Care Med. 2016;44: 368–74. pmid:26771782
- 26. Mortazavi BJ, Downing NS, Bucholz EM, Dharmarajan K, Manhapra A, Li SX, et al. Analysis of Machine Learning Techniques for Heart Failure Readmissions. Circ Cardiovasc Qual Outcomes. 2016;9: 629–640. pmid:28263938
- 27. Shouval R, Hadanny A, Shlomo N, Iakobishvili Z, Unger R, Zahger D, et al. Machine learning for prediction of 30-day mortality after ST elevation myocardial infraction: An Acute Coronary Syndrome Israeli Survey data mining study. Int J Cardiol. Elsevier Ireland Ltd; 2017;246: 7–13. pmid:28867023
- 28. Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. Proc 23rd Int Conf Mach Learn—ICML ‘06. 2006; 233–240.
- 29. Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients with Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad Emerg Med. 2016;23: 269–278. pmid:26679719
- 30. Ong MEH, Lee Ng CH, Goh K, Liu N, Koh Z, Shahidah N, et al. Prediction of cardiac arrest in critically ill patients presenting to the emergency department using a machine learning score incorporating heart rate variability compared with the modified early warning score. Crit Care. BioMed Central Ltd; 2012;16: R108. pmid:22715923
- 31. Wolpert DH. The Supervised Learning No-Free-Lunch Theorems. Proc 6th Online World Conf Soft Comput Ind Appl. 2001; 10–24.
- 32. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10: 1–21.
- 33. Wiler JL, Gentle C, Halfpenny JM, Heins A, Mehrotra A, Mikhail MG, et al. Optimizing Emergency Department Front-End Operations. Ann Emerg Med. Elsevier Inc.; 2010;55: 142–160.e1. pmid:19556030
- 34. Abdulwahid MA, Booth A, Kuczawski M, Mason SM. The impact of senior doctor assessment at triage on emergency department performance measures: Systematic review and meta- analysis of comparative studies. Emerg Med J. 2016;33:504–13. pmid:26183598
- 35. Partovi SN, Nelson BK, Bryan ED, Walsh MJ. Faculty triage shortens emergency department length of stay. Acad Emerg Med. 2001;8: 990–5. pmid:11581086