Application of a decision tree model in the early identification of severe patients with severe fever with thrombocytopenia syndrome

Background Severe fever with thrombocytopenia syndrome (SFTS) is a serious infectious disease with a fatality of up to 30%. To identify the severity of SFTS precisely and quickly is important in clinical practice. Methods From June to July 2020, 71 patients admitted to the Infectious Department of Joint Logistics Support Force No. 990 Hospital were enrolled in this study. The most frequently observed symptoms and laboratory parameters on admission were collected by investigating patients’ electronic records. Decision trees were built to identify the severity of SFTS. Accuracy and Youden’s index were calculated to evaluate the identification capacity of the models. Results Clinical characteristics, including body temperature (p = 0.011), the size of the lymphadenectasis (p = 0.021), and cough (p = 0.017), and neurologic symptoms, including lassitude (p<0.001), limb tremor (p<0.001), hypersomnia (p = 0.009), coma (p = 0.018) and dysphoria (p = 0.008), were significantly different between the mild and severe groups. As for laboratory parameters, PLT (p = 0.006), AST (p<0.001), LDH (p<0.001), and CK (p = 0.003) were significantly different between the mild and severe groups of SFTS patients. A decision tree based on laboratory parameters and one based on demographic and clinical characteristics were built. Comparing with the decision tree based on demographic and clinical characteristics, the decision tree based on laboratory parameters had a stronger prediction capacity because of its higher accuracy and Youden’s index. Conclusion Decision trees can be applied to predict the severity of SFTS.


Introduction
Severe fever with thrombocytopenia syndrome (SFTS) is a tick-born infectious disease caused by a novel virus called SFTS virus (SFTSV) [1]. SFTS was first reported in China in 2006, and other Asian countries, such as Japan and Korea, reported it later [2][3][4]. SFTSV is a novel member of the Phlebovirus genus of the Phenuiviridae family, which can be transmitted by several transmissions, including tick bites and person-to-person transmission through blood or other body fluids [5][6][7][8][9]. The average fatality rate of SFTS is up to 30% [10], which makes SFTS a serious health threat, and the World Health Organization lists SFTSV as a priority pathogen requiring urgent attention.
According to previous research, there is no specific antiviral therapy for SFTSV infection, and the most essential parts of case management are symptomatic treatment and supportive therapy [11]. The timely referral of serious SFTS patients to the intensive care unit (ICU) has been associated with an increased survival rate [12]. It is important for physicians to recognize patients who are experiencing severe situations as early as possible due to the high fatality of SFTS. An early and accurate identification model for the severity of SFTS could not only illustrate parameters that influence the evolution of SFTS but also help clinicians make better decisions and improve the efficiency of the treatment.
Decision trees are one of the most effective methods for data mining, as extracting meaningful information from measured data represents a plausible solution for massive data learning tasks [13]. Furthermore, decision trees have the advantages of a nonparametric setup, the tolerance of heterogeneous data, and immunity to noise [14]. This study aims to develop a parsimonious model in the form of a decision tree for classifying the severity of SFTS using the most frequently observed symptoms and laboratory parameters on admission.

Patients
From June to July 2020, a total of 86 suspected patients admitted to the Infectious Department of the Joint Logistics Support Force No. 990 Hospital, and 71 of them were identified as laboratory-confirmed patients by PCR assay. The real-time PCR assay using PCR Diagnostic Kit for SFTSV RNA (BGI-GBI, China) was performed as previously described [15,16].Of the 71 cases, all patients were farmers from Xinyang or an adjacent administrative region, 40 (56.3%) patients were males and 31 (43.7%) patients were females. The mean age of the patients was (62.58±11.89) years.

Inclusion criteria
According to the "Guideline for SFTS prevention and control (2010)" issued by the National Health and Family Planning Commission, SFTS cases were diagnosed by the following criteria: (1) epidemiological characteristics (e.g., history of tick bites, working in mountainous areas or teahouses, or direct contact with the blood of a confirmed patient during the two weeks prior to symptom onset); (2) clinical presentation (e.g., fever (>38˚C), headache, muscle aches, nausea, vomiting, diarrhea, skin bruising, bleeding, multiple-organ damage); (3) laboratory findings (e.g., decrease in leukocyte count and thrombocytopenia); and (4) The exclusion criterion was if the patient was positive assessed through PCR assay of acute phase blood samples for other diseases such as hemorrhagic fever with renal syndrome (HFRS), dengue fever, and thrombocytopenic. A patient who met all of the above criteria was diagnosed with SFTS [16].
Patients were divided into two groups: mild and severe. Severe SFTS patients were defined as any patient who either required admission to an ICU or met at least one of the following criteria: a) acute lung injury (ALI) or acute respiratory distress syndrome (ARDS); b) heart failure; c) acute renal failure; d) encephalitis; e) shock; and f) disseminated intravascular coagulation (DIC) or death [17].

Data collection
All patients' electronic records were investigated, and serum samples were collected at admission. All samples were transported frozen to the pathogen laboratory of Henan Center for Disease Control and Prevention (Henan CDC). Patient demographic information, including age, sex, exposure history, history of tick bite, clinical characteristics (body temperature, cough, nausea, muscular aches, fatigue), and routine laboratory parameter results, was collected by investigating the patient's electronic record.

Ethics approval and consent to participate
The research was approved by the Ethics Committee of Henan Center for Disease Control and Prevention. All participants gave written informed consent for the use of their samples in research. All data analyzed were anonymized.

Decision trees
Decision trees have proven to be a valuable tool for extracting meaningful information from measured data and represent a plausible solution for massive data learning tasks [18]. There are three classic decision tree algorithms: ID3, C4.5, and CART. C4.5 and CART can handle both continuous variables and discrete variables, and they are not sensitive to incomplete data, whereas ID3 can only handle discrete variables [19]. CART generates binary trees, and C4.5 generates multiple branches. In this study, CART was employed to construct a prediction model. The details of the CART algorithm are as follows.
Suppose that there are C categories of data in sample dataset S. The Gini index formula is as follows: where S represents the training data set, C represents the data class number and P i represents the ratio of the sample number in class i to all samples. Technically, suppose the current node corresponds to the training data set S, and characteristic root v divides S into k disjoint subsets S 1 , S 2 , S 3 , . . ., S K , that is, The information gain G(S, v) is as follows: According to the definition of information gain, more information gain means stronger classification capacity. SPSS 21.0 and R 4.0.2 were applied to perform statistical analysis. Categorical variables are summarized as frequencies and proportions. Continuous variables with a normal distribution are described as mean and standard deviation (SD), whereas those with an abnormal distribution are described as median and interquartile range (IQR). The unpaired t test or Mann-Whitney U test was employed to test the differences in continuous variables between the mild and severe cases. Comparisons of the clinical parameters between the two groups were carried out by the Pearson χ 2 (when sample size was over 40 and theoretical frequencies were over 5) or Fisher exact test(when sample size was smaller than 40 or theoretical frequencies were smaller than 1) in tables. The predictive value of models was evaluated by indexes including accuracy and Youden's index. A P value < 0.05 was considered to be statistically significant.

Patients' demographic and clinical characteristics
Through PCR assay, a total of 71 SFTS patients were enrolled (Fig 1), including 30 mild patients and 41 severe patients. There was no difference in sex distribution between the two groups (χ 2 = 1.975, p = 0.160). There was no significant difference in age between the two groups (t = -1.643, p = 0.105).

Comparison of laboratory testing results between mild and severe cases
Of all laboratory testing resluts, platelet count (PLT), lactate dehydrogenase (LDH), aspartate aminotransferase (AST), and creatine kinase (CK) in severe patients were significantly different from those in mild patients. PLT was lower in severe patients than mild patients, whereas AST, LDH and CK were dramatically higher. Other laboratory features (e.g., white blood cells (WBCs), lymphocytes (LYMs), neutrophils (NEUs), alanine aminotransferase (ALT), gammaglutamyl transpeptidase (GGT), and creatinine (Cr)) were comparable between the two groups. Details of the comparison of laboratory features are shown in Table 2.

Decision tree analysis
The first decision tree, based on laboratory parameters generated by the optimized CART algorithm, is shown in

Validation of the models
Cross-validation was performed to select the optimal model, whose test set has the lowest error rate. Cross-validation can make maximum use of the acquired information, which is to divide the training set S into k non-overlapping subsets S 1 , S 2 , . . ., S k . In the k-repeat validation, each of the subset is selected as the test set in sequence, and the remaining data is used as the training set to train the model.
The performance of the decision trees was evaluated by confusion matrices based on true positives (TP: number of patients with severe SFTS who were correctly predicted), true negatives (TN: number of patients with mild SFTS who were correctly predicted), false positives (FP: number of patients with mild SFTS who were wrongly predicted to have a severe condition), and false negatives (FN: number of patients with severe SFTS who were wrongly predicted to have a mild condition). The sensitivity, specificity, accuracy and Youden's index  were calculated based on the abovementioned parameters. These criteria were calculated as follows: Youden 0 s index ¼ Sensitivity À ð1 À SpecificityÞ ¼ ðSensitivity þ SensitivityÞ À 1 The confusion matrices of the two decision trees are summarized in Table 3. The sensitivity, specificity, accuracy and Youden's index of the two decision trees are displayed in Table 4. The decision tree based on the laboratory parameters of SFTS patients achieved a sensitivity of 92.7% and a specificity of 70.0%. The sensitivity and specificity of the decision tree based on

PLOS ONE
Application of a decision tree model in early identification demographic and clinical characteristics of the SFTS patients were 82.9% and 73.3%, respectively. According to Table 4, the first decision tree had a higher accuracy, which meant that the decision tree based on laboratory parameters had a more efficient classification strategy.

Discussion
The main clinical manifestations of SFTS are fever, fatigue, and muscular aches [20], and some patients also develop nausea at the initial stage of the disease, making it similar to many other viral infections. Like other hemorrhagic fevers, SFTS can also cause gastrointestinal symptoms, leukopenia, thrombocytopenia, elevated tissue enzymes, and hematuria proteinuria [16], which makes it difficult to diagnose and treat patients. SFTS patients are mostly middle-aged and elderly farmers. In severe cases, neurological symptoms and bleeding symptoms may occur, and patients may even die due to multiple-organ failure [21]. The number of fatal cases has increased annually in China, although national intervention programs that promote public awareness, set up sentinel hospitals and improve clinicians' skills have been established [22]. Therefore, accurate prediction of the prognosis may help clinicians perform intervention measures in advance, control the disease progression and improve the prognosis. In this study, demographic and clinical characteristics and laboratory parameters were compared between mild and severe SFTS patients. Univariate analysis showed that the body temperature and size of lymphadenectasis in severe patients were higher than those in mild patients. Cough, lassitude, limb tremor, hypersomnia, coma and dysphoria were risk factors for severity in SFTS patients. Although none of the gastrointestinal symptoms were significantly different between the two groups, in the decision tree based on demographic and clinical characteristics, vomiting was included as a discriminating factor. In the decision tree based on laboratory parameters, WBC was included as a discriminating factor, though it had no significant difference between mild and severe patients. This phenomenon demonstrates one of the usages of decision trees: variable selection [23]. Decision tree methods can be used to select the most relevant input variables that can be used to formulate clinical hypotheses and inform subsequent research, similar to stepwise variable selection in regression analysis.
To our best knowledge, there are four prediction models for SFTS patients, and most of them predict death [11,12,22,24]. According to previous studies, ALT, AST, CK, LDH and Cr levels are critical risk factors for fatal SFTS patients [11,22,[24][25][26]. Consistent with previous studies, differences in PLT, AST, LDH and CK between mild and severe patients in this study were significant. The size of lymphadenectasis is also a critical factor in diagnosing SFTS from other hemorrhagic fevers in clinical practice. The performance of the prediction based on the independent factors is shown in Fig 4 and Table 5. Though these are significant factors for predicting the severity of SFTS, decision trees had a stronger classification capacity due to their higher Youden's index.
In conclusion, decision trees can be applied to predict the severity of SFTS. Body temperature, size of the lymphadenectasis, PLT, AST, LDH and CK are classification factors whose prediction capacities are lower than those of decision trees. The classification strategy of the decision tree based on laboratory parameters was more efficient than the classification strategy of the decision tree based on clinical and demographic characteristics.

Acknowledgments
We wish to thank all the participants for volunteering their time to participate in this study.