The selection of indicators from initial blood routine test results to improve the accuracy of early prediction of COVID-19 severity

The global pandemic of COVID-19 poses a huge threat to the health and lives of people all over the world, and brings unprecedented pressure to the medical system. We need to establish a practical method to improve the efficiency of treatment and optimize the allocation of medical resources. Due to the influx of a large number of patients into the hospital and the running of medical resources, blood routine test became the only possible check while COVID-19 patients first go to a fever clinic in a community hospital. This study aims to establish an efficient method to identify key indicators from initial blood routine test results for COVID-19 severity prediction. We determined that age is a key indicator for severity predicting of COVID-19, with an accuracy of 0.77 and an AUC of 0.92. In order to improve the accuracy of prediction, we proposed a Multi Criteria Decision Making (MCDM) algorithm, which combines the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) and Naïve Bayes (NB) classifier, to further select effective indicators from patients’ initial blood test results. The MCDM algorithm selected 3 dominant feature subsets: {Age, WBC, LYMC, NEUT} with a selection rate of 44%, {Age, NEUT, LYMC} with a selection rate of 38%, and {Age, WBC, LYMC} with a selection rate of 9%. Using these feature subsets, the optimized prediction model could achieve an accuracy of 0.82 and an AUC of 0.93. These results indicated that Age, WBC, LYMC, NEUT were the key factors for COVID-19 severity prediction. Using age and the indicators selected by the MCDM algorithm from initial blood routine test results can effectively predict the severity of COVID-19. Our research could not only help medical workers identify patients with severe COVID-19 at an early stage, but also help doctors understand the pathogenesis of COVID-19 through key indicators.


Introduction
Currently, more than 40 million people worldwide are infected with the SARS-Cov-2 virus, and more than 10 million people are suffering from Coronavirus disease 2019 (COVID- 19) and are receiving treatments [1]. This poses a huge threat to the health and lives of people all over the world, and brings unprecedented pressure to the medical system. Many infected patients cannot receive timely and effective treatment, and it will also reduce the treatment efficiency of other emergency patients [2].
Patients with suspicious symptoms and epidemiological history first visit the fever clinic of the community hospital [3]. They usually undergo three initial tests: SARS-Cov-2 RNA confirms SARS-Cov-2 infection, blood routine test, and chest CT scan to initially assess the severity of COVID-19 [4]. The timely and effective triage of COVID-19 patients based on the results of the three initial tests is of great significance for maintaining emergency capacity and optimizing treatment plans [2].
Although most COVID-19 patients are Mild-Moderate cases and can recover on their own, about 14% of patients are Severe cases, and 5% of patients are Critically Severe cases [5]. Severe-Critically Severe cases usually develop Acute Respiratory Distress Syndrome (ARDS) or Multiple Organ Dysfunction Syndrome (MODS) within two weeks of infection [6], which consumes most of the medical resources and leads to a high case fatality rate (up to 49%) [5,6]. Early prediction of the severity of COVID-19 can help quickly triage patients (i.e., quarantine, hospital admission or ICU assignment, etc.) and optimize the use of medical resources and timely medical intervention [7,8]. Blood routine test is the most basic examination. The blood routine test results include red blood cell count (RBC), hemoglobin (HGB), platelets (PLT), white blood cell count (WBC), lymphocyte count (LYMC), lymphocyte ratio (LYMPH), neutrophil count (NEUT), neutrophil ratio (NEU) neutrophil to lymphocyte ratio (NLR), etc. [9][10][11]. For infectious diseases, a substantial increase or decrease of WBC prompt the severity of the infection. The number and proportion of NEUT can be used to determine the presence or absence of bacterial infection. The rise or fall of LYMC is a characteristic of viral infection [12]. Decreasing of lymphocytes is one of the most critical features of SARS-Cov-2 infection [13]. Of all the initial tests for COVID-19 patients, blood routine test is the worldwide common test with good consistency, and the results are usually available within 2 hours. Due to the influx of a large number of patients into the hospital and the running of medical resources, blood routine test might be the only possible check while COVID-19 patients first go to a fever clinic in a community hospital [4].
When an emerging infectious disease breaks out, we need to quickly understand its pathogenic characteristics and independent risk factors that affect its progression [14]. At this time, the outbreak area is often limited, and the number of patients is small at the very beginning [3]. How to comprehensively analyze the high-risk factors leading to severe illness in a small sample is a serious clinical challenge [15]. Up to now, there have been many studies on predicting the severity of the COVID-19 (i.e., older age, pulmonary micro-thrombosis, increased inflammatory factors (C-reactive protein (CRP), IL-6), hyper-lactic acidemia, D-dimer progressive heightened, decreased lymphocyte count (especially CD8+ T cell count) and shortterm progression of lung lesions, etc.) [7,[16][17][18][19]. However, the collection of these indicators requires multiple tests and takes a lot of time [19]. These studies certainly can help us improve the treatment, but can hardly help us quickly respond to emerging infectious disease outbreaks [20][21][22].
In this paper, we aimed to select features from initial blood test results to predict the severity of COVID-19 quickly and accurately. We first defined feature selection as a Multiple Criteria Decision Making (MCDM) problem that considers the correlation between input features and the correlation between input and output features [23][24][25][26]. In MCDM, some methods provide the priority of indicators, while others provide the ranking of indicators. One of the MCDM ranking methods is the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), which has been used in the selection of significant risk factors for healthcare and prognosis [27][28][29]. Different from the existing TOPSIS methods [8,27], we use maximum relevance and minimum redundancy [30][31][32] as the criteria for feature selection in order to select independent risk factors. The maximal relevance feature is to select the input features with the highest relevance to output features. The combinations of individually good features do not necessarily lead to good classification performance [30,31]. The minimal redundancy is to reduce the redundancy among input features. We then used a series of intuitive measures of relevance and redundancy to select independent risk factors. Finally, we use Naïve Bayes (NB) classifier to achieve the highest prediction accuracy with the fewest input features. Using TOP-SIS MCDM, we successfully screened out "independent risk factors" that predict the severity of COVID-19 [25].
Our research established an easy and accurate method for early predict the severity of COVID-19 based on the simple clinic characteristics, which could help medical workers identify patients with severe COVID-19 at an early stage, improve the efficiency of emergency triage of patients, and help doctors understand the pathogenesis of COVID-19 through key indicators.

Patient enrollment and study design
We performed this prospective cohort study from March 15 to March 20, 2020 in Wuhan Red Cross Hospital, a hospital designated to treat COVID-19 in Wuhan, China. We collected 196 COVID-19 patients diagnosed according to WHO guidance [33] from February 1, 2020 to March 15, 2020. The inclusion criteria were as follows: (1) diagnosis of COVID-19 pneumonia according to the WHO interim guidance published on 28 January 2020 (ref), and (2) availability of relevant medical record information, especially initial blood test results when patients first go to a fever clinic in a community hospital and patients' severity. Patients discharged within 24 h since admission were excluded.

Ethics
The study was approved by the ethics committee of Sichuan Provincial People's Hospital. Since it is not allowed to take any paper documents out of the quarantine area of Wuhan Red Cross Hospital, all participants have obtained oral informed consent, which is recorded by the doctor and kept in the medical record. Before building the predictive model, all data was completely anonymized and cleared.

Definitions
COVID-19 was confirmed by detecting SARS-CoV-2 RNA test. According to the 5th edition of the China Guidelines for the Diagnosis and Treatment Plan of COVID-19 Infection by the National Health Commission (Trial Version 5) [34], the cases were classified into Mild-Moderate and Severe-Critically Severe.

Data collection
The following information was extracted from each patient: Gender, Age and patients' initial blood routine test results including WBC, LYMC, LYMPH, NEUT, NEU and NLR. The dataset contained 8 input features {Gender, Age, WBC, LYMC, LYMPH, NEUT, NEU, NLR}, and 1 output feature (Severity).

Statistical analysis
Quantitative variables were expressed as the mean ± standard deviation or the median with interquartile ranges, while categorical variables were expressed as absolute and relative frequencies. The t test or Wilcoxon-test was performed to calculate differences between quantitative data; and χ2 test was performed to calculate differences between qualitative data. According to the data characteristics, the correlation between clinic characteristics and COVID-19 severity was calculated according to Kendall correlation coefficient (Gender-severity) or Spearman correlation coefficient. Logistic regression analysis was performed for independent variables with collinearity. Wald test was used to determine the joint significance of variables. The standard deviation was used to measure dispersion degrees. Statistical procedures were performed with R statistical software. P values of �0.05 were considered significant.

The MCDM algorithm design and implementation
The proposed algorithm is basically designed for predicting COVID-19 severity, either Mild-Moderate or Severe-Critically Severe case. It reduces computation time, improves prediction performance, and a better understanding of the data in machine learning. It consists of 4 major stages: preprocessing, feature ranking, feature selection and performance evaluation. Preprocessing is the process to refine the collected raw data to de-noise it. Feature ranking is the process of ordering the features by the value of some scoring function, which usually measures feature-relevance. Feature selection aims to choosing a small subset of the relevant features from the original features by removing irrelevant, redundant, or noisy features. Performance evaluation is to measure the performance of the binary classification by statistical measures, i.e., Accuracy (ACC), True Positive Rate (TPR), False Positive Rate (FPR) and F1 score.
Preprocessing. We use stratified random sampling to divide the dataset into 2 subsets: training set (80%) and test set (20%). In these 4 stages, we only used the test set for performance evaluation. Suppose there are m input features and n output features. Let X = {x| 1�x�m} be the input feature set and Y = {y|m+1�y�m+n} be the output feature set. Elements x and y are indexes of features. The feature set is F = X[Y = {i|1�i�m+n}. We calculated and visualized a (m+n)×(m+n) correlation matrix R and a (m+n)×(m+n) p-value matrix P to show the correlations between all different feature pairs. To simplify the analysis, we then preprocess R in 2 steps. STEP1: We ignored the sign of R Feature ranking. We defined a labeled feature set L and initialized with L = ;. We iterated the procedure of ranking input features x2X and moved the first in each ranking from X to L. The ranking criteria includes 2 evaluations: EVAL1: The correlation between input feature x2X and output feature y2Y, R[x,y] or R[y,x]. EVAL2: The correlation between input feature x2X and labeled feature v2L, R[x,v] or R [v,x]. This explicitly evaluates multiple conflicting criteria in decision making. We proposed an algorithm to solve this Multiple Criteria Decision Making (MCDM) problem by using the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), which is a compensatory aggregation method. The algorithm, called MCDM, creates an evaluation matrix E consisting of p criteria and q alternatives, to rank input features. According to Pareto's principle, the algorithm divided x into the following 2 types: The algorithm calculates the L2-distance between the target alternative j and the worst condition: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi It then calculates the distance between j's condition and the best condition: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi After that, it calculates the similarity to the worst condition: s j = 1 if and only if alternative j has the best condition, and s j = 0 if and only if alternative j has the worst condition. Let j � ¼ arg max j fs j g, then X = X\{c j � } and L = L[{c j � }.
The pseudocode of the MCDM algorithm is as follows: Algorithm MCDM is Input: correlation matrix R, number of input features m, number of input features n, input feature set X, output feature set Y Output: labeled feature set L initialize L = ; sort L[Y and X in ascending order else The goal of feature subset selection is to find the optimal input feature subset. We gradually increased the number of labeled features, and trained the model with Naïve Bayes classifier in turn. To find the optimal subset, we sequentially tested the accuracy of trained models on the training set.
Performance evaluation. In order to test the stability of the algorithm and observe the influence of the dataset uncertainty on feature selection, we divided the data set 100 times (80% training set and 20% test set) and repeatedly run the algorithm. We used the test set to analyze the performance of feature selection from ACC, TPR, FPR and F1 score.

Evaluation of the predictive value of selected features
According to stratified random sampling, we divided the data set into 2 subsets: 80% of the "training set" and 20% of the "testing set". We used Receiver Operating Characteristic (ROC) curve analysis to calculate the Area Under the Curve (AUC) and use "ROC" package in R to evaluate the prediction accuracy of our model.

Difference in age and initial blood test results between Mild-Moderate and Severe-Critically Severe groups
According to the 5th edition of the China Guidelines for the Diagnosis and Treatment Plan of COVID-19 Infection by the National Health Commission, we divided patients into 2 groups: 67 cases in the Mild-Moderate group, and 129 cases in the Severe-Critically Severe group (Table 1). Comparing Mild-Moderate and Severe-Critically Sever groups, the basal features showed no differences in Gender (p = 0.26) (Fig 1A). The Severe-Critically Severe group was significantly older than the Mild-Moderate group (p <0.001) (Fig 1B). The initial blood routine test seems to be important for predicting the severity of COVID-19: The Severe-Critically Severe group had a higher WBC level (p = 0.02) (Fig 1C). The Severe-Critically Severe group had extremely low LYMC (p<0.001) and LYMPH (p<0.001) (Fig 1D and 1E). In contrast, NEUT (p<0.001) and NEU (p<0.001) in the Severe-Critically Severe group were extremely high (Fig 1F and 1G). As a result, the Severe-Critically Severe group had a higher NLR (p<0.001) (Fig 1H). These observations suggest that patients' age, and WBC, LYMC, LYMPH, NEUT, NEU, NLR from the initial blood routine test could be critical factors for predicting the severity of COVID-19.

Predictive value of age and initial blood test results for COVID-19 severity
By calculating the correlation between clinic characteristics and severity of COVID-19, we found that Age (r = 0.73, p = 0.01), WBC (r = 0.24, p<0.01), NEUT (r = 0.34, p<0.01), NLR (r = 0.31, p<0.01) were significantly positively correlated with the severity of COVID-19, while LYMC (r = -0.55, p = 0.01) was significantly negatively correlated with the severity of COVID-19 (Fig 2A and  2B). These results indicated that Age and initial blood routine test results-WBC, LYMC, NEUT, NLR, might be important for predicting the severity of COVID-19. Wald test showed that only Age was the key indicator in predicting the severity of COVID-19 (Table 2). Using stratified random sampling, we generated the ROC curve to evaluate the predictive values: 80% for the "training set" and 20% for the "testing set". Using [18] for prediction, we can obtain an accuracy of 0.77, and an AUC of 0.92 (Fig 2C). Through dispersion analysis, we found that WBC, LYMC and LYMPH may be able to optimize prediction performance (Tables 3 and 4). The ROC curve showed that {Age, WBC, LYMC} had an accuracy of 0.82 and an AUC of 0.93 (Fig 2D). These results suggested that it is a good predictor of COVID-19 severity, but the accuracy was only 0.77. Using WBC and LYMC from initial blood routine test could rise the accuracy to 0.82.

Details of the MCDM algorithm to predict the severity of COVID-19
The MCDM algorithm was conducted to further investigate whether there were other factors that could improve the accuracy of prediction. The MCDM algorithm and Logistic regression analysis have obtained consistent results: Age was a key indicator in predicting the severity of COVID-19. In addition, the MCDM algorithm verified that the {Age, WBC, LYMC} subset is one of the index sets with the highest prediction accuracy.

Predictive value of the features selected by the MCDM algorithm
Using stratified random sampling, we generated ROC curves to evaluate the predictive values of the subsets selected by the MCDM algorithm: 80% for the "training set" and 20% for the "testing set". Our analysis results showed that {Age, WBC, LYMC, NEUT} (Fig 5A), {Age, NEUT, LYMC} ( Fig 5B) and {Age, WBC, LYMC} (Fig 5C) all achieved 0.82 accuracy and 0.93 AUC. The MCDM algorithm can steadily and accurately select Age and other features from initial blood routine test results to predict the severity of COVID-19.

Discussion
In this paper, we determined that age was the most critical indicator for predicting the severity of COVID-19. To improve the prediction accuracy, we proposed an MCDM algorithm, which combines the TOPSIS and NB classifier, to further select the indicators of patients' initial blood routine test. By ranking features, the MCDM algorithm selected three subsets, including Previous studies have shown that elderly COVID-19 patients with multiple concomitant diseases tend to develop Multiple Organ Failure (MOFE), leading to high mortality in elderly patients infected by SARS-CoV-2 [7,10]. According to the latest meta-analysis of the elderly in the European community, the prevalence of frailty is around 15% for the elderly 65 years and older [35], and the case fatality rate of patients over 85 years old is 1,000 times that of patients aged 5-17 years [36]. Our research indicated that age was the most important indicator for predicting the severity of COVID-19, with an accuracy 0.77 and an AUC of 0.92. However, some elderly patients had a good prognosis, so prognostic evaluation and medical decisionmaking based on age alone might not be accurate enough.
We found that WBC, LYMC and NEUT in initial blood routine test results other than age are also crucial for predicting the severity of COVID-19. Guo et al. [37] pointed out that the MuLBSTA score revealed that multi-lobar infiltrates, lymphocytes �0.8×10 9 /L, bacterial infection, smoking status, hypertension, and age �60 years could help prognosticate outcomes in COVID-19 patients [38]. The elevated WBC/NEUT is an essential sign of bacterial infection. Bacterial co-infection in COVID-19 patients may develop a severe form of disease, complicating the clinical situation [39][40][41]. The control and elimination of viruses depend on humoral immunity. Viral infections usually lead to abnormal changes in lymphocyte subsets which further impaired immune system functionality. The decrease of LYMC is the most straightforward and most intuitive indicator to predict the humoral immune response, indicating that the patient's T cell function is defective [18,42,43]. The count of lymphocyte subsets (CD4 + and CD8+ T cell), especially CD8+ T cell, is directly proportional to the severity of COVID-19 [44,45]. Although logistic regression can determine the key indicator Age, and discrete analysis can find a better subset {Age, WBC LYMC}, it is difficult to determine the best subset due to the small sample size or multicollinearity. Previous studies used the MCDM algorithm to evaluate diagnostic tests [46] and help doctors hasten COVID-19 treatment [47]. As far as we know, this is the first time the MCDM algorithm has been used to predict the severity of COVID-19. It first uses TOPSIS for feature ranking, and then combines the NB classifier for feature selection. Even if the sample size is small, the MCDM algorithm can select 3 effective subsets {Age, WBC, LYMC}, {Age, WBC, LYMC, NEUT} and {Age, NEUT, LYMC}. The selection process is visual and interpretable helping doctors find the features of the progress of emerging infectious diseases early, to make faster and better prevention and treatment plans. We used the ROC curve to evaluate the predictive value of the features selected by the MCDM algorithm. The results showed that the MCDM algorithm can not only find all effective subsets, but also predict stably and accurately.
Some recent studies point out that age [48][49][50][51], underlying diseases [17], systemic immune status [52], and blood test results can be used as key features to predict the severity of COVID-19. Although these features can improve the prediction accuracy (84%~93%), the tests are timeconsuming, expensive, and labor-intensive. Our algorithm can select features from blood test results to achieve a prediction accuracy of 82%. During the COVID-19 pandemic, it is more in line with clinical needs and is easy to promote and use in areas with different medical levels. Our research provides a possible and convenient strategy for the early prediction of COVID-19 severity. However, there are some limitations associated with it. First, there were only 196 cases, and all were from China. The sample size of the study was relatively small. We would like to collect more data and conduct multi-center evaluations. Second, the patient selection process may have been affected by referral bias because of the retrospective design. Third, the screening features are all derived from blood routine tests and are relatively simple. Other features, such as chest CT, absolute T cell count, etc., can be included during the therapy to further evaluate and predict patients' prognosis.

Conclusion
Our research revealed that using age and the indicators WBC/NEUT and LYMC selected by the MCDM algorithm from initial blood routine test results can effectively predict the severity of COVID-19. Advanced age, combined bacterial infections, and low immunity are the main reasons leading to the severity of COVID-19. We are considered feature selection as an MCDM problem so that the algorithm could provide a reference for clinical practice. Using the most common blood routine test, medical institutions could better determine the quarantine, hospital admission, ICU assignment of COVID-19 patients. The MCDM algorithm can be used for small sample data sets, and the prediction is accurate and stable. This study not only provided a reference for establishing a rapid response mechanism in the early stage of emerging infectious disease outbreaks but also help doctors understand the pathogenesis of new infectious disease through key indicators.