A novel interpretable machine learning system to generate clinical risk scores: An application for predicting early mortality or unplanned readmission in a retrospective cohort study

Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors to create parsimonious scores, but such ‘black box’ variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach using the recently developed Shapley variable importance cloud (ShapleyVIC) that accounts for variability in variable importance across models. Our approach evaluates and visualizes overall variable contributions for in-depth inference and transparent variable selection, and filters out non-significant contributors to simplify model building steps. We derive an ensemble variable ranking from variable contributions across models, which is easily integrated with an automated and modularized risk score generator, AutoScore, for convenient implementation. In a study of early death or unplanned readmission after hospital discharge, ShapleyVIC selected 6 variables from 41 candidates to create a well-performing risk score, which had similar performance to a 16-variable model from machine-learning-based ranking. Our work contributes to the recent emphasis on interpretability of prediction models for high-stakes decision making, providing a disciplined solution to detailed assessment of variable importance and transparent development of parsimonious clinical risk scores.


Introduction
Scoring models are widely used in clinical settings for assessment of risks and guiding decision making [1]. For example, the LACE index [2] for predicting unplanned readmission and early death after hospital discharge has been applied and validated in various clinical and public health settings since it was developed in 2010 [3][4][5]. The LACE index is appreciated for its simplicity, achieving moderate discrimination power [6] by using only four basic components: inpatient length of stay (LOS), acute admission, comorbidity, and the number of emergency department (ED) visits in past 6 months. There have been ongoing efforts to derive new readmission risk prediction models for improved performance and for specific subcohorts, which, despite the increasing availability of machine learning methods in recent years, are dominated by the traditional logistic regression approach [7,8].
Logistic regression models generate transparent prediction models that are easily converted to risk scores, e.g., by rounding the regression coefficients [9], and model interpretability is mainly reflected by variable selection. Traditional variable selection approaches (e.g., stepwise selection and penalized likelihood) are straightforward, but there have been methodological discussions on their insufficiency in identifying important predictors and controlling model complexity [10,11]. Machine learning methods are well-performing alternatives, e.g., a study used decision trees and neural networks to heuristically develop a 11-variable logistic regression model from 76 candidate variables [12], and a recently developed automated risk score generator, the AutoScore [13], uses the random forest (RF) to rank variables and built wellperforming scoring models using less than 10 variables in clinical applications [13,14]. Such machine learning methods rank variables by their importance to best-performing model(s) trained from data, but the 'black box' nature of most machine learning models (e.g., neural networks and tree ensembles such as the RF and XGBoost [15]) limits the interpretability of the variable selection steps. The interpretability issue may be alleviated via post-hoc explanation, for which the widely used Shapley additive explanations (SHAP) [16] is the current stateof-the-art. However, another general issue in current variable importance analyses has been largely ignored: restricting the assessment to best-performing models can lead to biased perception on variable importance [17].
Since predictive performance is often not the only concern in practice, a recent study extended the investigation to a group of models that are 'good enough' to generate a variable importance cloud (VIC), which provides a comprehensive overview of how important (or unimportant) each variable could be to accurate prediction [17]. Our recent work combined VIC with the well-received Shapley values to devise a Shapley variable importance cloud (Sha-pleyVIC) [18], which easily integrates with current practice to provide less biased assessments of variable importance. In a recidivism risk prediction study, SHAP analysis of the optimal model reported high importance for race, whereas the low overall importance estimated by both ShapleyVIC and VIC suggested that the assessment based on the single model was likely an overclaim [18]. Moreover, ShapleyVIC reports uncertainty intervals for overall variable importance for statistical assessments, which is not easily available from VIC [17,18]. In a mortality risk prediction study using logistic regression [18], ShapleyVIC handled strong collinearity among variables and effectively communicated findings through statistics and visualizations. These applications suggest ShapleyVIC as a suitable method for guiding variable selection steps when developing clinical scores from regression models.
In this paper, we explore the use of ShapleyVIC as an interpretable and robust variable selection approach for developing clinical scores. Specifically, we use an electronic health record dataset from ED to predict death or unplanned readmission within 30 days after hospital discharge. We interpret the importance of candidate variables with respect to a group of 'good' logistic regression models using visualizations of ShapleyVIC values, combine information across models to generate an ensemble variable ranking, and use the estimated overall importance to filter out candidate variables that will likely add noise to the scoring model. To develop scoring models from ShapleyVIC-ranked candidate variables, we take advantage of the disciplined and modularized AutoScore framework, which grows a relatively simple, interpretable model based on a ranked list of variables until the inclusion of any additional variable makes little improvement to predictive performance [13]. The current AutoScore framework ranks variables using the RF, which is a commonly used machine learning method in clinical applications that performs well and is easy to train. The inability of traditional variable selection approaches (i.e., stepwise variable selection and penalized likelihood approach) in building sparse prediction models and the good performance of the RF-based approach has been previously demonstrated [13]. Hence, in this work we include RF-based scoring models as a baseline to evaluate the ShapleyVIC-based model and to demonstrate the benefits of Shapley-VIC that are not available from existing variable importance methods. We also compare these scoring models to the LACE index to show improvements over existing models.

Study cohort
This study aimed to develop a scoring model to predict unplanned readmission or death within 30 days after hospital discharge. The data was derived from a retrospective cohort study of cases who visited the ED of Singapore General Hospital and were subsequently admitted to the hospital. The full cohort consists of 411,137 cases, where 388,576 cases were eligible. 63,938 (16.5%) eligible cases developed the outcome of interest, where 8174 were death and 55,764 had readmission. As summarized in Table 1, cases with and without the outcome were significantly different (i.e., had p-value<0.05) in all 41 candidate variables except the use of ventilation. Specifically, compared to those without the outcome, cases with outcome tended to be older and have shorter ED LOS, higher ED triage, longer inpatient LOS, more ED visits in the

RF-generated models
We first describe scoring models generated using RF-based variable ranking, which ranked the 41 variables based on their importance to the RF trained on the training set and used it in the variable ranking module of the AutoScore framework. Fig 1 presents  patient by applying a threshold to the derived risk score, for example, by using the optimal threshold defined as the point nearest to the upper-left corner of the receiver operating characteristic (ROC) curve (for which additional performance metrics are reported in Table 2), or by manually selecting a threshold that reaches a desirable level of sensitivity and/or specificity. Table 3 presents the scoring tables for Models 1A and 1B after fine-tuning the cut-off values Consultation waiting time (hours)

ShapleyVIC-generated model
Next, we describe the assessment of overall variable importance using ShapleyVIC values and the implications, the resulting ensemble ranking that accounts for variabilities in variable importance across models, the simplification of model building steps based on the significance of overall variable importance, and the final scoring model developed. Using the training set, we trained a logistic regression model on all 41 candidate predictors by minimizing model loss. Defining "nearly optimal" as model loss exceeding the minimum loss by no more than 5%, we randomly generated 350 models from the corresponding model space to empirically study variable importance across nearly optimal models. We used the first 3500 cases in the validation set to evaluate the ShapleyVIC values of these 350 models, which we found enough for the algorithm to converge and generate stable values in preliminary experiments. Unlike machine-learning-based methods (e.g., RF as described in the previous subsection) that focus on relative importance of candidate variables for ranking purpose, ShapleyVIC quantifies the extent of variable importance for more in-depth inference, and communicates the findings through different forms of visualizations to facilitate interpretation. Fig 2 visualizes the average ShapleyVIC values across 350 models and the 95% prediction intervals (PIs), which estimate overall variable importance and the variability, respectively. Non-positive values indicate unimportance and are visualized using grey bars. Fig 3 presents the distribution of ShapleyVIC values from individual models to visualize the relationship between variable importance and model performance among the nearly optimal models. While both the RF (see Fig 1) and the average ShapleyVIC values (see Fig 2) suggested the number of previous ED visits as the most important variable among the 41 candidates, the 95% PI of the average Shapley-VIC value (see Fig 2) further concluded the significance of its overall importance, and the violin plot in Fig 3 revealed its much higher importance than other variables in all nearly optimal models studied. Metastatic cancer had the second highest average ShapleyVIC value indicates an interval inclusive of the lower limit and exclusive of the upper limit. "--" indicates variables not included in a model. among all variables, consistent with its considerable contribution to model performance observed from the RF-based variable ranking. Both metastatic cancer and age were important to all nearly optimal models studied (indicated by positive ShapleyVIC values), but the wider spread of ShapleyVIC values for age PLOS DIGITAL HEALTH resulted in a wider 95% PI and hence a higher uncertainty of its overall importance. Similarly, although admission type had similar overall importance (indicated by the average ShapleyVIC value) as neighboring variables (cancer and systolic blood pressure), the wider spread and long left tail for admission type resulted in a 95% PI containing zero, indicating a non-significant

PLOS DIGITAL HEALTH
overall importance for this variable. Among the 41 candidate variables, 20 variables had nonsignificant overall importance and were excluded from subsequent model building steps.
To rank variables using ShapleyVIC values, we ranked the 41 variables within each of the 350 nearly optimal model, averaged across models to generate an ensemble ranking for all variables and subsequently focused on the 21 variables with significant overall importance. Unlike the parsimony plot from RF-based variable ranking (see Fig 1), the parsimony plot based on the ensemble ranking (see Fig 4) increased smoothly until the 6 th variable entered the model, and the increments in model performance became small afterwards. The parsimony plot suggested a feasible model, Model 2, also with 6 variables: number of ED visits in past 6 months, metastatic cancer, age, sodium, renal disease, and ED triage. Four of these 6 variables were in common with the final models built using RF, but now metastatic cancer was selected

PLOS DIGITAL HEALTH
automatically without the need for manual inspection. A detailed description of steps and corresponding R codes for deriving Model 2 is provided online (see section "AutoScore-Shapley-VIC workflow" of the ShapleyVIC guidebook, available at https://github.com/nliulab/ ShapleyVIC).
As a sensitivity analysis, we also generated a parsimony plot using the ensemble ranking of all 41 variables, visualized in Fig A in S1 Text. Although some variables with non-significant overall importance ranked higher than those with significant overall importance (e.g., admission type ranked 14 th among all 41 variables, higher than ED LOS and the next seven variables in Fig 4), these non-significant variables had small incremental change to model performance and did not affect the development of Model 2. Hence, the exclusion of variables with non-significant overall importance simplified model building steps without negative impact on the performance of the final model. The scoring table of Model 2 (after fine-tuning) is shown in Table 4, and the AUC evaluated on the test set (0.756, 95% CI: 0.751-0.760) was comparable to that of the 16-variable Model 1B and significantly higher than Model 1A or the LACE index (see Table 2). When evaluated at the optimal threshold identified in the ROC analysis, Model 2 had higher accuracy than Models 1A, 1B or the LACE index (see Table 2).

Discussion
Identifying important variables to the outcome is crucial for developing well-performing and interpretable prediction models [19], especially when developing clinical risk scores based on the simple structure of logistic regression models. The recently developed ShapleyVIC method [18] comprehensively assesses variable importance to accurate predictions, and quantifies the variability in importance to facilitate rigorous statistical inference. In this study, we propose ShapleyVIC as a variable selection tool for developing clinical risk scores, and illustrate its application for binary outcomes in conjunction with the modularized AutoScore framework [13] by developing a risk score for unplanned readmission or death within 30 days after hospital discharge. Based completely on the score-generating logistic regression models, Shapley-VIC avoids bias in importance assessments due to the choice of variable ranking methods, and presents rich statistics on variable contributions through effective visualizations to support indepth interpretation and guide variable selection. The ShapleyVIC-generated model used 6 of the 41 candidate variables, which outperformed the widely used LACE index and had comparable performance as a 16-variable model developed from RF-based variable ranking. Our work makes a novel contribution to the development of clinical risk scoring systems by providing an effective and robust variable selection method that is supported by statistical assessments of variable importance and is tailored to score-generating regression models. Models that are 'good enough' can be as relevant in real-life clinical applications as the 'best-performing' model, especially with respect to practical considerations such as clinical meaningfulness and costs [17,18]. The variable importance cloud proposed by Dong and Rudin [17] was the first to formally extend variable importance to such a group of models with nearly optimal performance. ShapleyVIC creates an ensemble of variable importance measures from a group of models using the state-of-the-art Shapley-based interpretable machine learning method, explicitly evaluates the variability of variable importance across models to support formal assessment of overall importance, and conveys such information through effective visualizations. In this work, we further propose a disciplined approach that takes into account such rich information on variable importance when ranking variables, and demonstrate its easy integration with the existing AutoScore framework [13] for automated development of scoring systems. In addition to using the average ShapleyVIC values across models as a measure of overall variable importance, we propose to use them as a screening tool to exclude candidate variables that are less useful.
Although machine learning methods (e.g., RF, XGBoost and neural networks) can be more efficient than traditional variable selection methods [12,13,20], they often assume complex nonlinear and/or tree-based structures that differ drastically from the linear structure assumed by score-generating logistic regression. Our application highlights such misalignment, where an important contributor (metastatic cancer) to the scoring model was only ranked 16 th among all 41 variables by the RF, and the 4 th -ranking variable (ED boarding time) by the RF had minimal incremental contribution to the performance of the scoring model. Such issues are not limited to the RF but are generally shared by machine learning methods. For example, similar issues were observed for XGBoost-based variable ranking in an additional study (see Fig B in S1 Text for the parsimony plot). While it may be relevant to explore other machine learning methods (e.g., neural networks) for variable ranking, it can be challenging to train such models (e.g., to determine the number of layers and number of nodes per layer in a neural network) to reasonably evaluate variable importance, making them practically less useful. Hence, we did not include additional variable ranking methods in our comparative evaluations.
The misalignment between the RF-based variable ranking and variable contributions to the scoring model is less problematic to the AutoScore framework, as it can be observed from the parsimony plot and handled by manual adjustment to the variable list, e.g., by excluding ED boarding time from Model 1A and adding metastatic cancer to build a new 6-variable model with comparable performance (AUC = 0.756, 95% CI: 0.751-0.760) to the 16-variable Model 1B. ShapleyVIC avoids such subjective assessments by providing an objective and data-driven ranking approach that is tailored to the score-generating logistic regression. Using an ensemble variable ranking across 350 well-performing logistic regression models, ShapleyVIC successfully assigned high ranks to all important contributors (including metastatic cancer), and excluded ED boarding time with other 19 variables that had non-significant overall importance. When the final list of variables is selected and the scoring table is derived using the AutoScore workflow, predicting the outcome for a new patient is simple: clinicians can first compute a total score for the patient by summating the points corresponding to each variable value, and then predict the outcome if the total score exceeds a threshold (e.g., a data-driven threshold based on ROC analysis, or an adjusted threshold corresponding to a satisfactory sensitivity or specificity).
Model 2 developed using ShapleyVIC-based variable ranking used 6 variables, among which 5 variables (i.e., ED triage, age, number of ED visits in past 6 months, metastatic cancer and renal disease) are easily collected at ED registration or retrieved from hospital database. The 6 th variable, sodium level at ED, is less convenient to collect but may be imputed or assigned 0 point in the score if not available. Hence, Model 2 is feasible for clinical application and so is the 6-variable Model 1A for similar reasons, but the 16-variable Model 1B can be difficult to implement in clinical applications. Although the widely used LACE index has been recognized as a 4-variable scoring model, it is worth noting that one of the predictors, the Charlson comorbidity index (CCI), aggregates information across 17 medical conditions, therefore may not be convenient for quick risk stratification. Despite the simplicity of Model 2, it outperformed Model 1A and the LACE index, and had comparable performance as Model 1B. These demonstrate the ability of the integrated AutoScore-ShapleyVIC framework in developing well-performing and sparse risk scoring models.
Compared with the LACE index, Model 2 only requires information on 2 of the 17 comorbidities (i.e., metastatic cancer and renal disease), uses ED triage instead of inpatient LOS, and requires additional information on patient age and sodium level at ED. We find the use of ED triage but not inpatient LOS in Model 2 reasonable in our setting, because the full cohort consists of acute admissions (i.e., inpatient admissions after ED visits), and ED triage is an important predictor for short-term clinical outcomes such as 30-day mortality or readmission [21]. The inclusion of age in Model 2 is not surprising, as old age generally associates with alleviated risk of adverse clinical outcomes, but the clinical basis of the use of sodium level may warrant further investigation. The focus on metastatic cancer and renal disease among the 17 comorbidities is consistent with a recent study of time to emergency readmission using a similar cohort to that in our study [22], which developed a 6-variable risk score using number of ED visits in previous year, age, cancer history, renal disease history and inpatient measures of creatinine and album. Three variables in Model 1B (i.e., number of ED visits in past 6 months, inpatient LOS and metastatic cancer) were also included in the LACE index. Model 1B did not include renal disease but included creatinine level at ED, which is reflective of renal function to some extent [23]. In addition to patient age, Model 1B also requires additional information on ED admission and several laboratory tests and vital signs at ED (including sodium used in Model 2). However, since the contribution of some of the variables in Model 1B to predictive performance has been questionable (e.g., see the 4 th and 7-15 th variables in Fig 1), we refrain from further discussing the clinical implications for these variables. Readers should also note that scoring models discussed above were developed for illustrative purpose and should not be use in clinical applications without further investigations and validation. Moreover, when developing risk prediction models for composite outcomes, e.g., in our study a composite of readmission or death, it is useful to separately evaluate the model for each event for improved interpretation of the prediction model and a more comprehensive evaluation of predictive performance.
Interestingly, additional analyses found that when using the same variables, RF had a slightly lower AUC than Model 1B (AUC = 0.754, 95% CI: 0.749-0.758) and a much lower AUC than Model 2 (AUC = 0.663, 95% CI: 0.659-0.668). This again highlights the drastic difference between RF and logistic regression models and their perception on variable importance, and that complex and robust machine learning models do not necessarily outperform simple scoring models. However, compared to variable ranking based on a single RF, ShapleyVIC requires much longer run-time and larger memory space, which would benefit from the use of high-performance computers and parallel computing (option available in the ShapleyVIC R package [24]). Another limitation of ShapleyVIC is that the sampled set of models generated using our pragmatic sampling approach may not fully represent the entire set of near-optimal models (formally referred to as the Rashomon set) [18], which remains a challenge in this area of research [1,17]. In addition, although collinearity was not present in our example (where generalized variance inflation factor [VIF] for all 41 candidate variables were below 2), it is generally an unresolved difficulty in variable importance assessments [17,18,[25][26][27]. Our pragmatic solution of using absolute SAGE values as model reliance measures for variables with VIF>2 worked well in a previous empirical experiment [18], and future work aims to devise more formal solutions. In current study, we adopted the holdout method that randomly splits the full cohort into three datasets for model development and validation, which is a common practice when working with large datasets. Future work will develop and validate an alternative workflow using cross-validation to accommodate smaller datasets.
Our work contributes to the recent emphasis on interpretability and transparency of prediction models for high-stakes decision making [28], by devising a robust and interpretable approach for developing clinical scores. Unlike existing statistical or machine learning methods for variable selection, which focus on ranking variables by their importance to the prediction, ShapleyVIC rigorously evaluates variable contributions to the scoring model and visualizes the findings to enable a robust and fully transparent variable ranking procedure. The ShapleyVIC-based variable ranking, which is tailored to the score-generating logistic regression, is easily applied to the AutoScore framework for generating sparse scoring models with the flexibility to adjust for expert opinions and practical conventions. ShapleyVIC and Auto-Score combines into an integrated approach for future interpretable machine learning research in clinical applications, which provides a disciplined solution to detailed assessment of variable importance and transparent development of clinical risk scores, and is easily implemented by executing a few commands from the respective R packages [24,29]. Although we have illustrated our approach for predicting early death or unplanned readmission, it is not limited to any specific clinical application. For example, AutoScore has recently been used to derive a risk score for ED triage that outperformed several existing indices [14], and to provide a wellperforming risk score for survival after return of spontaneous circulation in out-of-hospital cardiac arrest that is easier to implement than existing alternatives [30]. Future work will explore the performance of AutoScore-ShapleyVIC in such diverse application scenarios, and validate its performance in risk score derivation against other score generating systems and existing prediction models. Current work describes the implementation of our proposed scoring system for binary outcomes, but the model-agnostic property of ShapleyVIC [18] and the recent extension of the AutoScore framework to survival outcomes [31] enables the use of our approach for a wider range of applications.

AutoScore framework for developing risk scoring models
The AutoScore framework provides an automated procedure for developing risk scoring models for binary outcomes. Interested readers may refer to the original paper [13] and the R package documentation [29] for detailed description of AutoScore methodology, and to a clinical application for an example use case [14]. In brief, AutoScore divides data into training, validation and test sets (typically consisting of 70%, 10% and 20% of total sample size, respectively), and ranks all candidate variables based on their importance to a RF trained on the training set. Continuous variables are then automatically categorized (using percentiles or k-means clustering) [13] for preferable clinical interpretation and to account for potential non-linear relationships between predictors and outcomes. AutoScore creates scoring models based on the logistic regression, where coefficients are scaled and rounded to non-negative integer values for convenient calculation. Following the RF-based variable ranking, AutoScore grows the scoring model by adding one variable each time and inspects the resulting improvement in model performance (measured by the AUC on the validation set) using a parsimony plot. The final model is often determined by selecting the top few variables where a reasonable AUC is reached and inclusion of an additional variable only results in a small increment (e.g., <1%) to AUC. After selecting the final list of variables, cut-off values used to categorize continuous variables may be fine-tuned to suit clinical practice and conventions, and the finalized model is evaluated on the test set.

Shapley variable importance cloud (ShapleyVIC)
In practical prediction tasks, researchers are often willing to select a simpler and interpretable model even when its performance is marginally lower than a more complex model. Hence, conventional variable importance assessments based on a single best model can be limiting. ShapleyVIC [18] is suitable for such purposes and assesses the overall contribution of variables by studying a group of models with nearly optimal performance. In the case of developing scoring models for binary outcomes, nearly optimal models can be characterized by vectors of regression coefficients corresponding to logistic loss that exceeds the minimum logistic loss by no more than a selected threshold, for which 5% is a reasonable choice but may be adjusted based on practical needs and considerations.
ShapleyVIC consists of three general steps. First, a suitable number (e.g., 350) of nearly optimal models (defined using the selected threshold for model loss) are sampled from a multivariable normal distribution centered at the regression coefficients of the optimal logistic regression model. Specifically, due to the difficulty in mathematically defining the boundary of this multivariable normal distribution corresponding to the desirable range of model loss, we proposed a rejection sampling approach to empirically explore the distribution, with two tunable parameters to control the range explored. The sampling steps and instructions on parameter tuning are described in detail in the original paper [18] and implemented in the ShapleyVIC package [24]. Next, ShapleyVIC measures reliance of each nearly optimal model on variables using the Shapley additive global importance (SAGE) method [26], which measures variable impact on a model using Shapley values and is closely related to the commonly used SHAP method. As explained in the original paper on ShapleyVIC, we use the absolute SAGE values instead of the unadjusted values to measure model reliance for variables involved in collinearity, identified by VIF>2 from the optimal logistic regression model. Finally, the ShapleyVIC values are pooled across all nearly optimal models using a random-effect metaanalysis approach, which quantifies the overall importance of variables and explicitly evaluates the variability of these measures across well-performing models (in terms of a 95% PI of variable importance for a new model) to support formal statistical inference. The average Shapley-VIC values and 95% PIs are visualized using a bar plot, and the variability of ShapleyVIC values across models is highlighted using a colored violin plot for additional insights.

Ensemble variable ranking from ShapleyVIC
To apply ShapleyVIC in practical development of scoring models based on logistic regression, we use an ensemble variable ranking approach from ShapleyVIC that is tailored to score-generating models and filter out variables that are not likely to make significant contributions. First, we describe our proposed ensemble variable ranking for d variables for a single nearly optimal model f. Letmr s j ðf Þ denote the ShapleyVIC value of the j-th variable (j = 1,. . .,d) for model f, which, as described above, is either the unadjusted SAGE value or the absolute SAGE value depending on the VIF of the variable. The variability ofmr s j ðf Þ can therefore be measured using the standard error of the SAGE value, denoted by σ j (f), that is readily available from the SAGE algorithm. Assuming independent normal distributions for SAGE values [18,26] of two variables (e.g., X j and X k ) to model f, the difference between their ShapleyVIC values is also normally distributed:mr s j ðf Þ Àmr s k ðf Þ � Nðmr s j ðf Þ À mr s k ðf Þ; s 2 j ðf Þ þ s 2 k ðf ÞÞ. We compare all possible pairs of variables, and subsequently rank the d variables to model f based on the number of times each variable has significantly larger ShapleyVIC value than the other d−1 variables. Assigning rank 1 to the most important variable and using the same smallest integer value available for tied ranks, we rank the d variables for each model, and use the average rank of each variable across all nearly optimal models to generate an ensemble ranking.
Instead of considering all d candidate variables in subsequent model building steps, we propose to filter out variables that are not likely important contributors to accurate predictions based on their overall importance, corresponding to variables where the 95% PI for the average ShapleyVIC value contains or is entirely below zero. In this work, we applied the proposed ensemble variable ranking to the AutoScore framework, using it to replace RF when ranking variables.

Study design and data preparation
Our empirical experiment was based on a retrospective cohort study of patients who visited the ED of Singapore General Hospital in years 2009 to 2017 and were subsequently admitted to the hospital. The outcome of interest was readmission or death within 30 days after discharge from hospital. Since data was collected from ED, all readmissions in the cohort were unplanned readmissions. Information on patient demographics, ED administration, inpatient admission, healthcare utilization, clinical tests, vital signs, and comorbidities was extracted from the hospital electronic health record system. A full list of variables is provided in Table 1. ED triage was evaluated using the Patient Acuity Category Scale [21], the current triage system used in Singapore. Admission type refers to the four types of wards in the hospital with different costs [32], which partially reflects the socio-economic status of patients. We excluded patients who were not Singapore residents, died in hospital, aged below 21 or had missing information on ED boarding time or consultation waiting time. The final cohort was split into training (70%), validation (10%) and testing sets (20%), and missing values for vital signs or clinical tests were imputed using the median value in the training set. To compute the LACE index, we computed the CCI as described in a previous work [33].

Ethics
This study was approved by Singapore Health Services' Centralized Institutional Review Board (CIRB 2021/2122), and a waiver of consent was granted for electronic health record data collection.