Figures
Abstract
We developed an inherently interpretable multilevel Bayesian framework for representing variation in regression coefficients that mimics the piecewise linearity of ReLU-activated deep neural networks. We used the framework to formulate a survival model for using medical claims to predict hospital readmission and death that focuses on discharge placement, adjusting for confounding in estimating causal local average treatment effects. We trained the model on a 5% sample of Medicare beneficiaries from 2008 and 2011, based on their 2009–2011 inpatient episodes (approximately 1.2 million), and then tested the model on 2012 episodes (approximately 400 thousand). The model scored an out-of-sample AUROC of approximately 0.75 on predicting all-cause readmissions—defined using official Centers for Medicare and Medicaid Services (CMS) methodology—or death within 30-days of discharge, being competitive against XGBoost and a Bayesian deep neural network, demonstrating that one need-not sacrifice interpretability for accuracy. Crucially, as a regression model, it provides what blackboxes cannot—its exact gold-standard global interpretation, explicitly defining how the model performs its internal “reasoning” for mapping the input data features to predictions. In doing so, we identify relative risk factors and quantify the effect of discharge placement. We also show that the posthoc explainer SHAP provides explanations that are inconsistent with the ground truth model reasoning that our model readily admits.
Citation: Chang TL, Xia H, Mahajan S, Mahajan R, Maisog J, Vattikuti S, et al. (2024) Interpretable (not just posthoc-explainable) medical claims modeling for discharge placement to reduce preventable all-cause readmissions or death. PLoS ONE 19(5): e0302871. https://doi.org/10.1371/journal.pone.0302871
Editor: Robert Jeenchen Chen, Stanford University School of Medicine, UNITED STATES
Received: October 6, 2023; Accepted: April 15, 2024; Published: May 9, 2024
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: This paper uses the CMS LDS, which the authors do not have permission to share. As per CMS: "The Centers for Medicare & Medicaid Services (CMS) makes Limited Data Set (LDS) files available to researchers as allowed by federal laws and regulations as well as CMS policy. LDS files contain beneficiary-level health information and are considered identifiable files, but they do not contain specific direct identifiers as defined in the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Questions about LDS files or the process for requesting LDS files can be sent to datauseagreement@cms.hhs.gov.
Funding: CCC is supported by the Intramural Research Program of the NIH, NIDDK. JCC is partially supported by the Intramural Research Program of the NIH, Clinical Center. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [58], which is supported by National Science Foundation grant number ACI-1548562 through allocation TG-DMS190042.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Preventable readmission after hospital discharge is costly. In 2011, for adult 30-day all cause hospital readmission in the United States, the cost was about $41.3 billion [1]. To improve outcomes, Medicare, through its Hospital Readmissions Reduction Program [2], penalizes providers for readmissions that occur within the 30-days after discharge; penalties have spurred interest in interventions surrounding transitions of care including discharge planning services such as hand-offs to less-intensive healthcare institutions. Population-scale individual-level medical claims data provides rich longitudinal health context behind each hospital stay, making it possible to assess the efficacy of these interventions retroactively. This manuscript focuses on the problem of deciding discharge placement for individuals in order to prevent readmission or death.
Readmission models
A recent review [3] surveyed properties of readmission models in the literature. By and large, they found no model type to consistently predict more-accurately than others. Some studies have reported marginal improvements using either XGBoost or neural networks over interpretable methods [4–6], though not consistently [7–11], as seen in other problems [12]. Generally, the literature has focused on 30-day readmissions, though nuances in how readmission is defined complicate direct performance comparisons. Models based on medical claims data typically achieved area under the receiver operator characteristic (AUROC) of approximately 0.7 for predicting their version of all cause 30-day readmission.
Another factor that complicates the direct comparison of modeling efforts is differences in datasets—and hence the underlying patient populations and predictors. We are aware of two readmission studies performed on datasets identical to ours. MacKay et al. [13] developed XGBoost models for predicting a set of adverse events, reporting an AUROC of 0.73 for all-cause readmission prediction. Lahlou et al. [14] created an attention-based neural network for predicting admissions after discharge within 30-days and reported an AUROC value of 0.81, however, they did not distinguish between transfers, planned admissions, and acute admissions in their outcome label so they solve a different problem that is of less practical utility.
Yet, having a high AUROC is insufficient for making a model useful. Prerequisites for utility include the ability to understand predictions, assess validity, and derive actions. Model interpretability is a means to these ends. Most studies surveyed were aware of the importance of model interpretability, regardless of whether they produced interpretable models. Studies that claim interpretability for their blackbox solutions only offer “posthoc explainability,” a catch-all phrase for narratives generated in order to promote a sense that a model is interpretable when it is not.
Blackbox models
Methods such as Deep Learning (DL) and ensemble boosted trees (XGBoost, LightGBM, others) can model nonlinearities. When copious training data is available, these methods yield models that are more expressive than traditional generalized linear models. Most-generally, blackbox models like DL and ensemble trees are nonlinear kernel machines (function interpolations) [15]. The convoluted nature of their interpolations makes these models uninterpretable. Massive investment exists in these models because of their predictive performance and low effort requirement. This existing investment, the challenge of creating truly interpretable models, and a myth that blackboxes perform better than interpretable models [16], incentivize the marketing of posthoc-xAI as an alternative to interpretable modeling. In finance, a similarly high-stakes domain, there has been wide resistance to blackbox modeling, formalized recently in model risk management guidelines published by The Office of the Comptroller of the Currency (OCC) [17]. We should also be wary of the use of these models in healthcare, where the risk to patients requires truly trustworthy solutions.
Interpretability
The goal of interpretable modeling is to produce predictions that an end-user can understand [16, 18], which is a prerequisite for making a prediction actionable. One necessary yet insufficient aspect of intrinsic model interpretability is feature attribution. Blackbox models do not admit feature attributions without the use of unreliable approximations. Conversely, feature attribution is exact in regression models, where each model coefficient has the unequivocal interpretation as the conditional expected change of the response corresponding to a given unit of change in the predictor, while fixing the other predictors. For this reason, even ignoring attributes beyond feature attribution, a significant disconnect already separates blackbox models from inherently interpretable models. Fig 1 is a representation of the spectrum of interpretability focusing on structured data problems in healthcare.
Models without intrinsic interpretability rely on unreliable approximation techniques for crafting explanations. More-interpretable models are more trustworthy and insightful.
While the definition of interpretability varies according to problem domain, all notions of interpretability require a basic ability to parse the computations behind a model’s predictions in terms of the input data features. We refer to this fundamental aspect of interpretability as “computational interpretability.” Computational interpretability is a necessary yet insufficient attribute for prediction comprehensibility. ReLU-activated neural networks, matrix composition methods like principle components analysis (PCA), and large multiple regression models are computationally interpretable, whereas Deep Learning (DL) models in-general and ensemble trees methods like XGBoost are not.
However, knowing how a prediction is computed from individual features does not automatically make the prediction comprehensible—it may still difficult to understand how a model behaves as there is a limit to the capacity of information that humans can process simultaneously [19]. Sudjianto et al. [20] note that additivity, sparsity, linearity, smoothness, monotonicity, and visualizability are some attributes of interpretable models that are also comprehensible. Each of these attributes can be enforced through suitable modeling constraints.
The highest bar for interpretability is for a model to be mechanistically meaningful. These models often leverage domain knowledge and are capable of providing deep and robust insights. They also often justify causal interpretations [21]. Even if one can truly understand a model, one often cannot act on it. To be directly actionable, a model also needs to adjust for biases in the data so that its prediction of the effects of interventions can be interpreted causally [22]. Yet, independent of causal validity, predictive model interpretability is still important because it allows practitioners to better understand the risks and biases of a given model.
Posthoc explainable-AI (xAI)
Posthoc xAI is a set of techniques to market uninterpretable blackbox models as interpretable (Fig 1). The most popular xAI methods (LIME [23] and SHAP [24, 25]), use approximations [26, 27] to provide narratives of feature importance within a prediction. Other methods such as attention [28] build an explanation mechanism as a module within a blackbox model in order to more-easily compute them [29, 30]. Narratives, convincing as they might seem, are not necessarily true. In fact, researchers have shown [30–35] that these methods provide imprecise and unreliable explanations of models, and often disagree. Aptly, Krishna et al. [36] coined this the “disagreement problem” with posthoc interpretability and conducted a survey of real world data scientists finding no consistent or principled method to handle these inconsistencies. As Rudin [16] notes, “an explanation model that is correct 90% of the time is wrong 10% of the time.” Despite marketing claims, xAI does not carry blackbox models across even a very minimal bar of requirements for interpretability. If an explanation is not true to one’s model, any sense that the model is comprehensible is based on faulty information.
Piecewise-linear modeling
Blackboxes provide clues on how to extend traditional linear models. DL is the application of artificial neural networks (ANNs) to prediction problems. ANNs consist of sequences (or more generally of graphs) of successive affine matrix arithmetic operations, sandwiched between activation functions. In general, these methods are blackboxes, with the exception of ReLU-activated neural networks (ReLU-nets for short). Examining ReLU-nets elucidates the nature of how DL captures nonlinearities. ReLU-nets use the activation function
(1)
In these models, ReLU is independently applied to each matrix coordinate after each successive matrix operation. The output of the function is nonzero if and only if a linear combination of the elements computed by the prior layer are positive. Hence, ReLU defines an inequality over quantities within the model—applied to each coordinate within each layer, ReLU defines recursive sets of inequalities. These inequalities collectively segment the training data into disjoint regions. In sum, ReLU-nets are composed of regionally-disjoint generalized linear models—each of which is interpreted in the same manner as linear regression. Hence, ReLU-nets are computationally interpretable. The salient nonlinearity of these models is locality. To interpret a specific prediction given by these models, one needs to map the input to a particular linear submodel. Then, conditional on this mapping, a ReLU net is locally a simple generalized multiple linear regression model. Observing this fact, Sudjianto et al. [37] provides a tool for exactly interpreting trained ReLU neural networks, by unwrapping the cascades of inequalities. In this manuscript we mimic this property of ReLU-nets within a well-controlled multilevel Bayesian regression framework in order to gain expressiveness while prioritizing interpretability.
Methods
We generalize the classic readmission problem of within 30 days of discharge, to the likelihood of readmission at any arbitrary day after discharge. To this end, our objective is to characterize the statistics of the inter-inpatient wait time Tn. Additionally, we focus on identifying the effects of discharge placement, representing the choices symbolically as In, ranked in terms of health acuity: (0) discharge home, (1) discharge home with home health, (2) discharge to skilled nursing, (3) intermediate care/critical access, (4) long term care, (5) other less-acute inpatient. The issue that complicates the estimation of discharge placement effects is unobserved confounding—providers use the patient’s health status in order to decide placement. To resolve the treatment assignment bias, we model the joint outcomes
(2)
where
is a covariate vector and we explicitly adjust for assignment bias. Note that we distinguish between the scalar-valued In, which corresponds to the list of interventions above, and the vector valued In which we will explain later in this manuscript. For the sake of interpretability, we formulate f and g in Eq 2 as hierarchical multilevel Bayesian generalized linear regression models, However, to increase expressivity by introducing the type of nonlinearity seen in ReLU-nets, we allow all of the model parameters αn, βn, γn, νn, ξn to vary locally [38–40] across regions defined by xn, in ways that comport with domain knowledge.
Ethics statement
This study used the de-identified Centers for Medicare and Medicaid Services (CMS) Limited Dataset and was commissioned by the CMS for their inaugural AI competition. As a result, it is not subject to IRB approval. The CMS provided data for this study at the commencement of the competition in December 2019, and again in June 2020 for model evaluation. The provided dataset encompassed a 5% sample of adult Medicare beneficiaries in the USA.
Data preprocessing
The available dataset, the CMS Limited Dataset (CMS LDS), consists of a national 5% beneficiary sample of Medicare FFS Part A and B claims from 2008 to 2012. The 2008 claims had only quarter date specificity so we used them solely to fill out the medical history for 2009 inpatient stays, by assuming that each 2008 claim fell in the middle of its given quarter. We trained the readmission models on 2009—2011 admissions, and evaluated the models on 2012 admissions.
After grouping claims into coherent episodes, based on date, provider, and patient overlap, we filtered for inpatient-specific episodes with certain characteristics to use as index admissions. We retained only episodes where the patient had a continuous prior year of Part A/B enrollment. We also excluded episodes from consideration as index episodes if they did not correspond to discharges to less-intensive care (excluding death and most inpatient-to-inpatient transfers). Additionally, we used the official CMS methodology for determining whether each episode is a planned admission, acute admission, or potentially planned admission [41]. For each episode we then computed the waiting time to either the next unplanned acute episode or death, or until censorship due to the end of the observation window. In the end, the training dataset consisted of approximately 1.2 million inpatient episodes, of which approximately 17% were followed by an unplanned acute inpatient episode and an additional 3% by death, within 30 days.
For each episode, we collected all billing codes, creating lists of concurrent procedure and diagnostic codes. Additionally, we collected the preceding four quarters of history for each episode, aggregating billing codes on a lagged quarterly basis.
Feature engineering.
Medical claims data consists of series of billing codes in several dialects (ICD9/10, HCPCS, RUG, HIPPS, etc). We down-sampled diagnostic (Dx) and procedure (Tx) codes, from their original dialects to multilevel Clinical Classification Software (CCS) codes [42]. CCS codes are clinically curated hierarchical categories that are more tractable for analysis and interpretation. Mapping to CCS drastically removes redundancy in the vocabulary of the dataset and helps to separate the health-specific information in billing codes from noisy reimbursement-specific details.
We used AHRQ Healthcare Cost and Utilization Project (HCUP) databases in order to tag codes for comorbidities, chronic conditions, surgical flags, utilization flags, and procedure flags. Included within skilled nursing facility (SNF) and home health (HH) claim codes are also activities of daily living (ADL) assessments. We converted these codes to ADL scores, where higher scores correspond to lower functional ability. We also incorporated CMS’s risk adjustment methodology, hierarchical condition categories (HCC), as model predictors. The CMS LDS contains beneficiary county codes that we used to incorporate the urban rural index and social economic scale as model features. Together with beneficiary race information and Medicaid state buy-in, these variables allowed for some measure of social determinants of health.
We encoded CCS and other code mappings into numerical vectors by counting the incidences of each permissible code. In the case of CCS, which is multilevel, we truncated codes at each of the first two levels and counted at each level. Altogether, the numerically encoded derived features constituted a vector of size p = 1072, which encompassed both concurrent episode codes and the past four quarters of history, where CCS was truncated to the first level for history.
Feature quantization.
To improve model interpretability, we made an effort to place all model parameters (log hazard ratios) on the same scale so that the magnitudes of all regression coefficients are directly comparable. In examining our derived data features, we found that they were predominantly sparse and heavy tailed. When fitting a logistic regression model to these data features, the model fit poorly to observations with large counts. Theses findings, and our desire to optimize model interpretability, led us to quantize all numerical variables so that the input variables into the model are entirely binary. To this end, we first computed the percentiles for each feature across the entire dataset. Then we re-coded each quantity into a series of binary variables corresponding to inequalities, where the cutoffs were determined by examining each variables at a set of quantiles and eliminating duplicate values. The usage of quantile-based coding has appeared in the literature [43, 44] as a nonlinear feature coding that has demonstrated benefits to model performance in certain problems. Generally, we retained only the quantized features in specifying the models except when otherwise specified. The total size of the feature vector after dropping all original non-quantized numerical features and all constant features expanded to p = 3143.
Survival modeling
For flexibly modeling the wait time distribution f, we use the piecewise exponential survival regression model (PEM) [45]. PEMs are defined by specifying the time-dependent hazard using a piecewise constant function, where the hazard changes across breakpoints that define disjoint time intervals. The probability density function for the PEM follows
(3)
In this manuscript we set the breakpoints between time intervals at 1 week, 4 weeks, and 9 weeks after discharge. For each episode n, we can estimate a wait time distribution by estimating the log-hazard within each time interval i,
(4)
where we allow the model parameters to vary across the data regionally, in a manner that emulates the type of nonlinearity seen in ReLU-nets. In Eq 4 we separate out the discharge placement effects (γn) from other effects (βn). Doing so makes it easier to structure the model for causally interpreting the discharge assignment effects. We incorporate domain knowledge by acuity-ordering the interventions, enforcing monotonicity of intervention effect by constraining the last five coefficients of ξn to non-positivity.
We model the discharge placement process g using an ordinal logistic regression model, where
(5)
where ξn are slopes corresponding to episode n and νn = [νn1, …, νn5] are intercepts under the constraints νnk < νn,k+1, ∀k, n. The predictions given by this model then feed back into the prediction of the wait time through a slope term for each element of the covariate vector
. Utilizing the discharge placement probabilities as model covariates adjusts for the confounding bias caused by the selection process, in a manner analogous to incorporating the local treatment probability as a covariate [46]. Additionally, directly modeling the treatment effects within a multilevel model allows us to infer locally-varying treatment effects, partially pooled for stable inference in regions where the data is sparse [47, 48].
Parameter decomposition.
The piecewise linear nature of ReLU-nets, and the observation that neural networks produce learned data representations [49], suggest that an approach to mimicking their expressivity within regression models is to allows slopes (and intercepts) to vary across regions of the data. We do so by expressing each of these regionally-varying parameters using an additive decomposition.
First, for delineating regions in data space (corresponding to cohorts), we project portions of the input data to lower dimensions through unsupervised methods. In Chang et al. [50], the authors make a connection between sparse probabilistic matrix factorization and probabilistic autoencoders. We use this approach to develop a low-dimensional representation of the portions of the input covariate vector that pertain to the lagged quarterly history. Then, we compute the statistics of the learned representation in the training data and develop for each dimension a set of cut-offs to use for bucketization. This procedure puts each inpatient episode into a specific cohort, represented by a location within a multidimensional lattice, based on medical history. Specifically, we used a single cut-off (the median) for each of five dimensions (S2 Fig in S1 File), creating a set of 25 = 32 groups based on history. By design, the rules governing the group assignment can be easily converted to a set of inequalities over sparse subsets of the original data features. Additionally, we included interactions between the history groups with other discrete attributes such as the major diagnostic category (MDC), complication or comorbidity (CC) or a major complication or comorbidity (MCC), and race, to create high dimensional discrete lattices where the cells define coarse interaction cohorts in the data. When partitioning data by a high-order interaction, a big data problem quickly becomes many small data problems—divide-and-conquer approaches can suffer from overfitting. To combat this issue, we developed a multiscale modeling approach where higher-order interactions are regularized by partially pooling their effects into related lower-order interactions. Specifically, given a multidimensional lattice that represents all cohorts for which the parameter will vary, we assign for each parameter a value within the lattice by decomposing the value into the form
(6)
where κ = (κ1, κ2, …, κD) is a D dimensional multi-index. In practice, we truncate the maximum order of terms in this decomposition due to memory constraints. More details on the exact decompositions that we used for our model parameters can be found in the Supplemental Materials.
Regularization.
By design, the parameter decomposition method inherently regularizes by partial pooling [47]. Additionally, we used weakly informative priors on the component tensors in these decompositions in order to encourage shrinkage at higher orders. For the regression coefficients, we utilized the horseshoe prior for local-global shrinkage [51–54]. Please see the Supplemental Materials for more details on the model specification.
Relationship to other interpretable model types.
In the Introduction, we motivated our methodology by noting the piecewise-linear nature of ReLU activated artificial neural networks. Those models offer a particular type of local linearity that we intended to mimic—the main improvements of our methodology are in moving beyond local linear interpretability by making the mapping for a data point to a local region easier to understand, and in explicitly exploiting the partial pooling properties of hierarchical mixed effects models.
Beyond ReLU-nets, other inherently interpretable models, where nonlinearity results from the locality of relationships, exist under the wide umbrella of varying coefficient regression models [39, 55] including hierarchical mixed effects models, tree-boosted varying coefficient regresion models [56], and a broad class of ensemble-like models that can be formed by local procedures such as hierarchical stacking [57]. Additionally, inherently interpretable globally nonlinear models are also popular—many of these models are extensions to classical generalized additive models (GAMs) [58], including explainable boosting machines [59] and the ReLU-net powered GAMI-Net [60].
Each of these methodologies offer computational interpretability (Fig 1). Through suitable constraints (human intervention) many of them can be tuned so that the resulting “reasoning” that a model is performing is comprehensible and/or mechanistically meaningful.
Implementation.
We used Tensorflow Probability [61], developing a set of libraries for managing the parameter decompositions that is publicly available at github:mederrata/bayesianquilts.
We trained our model using minibatch mean-field stochastic ADVI, using batch sizes of 104, and a parameter sample size of 8 for approximating the variational loss function. We utilized the Adam optimizer with a starting learning rate of 0.0015, embedded within a lookahead optimizer [62] for stability. Each epoch where the mean batch loss did not decrease, we set the learning rate to decay by 10%. Training was set to conclude if there was no improvement for 5 epochs, or if we reached 100 epochs, whichever came sooner. More information on the training is present in the Supplemental Materials. We used scikit-learn 1.1.1 for fitting two baseline logistic regression models (all features and restrcited to only LACE features), and XGBoost 1.6.1 for fitting a reference blackbox model for comparison. We implemented a horseshoe Bayesian convolution neural network with ReLU activation using TFP, where we used a single hidden layer of size one-fifth the input layer. For computing global SHAP values, we used regression-based KernelSHAP [63]. All computation was performed using the Pittsburgh Supercomputing Center’s Bridges2 resources. We utilized extreme memory (EM) nodes for preprocessing, and Bridges2-GPU-AI for training.
Results
Prediction accuracy
Table 1 shows the classification accuracy of our model in predicting readmissions or death within the first 30 days, benchmarked against predictions given by alternative models trained on the same dataset. Note that the LACE model is also trained using our dataset, restricted to LACE predictors [64]. The standard deviation in both the AUROC and AUPRC measures, as determined using bootstrap, was approximately 0.003. Non-linearly transforming our count features using quantization improved the accuracy of logistic regression to nearly match that of XGBoost on this dataset as measured by AUROC. Hence, we used quantization for features in both the Bayesian neural network (BNN) and piecewise exponential (PEM) models. The Bayesian neural network we developed utilizes sparsity-inducing horseshoe priors [65] on the weights and biases, which has been shown to improve model performance [52].
Quantization refers to the histogram-based bucketization of real-valued features. Area under the receiver operator curve (AUROC) and area under the precision-recall curve (AUPRC) computed on held-out 2012 inpatient episodes. Models trained on 2009–2011 episodes. Interpretability judged according to Fig 1.
Interpretation
In addition to being competitive with blackbox methods in terms of prediction accuracy, our model, as a generalized linear survival regression model, is easily interpretable. To be specific, our model is a generalized linear survival model where the coefficients vary. The value of each coefficient is the logarithm of a hazard ratio corresponding to the effect of a given feature, for a given data cohort, for a given time period. Log hazards greater than zero correspond to increased probability of event (readmission or death). Here, we provide select portions of the ground-truth global interpretation of the model, found by simply reading off the values of the regression coefficients. Please see the Supplemental Materials for a more-complete accounting of the model. This type of exposition is impossible with blackbox models without relying on unreliable approximations.
Time-dependent risk factors.
The model segments the data based on low-dimensional representation and assigns for each predictor a cohort-level effect within each time interval. The cohorts are delineated by the recent history of medical services utilization and the properties of the present hospital admission. The effects within the model are hazard ratios, which describe the instantaneous relative risk associated with a predictor relative to a baseline. In most cases, the baseline refers to a typical or normal value of a variable. Membership to cohorts also itself is associated with a baseline risk—baseline log-hazards are presented in Fig 2 for the 12480 episode cohort types defined within the decomposition for the parameter vector αn in Eq 4. Larger values of the hazard imply higher probability of event (readmission or death). There exists variability in the hazards across cohorts (rows), though the most striking change is in time (columns). Generally, the hazard is greatest in the first week after discharge. This finding implies that patients are more vulnerable in the first week than afterwards—keeping a patient out of the hospital within the first week has the largest impact on the overall risk that they will die or be readmitted. For this reason, we will focus on understanding the model’s predictions of the first-week risk.
Larger log-hazards corresponds to more readmission risk. Personalized values of αn specific to each episode are found by mapping an episode into its cohort grouping.
The 40 most-impactful first-week factors are shown in Fig 3, where the parameters have been decomposed in order to control for racial biases. The most-predictive single feature was length of stay. Lengths of stay less than a full day had a relative log hazard ratio of 0.97 (95% CI: 0.96–0.98) (note LOS<1 day was the reference group and so is the converse of LOS ≥ 1 days shown in Fig 3). Having an acute primary diagnosis code, at least one inpatient stay in the previous quarter (lagQ0, within 90 days of admit), and discharge against medical advice were also strong predictors associated with increased risk of readmission or death. Patients who received skilled nursing care in the quarter preceding an inpatient episode, who had a Resource Utilization Group (RUG) Activities of Daily Living (ADL) score of at least 6.125 tended to have a lower risk of readmission in the first week than otherwise, however, the risk increased for quarter-lagged ADL scores of at least 13.5.
All predictors are binary and all parameters are additive log hazard ratios. Higher (red) corresponds to larger hazards and greater readmission risk.
Discharge placement effects.
In Fig 4, we show the cohort-wise causally-adjusted mean local average treatment effects of discharge to each of the given care settings as well as the local standard deviation in the effect. Focusing on the effect of discharging to skilled nursing care, the effects were greatest for episodes graded by DRG code as having either a complication or comorbidity (CC) or a major complication or comorbidity (MCC). In particular, CC/MCC episodes with a major diagnostic code of 2 (Diseases and Disorders of the Eye), 14 (Pregnancy, Childbirth And Puerperium), and 22 (Burns) have the greatest response to discharge to skilled nursing.
First-week effects of discharge placement: Mean (left) and standard deviation (right) by cohort (row) of the five placement interventions assessed, in increasing order of implied acuity. Effect is difference in log-hazard relative to a normal discharge (home).
Posthoc-xAI (SHAP) misleads
Knowing what the model is doing in exact terms, let us see how posthoc-xAI thinks the model is working. In Fig 5 we display the most important model features as determined by magnitude of global SHAP values in the prediction of readmission or death within the first 30 days. SHAP is computationally costly to approximate—the details of our SHAP computation are available in the Supplemental Materials. The four most-influential features according to the explainer are specific CCS classes of treatments and diagnoses in the recent quarterly history. Comparing these results to the parameter values of Fig 3, it is evident that the feature sets disagree. Nor do the values in Fig 5 align with parameter values for later weeks (see Supplemental Materials). This finding is unsurprising; SHAP has been consistently shown to fail to recover ground-truth interpretations [32, 66], in problems where the predictors are correlated. SHAP fundamentally does not answer the question of what a given model is doing in order to reach a prediction. Furthermore, feature importance is not grounded in any relevant units and also does not speak to relevant interactions that are captured in a model. We criticize SHAP because it is one of the most popular posthoc-xAI techniques, however, similar arguments hold for other techniques [16, 30, 67].
See Fig 4 and S2 Fig in S1 File, and the supplement for how the features actually are incorporated into the model. SHAP fails to identify the features a model is using whenever features are correlated.
Discussion
We presented a method for mimicking ReLU-nets within inherently interpretable multilevel Bayesian models. We applied this methodology to the prediction of hospital readmissions or death after discharge, and to the causal inference of the effects of discharge assignments.
Accuracy without blackboxes
We demonstrated how we were able to perform like blackboxes, without sacrificing interpretability. We accomplished this feat through two classes of methods: First, our novel modeling framework allowed us some fine-grain resolution in looking at the differential effects of the predictors in data subgroups. Additionally, it helped regularize the inference of local average treatment effects for choosing discharge placement. Second, we performed layers of feature engineering. The first layer was an extraction of medically-relevant information from the raw billing that gave us attributes such as chronic diseases, comorbidities, and ADL function. Then, we reduced noise in the raw coding by mapping to the clinically-relevant CCS system. These two steps were sufficient for our logistic regression model to match the performance of an XGBoost model in the literature based on the same dataset [13]. Finally, we performed feature quantization based on the per-feature statistics. Quantization led to a big performance increase in logistic regression and also in the neural network for a given model size. We took these lessons and used them in defining our interpretable survival model.
Posthoc xAI is inherently untrustworthy
Our model, being a regression model is inherently interpretable. It admits an unequivocal ground-truth explanation that is found by simply examining its regression coefficients, all of which are log hazard ratios. Hence, it is a good test case for testing the accuracy of posthoc explainers. We tested SHAP on our model; it failed in coming close to the ground-truth. This finding is consistent with other literature that has looked critically at SHAP and other xAI tools.
While posthoc-xAI does not make blackboxes interpretable, interpretability is not always unnecessary. Quantifying sample average treatment effects and making predictions does not require interpretable modeling [68], or even necessarily models at all [69]. Blackbox methods offer good performance with minimal thoughtfulness. For these reasons, blackbox methods remain inherently useful—so long as one does not whitewash them with false explainability.
Critical look at the applied literature
As evidenced by the explosion of applied machine learning manuscripts that claim explainability or interpretability, the research community has realized that it is important to have some understanding of what a machine learning model is doing. In many cases, the claim of interpretability is warranted, for instance in manuscripts that use methods such as logistic regression or even more-advanced methods such as explainable-boosting machines [70, 71]. However, a large volume of studies (a small sampling for example [72–76]) are in-actuality putting forward blackboxes as explainable by using SHAP or similar methodologies. Additionally, many manuscripts eschew the word explainable altogether and claim that their SHAP-endowed blackboxes are interpretable. It is our humble opinion that the machine learning field needs to set a higher bar for what should be labeled explainable, let along interpretable. It is our hope that we have made a contribution to this particular conversation.
Limitations
Our modeling approach has downsides. Numerical stability when performing ADVI inference generally requires the use of double precision floating point—a limitation common to Bayesian inference (popular statistical packages such as Stan use double precision by default). This limitation is significant when looking to expand to larger models that encompass more data features. The methodology is based on piecewise linear modeling defined over a high-dimensional lattice of coarsened variables derived from the data. For this reason, models can become big quickly and can run into practical memory constraints, particularly in unison with the requirement for using double precision floating point.
The trend in machine learning has been to move towards more automation and the search for modeling architectures that do not require tuning beyond engineering of input predictors. Bucking this trend, our framework is designed with more intentionality in mind—it requires the practitioner to think about what types of broad interactions make sense for a given problem and what types of coarsening will give rise to a model that is understandable and useful in a real-world sense. Some may view this characteristic as a limitation.
Supporting information
S2 File. Supplementary review history from prior KDD submission.
https://doi.org/10.1371/journal.pone.0302871.s002
(PDF)
Acknowledgments
We thank the Innovation Center of the Center for Medicare and Medicaid services for providing access to the CMS Limited Dataset through DUA LDSS-2019-54177. We also thank Dr. Pei-Shu Ho for help in understanding Medicare billing data.
References
- 1.
Anika L. Hines, Marguerite L. Barrett, H. Joanna Jiang, and Claudia A. Steiner. Conditions With the Largest Number of Adult Hospital Readmissions by Payer, 2011. In Healthcare Cost and Utilization Project (HCUP) Statistical Briefs. Agency for Healthcare Research and Quality (US), Rockville (MD), 2006.
- 2. McIlvennan Colleen K., Eapen Zubin J., and Allen Larry A. Hospital Readmissions Reduction Program. Circulation, 131(20):1796–1803, May 2015. ISSN 0009-7322. pmid:25986448
- 3. Huang Yinan, Talwar Ashna, Chatterjee Satabdi, and Aparasu Rajender R. Application of machine learning in predicting hospital readmissions: A scoping review of the literature. BMC Medical Research Methodology, 21(1):96, May 2021. ISSN 1471-2288. pmid:33952192
- 4. Jamei Mehdi, Nisnevich Aleksandr, Wetchler Everett, Sudat Sylvia, and Liu Eric. Predicting all-cause risk of 30-day hospital readmission using artificial neural networks. PLoS ONE, 12(7), July 2017. ISSN 1932-6203. pmid:28708848
- 5. Liu Wenshuo, Stansbury Cooper, Singh Karandeep, Ryan Andrew M., Sukul Devraj, Mahmoudi Elham, et al. Predicting 30-day hospital readmissions using artificial neural networks with medical code embedding. PLoS ONE, 15(4), April 2020. ISSN 1932-6203. pmid:32294087
- 6. Futoma Joseph, Morris Jonathan, and Lucas Joseph. A comparison of models for predicting early hospital readmissions. Journal of Biomedical Informatics, 56:229–238, August 2015. ISSN 1532-0464. pmid:26044081
- 7. Soliman Amira, Agvall Björn, Etminani Kobra, Hamed Omar, and Lingman Markus. The Price of Explainability in Machine Learning Models for 100-Day Readmission Prediction in Heart Failure: Retrospective, Comparative, Machine Learning Study. Journal of Medical Internet Research, 25(1):e46934, October 2023. pmid:37889530
- 8. Shameer Khader, Johnson Kipp W, Yahi Alexandre, Miotto Riccardo, Li Li, Ricks Doran, et al. Predictive modeling of hospital readmission rates using electronic medical record-wide machine learning: A case-study using mount sinai heart failure cohort. In Biocomputing 2017, pages 276–287. WORLD SCIENTIFIC, November 2016. ISBN 978-981-320-780-6.
- 9. Allam Ahmed, Nagy Mate, Thoma George, and Krauthammer Michael. Neural networks versus Logistic regression for 30 days all-cause readmission prediction. Scientific Reports, 9(1):9277, June 2019. ISSN 2045-2322. pmid:31243311
- 10. Min Xu, Yu Bin, and Wang Fei. Predictive Modeling of the Hospital Readmission Risk from Patients’ Claims Data Using Machine Learning: A Case Study on COPD. Scientific Reports, 9(1):2362, February 2019. ISSN 2045-2322. pmid:30787351
- 11. Larsson Anna, Berg Johanna, Gellerfors Mikael, and Wärnberg Martin Gerdin. The advanced machine learner XGBoost did not reduce prehospital trauma mistriage compared with logistic regression: A simulation study. BMC Medical Informatics and Decision Making, 21(1):192, June 2021. ISSN 1472-6947. pmid:34148560
- 12. Van Der Donckt Jeroen, Van Der Donckt Jonas, Deprost Emiel, Vandenbussche Nicolas, Rademaker Michael, Vandewiele Gilles, et al. Do not sleep on traditional machine learning: Simple and interpretable techniques are competitive to deep learning for sleep scoring. Biomedical Signal Processing and Control, 81:104429, March 2023. ISSN 1746-8094.
- 13. MacKay Emily J., Stubna Michael D., Chivers Corey, Draugelis Michael E., Hanson William J., Desai Nimesh D., et al. Application of machine learning approaches to administrative claims data to predict clinical outcomes in medical and surgical patient populations. PLOS ONE, 16(6):e0252585, June 2021. ISSN 1932-6203. pmid:34081720
- 14.
Chuhong Lahlou, Ancil Crayton, Caroline Trier, and Evan Willett. Explainable Health Risk Predictor with Transformer-based Medicare Claim Encoder. May 2021.
- 15.
Pedro Domingos. Every Model Learned by Gradient Descent Is Approximately a Kernel Machine. In arXiv:2012.00152 [Cs, Stat], November 2020.
- 16. Rudin Cynthia. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, May 2019. ISSN 2522-5839. pmid:35603010
- 17.
The Office of the Comptroller of the Currency (OCC). Comptroller’s Handbook: Model Risk Management. In Comptroller’s Handbook, Safety and Soundness. August 2021.
- 18.
Cynthia Rudin. Algorithms for interpretable machine learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’14, page 1519, New York, NY, USA, August 2014. Association for Computing Machinery. ISBN 978-1-4503-2956-9.
- 19. Miller G. A. The magical number seven plus or minus two: Some limits on our capacity for processing information. Psychological review, 1956. pmid:13310704
- 20.
Agus Sudjianto and Aijun Zhang. Designing Inherently Interpretable Machine Learning Models. arXiv, November 2021.
- 21.
Peters Jonas, Bauer Stefan, and Pfister Niklas. Causal Models for Dynamical Systems. In Probabilistic and Causal Inference: The Works of Judea Pearl, volume 36, pages 671–690. Association for Computing Machinery, New York, NY, USA, 1 edition, March 2022. ISBN 978-1-4503-9586-1.
- 22. Pearl Judea. Causal inference in statistics: An overview. Statistics Surveys, 3(none):96–146, January 2009. ISSN 1935-7516.
- 23.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16, pages 1135–1144, New York, NY, USA, August 2016. Association for Computing Machinery. ISBN 978-1-4503-4232-2.
- 24. Lipovetsky Stan and Conklin Michael. Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry, 17(4):319–330, 2001. ISSN 1526-4025.
- 25.
Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems. In 2016 IEEE Symposium on Security and Privacy (SP), pages 598–617, May 2016.
- 26.
Scott M Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- 27. Aas Kjersti, Jullum Martin, and Løland Anders. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artificial Intelligence, 298:103502, September 2021. ISSN 0004-3702.
- 28. Niu Zhaoyang, Zhong Guoqiang, and Yu Hui. A review on the attention mechanism of deep learning. Neurocomputing, 452:48–62, September 2021. ISSN 0925-2312.
- 29.
Sarthak Jain and Byron C. Wallace. Attention is not Explanation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- 30.
Yilun Zhou, Serena Booth, Marco Tulio Ribeiro, and Julie Shah. Do Feature Attribution Methods Correctly Attribute Features? Proceedings of the AAAI Conference on Artificial Intelligence, 36(9):9623–9633, June 2022. ISSN 2374-3468.
- 31.
Thibault Laugel, Marie-Jeanne Lesot, Christophe Marsala, Xavier Renard, and Marcin Detyniecki. The dangers of post-hoc interpretability: Unjustified counterfactual explanations. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI’19, pages 2801–2807, Macao, China, August 2019. AAAI Press. ISBN 978-0-9992411-4-1.
- 32.
I. Elizabeth Kumar, Suresh Venkatasubramanian, Carlos Scheidegger, and Sorelle A. Friedler. Problems with Shapley-value-based explanations as feature importance measures. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of ICML’20, pages 5491–5500. JMLR.org, July 2020.
- 33.
Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES’20, pages 180–186, New York, NY, USA, February 2020. Association for Computing Machinery. ISBN 978-1-4503-7110-0.
- 34.
David Alvarez-Melis and Tommi S. Jaakkola. On the Robustness of Interpretability Methods. arXiv:1806.08049 [cs, stat], June 2018.
- 35.
Aida Brankovic, David Cook, Jessica Rahman, Wenjie Huang, and Sankalp Khanna. Evaluation of Popular XAI Applied to Clinical Prediction Models: Can They be Trusted?, June 2023.
- 36.
Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, et al. The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective. February 2022.
- 37.
Agus Sudjianto, William Knauth, Rahul Singh, Zebin Yang, and Aijun Zhang. Unwrapping The Black Box of Deep ReLU Networks: Interpretability, Diagnostics, and Simplification. In arXiv:2011.04041 [Cs, Stat], November 2020.
- 38. Hastie Trevor and Tibshirani Robert. Varying-Coefficient Models. Journal of the Royal Statistical Society. Series B (Methodological), 55(4):757–796, 1993. ISSN 0035-9246.
- 39. Fan Jianqing and Zhang Wenyang. Statistical Methods with Varying Coefficient Models. Statistics and its interface, 1(1):179–195, 2008. ISSN 1938-7989. pmid:18978950
- 40. Li Feng, Li Yajie, and Feng Sanying. Estimation for Varying Coefficient Models with Hierarchical Structure. Mathematics, 9(2):132, January 2021. ISSN 2227-7390.
- 41.
CMS. 2015 Measure Information About the 30-Day All-Cause Hospital Readmission Measure, Calculated for the Value-Based Payment Modifier Program | Guidance Portal. https://www.hhs.gov/guidance/document/2015-measure-information-about-30-day-all-cause-hospital-readmission-measure-calculated, 2015.
- 42.
AHRQ. HCUP-US Tools & Software Page. https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccsfactsheet.jsp, 2022.
- 43.
Xinyu Hu, Tanmay Binaykiya, Eric Frank, and Olcay Cirit. DeeprETA: An ETA Post-processing System at Scale. arXiv, June 2022.
- 44.
Mohammad Saberian, Pablo Delgado, and Yves Raimond. Gradient Boosted Decision Tree Neural Network. arXiv, November 2019.
- 45. Friedman Michael. Piecewise Exponential Models for Survival Data with Covariates. The Annals of Statistics, 10(1):101–113, March 1982. ISSN 0090-5364, 2168-8966.
- 46.
Joseph Bafumi and Andrew Gelman. Fitting Multilevel Models When Predictors and Group Effects Correlate. September 2007.
- 47. Gelman Andrew. Multilevel (Hierarchical) Modeling: What It Can and Cannot Do. Technometrics, 48(3):432–435, August 2006. ISSN 0040-1706.
- 48.
Feller Avi and Gelman Andrew. Hierarchical Models for Causal Effects. In Emerging Trends in the Social and Behavioral Sciences, pages 1–16. John Wiley & Sons, Ltd, 2015. ISBN 978-1-118-90077-2.
- 49.
Goodfellow Ian, Bengio Yoshua, and Courville Aaron. Deep Learning. MIT Press, November 2016. ISBN 978-0-262-33737-3.
- 50.
Joshua C. Chang, Patrick Fletcher, Jungmin Han, Ted L. Chang, Shashaank Vattikuti, Bart Desmet, et al. Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization. In International Conference on Learning Representations, 2021.
- 51. Ghosh Soumya, Yao Jiayu, and Doshi-Velez Finale. Model Selection in Bayesian Neural Networks via Horseshoe Priors. Journal of Machine Learning Research, 20(182):1–46, 2019. ISSN 1533-7928.
- 52. Bhadra Anindya, Datta Jyotishka, Li Yunfan, and Polson Nicholas. Horseshoe Regularisation for Machine Learning in Complex and Deep Models1. International Statistical Review, 88(2):302–320, 2020. ISSN 1751-5823.
- 53.
Polson Nicholas G. and Scott James G. Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction *. Oxford University Press, October 2011. ISBN 978-0-19-173192-1.
- 54. van Erp Sara, Oberski Daniel L., and Mulder Joris. Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89:31–50, April 2019. ISSN 0022-2496.
- 55. Franco-Villoria Maria, Ventrucci Massimo, and Rue Håvard. A unified view on Bayesian varying coefficient models. Electronic Journal of Statistics, 13(2):5334–5359, January 2019. ISSN 1935-7524, 1935-7524.
- 56. Zhou Yichen and Hooker Giles. Decision tree boosted varying coefficient models. Data Mining and Knowledge Discovery, 36(6):2237–2271, November 2022. ISSN 1573-756X.
- 57.
Yuling Yao, Gregor Pirš, Aki Vehtari, and Andrew Gelman. Bayesian hierarchical stacking: Some models are (somewhere) useful. arXiv:2101.08954 [cs, stat], May 2021.
- 58. Hastie Trevor and Tibshirani Robert. Generalized Additive Models. Statistical Science, 1(3):297–310, August 1986. ISSN 0883-4237, 2168-8745.
- 59.
Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’13, pages 623–631, New York, NY, USA, August 2013. Association for Computing Machinery. ISBN 978-1-4503-2174-7.
- 60. Yang Zebin, Zhang Aijun, and Sudjianto Agus. GAMI-Net: An explainable neural network based on generalized additive models with structured interactions. Pattern Recognition, 120:108192, December 2021. ISSN 0031-3203.
- 61.
Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, et al. TensorFlow Distributions. arXiv:1711.10604 [cs, stat], November 2017.
- 62.
Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead Optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- 63.
Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pages 3457–3465. PMLR, March 2021.
- 64. Damery Sarah and Combes Gill. Evaluating the predictive strength of the LACE index in identifying patients at high risk of hospital readmission following an inpatient episode: A retrospective cohort study. BMJ Open, 7(7):e016921, July 2017. ISSN 2044-6055, 2044-6055. pmid:28710226
- 65. Carvalho Carlos M., Polson Nicholas G., and Scott James G. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, June 2010. ISSN 0006-3444.
- 66.
Sebastian Bordt, Michèle Finck, Eric Raidl, and Ulrike von Luxburg. Post-Hoc Explanations Fail to Achieve their Purpose in Adversarial Contexts. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 891–905, June 2022.
- 67. Babic Boris, Gerke Sara, Evgeniou Theodoros, and Glenn Cohen I. Beware explanations from AI in health care. Science, 373(6552):284–286, July 2021. pmid:34437144
- 68. Hill Jennifer L. Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics, 20(1):217–240, January 2011. ISSN 1061-8600.
- 69. Ding Peng and Miratrix Luke W. Model-free causal inference of binary experimental data. Scandinavian Journal of Statistics, 46(1):200–214, 2019. ISSN 1467-9469.
- 70. Yagin Fatma Hilal, Yasar Seyma, Gormez Yasin, Yagin Burak, Pinar Abdulvahap, Alkhateeb Abedalrhman, et al. Explainable Artificial Intelligence Paves the Way in Precision Diagnostics and Biomarker Discovery for the Subclass of Diabetic Retinopathy in Type 2 Diabetics. Metabolites, 13(12):1204, December 2023a. ISSN 2218-1989. pmid:38132885
- 71. Yagin Fatma Hilal, Alkhateeb Abedalrhman, Raza Ali, Samee Nagwan Abdel, Mahmoud Noha F., Colak Cemil, et al. An Explainable Artificial Intelligence Model Proposed for the Prediction of Myalgic Encephalomyelitis/Chronic Fatigue Syndrome and the Identification of Distinctive Metabolites. Diagnostics, 13(23):3495, November 2023b. ISSN 2075-4418. pmid:38066735
- 72. Alsinglawi Belal, Alshari Osama, Alorjani Mohammed, Mubin Omar, Alnajjar Fady, Novoa Mauricio, et al. An explainable machine learning framework for lung cancer hospital length of stay prediction. Scientific Reports, 12(1):607, January 2022. ISSN 2045-2322. pmid:35022512
- 73. Chan Ming-Cheng, Pai Kai-Chih, Su Shao-An, Wang Min-Shian, Wu Chieh-Liang, and Chao Wen-Cheng. Explainable machine learning to predict long-term mortality in critically ill ventilated patients: A retrospective study in central Taiwan. BMC Medical Informatics and Decision Making, 22(1):75, March 2022. ISSN 1472-6947. pmid:35337303
- 74. Chmiel F. P., Burns D. K., Azor M., Borca F., Boniface M. J., Zlatev Z. D., et al. Using explainable machine learning to identify patients at risk of reattendance at discharge from emergency departments. Scientific Reports, 11(1):21513, November 2021. ISSN 2045-2322. pmid:34728706
- 75.
Alex G. C. de Sá, Daniel Gould, Anna Fedyukova, Mitchell Nicholas, Lucy Dockrell, Calvin Fletcher, et al. Explainable Machine Learning for ICU Readmission Prediction, September 2023.
- 76. Duan Minjie, Shu Tingting, Zhao Binyi, Xiang Tianyu, Wang Jinkui, Huang Haodong, et al. Explainable machine learning models for predicting 30-day readmission in pediatric pulmonary hypertension: A multicenter, retrospective study. Frontiers in Cardiovascular Medicine, 9, 2022. ISSN 2297-055X. pmid:35958416