Skip to main content
Advertisement
  • Loading metrics

Step-by-step causal analysis of EHRs to ground decision-making

  • Matthieu Doutreligne ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    m.doutreligne@has-sante.fr

    Affiliations Soda Team, Inria Saclay, Palaiseau, France, Mission Data, Haute Autorité de Santé, Saint-Denis, France

  • Tristan Struja,

    Roles Methodology, Writing – original draft, Writing – review & editing

    Affiliations Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America, Medical University Clinic, Division of Endocrinology, Diabetes & Metabolism, Kantonsspital Aarau, Aarau, Switzerland

  • Judith Abecassis,

    Roles Writing – review & editing

    Affiliation Soda Team, Inria Saclay, Palaiseau, France

  • Claire Morgand,

    Roles Writing – review & editing

    Affiliation Agence Régionale de Santé Ile-de-France, Saint-Denis, France

  • Leo Anthony Celi,

    Roles Writing – review & editing

    Affiliations Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America, Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America

  • Gaël Varoquaux

    Roles Supervision, Writing – review & editing

    Affiliation Soda Team, Inria Saclay, Palaiseau, France

Abstract

Causal inference enables machine learning methods to estimate treatment effects of medical interventions from electronic health records (EHRs). The prevalence of such observational data and the difficulty for randomized controlled trials (RCT) to cover all population/treatment relationships make these methods increasingly attractive for studying causal effects. However, researchers should be wary of many pitfalls. We propose and illustrate a framework for causal inference estimating the effect of albumin on mortality in sepsis using an Intensive Care database (MIMIC-IV) and comparing various sensitivity analyses to results from RCTs as gold-standard. The first step is study design, using the target trial concept and the PICOT framework: Population (patients with sepsis), Intervention (combination of crystalloids and albumin for fluid resuscitation), Control (crystalloids only), Outcome (28-day mortality), Time (intervention start within 24h of admission). We show that too large treatment-initiation times induce immortal time bias. The second step is selection of the confounding variables based on expert knowledge. Increasingly adding confounders enables to recover the RCT results from observational data. As the third step, we assess the influence of multiple models with varying assumptions, showing that a doubly robust estimator (AIPW) with random forests proved to be the most reliable estimator. Results show that these steps are all important for valid causal estimates. A valid causal model can then be used to individualize decision making: subgroup analyses showed that treatment efficacy of albumin was better for patients >60 years old, males, and patients with septic shock. Without causal thinking, machine learning is not enough for optimal clinical decision on an individual patient level. Our step-by-step analytic framework helps avoiding many pitfalls of applying machine learning to EHR data, building models that avoid shortcuts and extract the best decision-making evidence.

Author summary

Rich routine-care data, as EHR or claims, is useful to individualize decision making using machine learning; but guiding interventions requires causal inference. Unlike with an RCT, interventions in routine data do not easily enable an apple-to-apple measure of the effect of an intervention, leading to many analytical pitfalls, particularly in time-varying data. We study these in a tutorial spirit, making the code and data openly available. We give 5 analytical steps for data-driven individualized interventions: Step 1) Study design, where common pitfalls are selection bias, with information unequally collected across treatment and control patients, and immortal time bias, where the inclusion-defining event interacts with the intervention time. Step 2) Identification of the causal assumptions and categorization of confounders. Step 3) Estimation of the causal effect of interest by correct aggregation of confounders and selection of an appropriate statistical model. Step 4) Assessing the analysis’ robustness to assumptions, and finally Step 5) Individualizing treatment decision, by exploring treatment heterogeneity, eg across subgroups. Studying choice of fluid resuscitation in sepsis, we show that common mistakes in steps 1, 2, and 3 equally compromise causal validity.

Introduction: Data-driven decisions require causal inference

Informing a care option extends beyond merely predicting the occurrence of an event; it involves estimating the effect of the corresponding treatment effects. Routine-care data comes naturally to mind to guide routine decisions, but they require care to estimate treatment effects as they are observational, unlike Randomized controlled trials (RCTs). This context calls for causal inference statistical frameworks. But merely applying these tools to the data does suffice to ensure the validity of the inferences; numerous considerations must be carefully addressed.

Individualized medicine and machine learning challenges

Machine learning plays a pivotal role in individualized medicine [15]. It demonstrated superior performance over traditional rule-based clinical scores in predicting a patient’s readmission risk, mortality, or future comorbidities using Electronic Health Records (EHRs) [15]. However, mounting evidence suggests that machine-learning models can inadvertently perpetuate and exacerbate biases present in the data [6], including gender or racial biases [7,8], and the marginalization of under-served populations [9]. These biases are typically encoded by capturing shortcuts—stereotypical or distorted features in the data [1012]. For instance, numerous machine learning algorithms rely on post-treatment information [1316], exemplified by a diagnostic model for skin cancer that depends on surgical marks [11]. For Intensive Care Unit data, focus of our study, such information markedly improves mortality prediction (S1 Fig), but cannot inform decisions.

The importance of causal reasoning in data-driven decision-making [17]

While conventional machine learning relies on retrospective to generate predictions of future effects [18], truly informing decision-making needs a comparison of potential outcomes with and without the intervention. This involves estimating a causal effect, mirroring the methodology employed in RCTs [17]. However, RCTs encounter challenges such as selection biases [19,20], difficulties in recruiting diverse populations, and limited sample sizes for exploring treatment heterogeneity across subgroups. Routinely collected data presents a unique opportunity to assess real-life benefit-risk trade-offs associated with a decision [21], with reduced sampling bias and sufficient data to capture heterogeneity [22]. Nevertheless, estimating causal effects from such data is challenging due to the confounding of the intervention by indication. Therefore, dedicated statistical techniques are imperative to emulate a "target trial" [23] from observational data.

Multiple perspectives on evidence-based decision making

Across different fields, existing literature has emphasized different challenges associated with estimating treatment effects using observational data. While epidemiologic studies underscore the importance of the target trial approach [2428], there emphasis primarily lies on biases that arise from temporal effects [23,2933] or confounding variables [3436], with relatively less attention to issues arising from estimator selection. Recent replications of RCTs using observational data did not explore the impact of modern machine learning methods on the robustness of the results [27,37].

In contrast, machine learning and causal inference literature predominantly studies estimators [3842] : propensity score matching [43], inverse probability weighting [44], outcome models [45], doubly robust methods, [39] or deep learning based models [46]. This literature may be opaque for some due to intricate mathematical details and unverifiable assumptions. Guidelines seldom address time-related biases, or covariate aggregation which frequently emerge in datasets with temporal dependencies [29,31]. Recently, the machine learning community shifted its focus from EHR data to simulated data, which may not capture the complexities of real-world data [4750].

In this work, we bring together epidemiological concepts and principles from statistical and machine learning literature. We adopt an empirical perspective to answer practical needs of applied researchers. A study of choices spread out across the analysis –study design, consideration of confounders, and selection of estimators (refer to Section Step-by-step framework for robust decision-making from EHR data)– highlights their equal importance in ensuring the validity of results. To illustrate and compare biases, we investigate the impact of albumin on sepsis mortality using data from a publicly available intensive care database, MIMIC-IV [51] (section Application: evidence from MIMIC-IV on which resuscitation fluid to use).

The primary focus of the main section is on accessibility, with technical details expanded in the appendices.

Step-by-step framework for robust decision-making from EHR data

thumbnail
Fig 1. Step-by-step analytic framework. The complete inference pipeline confronts the analyst with many choices, some guided by domain knowledge, others by data insights. Making those choices explicit is necessary to ensure robustness and reproducibility.

https://doi.org/10.1371/journal.pdig.0000721.g001

Whether or not using machine learning, many pitfalls threaten an analysis’ value for decision-making. To avoid these pitfalls, we outline a simple step-by-step analytic framework illustrated in Fig 1 for retrospective case-control studies. We frame the medical question as a target trial [52] to match the design to an RCT giving the gold standard average effect. Then we probe for heterogeneity –predictions on sub-groups– going beyond what RCTs can achieve.

Step 1: study design – Frame the question to avoid biases

thumbnail
Table 1. PICO(T) components help to clearly define the medical question of interest.

https://doi.org/10.1371/journal.pone.0313772.t001

Grounding decisions on evidence needs well-framed questions, defined by their PICO(T) components. Population, Intervention, Control, and Outcome [53,54], and in case of EHRs or claims data an additional time component, are necessary to concord with a (hypothetical) target randomized clinical trial [37,55] – Table 1. A selection flowchart such as in S5 Fig makes inclusion and exclusion choices for PICOT explicit.

Without care in defining these PICO(T) components, non-causal associations between treatment and outcomes can easily be introduced into an analysis [56]. The time-varying nature of EHR calls for checking systematically of the Population and Time components by addressing two commonly encountered types of bias.

Selection Bias.

In EHRs, outcomes and treatments are often not directly available and need to be inferred from indirect events. These signals could be missing not-at random, sometimes correlated with the treatment allocation [57]. For example, billing codes can be strongly associated with case-severity and cost. Consider comparing the effectiveness of fluid resuscitation with albumin to crystalloids. As albumin is more costly, this treatment is more likely to have a sepsis billing code. On the contrary, for patients treated with crystalloids, only the most severe cases will have a billing code. Naively comparing patients would overestimate the effect of albumin.

Immortal time bias.

Improper alignment of the inclusion defining event and the intervention time is a major source of bias in time-varying data [23,29,32]. Immortal time bias (illustrated in S2 Fig) occurs when the follow-up period, i.e. cohort entry, starts before the intervention, e.g. prescription for a second-line treatment. In this case, the treated group will be biased towards patients still alive at the time of assignment and thus overestimating the effect size. Other frequent temporal biases are lead time bias [30,31] or right censorship [23], and attrition bias [33]. Good practices include explicitly stating the cohort inclusion event [58,Chapter 10:Defining Cohorts] and defining an appropriate grace period between starting time and the intervention assignment [23]. At this step, a population timeline can help.

Step 2: identification – List necessary information to answer the causal question

The identification step builds a causal model to answer the research question. Indeed, the analysis must compensate for differences between treated and non-treated that are not due to the intervention ([59,chapter 1], [26,chapter 1]).

Causal assumptions.

Valid causal inference requires assumptions [60] –detailed in S1 Appendix. The analyst should thus review the plausibility of the following: 1) Unconfoundedness: after adjusting for the confounders as ascertained by domain expert insight, treatment allocation should be random; 2) Overlap –also called positivity– the distribution of confounding variables overlaps between the treated and controls –this is the only assumption testable from data [44]–; 3) No interference between units and consistency in the treatment, a reasonable assumption in most clinical questions.

Categorizing covariates.

Potential predictors –covariates– should be categorized depending on their causal relations with the intervention and the outcome (illustrated in S4 Fig): confounders are common causes of the intervention and the outcome; colliders are caused by both the intervention and the outcome; instrumental variables are a cause of the intervention but not the outcome, mediators are caused by the intervention and is a cause of the outcome. Finally, effect modifiers interact with the treatment, and thus modulate the treatment effect in subpopulations [61].

To capture a valid causal effect, the analysis should only include confounders and possible treatment-effect modifiers to study the resulting heterogeneity. Regressing the outcome on instrumental and post-treatment variables (colliders and mediators) will lead to biased causal estimates [35]. Drawing causal Directed Acyclic Graphs (DAGs) [34], eg with a webtool such as DAGitty [62], helps capturing the relevant variables and defining a suitable estimand or effect measure.

Unconfoundedness –inclusion of all confounders in the analysis– is a strong assumption that can be difficult to ascertain in practice applications. In these cases, sensitivity analyses for omitted variable bias allow to test the robustness of the results to missing confounders [63], proximal inference can be used to leverage proxy of unobserved confounders [64], and the presence of a natural experiment or RCT might identify the desired causal effect without unconfoundedness [65,Chapter 5, 9].

The estimand is the final causal quantity estimated from the data. Depending on the question, different estimands are better suited to contrast the two potential outcomes E[Y(1)] and E[Y(0)] [66,67]. For continuous outcomes, risk difference is a natural estimand, while for binary outcomes (e.g. events) the choice of estimand depends on the scale. Whereas the risk difference is very informative at the population level, e.g. for medico-economic decision-making, the risk ratio and the hazard ratio are more informative at the level of sub-groups or individuals [67].

Causal estimators.

A given estimand can be estimated through different methods. One can model the outcome with regression models also known as G-formula, [45] and use it as a predictive counterfactual model for all possible treatments for a given patient. Alternatively, one can model the propensity of being treated for use in matching or Inverse Propensity Weighting (IPW) [44]. Finally, doubly robust methods model both the outcome and the treatment, benefiting from the convergence of both models [65]. There is a variety of doubly robust models, reviewed in S2 Appendix.

Step 3: Statistical estimation – Compute the causal effect of interest

Confounder aggregation.

Confounders captured via measures collected over multiple time points must be aggregated at the patient level. Simple forms of aggregation include taking the first or last value before a time point, or an aggregate such as mean or median over time. More elaborate choices may rely on hourly aggregations providing more detailed information on the disease course such as vital signs. They may reduce confounding bias between rapidly deteriorating and stable patients but also increase the number of confounders making estimation more challenging [68]. The increase of variance occurs either in arbitrarily small propensity scores for treatment models or in hazardous extrapolation from one group to another for outcome model. If multiple choices appear reasonable, one should compare them in a vibration analysis (see Step 4: Vibration analysis – Assess the robustness of the hypotheses).

Beyond tabular data, unstructured clinical text may capture confounding or prognostic information [69,70] which can be added in the causal model [28]. However, high-dimensional confounder space such as text may break the positivity assumption just as hourly aggregation choices for measurements.

Missing covariate values might also be a source of confounding. Some statistical estimators (such as forests) can directly incorporate them as supplementary covariates. Others, such as linear models, require imputations. S3 Appendix details general sanity checks for imputation strategies when using statistical estimators.

Statistical estimation models of outcome and treatment.

The causal estimators use models of the outcome or the treatment –called nuisances. There is currently no clear best practice to choose the corresponding statistical model [48,71]. The trade-off lies between simple models risking misspecification of the nuisance parameters versus flexible models risking to overfit the data at small sample sizes. Stacking models of different complexity as in a super-learner is a good solution to navigate the trade-off [72,73].

Step 4: Vibration analysis – Assess the robustness of the hypotheses

Some choices in the pipeline may not be clear cut. Several options should then be explored, to derive conceptual error bars going beyond a single statistical model. When quantifying the bias from unobserved confounders, this process is sometimes called sensitivity analysis [7476]. Following [77], we use the term vibration analysis to describe the sensitivity of the results to all analytic choices.

Step 5: Treatment heterogeneity – Compute treatment effects on subpopulations

Once the causal design and corresponding estimators are established, they can be used to explore the variation of treatment effects among subgroups. A causally-grounded model can be used to predict the effect of the treatment from all the covariates –confounders and effect modifiers– the Conditional Average Treatment Effect (CATE) [78]. Practically, CATEs can be estimated by regressing an individual’s predictions given by the causal estimator against the sources of heterogeneity (details in S7 Appendix).

Application: evidence from MIMIC-IV on which resuscitation fluid to use

We now use the above framework to extract evidence-based decision rules for resuscitation. Ensuring optimal organ perfusion in patients with septic shock requires resuscitation by reestablishing circulatory volume with intravenous fluids. While crystalloids are readily available, inexpensive and safe, a large fraction of the administered volume is not retained in the vasculature. Colloids offer the theoretical benefit of retaining more volume, but might be more costly and have adverse effects [79]. Meta-analyses from multiple pivotal RCTs found no effect of adding albumin to crystalloids [80,81] on 28-day and 90-day mortality. Given this previous evidence, we thus expect no average effect of albumin on mortality in sepsis patients. However, studies –RCT [82] and observational [83]– have found that septic-shock patients do benefit from albumin.

Emulated trial: Effect of albumin in combination with crystalloids compared to crystalloids alone on 28-day mortality in patients with sepsis.

Multiple published RCTs can validate the analysis pipeline before investigating sub-population effects for individualized decisions. Using MIMIC-IV [51], we compare the magnitude of biases introduced by reasonable choices in the different analytical steps recalled in Fig 2.

MIMIC-IV is a publicly available database that contains information from real ICU stays of patients admitted to one tertiary academic medical center, Beth Israel Deaconess Medical Center (BIDMC), in Boston, United States between 2008 and 2019. The data in MIMIC-IV has been previously de-identified, and the institutional review boards of the Massachusetts Institute of Technology (No. 0403000206) and BIDMC (2001-P-001699/14) both approved the use of the database for research. The database contains comprehensive information from ICU stays including vital signs, laboratory measurements, medications, and mortality data up to one year after discharge.

thumbnail
Fig 2. Application of the step-by-step framework on which resuscitation fluid to use.

https://doi.org/10.1371/journal.pdig.0000721.g002

Step 1: Study design – effect of crystalloids on mortality in sepsis.

Population: Patients with sepsis in an ICU stay according to the sepsis-3 definition. Other inclusion criteria: sufficient follow-up of at least 24 hours, and age over 18 years. S5 Fig details the selection flowchart and S1 Table the population characteristics.

Intervention: Treatment with a combination of crystalloids and albumin during the first 24 hours of an ICU stay.

Control: Treatment with crystalloids only in the first 24 hours of an ICU stay.

Outcome: 28-day mortality.

Time: Follow-up begins after the first administration of crystalloids. Thus, we potentially introduce a small immortal time bias by allowing a time gap between follow-up and the start of the albumin treatment –see the full timeline in S3 Fig. Because we are only considering the first 24 hours of an ICU stay, we hypothesize that this gap is insufficient to affect our results. We test this hypothesis in the vibration analysis step.

In MIMIC-IV, these inclusion criteria yield 18,121 patients of which 3,559 were treated with a combination of crystalloids and albumin. While glycopeptide antibiotic therapy was similar between both groups (51.8% crystalloid vs 51.5% crystalloids + albumin), aminoglycosides, carbapenems, and beta-lactams were more frequent in the crystalloid only group (2.0% vs. 0.7%, 4.3% vs. 2.6%, and 35.5% vs. 13.8%, respectively). The crystalloid only group was more frequently admitted as an emergency (57.3% vs. 30.7%). Vasopressors (80.2% vs 41.7%) and ventilation (96.8% vs 87.0%) were more prevalent in the treated populations, underlying the overall higher severity of patients receiving albumin (mean SOFA at admission 6.9 vs. 5.7). Table 2 details patient characteristics.

thumbnail
Table 2. Characteristics of the trial population measured on the first 24 hours of ICU stay.

https://doi.org/10.1371/journal.pone.0313772.t002

Step 2: Identification – listing confounders

For confounders selection we use a causal DAG shown in Fig S6 Fig. Gray confounders are not controlled for since they are not available in the data. However, resulting confounding biases are captured by proxies such as comorbidity scores (SOFA or SAPS II) or other variables (eg. race, gender, age, weight). S1 Table details confounders summary statistics for treated and controls.

Causal estimators.

We implemented multiple estimation strategies, including Inverse Propensity Weighting (IPW), outcome modeling (G-formula) with T-Learner, Augmented Inverse Propensity Weighting (AIPW) and Double Machine Learning (DML). We used the python packages dowhy [41] for IPW implementation and EconML [84] for all other estimation strategies. Confidence intervals were estimated by bootstrap (50 repetitions). S2 Appendix and S4 Appendix detail the estimators and the available Python implementations. S3 Appendix details statistical considerations that we identified as important but missing in these packages, namely lack of cross fitting estimators, bad practices for imputation, or lack of closed form confidence intervals.

Step 3: Statistical estimation

Confounder aggregation.

We tested multiple aggregations such as the last value before the start of the follow-up period, the first observed value, and both the first and last values as separated features. Missing values were median imputed for numerical features, categorical variables were one-hot encoded (thus discarding missing values).

Outcome and treatment estimators.

To model the outcome and treatment, we used two common but different estimators: random forests and ridge logistic regression implemented with scikit-learn [85]. We chose the hyperparameters with a random search procedure (S5 Appendix). While logistic regression handles predictors in a linear fashion, random forests bring the benefit of modeling non-linear relations.

Step 4: Vibration analysis – Comparing sources of systematic errors

Study design flaw – Illustration of immortal time bias.

To illustrate the risk of immortal-time bias, we vary the eligibility period of treatment or control in a shorter or longer time window than 24 hours. As explained in Step 1: study design – Frame the question to avoid biases, a longer eligibility period means that patients are more likely to be treated if they survived up to the intervention and hence the study is biased to overestimate the beneficial effect of the intervention. Fig 3a) shows that longer eligibility periods lead to albumin being markedly more efficient (detailed results with causal forest and other choices of aggregation in S8 Fig).

Confounder choice flaw.

We consider other choice of confounding variables (detailled in S6 Appendix). Fig 3b) shows that a less thorough choice, neglecting the administrated drugs, makes little to no difference. Major errors, such as omitting the biological measurements or using only socio-demographical variables, lead to sizeable bias. This is consistent with the literature highlighting the importance of a clinically valid DAG [34].

Estimation choices flaw – Confounder aggregation, causal and nuisance estimators.

Fig 3c) shows varying confidence intervals (CI) depending on the method. Doubly-robust methods provide the narrowest CIs, whereas the outcome-regression methods have the largest CI. The estimates of the forest models are closer to the consensus across prior studies (no effect) than the logistic regression indicating a better fit of non-linear relationships. We only report the first and last pre-treatment feature aggregation strategy, since detailed analysis showed little differences for other aggregations (S7 Fig for complete results, and S9 Fig for a detailed study on aggregation choices). Both methodological studies [86] and consistency with published RCTs suggest to prefer doubly-robust approaches.

Step 5: Treatment heterogeneity – Which treatment for a sub-population?

With adequate choice of study design, confounding variables and causal estimator, the average treatment effect matches well published findings: Pooling evidence from high-quality RCTs, no effect of albumin in severe sepsis was demonstrated for both 28-day mortality (odds ratio (OR) 0.93, 95% CI 0.80-1.08) and 90-day mortality (OR 0.88, 95% CI 0.761.01) [80]. Having validated the analytical pipeline, we can use it to inform decision-making. We explore heterogeneity along four binary patient characteristics, displayed in Fig 4. We find that albumin is beneficial with patient with septic shock consistent with one RCT [82]. It is also beneficial for older patients (age >=60) and males. S7 Appendix details the heterogeneity analysis.

thumbnail
Fig 4. Subgroup distributions of Individual Treatment effects: better treatment efficacy for patients older than 60 years, septic shock, and to a lower extent males. The final estimator is ridge regression. The boxes contain the 25th and 75th percentiles of the CATE distributions with the median indicated by the vertical line. The whiskers extend to 1.5 times the inter-quartile range of the distribution.

https://doi.org/10.1371/journal.pdig.0000721.g004

Discussion and conclusion

Valid decision-making evidence from EHR data requires a clear causal framework. Indeed, machine-learning algorithms have often extracted non-causal associations between the intervention and the outcome, improper for decision-making [11,13,14]. Machine learning studies in medicine often rely on an implicit causal thinking, via a good understanding of the clinical settings. A clear framework helps making sure nothing falls through the cracks.

We have separated three steps important for causal validity: the choice of study design, confounders, and estimators. Regarding study design, major caveats arise from the time component, where a poor choice of inclusion time easily brings in significant bias. Regarding choice of prediction variables, forgetting some variables that explains both the treatment allocation and the outcome leads to confounding bias, that however remains small when these variables capture weak links. Regarding choice of causal estimators, preferring flexible models such as random forests reduces the bias, in particular for doubly-robust estimators. We have shown that all these three steps are equally important: paying no attention to one of them leads to invalid estimates of treatment effect, yet imperfect but plausible choices lead to small biases of the same order of magnitude for all steps. For instance, despite the emphasis often put on choice of confounders, minor deviations from the expert’s causal graph did not introduce substantial bias (3b)), no larger than a too rigid choice of estimator. To assert the validity of the analysis, we argue to relate as much as possible the average effect to a reference target trial, even when the goal is to capture the heterogeneity of the effect to individualize decisions. EHRs complement RCTs: RCTs cannot address all the subpopulations and local practices [19,87]. EHRs often cover many individuals, with the diversity needed to model treatment heterogeneity. The corresponding model can then inform better decision-making [17]: a sub-population analysis (as in Fig 4) can distill rules on which groups of patients should receive a treatment. Beyond a sub-group perspective, patient-specific estimates facilitate a personalized approach to clinical decision-making [88].

Since the early 1980ies, researcher investigated the use of colloid fluids in sepsis resuscitation due to their theoretical advantages. However, evidence has long been conflicting. The debate was sparked anew when new synthetic colloid solutions became available, but were later shown to have renal adverse effects [80]. As even large RCTs left unanswered questions, researchers focused on meta-analyses. Here our analysis is in line with the latest two meta-analyses [80,81], as we found no net benefit for resuscitation with albumin in septic patients overall, but a possible slight benefit for patients with septic shock (see Fig 4). While regular meta-analyses not utilizing patient-level data are restricted in their sensitivity analyses, our approach offers the benefit to investigate further potential effect modifiers such as age, sex, or race.

Even without considering a specific intervention, anchoring machine-learning models on causal mechanisms can make them more robust to distributional shift [89], thus safer and fairer for clinical use [18,90]. Yet it is important to keep in mind that better prediction is not per se a goal in healthcare. Establishing strong predictors might be less important than identifying moderately strong but modifiable risk factors as established in the Framingham cohort [91], or optimizing population-wide cost-effectiveness instead of individual treatment effect.

No sophisticated data-processing tool can safeguard against invalid study design or a major missing confounder, loopholes that can undermine decision-making systems. Our framework helps the investigator ensure causal validity by outlining the important steps and relating average effects to RCTs. Causal grounding of individual predictions should reduce the social disparities that they reinforce [6,92,93], as these are driven by historical decisions and not biological mechanisms. At the population level, it leads to better public health decisions. For instance, going back to cardio-vascular diseases, the stakes are to go beyond risk scores and also account for responder status when prescribing prevention drugs.

Supporting information

S1 Fig. Motivating example: Failure of predictive models to predict mortality from pretreatment variables.

https://doi.org/10.1371/journal.pdig.0000721.s001

(PDF)

S2 Fig. Immortal time bias illustration.

https://doi.org/10.1371/journal.pdig.0000721.s002

(PDF)

S7 Fig. Complete results for the main analysis.

https://doi.org/10.1371/journal.pdig.0000721.s007

(PDF)

S8 Fig. Complete results for the Immortal time bias.

https://doi.org/10.1371/journal.pdig.0000721.s008

(PDF)

S9 Fig. Vibration analysis for aggregation.

https://doi.org/10.1371/journal.pdig.0000721.s009

(PDF)

S1 Appendix. Assumptions: what is needed for causal inference from observational studies.

https://doi.org/10.1371/journal.pdig.0000721.s010

(PDF)

S2 Appendix. Major causal-inference methods: When to use which estimator?

https://doi.org/10.1371/journal.pdig.0000721.s011

(PDF)

S3 Appendix. Statistical considerations when implementing estimation.

https://doi.org/10.1371/journal.pdig.0000721.s012

(PDF)

S4 Appendix. Packages for causal estimation in the python ecosystem.

https://doi.org/10.1371/journal.pdig.0000721.s013

(PDF)

S5 Appendix. Hyper-parameter search for the nuisance models.

https://doi.org/10.1371/journal.pdig.0000721.s014

(PDF)

S6 Appendix. Deviating from expert ignorability – Impact of smaller confounders sets.

https://doi.org/10.1371/journal.pdig.0000721.s015

(PDF)

S7 Appendix. Details on treatment heterogeneity analysis.

https://doi.org/10.1371/journal.pdig.0000721.s016

(PDF)

S1 Table. Complete description of the confounders for the main analysis.

https://doi.org/10.1371/journal.pdig.0000721.s017

(PDF)

Acknowledgments

We thank all the PhysioNet team for their encouragements and support. In particular: Fredrik Willumsen Haug, Jo�o Matos, Luis Nakayama, Sicheng Hao, Alistair Johnson.

References

  1. 1. Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;118. https://doi.org/10.1038/s41746-018-0029-1 pmid:31304302
  2. 2. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health 2019;1(6):e271–97. pmid:33323251
  3. 3. Li Y, Rao S, Solares JRA, Hassaine A, Ramakrishnan R, Canoy D, et al. BEHRT: Transformer for electronic health records. Sci Rep 2020;10(1):7155. pmid:32346050
  4. 4. Beaulieu-Jones BK, Yuan W, Brat GA, Beam AL, Weber G, Ruffin M, et al. Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians? NPJ Digit Med. 2021;4(1):62. pmid:33785839
  5. 5. Aggarwal R, Sounderajah V, Martin G, Ting DSW, Karthikesalingam A, King D, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit Med 2021;4(1):65. pmid:33828217
  6. 6. Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018;169(12):866–72. pmid:30508424
  7. 7. Singh H, Mhasawade V, Chunara R. Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database. PLOS Digit Health 2022;1(4):e0000023. pmid:36812510
  8. 8. Gichoya JW, Banerjee I, Bhimireddy AR, Burns JL, Celi LA, Chen L-C, et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health 2022;4(6):e406–14. pmid:35568690
  9. 9. Seyyed-Kalantari L, Zhang H, McDermott MBA, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med 2021;27(12):2176–82. pmid:34893776
  10. 10. Geirhos R, Jacobsen J-H, Michaelis C, Zemel R, Brendel W, Bethge M, et al. Shortcut learning in deep neural networks. Nat Mach Intell 2020;2(11):665–73.
  11. 11. Winkler JK, Fink C, Toberer F, Enk A, Deinlein T, Hofmann-Wellenhof R, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol 2019;155(10):1135–41. pmid:31411641
  12. 12. DeGrave AJ, Janizek JD, Lee S-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell 2021;3(7):610–9.
  13. 13. Badgeley MA, Zech JR, Oakden-Rayner L, Glicksberg BS, Liu M, Gale W, et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit Med. 2019;231. https://doi.org/10.1038/s41746-019-0105-1 pmid:31304378
  14. 14. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366(6464):447–53. pmid:31649194
  15. 15. Yuan W, Beaulieu-Jones BK, Yu K-H, Lipnick SL, Palmer N, Loscalzo J, et al. Temporal bias in case-control design: preventing reliable predictions of the future. Nat Commun 2021;12(1):1107. pmid:33597541
  16. 16. Wong A, Otles E, Donnelly JP, Krumm A, McCullough J, DeTroyer-Cooley O, et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern Med 2021;181(8):1065–70. pmid:34152373
  17. 17. Prosperi M, Guo Y, Sperrin M, Koopman JS, Min JS, He X, et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nat Mach Intell 2020;2(7):369–75.
  18. 18. Plecko D, Bareinboim E. Causal fairness analysis. arXiv preprint arXiv:220711385. 2022.
  19. 19. Travers J, Marsh S, Williams M, Weatherall M, Caldwell B, Shirtcliffe P, et al. External validity of randomised controlled trials in asthma: to whom do the results of the trials apply? Thorax. 2007;62(3):219–23. pmid:17105779
  20. 20. Averitt AJ, Weng C, Ryan P, Perotte A. Translating evidence into practice: eligibility criteria fail to eliminate clinically significant differences between real-world and study populations. NPJ Digit Med. 2020;367. https://doi.org/10.1038/s41746-020-0277-8 pmid:32411828
  21. 21. Desai RJ, Matheny ME, Johnson K, Marsolo K, Curtis LH, Nelson JC, et al. Broadening the reach of the FDA Sentinel system: A roadmap for integrating electronic health record data in a causal analysis framework. NPJ Digit Med 2021;4(1):170. pmid:34931012
  22. 22. Rekkas A, van Klaveren D, Ryan PB, Steyerberg EW, Kent DM, Rijnbeek PR. A standardized framework for risk-based assessment of treatment effect heterogeneity in observational healthcare databases. NPJ Digit Med 2023;6(1):58. pmid:36991144
  23. 23. Hern�n MA, Sauer BC, Hern�ndez-D�az S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. J Clin Epidemiol. 2016;79:70–5. pmid:27237061
  24. 24. von Elm E, Altman DG, Egger M, Pocock SJ, G�tzsche PC, Vandenbroucke JP, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet 2007;370(9596):1453–7. pmid:18064739
  25. 25. Benchimol EI, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I, et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Med 2015;12(10):e1001885. pmid:26440803
  26. 26. Hernán MA, Robins JM. Causal inference: What if? 2020.
  27. 27. Schneeweiss S, Patorno E. Conducting real-world evidence studies on the clinical outcomes of diabetes treatments. Endocr Rev 2021;42(5):658–90. pmid:33710268
  28. 28. Zeng J, Gensheimer MF, Rubin DL, Athey S, Shachter RD. Uncovering interpretable potential confounders in electronic medical records. Nat Commun 2022;13(1):1014. pmid:35197467
  29. 29. Suissa S. Immortal time bias in pharmaco-epidemiology. Am J Epidemiol 2008;167(4):492–9. pmid:18056625
  30. 30. Oke J, Fanshawe T, Nunan D. Lead time bias, catalogue of bias collaboration. 2021. Available from: https://catalogofbias.org/biases/lead-time-bias/.
  31. 31. Fu EL, Evans M, Carrero JJ, Putter H, Clase CM, Caskey FJ. Timing of dialysis initiation to reduce mortality and cardiovascular events in advanced chronic kidney disease: nationwide cohort study. BMJ. 2021;375.
  32. 32. Wang SV, Sreedhara SK, Bessette LG, Schneeweiss S. Understanding variation in the results of real-world evidence studies that seem to address the same question. J Clin Epidemiol. 2022;151161–70. https://doi.org/10.1016/j.jclinepi.2022.08.012 pmid:36075314
  33. 33. Bankhead C ND Aronson JK. Attrition bias, Catalogue of Bias Collaboration. 2017. Available from: https://catalogofbias.org/biases/attrition-bias/.
  34. 34. Greenland S, Pearl J, Robins J. Causal diagrams for epidemiologic research. Epidemiology. 199937–48.
  35. 35. VanderWeele TJ. Principles of confounder selection. Eur J Epidemiol 2019;34(3):211–9. pmid:30840181
  36. 36. Loh WW, Vansteelandt S. Confounder selection strategies targeting stable treatment effect estimators. Stat Med 2021;40(3):607–30. pmid:33150645
  37. 37. Wang SV, Schneeweiss S, RCT-DUPLICATE Initiative, Franklin JM, Desai RJ, Feldman W, et al. Emulation of randomized clinical trials with nonrandomized database analyses: results of 32 clinical trials. JAMA 2023;329(16):1376–85. pmid:37097356
  38. 38. Belloni A, Chernozhukov V, Hansen C. High-dimensional methods and inference on structural and treatment effects. J Econ Perspect 2014;28(2):29–50.
  39. 39. Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W. Double/debiasing machine learning for treatment and structural parameters. 2018.
  40. 40. Shalit U, Sontag D. Causal Inference for Observational studies: Tutorial; 2016. Available from: https://docplayer.net/
  41. 41. Sharma A. Tutorial on causal inference and counterfactual reasoning. 2018. Available from: https://causalinference.gitlab.io/kdd-tutorial/.
  42. 42. Moraffah R, Sheth P, Karami M, Bhattacharya A, Wang Q, Tahir A, et al. Causal inference for time series analysis: problems, methods and evaluation. Knowl Inf Syst 2021;63(12):3041–85.
  43. 43. Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci 2010;25(1):1–21. pmid:20871802
  44. 44. Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med 2015;34(28):3661–79. pmid:26238958
  45. 45. Robins JM, Greenland S. The role of model selection in causal inference from nonexperimental data. Am J Epidemiol 1986;123(3):392–402. pmid:3946386
  46. 46. Johansson FD, Shalit U, Kallus N, Sontag D. Generalization bounds and representation learning for estimation of potential outcomes and causal effects. J Machine Learn Res. 2022;23(1):7489–538.
  47. 47. Schuler MS, Rose S. Targeted maximum likelihood estimation for causal inference in observational studies. Am J Epidemiol 2017;185(1):65–73. pmid:27941068
  48. 48. Dorie V, Hill J, Shalit U, Scott M, Cervone D. Automated versus do-it-yourself methods for causal inference. Statistical Science. 2019;34(1):43–68.
  49. 49. Alaa A, Van Der Schaar M. Validating causal inference models via influence functions. In: International Conference on Machine Learning. PMLR; 2019.
  50. 50. Curth A, Svensson D, Weatherall J, van der Schaar M. Really doing great at estimating CATE? A critical look at ML benchmarking practices in treatment effect estimation. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021.
  51. 51. Johnson A, Bulgarelli L, Pollard T, Horng S, Celi L, Mark R. Mimic-iv. PhysioNet. 2020.
  52. 52. Hern�n MA. Methods of public health research - strengthening causal inference from observational data. N Engl J Med 2021;385(15):1345–8. pmid:34596980
  53. 53. Richardson WS, Wilson MC, Nishikawa J, Hayward RS. The well-built clinical question: a key to evidence-based decisions. ACP J Club. 1995;123(3):A12-3. https://doi.org/10.7326/acpjc-1995-123-3-a12 pmid:7582737
  54. 54. Riva JJ, Malik KMP, Burnie SJ, Endicott AR, Busse JW. What is your research question? An introduction to the PICOT format for clinicians. J Can Chiropr Assoc. 2012;56(3):167–71. pmid:22997465
  55. 55. Hern�n MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol 2016;183(8):758–64. pmid:26994063
  56. 56. Catalogue of Bias Collaboration; 2023. Available from: https://catalogofbias.org/biases/.
  57. 57. Weiskopf N, Dorr D, Jackson C, Lehmann H, Thompson C. Healthcare utilization is a collider: an introduction to collider bias in EHR data reuse. Journal of the American Medical Informatics Association. 2023ocad013.
  58. 58. The Book of OHDSI: Observational Health Data Sciences and Informatics. OHDSI. 2021.
  59. 59. Pearl J, Mackenzie D. The book of why: the new science of cause and effect. 2018.
  60. 60. Rubin DB. Causal inference using potential outcomes. J Am Stat Assoc 2005;100(469):322–31.
  61. 61. Attia J, Holliday E, Oldmeadow C. A proposal for capturing interaction and effect modification using DAGs. 2022.
  62. 62. Textor J, Hardt J, Kn�ppel S. DAGitty: a graphical tool for analyzing causal diagrams. Epidemiology 2011;22(5):745. pmid:21811114
  63. 63. Cinelli C, Hazlett C. Making sense of sensitivity: extending omitted variable bias. J R Stat Soc B 2019;82(1):39–67.
  64. 64. Tchetgen Tchetgen EJ, Ying A, Cui Y, Shi X, Miao W. An Introduction to proximal causal inference. Statist Sci. 2024;39(3):.
  65. 65. Wager S. Stats 361: Causal inference; 2020.
  66. 66. Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat 2004;86(1):4–29.
  67. 67. Colnet B, Josse J, Varoquaux G, Scornet E. Risk ratio, odds ratio, risk difference... Which causal measure is easier to generalize? arXiv preprint arXiv:230316008. 2023
  68. 68. D’Amour A, Ding P, Feller A, Lei L, Sekhon J. Overlap in observational studies with high-dimensional covariates. J Econ 2021;221(2):644–54.
  69. 69. Horng S, Sontag DA, Halpern Y, Jernite Y, Shapiro NI, Nathanson LA. Creating an automated trigger for sepsis clinical decision support at emergency department triage using machine learning. PLoS One 2017;12(4):e0174708. pmid:28384212
  70. 70. Jiang L, Liu X, Nejatian N, Nasir-Moin M, Wang D, Abidin A. Health system-scale language models are all-purpose prediction engines.. Nature. 2023:1–6.
  71. 71. Wendling T, Jung K, Callahan A, Schuler A, Shah NH, Gallego B. Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Stat Med 2018;37(23):3309–24. pmid:29862536
  72. 72. van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl Genet Mol Biol. 2007;6:Article25. https://doi.org/10.2202/1544-6115.1309 pmid:17910531
  73. 73. Doutreligne M, Varoquaux G. How to select predictive models for causal inference?. arXiv preprint. 2023.
  74. 74. Schneeweiss S. Sensitivity analysis and external adjustment for unmeasured confounders in epidemiologic database studies of therapeutics. Pharmacoepidemiol Drug Saf 2006;15(5):291–303. pmid:16447304
  75. 75. Thabane L, Mbuagbaw L, Zhang S, Samaan Z, Marcucci M, Ye C, et al. A tutorial on sensitivity analyses in clinical trials: the what, why, when and how. BMC Med Res Methodol. 2013;1392. https://doi.org/10.1186/1471-2288-13-92 pmid:23855337
  76. 76. . Statistical Principles for Clinical Trials: Addendum: Estimands and Sensitivity Analysis in Clinical Trials. FDA. 2021.
  77. 77. Patel CJ, Burford B, Ioannidis JPA. Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. J Clin Epidemiol 2015;68(9):1046–58. pmid:26279400
  78. 78. Robertson SE, Leith A, Schmid CH, Dahabreh IJ. Assessing Heterogeneity of Treatment Effects in Observational Studies. Am J Epidemiol 2021;190(6):1088–100. pmid:33083822
  79. 79. Annane D, Siami S, Jaber S, Martin C, Elatrous S, Decl�re AD, et al. Effects of fluid resuscitation with colloids vs crystalloids on mortality in critically ill patients presenting with hypovolemic shock: the CRISTAL randomized trial. JAMA 2013;310(17):1809–17. pmid:24108515
  80. 80. Xu J-Y, Chen Q-H, Xie J-F, Pan C, Liu S-Q, Huang L-W, et al. Comparison of the effects of albumin and crystalloid on mortality in adult patients with severe sepsis and septic shock: a meta-analysis of randomized clinical trials. Crit Care 2014;18(6):702. pmid:25499187
  81. 81. Li B, Zhao H, Zhang J, Yan Q, Li T, Liu L. Resuscitation fluids in septic shock: a network meta-analysis of randomized controlled trials. Shock 2020;53(6):679–85. pmid:31693630
  82. 82. Caironi P, Tognoni G, Masson S, Fumagalli R, Pesenti A, Romero M, et al. Albumin replacement in patients with severe sepsis or septic shock. N Engl J Med 2014;370(15):1412–21. pmid:24635772
  83. 83. Zhou S, Zeng Z, Wei H, Sha T, An S. Early combination of albumin with crystalloids administration might be beneficial for the survival of septic patients: a retrospective analysis from MIMIC-IV database. Ann Intensive Care 2021;11(1):42. pmid:33689042
  84. 84. Battocchi K, Dillon E, Hei M, Lewis G, Oka P, Oprescu M. EconML: A python package for ML-based heterogeneous treatment effects estimation. 2019.
  85. 85. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: Machine learning in python. J Machine Learn Res. 2011;12:2825–30.
  86. 86. Naimi AI, Mishler AE, Kennedy EH. Challenges in obtaining valid causal effect estimates with machine learning algorithms. Am J Epidemiol. 2023;192(9):kwab201. pmid:34268558
  87. 87. Kennedy-Martin T, Curtis S, Faries D, Robinson S, Johnston J. A literature review on the representativeness of randomized controlled trial samples and implications for the external validity of trial results. Trials. 2015;16:495. pmid:26530985
  88. 88. Kent DM, Steyerberg E, van Klaveren D. Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects. BMJ. 2018;363.
  89. 89. Scholkopf B, Locatello F, Bauer S, Ke NR, Kalchbrenner N, Goyal A, et al. Toward Causal representation learning. Proc IEEE 2021;109(5):612–34.
  90. 90. Richens JG, Lee CM, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun 2020;11(1):3923. pmid:32782264
  91. 91. Brand RJ, Rosenman RH, Sholtz RI, Friedman M. Multivariate prediction of coronary heart disease in the Western Collaborative Group Study compared to the findings of the Framingham study. Circulation 1976;53(2):348–55. pmid:1245042
  92. 92. Mitra N, Roy J, Small D. The Future of Causal Inference. Am J Epidemiol 2022;191(10):1671–6. pmid:35762132
  93. 93. Ehrmann DE, Joshi S, Goodfellow SD, Mazwi ML, Eytan D. Making machine learning matter to clinicians: model actionability in medical decision-making. NPJ Digit Med 2023;6(1):7. pmid:36690689