External validation of clinical prediction rules for complications and mortality following Clostridioides difficile infection

Background Several clinical prediction rules (CPRs) for complications and mortality of Clostridioides difficile infection (CDI) have been developed but only a few have gone through external validation, and none is widely recommended in clinical practice. Methods CPRs were identified through a systematic review. We included studies that predicted severe or complicated CDI (cCDI) and mortality, reported at least an internal validation step, and for which data were available with minimal modifications. Data from a multicenter prospective cohort of 1380 adults with confirmed CDI were used for external validation. In this cohort, cCDI occurred in 8% of the patients and 30-day all-cause mortality occurred in 12%. The performance of each tool was assessed using individual outcomes, with the same cut-offs and standard parameters. Results Seven CPRs were assessed. Three predictive scores for cCDI showed low sensitivity (25–61%) and positive predictive value (PPV; 9–31%), but moderate specificity (54–90%) and negative predictive value (NPV; 82–95%). One model [using age, white blood cell count (WBC), narcotic use, antacids use, and creatinine ratio > 1.5× the normal level as covariates] showed a probability of 25% of cCDI at the optimal cut-off point with 36% sensitivity and 84% specificity. Two scores for mortality had low sensitivity (4–55%) and PPV (25–31%), and moderate specificity (71–78%) and NPV (87–92%). One predictive model for 30-day all-cause mortality [Charlson comorbidity index, WBC, blood urea nitrogen (BUN), diagnosis in ICU, and delirium] showed an AUC-ROC of 0.74. All other CPRs showed lower AUC values (0.63–0.69). Errors in calibration ranged from 12%- 27%. Conclusions Included CPRs showed moderate performance for clinical use in a large validation cohort with a majority of patients infected with ribotype 027 strains and a low rate of cCDI and mortality. These data show that better CPRs need to be developed and validated.


Introduction
Clostridioides difficile infection (CDI) is a significant nosocomial infection, accounting for 10-25% of antibiotic-associated diarrhea [1]. Approximately, 20% of patients with CDI will experience a complicated disease course, which is defined by the presence of hypotension or shock, ileus, or megacolon [2]. Mortality is twice as high in hospitalized patients with CDI than in outpatients [3]. The emergence of the hypervirulent strain NAP1/BI/027 in the early 2000s has been associated with an increase in unfavorable outcomes [4][5][6]. In many cases, severe and complicated CDI (cCDI) leads to extended hospital stays and surgical procedures [3], further exacerbating the burden of this disease.
Many studies have tried to identify risk factors for cCDI including age, leukocytosis, high C-reactive protein levels, hypoalbuminemia, acute renal failure and comorbidities such as diabetes, chronic kidney injury, and immunosuppression [2]. Many clinical prediction rules (CPRs) for severe CDI and mortality have been developed, but few have been validated in multiple clinical settings where prevalence and outcomes vary. Consequently, no CPRs have been widely accepted for clinical use. To build a clinically relevant prediction rules, many essential steps are needed. The first step is derivation, which consists of identifying variables with predictive power, usually by multivariable analysis in the original cohort. The second step is internal validation, which is performed on a subset of the original cohort (or with resampling techniques such as the bootstrapping method). This step determines the reproducibility of the rules by demonstrating the stability of the selection of the predictors and quality of predictions. However, when internal validation is performed on the same cohort used for derivation, it usually overestimates the performance of scores [7]. Consequently, the third step is a broad validation (external validation) to evaluate the performance of the rules on a different cohort from a separate clinical setting, with different prevalence and disease outcomes. The last step is an impact analysis to evaluate how the rule is used in real-life, and how it impacts clinician behaviour and clinical outcomes is the last essential step for translation from research to clinic [8]. An accurate predictive tool would be useful for early recognition of patients who are at higher risk of unfavorable outcomes, and for improved stratification of patients in clinical trials. In this context, we conducted an external validation of selected CPRs for cCDI and CDI mortality.
clinical trials involving antibacterials from Wyeth, Pfizer, Optimer, Cubist, Merck and Actelion. A.E.S. served on Merck Advisory Boards and received honoraria for speaking on behalf of Merck Canada Inc. All other authors have no competing interest to declare. This does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no restrictions on sharing of data and/or materials.
research of databases (MEDLINE, PubMed, Embase, Web of science and Cochrane library for evidence-based medicine) and conference proceedings [10]. An updated review was performed for studies through December 2018, using MEDLINE and PubMed databases, using the same steps and the same combination of keywords as the initial review. The keywords were: "(Clostridium difficile or Clostridium difficile-associated diarrhea or Clostridium difficile-associated disease) AND (diarrhea or diarrhoea or colitis or pseudomembranous or enterocolitis or enteritis or antibiotic-associated disease) AND (sensitivity or specificity or prediction or index or score or model or factor or gradient or decision rule or decision technique or prognosis or risk index or risk score or risk model or risk scale)." We included publications on CPRs, scoring systems (point systems generated by comparing the parameter estimates of a multivariable regression model), or predictive multivariable models for cCDI and CDI mortality in general patients. Study inclusion criteria were as follows: i) a clear methodology for derivation, ii) an internal validation step, and iii) having predictors available integrally or with minimal adaptation in the validation cohort. Publications studying specific populations (e.g. patients with immunosuppression, inflammatory bowel disease, or children) were excluded.

External validation of included CPRs
Data on 1380 adults with confirmed CDI from a multicenter prospective cohort with a 90-day follow-up period were used for the external validation of selected CPRs. Patients were enrolled in 10 Canadian hospitals across two provinces (2005)(2006)(2007)(2008). Detailed methods are as previously described [11]. Descriptive data on the cohort are provided in the Supplementary materials (S1 Table). Data from the validation cohort were recoded to reproduce predictors and primary outcomes of each included CPR, and modifications were made to match available data (Table 1). Our data were independent from the data used in the derivation of any of the included CPRs. Patients' characteristics in sub-cohorts used for each CPR are also described in S1 Table. Patients with missing data for any predictor were excluded from analyses. The missingness of predictors was most likely a random event and was not associated with the outcome. The same point assignment and cut-off values were used for validation analyses.
A logistic regression was conducted for each CPR, and the following standard performance parameters with 95% confidence intervals (CI) were assessed [7]: sensitivity, specificity, positive and negative predictive values (PPV and NPV), positive and negative likelihood ratios (LR), and overall accuracy. Discrimination (the ability of a CPR to distinguish high-risk from low-risk patients) was assessed by the area under the receiver operating characteristic curve (AUC-ROC). Calibration (the extent of agreement between predicted probabilities by a CPR and the observed occurrence of an outcome) was assessed graphically and by Brier score or the mean absolute calibration error (MACE; %), where absolute values of deviance between observed and predicted probability were averaged [12]. Lower MACE values reflect greater precision of the predicted probability. Statistical analyses were performed using SAS 9.4 (SAS Institute Inc., Cary, North Carolina).

Results
The systematic review flow-chart is shown in Fig 1. Overall, 32 studies with a derivation of a CPR for CDI unfavorable outcomes were identified. Among these, ten aimed to predict cCDI and nine aimed to predict CDI-associated mortality. We excluded four studies that did not report clear estimates of internal validity [13][14][15][16], seven studies for which important variables or outcomes were not available in our cohort [17][18][19][20][21][22][23], and one study that reported very limited information on derivation methodology [24]. We assessed the external validity of three scores predictive for cCDI [25][26][27] and two for mortality [28,29], and one predictive model for each outcome, respectively [30,31]. The characteristics of the included CPRs and relative modifications are shown in Table 1. Four studies used a prospective design for derivation [25][26][27]31] and one was performed in multiple centers [26]. The duration of follow-up was reported in two studies [26,31]. The derivation sample size was relatively small (n = 213 to 746 patients), except in the Kassam et al. [29] study, where data from an administrative database of CDI-associated hospitalizations were used (Table 1 and S2 Table). Complications of CDI were mainly defined by ICU admission, colectomy, or all-cause death. However, only Na et al. [25] restricted the outcome to CDI-attributable events. The distribution of outcomes varied greatly for cCDI and mortality, with rates of 6.4%-34%, and 8-21%, respectively (S1 Table). Most studies used multivariable logistic regression models to identify predictors. The complete data on predictors was reported in only two studies [26,27] (S2 Table). The C. difficile strain (ribotype) was reported in only one study during an endemic period, without considering it among potential predictors [26]. Scoring points were attributed in proportion to risk estimates in two studies [28,29] (Table 1). The AUC-ROC was the most frequently reported performance measure for internal validation (Tables 2 and 3).

External validation
Sample sizes for data that were used in external validation of each CPR varied from 933-1338 patients in each sub-cohort (S1 Table), without major differences from the overall (or full) external validation cohort [11]. Most cases were older adults (median age 71), had at least one chronic underlying illness (88%), had received at least one antimicrobial agent (87%) within two months preceding enrollment, had received at least one acid suppression medication (66%), had undergone surgery (38%), and were immunocompromised (29%). At enrollment, CDI was the initial episode in 86% of patients and was healthcare-associated in 90%. Strain ribotype was obtained for 922 patients; ribotype 027 was found in 52% of them. Metronidazole was the initial treatment in 81% of patients, vancomycin in 8%, and a combination of these drugs in 6.5%. During the 30-day follow-up period, 3% of patients were admitted to ICU for complications of CDI, and 12% died from all causes. Only 4% of deaths were deemed attributable to CDI.

CPRs for cCDI
For all four studies that were assessed, the observed cCDI outcomes were lower in each validation sub-cohort than in the respective derivation cohorts (Table 1, S3 Table). Calibration curves are shown in S1 Fig. For a score �2 points in Na et al. [25], sensitivity was lower in our cohort (46%) compared to both the internal assessment (62%) and the external validation cohorts reported in the publication (53%). The external validation was conducted on a cohort of combined patients from two centers in two different countries (n = 345) (S3 Table). The PPV of the score decreased by two-fold in our validation (23% vs 44%). Specificity, the NPV, both LRs, and diagnostic accuracy were similar in the derivation and the reported external validation cohorts, but were higher in ours, although the discrimination level was only 66% (AUC) and the calibration error was 14% (Table 2).
Performance parameters decreased for all cut-offs of Hensgens' score [26] in our validation model and outcomes were more frequently observed in patients with low scores (11%, n = 92/ 827 vs. 3%, n = 7/219 with a score � 1 point). We obtained lower sensitivity for a score of �4 points than both reported internal and external validations (25% vs. 43%), but with  comparable specificity (90%). Both discrimination and calibration were low in the external validation cohort (AUC = 0.63 and MACE = 25%). In the study by van der Wilden et al. [27], attributable and all-cause events were not differentiated. We conducted validation using all-cause and CDI-attributable ICU admissions and death (Table 2). In both cases, all observed parameters were significantly lower than in the derivation cohort (S3 Table), where a score �6 points was extremely sensitive (98% vs. 48% and 61%, respectively) but less specific (88% vs. 54%). This score showed perfect discrimination in the derivation cohort (0.98) that dramatically decreased in external validation (0.6-0.7), with 12% error in calibration. Diffuse abdominal tenderness, a criterion generating 6 out of 16 points in this score, was 2.6-fold more frequent in our cohort (S2 Table), and 46% of patients had high scores. Despite a very low PPV (9% vs. 19%), using CDI-attributable vs. all-cause events to define cCDI led to higher LR+ (1.35 vs. 1.05) values, slightly higher overall accuracy (55%), and better discrimination (0.67 vs. 0.57) and calibration (MACE 12% vs. 30%).
The optimal cut-off point (score = -1.1) for the Shivashankar et al. [30] model corresponded to a 25% probability of developing cCDI in both the derivation and external validations. However, this score had 80% sensitivity and 46% specificity in the derivation cohort (S3 Table), vs. 36% and 84%, respectively in the validation cohort. We observed 50% of the frequency of the outcome (17% vs. 34%), and the AUC was similar in both cohorts (0.70). Although each increasing point in the score was associated with a 3-fold increased risk of cCDI (crude OR = 2.9), this model showed 27% error in calibration. However, the sensitivity steadily decreased, while specificity increased with the scoring points. PPV reached 45% in the external validation cohort, but was not reported in the derivation study (Table 2).

CPRs for mortality
Calibration plots are shown in S2 Fig. Despite the much larger sample size, in-hospital mortality was about two-fold higher in the external validation sub-cohort (18% vs. 8%) than in the derivation cohort used in Kassam et al. [29]. Lower scores were more frequently observed in the validation sub-cohort, with 75% of patients having � 6 points with a maximum observed of 14 points (Table 3). These frequencies were not reported in the derivation study (S4 Table). Only 50% of patients with 11-14 points experienced the outcome (n = 17), leading to a low sensitivity and PPV (9% and 50% respectively). Both discrimination and calibration were lower in the validation sub-cohort (AUC-ROC = 0.66; MACE = 27%) despite a 30% increased risk of mortality associated with each point increase in the score (crude OR = 1.28).
The study of Butt et al. [28] had higher all-cause mortality in the derivation cohort than in the validation sub-cohort (20% vs. 12%). Only 7% of patients had �2 points in the derivation cohort (n = 18/244), among whom 72% had died (n = 13), while 19% of patients assigned �2 points (14%, n = 125/933) in the validation sub-cohort died. When the respiratory rate criterion was dropped, as in the original study, the sensitivity for a score � 2 points increased to 42%, with a moderate PPV (34%), and without any significant changes in the other parameters. The AUC value was the only reported parameter (S4 Table), and was similar in the external validation cohort. Independent of the respiratory rate criterion, the calibration error for this score was about 20%.
For the Archbald-Pannone et al. [31] model, a one-point increase was associated with a 5% increased risk of all-cause 30-day mortality in the validation sub-cohort (crude OR = 1.05), compared to 11% in the derivation cohort (S4 Table). Levels of WBCs and blood urea nitrogen (BUN) were lower in the validation sub-cohort (S2 Table). Overall, 14% of patients were assigned scores of � 50 points in the derivation cohort, among whom 42% died (n = 22), vs. only 8% of patients with 36% mortality (n = 35) in the validation sub-cohort. The model External validation of complicated CDI scores showed significant increases in PPV with increasing points (77% for high scores), but a much lower sensitivity (7%). The AUC was of 0.74, which was lower than the internal validation (0.80), but comparable to the expected value of 0.77, with 19% error in calibration.

Discussion
In this study, we validated four prediction rules for CDI complications and three for CDI associated-mortality using a cohort with a low incidence of outcomes (8% cCDI and 12% mortality), in which more than half of cases were attributed to the R027 strain. External validation is a mandatory step in taking a prediction rule from development to clinical integration, as it addresses the transportability of the score. This study aimed to evaluate published scores performance in a large, multicentre, and prospective cohort, and included patients with demographic and clinical characteristics of typical CDI patients [11]. All predictor and outcome definitions in each included study were closely reproduced, with the exception of a few studies in which we adapted the original definitions to fit the variables in our cohort. All included scores and models showed a decrease in the performance of their predictive potential in our cohort, even when the results from the internal validation were promising [27]. Calibration, as defined by the mean error between observed and expected outcomes, ranged from 12 to 30%. Using the same cut-offs as in derivation studies, the CPRs for cCDI showed sensitivities ranging from 25% to 61%, specificities from 54% to 90%, PPV from 9% to 31%, and NPV from 82% to 95% in our validation cohort. The overall accuracy and AUC values were low, ranging from 53% to 84% and from 0.57 to 0.67, respectively. However, a decrease in performance is not unusual in the external validation of CPRs [7]. For example, the scores of Na and Hensgens [25,26] performed similarly (AUC = 0.54 and 0.68 respectively) in a second validation cohort of 148 patients during an outbreak of R027 strains [33]. The effect of strain on CDI outcome is still controversial in observational studies [34], and we did not find any significant association with cCDI in our cohort [11], despite the high frequency of R027. Only one of the included studies reported the frequency of ribotypes, and it was not considered a potential predictor [26]. Similar performance was observed for mortality, with sensitivities ranging from 4% to 55%, specificities from 71% to 78%, PPVs from 25% to 31%, and NPVs from 87% to 92%. Overall accuracy ranged from 73% to 82% and the AUC was also moderate, ranging from 0.66 to 0.69.
In the time since most of the included CPRs were published, methodological standards [35] and quality assessment criteria [36] based on predictive modelling and prognostic studies have been released. Accordingly, important methodological limitations of the included CPRs could have affected their performance in derivation as well as in the external settings, as most of the CPRs in this study were derived from small samples sizes, and in different settings. Consequently, it is not surprising to find heterogeneity between the different variables identified as predictors. There were no common predictors in studies predicting mortality, and a limited number of common predictors in studies predicting cCDI, except for older age (increased WBC was a predictor in three studies [25,27,30], and increased serum creatinine and hypotension were each common predictors in two studies [25][26][27]30]). In contrast, the CARDS score characterized by Kassam et al. [29], which was derived from a medico-administrative database, allowed for a large sample size. However, this study also reported less detailed data and used ICD-9 discharge codes for case definitions. A more recent score was derived using the same criteria as CARDS to predict 30-day mortality following complete colectomy [23]. This score was not included for external validation, due to a very low outcome occurrence in our cohort (4 deaths in 18 patients who underwent a colectomy). In the study by Na et al. [25], each predictor was assigned one point despite the serum creatinine having a four-fold higher risk estimate, and WBC counts having a two-fold higher risk estimate, compared to age (OR, 8, 4, and 2.3 respectively). In the study by Shivashankar et al. [30], narcotics and antacids used during the seven days before CDI diagnosis and up to 30 days after diagnosis were considered predictors for cCDI. In the score of Butt et al. [28], low albumin levels were protective in the derivation study, but were assigned the same number of points as the other criteria. For each point increase in their continuous score, risk of all-cause death was two-fold higher than inhospital mortality [29].
Differences in predictor frequencies between the derivation and the validation sub-cohorts might also have influenced the performance of the CRPs. In the van der Wilden study [27], diffuse abdominal tenderness was three times more frequent than in our cohort and shifted individual scores to higher levels. Also, in the Archbald-Pannone model [31], only 14% of the patients in the validation cohort presented with delirium, a predictor that was given 11 points. Using CDI-attributable unfavorable events in the van der Wilden score [27] led to better prediction than all-cause ICU admission and mortality in terms of discrimination, and decreased calibration from 30% to 12%.
Other studies have attempted external validation of indices for severe CDI course on much smaller sample sizes [33,[37][38][39]. While we identified relevant studies through a rigorous systematic review of the literature and included only studies with clear derivation methodology and at least one internal validation assessment. Only van Beurden et al. [33] used a standardized selection of studies.
Recent guidelines from the Infectious Diseases Society of America (IDSA) [40] highlight two models developed from the fidaxomicin vs. vancomycin clinical trial data, which identified factors that correlated with treatment failure and cure [41,42]. These guidelines used expert opinions to define severe or fulminant CDI and underlined the need for a prospectively validated severity score. Many scores have been published throughout the years, but none have shown performance data sufficient for wide clinical use, which reflects the complexity of this task. In the future, other types of predictors should probably be considered and integrated into prediction models such as levels of toxins and measures of immunity, frailty, and bacterial strains.

Conclusion
Clinical prediction rules for cCDI and CDI mortality showed moderate performances in an external validation cohort that had a low rate of measured outcomes and a high proportion of R027 strains. The methodological limits of the original studies and the heterogeneity of the primary outcomes may have contributed to these suboptimal results. An accurate predictive tool is needed to help clinicians and researchers identify patients at risk for cCDI, and to direct the most effective therapies to these patients.