Validation of prognostic indices for short term mortality in an incident dialysis population of older adults >75

Rational and objective Prognosis provides critical knowledge for shared decision making between patients and clinicians. While several prognostic indices for mortality in dialysis patients have been developed, their performance among elderly patients initiating dialysis is unknown, despite great need for reliable prognostication in that context. To assess the performance of 6 previously validated prognostic indices to predict 3 and/or 6 months mortality in a cohort of elderly incident dialysis patients. Study design Validation study of prognostic indices using retrospective cohort data. Indices were compared using the concordance (“c”)-statistic, i.e. area under the receiver operating characteristic curve (ROC). Calibration, sensitivity, specificity, positive and negative predictive values were also calculated. Setting & participants Incident elderly (age ≥75 years; n = 349) dialysis patients at a tertiary referral center. Established predictors Variables for six validated prognostic indices for short term (3 and 6 month) mortality prediction (Foley, NCI, REIN, updated REIN, Thamer, and Wick) were extracted from the electronic medical record. The indices were individually applied as per each index specifications to predict 3- and/or 6-month mortality. Results In our cohort of 349 patients, mean age was 81.5±4.4 years, 66% were male, and median survival was 351 days. The c-statistic for the risk prediction indices ranged from 0.57 to 0.73. Wick ROC 0.73 (0.68, 0.78) and Foley 0.67 (0.61, 0.73) indices performed best. The Foley index was weakly calibrated with poor overall model fit (p <0.01) and overestimated mortality risk, while the Wick index was relatively well-calibrated but underestimated mortality risk. Limitations Small sample size, use of secondary data, need for imputation, homogeneous population. Conclusion Most predictive indices for mortality performed moderately in our incident dialysis population. The Wick and Foley indices were the best performing, but had issues with under and over calibration. More accurate indices for predicting survival in older patients with kidney failure are needed.


Introduction
Optimal shared decision making is predicated on informed and evidence-based conversations between the patient, caregiver, and clinician. For people with end stage renal disease (ESRD) the decision about pursuing renal replacement therapy (RRT) requires a clear understanding of the differences in prognosis with initiation of dialysis, pursuit of kidney transplantation, or maintenance of conservative therapy [1,2]. This conversation is particularly important for patients for whom dialysis is a destination therapy, and whose prognosis while receiving dialysis may be poor [3][4][5]. Many nephrologists and primary care clinicians, hesitate to share prognostic information with patients [6] and feel unprepared for discussions about prognosis and goals of care [6][7][8][9]. This hesitancy stems, in part, from lack of a commonly accepted and widely used standard for predicting and communicating prognostic information to patients and caregivers. Absence of real-time prognostic guidance may contribute to the current default to pursue more aggressive treatment options and deprive patients of the opportunity to make informed choices about their health and healthcare [10][11][12].
The rate of incident ESRD is highest among older adults [13], with high treatment and symptom burden [14] resulting on average in 44.2% of older patients dying within first six months of dialysis initiation [5]. Once on dialysis upward of 50% of elderly patients choose to withdrawal treatment before death [15]. Several prognostic indices have been developed to predict mortality in dialysis patients [16][17][18][19][20][21][22][23][24][25][26]. However, there has been limited uptake of these tools into routine clinical practice and limited research of their utility and impact for shared decision making especially in the oldest patients. The available indices have variable performance with most have moderate to good accuracy in development cohorts that do not always hold in external validation.
Better understanding of the generalizability, performance, and advantages/disadvantages of the available prognostic mortality indicators is needed to assess their utility in real-world populations. The primary aim of the study was therefore to examine the performances of the available prognostic indices in a cohort of elderly (aged 75 years and older) patients newly initiated on RRT.

Methods
This was a prognostic index validation study, following the TRIPOD checklist for prediction model validation [27].

Study design and population
The cohort included all adults aged 75 years and older who initiated any type of RRT from January 1, 2007, through December 31, 2011 in the Mayo Clinic Dialysis Services (MCDS) which provides all RRT services in our health system and serves a general population of 385,000 patients in Southeast Minnesota, Northern Iowa, and Southwest Wisconsin, through 8 community based HD facilities as well as inpatient HD. Patients were excluded if they did not provide the institutions generic research authorization, in accordance with Minnesota state law, or if they initiated RRT at another institution or if they had previously received a kidney transplant. Mayo Clinic Institutional Review Board reviewed and approved this study. The de-identified study dataset can be made available upon request from the corresponding author.

Prognostic indices
We identified 11 indices validated for use at RRT initiation, predicting short term survival (3-6 months), through a systematic review of mortality prediction indices [16]. We had the necessary data to calculate 6 of the indices, three (Foley, REIN and NCI) [18,22,28] had been previously validated externally, whereas for the other three (Updated REIN, Wick and Thamer) [19,24,29], this paper serves as the first external validation. The indices were developed and tested in cohorts of different, size and composition general vs. geriatric and varied in their inclusion or exclusion of patients with acute kidney injury AKI (Table 1). Most had a c-statistic around 0.7-0.8 in development and internal validation but varied in their performance in previous external validation [16].

Primary outcome
Primary outcomes were index discrimination as measured by the c-statistic or the area under the receiver operating characteristics curve (ROC); calibration as measured by the Hosmer-Lemeshow; goodness of fit statistic and calibration curves; and positive and negative predictive value to predict 3-and 6-month all-cause mortality.

Independent variables
Data on patient demographics (sex, marital status, and living arrangement), comorbidities, context, and survival was extracted from the EHR by a college student supervised by an internist and nephrologist (BT, LJH). Living arrangement was classified as independent and assisted living and nursing home (NH). Comorbidities extracted manually from past medical history were supplemented with a validated electronic search from the EHR that was then used to calculate the Charlson Comorbidity Index (CCI) [30]. Functional status for hospitalized patients was based on the Barthel's index was calculated by a validated electronic search pulling information from nursing assessment for hospitalized patients [31]. For patients without a  hospitalization, we used patient provided information (PPI) of functional status obtained from an annual questionnaire completed by patients as part of routine care in the outpatient setting (S1 Table). Baseline data were collected on the closest available data prior to dialysis extending back up to 30 days for laboratory values, 1 year for outpatient functional status and 2 years for comorbidities. Laboratory results for hemoglobin, creatinine, CRP, phosphorous and albumin   were pulled from the EHR. GFR was calculated using the CKD-EPI equation [32]. Mortality and death dates were identified by an EHR review through December 27, 2013 and were supplemented with online queries for publicly available death certificates and obituaries for each individual patient based on name and date of birth.

Statistical analysis
We compare and contrast descriptive statistics of the study cohort to those used by each of the prognostic indicator development study, with the exception of NCI for which we used data from a validation cohort in a study focused on elderly incident RRT patients [28]. Data for all variables used in the prognostic indices are presented as means and standard deviations for continuous variables, and counts and frequencies for categorical variables. A score for each patient for each of the six prognostic instruments was calculated based on original model parameters specified in their respective development papers. Categorization into high and low risk groups also followed the classifications the original papers. A separate logistic regression model was run for each of the indices to predict death at 3 and/or 6 months post RRT initiation using the prognostic score as the independent variable. Indices were compared using the concordance ("c")-statistic, corresponding to the ROC; higher c-statistic indicates a better preforming model. Sensitivity, specificity, positive and negative predictive values, and positive and negative likelihood ratios were also calculated. We created calibration plots to evaluate predicted probability of death vs. true observed mortality rate (true probability) in R using the "Presence-Absence" package. A Hosmer-Lemeshow goodness of fit test was performed for each index to assess whether differences between the observed and expected proportions of the outcome were significant, indicating poor model fit. The investigators and analysts were not blinded to the, tools, predictors or outcome.

Missing data
Complete data was available to implement three of the six indices (Foley, NCI and Wick). For the remaining three indices (REIN, Updated REIN, Thamer) almost 52% of patients were missing at least one variable, ranging from 3% missing the Barthel score to 38% missing the BMI. For missing BMI, albumin, and Barthel score data, we first tested the assumption that variables were missing at random (by testing for collinearity and interaction with other variables) and then imputed by means of multiple imputation using chained equations (10 replications) in STATA. Prognostic scores were generated for each imputed data set separately and averaged over the fitted indices. Analysis was preformed using STATAMP version 15.1 (Stata-Corp, LP), and R 3.4.2.

Results
The patient population of 349 older adults initiating RRT was, on average, 81.5±4.4 years old, 66% were male, 94.6% were non-Hispanic white (Fig 1, Table 1). This cohort was smaller and older than most of the studies and most comparable to the REIN cohorts in terms of age and functional status. The overall burden of comorbidity was high, with coronary artery disease (CAD), chronic heart failure (CHF), and diabetes being the most common. Median survival was 351 days with 132 patients dying before 90 days (37.8%) and 142 (40.6%) before 6 months. Sixty patients (17%) recovered renal function and discontinued RRT during the follow up, they were not censored as our interest was in overall survival. Functional status was similar between patients who started inpatient (Barthel score 83.6 +/-22.3) vs. outpatient (84.5 +/-22.2).
With the different indices of interest using different variables and predicting different levels of risk, the resulting risk stratification of our cohort varied depending on which index was used (Fig 2, Table 2). The "high risk" designation was assigned by 22.6% of our cohort when the Foley index was used, compared to 0.9% of the cohort with the Thamer index. This was not necessarily consistent with the predicted mortality threshold corresponding to "high risk" of death, since "high risk" in the Foley index corresponds to 90-100% 6-months mortality, whereas it is >55% 6-months mortality for the Thamer index ( Table 2).
None of the indices performed well in our index with only Wick having ROC >0.7, at 0.73 (95% CI: 0.68, 0.78). A comparison of ROCs across all 6 indices indicated that they did not substantially differ in their predictive ability (Table 3, Fig 3). Predicted mortality for four (REIN, NCI, Wick, and Thamer) underestimated mortality for the highest risk group, while Foley markedly overestimated it ( Table 2). Table 3 shows positive and negative likelihood ratios with Thamer, Wick, and Foley indices performing best.
Calibration plots for each of the indices are shown in Fig 4. For the two indices predicting 3-month mortality (Updated REIN and Thamer), 3-month predictions were slightly better calibrated than their 6-month counterparts. Of the two indices with highest discrimination, the Foley index was weakly calibrated with poor overall model fit (p <0.01), while the Wick index was relatively well-calibrated.
Using the pre-specified cutoffs for "high risk" defined by each index, the PPV for mortality in the high risk group ranged from 41.9% to 62.4% (Wick performed the best), and NPV ranged from 0-100% (REIN and Thamer performed the best).

PLOS ONE
To improve the performance of these indices, we identified different risk thresholds for each index that would be optimized for our patient population. The following cutoff scores yielded a specificity of >50% and >90%, respectively, in predicting mortality: Foley 7 and 10, NCI 4 and 10, REIN 4.2 and 8.2, updated REIN 12.1 and 16.4, Thamer 4.2 and 6.5, Wick 6 and 10.

Discussion
Prognostic information is desired by patients and can facilitate and improve shared decision making [9,33]. We tested six indices predicting short-term (3-or 6-month) mortality at the start of RRT [16]. Their performance in our population-based cohort of elderly incident RRT patients was variable. The discrimination, which reflects the probability that a randomly selected patient who died had a higher risk score than a patient did not die, was poor for all indices except for Wick, which had good discrimination ROC 0.73. Calibration, i.e. the agreement between observed and expected (i.e. predicted) outcomes, was acceptable only for the Wick and Thamer indices. All of the indices fell short in their ability to predict death for the highest risk group. Most concerning was the low positive and high negative predictive values of all the prognostic indicators in the highest risk patient subgroups, as this may lead patients

PLOS ONE
and clinicians to forego life-sustaining treatment due to underestimation of life expectancy and potential benefit. The indices performed considerably better in predicting survival for the lowest risk patients. Thus, they may be more helpful to promote optimism and treatment options such as dialysis and kidney transplant for patients with reasonably good chances of survival.
The indices that performed best in our elderly cohort included functional status and hospitalizations in the last 6 months, as well as proxy variables suggestive of unplanned dialysis start, all three are important markers of poor health or sentinel events [34][35][36][37]. The other indices variably included similar variables but not all three. While disappointing, the c-statistics for the different indices in our validation study are similar to those reported in multiple other validation studies summarized in our recent systematic review and thus can also be seen to show reasonable reproducibility of the initial studies [16]. It is not unusual for prognostic indices to perform worse in a new population than in the development cohorts and our findings again demonstrate how difficult it is to develop completely accurate and reliable models that are generalizable to different settings of a heterogeneous patient population. When the Foley index was initially validated it did poorly, the discrimination of the REIN index ranged from 0.68-0.74 in the initial development and validation study and has varied from 0.66-0.70 in Table 2. Breakdown of cohort into predicted risk categories 1 by index, actual and predicted mortality (%) at 3 months and 6 months after dialysis initiation by risk score.

Score
Points Predicted 3 mo. mortality

PLOS ONE
external validation studies and external validations of the NCI from 0.60-0.91 [16,18,19,22,24,28,29,38]. Our findings are however lower than those reported by Ramspek et al. [39] in a recent validation study. Their study looked at 1 year prognosis and thus included a different set of prognostic indices with the Foley index being the only one included in both studies. We were unable to include the two best performing indices in the Ramspek study because they included variables not available at dialysis start including dialysis adequacy and treatment modality after 3 months on dialysis [25,26]. In addition to their strength of size and generalizability of a population based cohort, the difference in discrimination may also tie to the fact that we looked at mortality from the day of RRT initiation whereas they gathered baseline data and started the prediction validation at day 90 of RRT. Thus our mortality rate was significantly higher than the other studies as well as the mortality in our population when limited to patients who survive the initial 30 days of HD [40]. While this has long been customary for studies on ESKD to ensure that patients do in fact have ESKD as opposed to acute kidney injury, we feel this fails to help patients and their clinicians make decisions at the time of dialysis initiation and fails to account for the high early mortality [5].
The lack of generalizability of the examined indices likely stem from the varied populations in which these indices were developed and differences in predictive variables chosen and may reflect overfitting to the development populations. Our population differed by representing a narrower age range with a higher mortality rate than reported in most of the development studies. If age were appropriately factored into the models however, then applying their weights should yield accurate results. Also a well calibrated index should be able to perform in new populations with higher and lower mortality rates than the original development populations. We acknowledge that the small size of our cohort contributes to the poor fit of the indices, but is representative of the difficulties likely faced by other health care organizations with a limited number of patients with incident ESRD. The event rate for our primary outcome of 6 month mortality was approximately 40%; thus, we were sufficiently powered to assess all of the indices.
Moreover, our cohort is limited by its small size, the racial homogeneity of our cohort as mostly white also contrasts with the general US ESKD population but it is unclear how it compares to the populations used in most previous development and validation studies that often did not report on race [16]. Another limitation is the use of secondary EHR data which is only as good as the initial documentation allows. We did supplement manual extraction with validated algorithms for data extraction for important variables as well as imputation for key missing variables necessary for the construction of the index scores. When imputing we did make sure that our data suggested that they were missing at random. We did use different methods to assess functional status for inpatients (nursing assessment) vs. outpatient (patient survey) however the average values for those two methods were similar. We did not censor our cohort at the time of renal recovery which was similar to that previously reported in our practice [41]. Since we were not directly estimating survival but rather testing the tools accuracy for predicting death at a certain time point the effect of this should be negligible. Finally our study was limited to a single network and local practice patterns could have introduced some bias. Nonetheless, we closely adhered to the CHARMS recommendations for prognostic validation studies and manual data abstraction from a narrative medical record to supplement electronic pulls of secondary data. Our study adds to the small number of studies assessing prognostic index performance in elderly dialysis patients [18,19,42] and also serves as the first external validation of three of the included indices (Wick, Thamer and updated REIN) [19,24,29].

PLOS ONE
Discussion about prognosis and goals of care are especially poignant and relevant for older dialysis patients. The uptake of prognostic indices into clinical practice has been poor with most patients reporting having had no discussions about prognosis at dialysis start [6,43]. Even for tools that are frequently used in clinical settings (i.e. APACHE III in the ICU) concerns about the ability of prognostic tools to predict accurately for an individual patient lead to a lack of bedside discussions. In qualitative studies, clinicians have expressed skepticism regarding the reliability and accuracy of available tools [9]. Our study confirms that they are justified in their concern. Discrimination between 0.70-0.73 is not sufficient to support high stakes decisions advising on whether to initiate or defer life-sustaining dialysis treatment. Another concern is the wide variation in the gradient of risk (i.e. the percent expected mortality deemed to be "high" by each model), which hinders their interpretability and clinical utility to patients, caregivers, and clinicians. The low positive predictive value for death noted for all the indices was particularly concerning. In fact many of the indices paradoxically had a higher negative than positive predictive value for the highest risk category, a function of the fact that the mortality risk in the highest risk groups in the development cohorts was lower than 50% for many of the indices. The utility of such predictions at the bedside to aid treatment choice thus is questionable, especially when coupled with the absence of being able to predict patient important outcomes for the alternative of no intervention.
Understanding if certain risk thresholds are more or less meaningful to patients and clinicians and how they influence treatment has not been well studied. Furthermore the importance of precision to clinicians and patients in this context also remains unclear.
Even if prognostic indices may not perform well enough on an individual level they may still be acceptable for use on a population level in shaping policy. In particular Medicare coverage in the U.S. limits patients to coverage of either dialysis or hospice, not both as dialysis is considered a life extending treatment. Patients are eligible for the Medicare Hospice benefit if they are deemed more likely than not to die in the next 6 months. The Thamer, Wick and Foley predict more than 50% risk of death within the next 6 months for patients in their highest risk categories. This can support arguments for dual coverage of hospice and dialysis in this high risk group, which in turn could help high risk dialysis patients avoid aggressive and costly treatments that they typically are subject to at the end of life [12,13].
Developing a more accurate, reliable and generalizable mortality prediction model for older adults facing the decision of whether or not to initiate dialysis may require larger multi-center studies and consideration of a wider array of risk factors including cognitive and functional status, frailty, and social determinants of health [35,36,[44][45][46][47]. Additionally, advanced analytic methods such as machine learning and artificial intelligence, may help identify highest risk patients and facilitate generalizable self-learning models that adapt to each population and setting [48].

Conclusion
None of the indices performed well in predicting early mortality for the highest risk group in our cohort of elderly incident dialysis patients. The Wick index performed best in terms of discrimination with two other indices, Thamer and Foley having acceptable performance. The future will tell if big data and artificial intelligence can develop more accurate prediction tools but more importantly, better understanding of the role of prognosis at the bedside is needed to promote shared decision making.