Prediction Models and Their External Validation Studies for Mortality of Patients with Acute Kidney Injury: A Systematic Review

Objectives To systematically review AKI outcome prediction models and their external validation studies, to describe the discrepancy of reported accuracy between the results of internal and external validations, and to identify variables frequently included in the prediction models. Methods We searched the MEDLINE and Web of Science electronic databases (until January 2016). Studies were eligible if they derived a model to predict mortality of AKI patients or externally validated at least one of the prediction models, and presented area under the receiver-operator characteristic curves (AUROC) to assess model discrimination. Studies were excluded if they described only results of logistic regression without reporting a scoring system, or if a prediction model was generated from a specific cohort. Results A total of 2204 potentially relevant articles were found and screened, of which 12 articles reporting original prediction models for hospital mortality in AKI patients and nine articles assessing external validation were selected. Among the 21 studies for AKI prediction models and their external validation, 12 were single-center (57%), and only three included more than 1,000 patients (14%). The definition of AKI was not uniform and none used recently published consensus criteria for AKI. Although good performance was reported in their internal validation, most of the prediction models had poor discrimination with an AUROC below 0.7 in the external validation studies. There were 10 common non-renal variables that were reported in more than three prediction models: mechanical ventilation, age, gender, hypotension, liver failure, oliguria, sepsis/septic shock, low albumin, consciousness and low platelet count. Conclusions Information in this systematic review should be useful for future prediction model derivation by providing potential candidate predictors, and for future external validation by listing up the published prediction models.

Introduction Acute kidney injury (AKI) is a common complication among critically ill patients and their mortality is high [1][2][3][4]. Reliable AKI specific scoring systems are important to predict outcome of AKI patients and to provide severity stratification for clinical studies. However, general severity scores for critically ill patients, e.g., Acute Physiology and Chronic Health Evaluation (APACHE) [5][6][7], Simplified Acute Physiology Score (SAPS) [8,9], and Mortality Probability Model [10] have shown controversial results on the accuracy of predicting mortality in AKI patients [11][12][13], partly because those scores were generated from data that included only a few AKI patients.
Over the past three decades, multiple AKI outcome prediction models, which incorporated physiologic, laboratory, organ dysfunction and previous comorbidity, have been derived [14][15][16][17][18][19][20]. Even in the 21 st century, five additional prediction models have been generated [12,[21][22][23][24]. Although internal validation of these prediction models has shown good accuracy, the results of external validation studies for the models have been unsatisfactory [11,25,26]. Currently, there is neither consensus nor guideline recommending which prediction model to apply to clinical practice.
The objectives of this study are to systematically review the AKI outcome prediction models and their external validation studies, to describe the discrepancy of reported accuracy between the results of internal and external validations, and to identify variables frequently included in the prediction models, which might be potentially useful for future prediction model derivation.

Studies eligible for review
Studies published in the medical literature were eligible if they derived a model to predict mortality of AKI patients or externally validated at least one of the prediction models, and presented area under the receiver-operator characteristic curves (AUROC) [27] or the concordance index (c-statistic) to assess model discrimination. Studies were excluded if they described only results of logistic regression without reporting a scoring system, or if a prediction model was generated from a specific cohort. Unpublished conference abstracts were also excluded. This study followed the same principal as in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (S1 PRISMA Checklist) [28].

Literature review and study selection
We searched the MEDLINE and Web of Science electronic databases (until January 2016). In the MEDLINE search, we used the terms of "acute kidney injury" (MeSH Terms), "statistical model" (MeSH Terms), "predictive value of tests" (MeSH Terms) and "validation". In the Web of Science, we used Key words of "acute kidney injury", "acute renal failure", "model", "prediction", "predictor", "validity", and "validation". References of all selected articles were searched to identify any eligible studies. The search was restricted to human subjects. Each article selected by the primary reviewer (TO) was assessed by the second reviewer to confirm eligibility (SU).

Data extraction
A standardized data abstraction form was used to collect data on study characteristics and outcomes of interest. Data collected to describe characteristics of articles for original outcome prediction models were the type of study, study period, number of centers, sample size, mean age, gender, region, population, renal replacement therapy (RRT) requirement, hospital mortality, AKI definition, exclusion criteria, follow-up and variables included in prediction models. Following information was also collected for quality assessment of the prediction models: definition of predictors, indications for RRT defined, missing data definition, bootstrap resampling, multivariable analysis approach, event per variable ratio and internal validation cohort.
Data collected to describe characteristics of articles for external validation were type of study, study period, number of centers, sample size, mean age, hospital mortality, number of validated models and methods of discrimination and calibration. AUROCs reported in both original prediction models and external validation studies were also collected.

Results
A total of 2204 potentially relevant articles were found and screened, of which 80 were retrieved for detailed evaluation (Fig 1). We excluded five articles that had no prediction models developed by multivariate regression analysis, six articles that had no discrimination results, seven articles that validated only general severity scores or had no external discrimination results and 41 articles that assessed specific cohorts (cardiac surgery: 10, contrast-induced nephropathy: eight, others: 23). Fifty-nine articles excluded from this study are listed in a supplement file (S1 File). Finally, 12 articles reporting original prediction models for hospital mortality in AKI patients [12,[14][15][16][17][18][19][20][21][22][23][24] and nine additional articles assessing external validation of the outcome prediction models [11,25,26,[29][30][31][32][33][34] were selected for analysis. Five out of 12 articles reporting original prediction models also assessed other models (14 articles in total for external validation).
Characteristics of the 12 articles reporting outcome prediction models for AKI are shown in Tables 1 and 2. The study sample size ranged from 126 to 1,122 patients and the hospital mortality ranged from 36% to 75%. Only five studies (Chertow 1998, Mehta, Lins 2004, Chertow 2006, Demirjian) included more than one center and remaining seven were conducted in single center. The definition of AKI was not uniform among the 12 articles and none used recently published consensus definitions for AKI. Quality assessment for these articles is shown in Table 3. How missing data were dealt was defined only in four articles, and all of these articles also used bootstrap resampling. Eight articles used multivariable logistic regression analysis, and the other four articles (Ramussen, Schaefer, Liano and Lins 2000) used multivariable linear regression analysis. The event per variable ratio was more than 10 in all articles except for the earliest (Ramussen).
Characteristics of the 14 external validation studies are shown in Table 4. The study sample size ranged from 197 to 17,326 patients and the hospital mortality ranged from 37% to 85%. Five studies were conducted in single center. All studies evaluated discrimination with the AUROC and nine studies evaluated calibration with the Hosmer-Lemeshow test. AUROCs for hospital mortality reported in the original articles (internal validation) and external validation studies are shown in Fig 2. Seven recently published articles for AKI outcome prediction models reported AUROCs for internal validation and all of them had high AUROCs of above 0.7. All prediction models were externally validated by one or more studies. AUROCs in the external validation studies for these scores were generally low (less than 0.7 in most studies). In addition, seven prediction models that were validated both internally and externally had invariably lower AUROCs in external validation than those in internal validation. Table 5 shows variables included in more than one prediction model and their odds ratios / p values. There were 10 common non-renal variables that were reported in more than three prediction models: mechanical ventilation, age, gender, hypotension, liver failure, oliguria, sepsis/septic shock, low albumin, consciousness and low platelet count. Renal variables (low creatinine and high urea) were often used in the same prediction models.

Key findings
We have systematically reviewed AKI outcome prediction models and their external validation studies. We found 12 articles reporting original prediction models for hospital mortality in AKI patients and nine articles assessing external validation of the outcome prediction models. Although good performance was reported in their internal validation, most of the prediction models had poor discrimination with an AUROC below the threshold of 0.7 in their external validation studies. We also identified 10 common variables that were frequently included in the prediction models.

Relationship to previous studies
The establishment of a clinical prediction model encompasses three consecutive research phases, namely derivation, external validation and impact analysis [35]. In this study, we conducted a systematic review for the first two phases in AKI outcome prediction. Several systematic reviews for clinical prediction models and their external validation have been conducted in other medical conditions, which consistently found methodological limitations. [36][37][38][39][40]. Such limitations include case mix heterogeneity, small sample sizes, insufficient description of study design, and lack of external validation. We found the same limitations in the AKI outcome prediction studies. For example, all prediction models examined in this study were relatively old (data collected more than 10 years ago) and conducted before consensus criteria for AKI were published [41][42][43]. Therefore, patients included in these prediction models were heterogeneous, with varied RRT requirement and mortality. We also found that more than half of the studies for AKI prediction models and their external validation were single-center (12/21, 57%), and most of them included less than 1,000 patients (19/22, 86%). Furthermore, the moment of data collection for each clinical prediction model and external validation was different. Data collection can be done at admission, at AKI diagnosis, at the start of RRT, at nephrologist consultation, and so on. Demirjian's model for instance, collected variables at Considering the poor generalizability of currently available prediction models (AUR-OCs lower than 0.7 in most external validation studies), a large database collected in multicenter using consensus AKI criteria will be needed both to derive and validate AKI outcome prediction models. Among the prediction models included in this systematic review, we found that the Liano's score [17] was the most often evaluated externally (11 studies). The range of AUROC validated externally for the Liano's score was from 0.55 to 0.90, and four of them were above 0.7. The reason why Liano showed high AUROCs in some external validation studies is unclear. It might be partially explained by that the Liano' score contained several risk factors that are frequently used in the prediction models (mechanical ventilation, age, gender, hypotension, liver failure, oliguria, consciousness disturbance), although Dharan also included nine variables, with poor discrimination by one external validation study (Table 5).

Significance and implications
To derive an accurate prediction model, choosing appropriate candidate predictors is of much importance. Previous studies have shown that clinical intuition may not be suitable for identifying candidate predictors [44]. A better approach is to combine a systematic literature review of prognostic factors associated with the outcome of interest with opinions of field experts [35]. We identified 10 common variables that were frequently included in the prediction models. These variables are also often found to be related to mortality in more recent epidemiological studies using consensus AKI criteria [45][46][47][48]. We believe that our study results will be useful for future studies to derive accurate AKI outcome prediction models by including these variables for data collection. Although often included in the prediction models, we think that including both low creatinine and high urea concentrations as independent variables can be problematic (Table 5). Low serum creatinine is included in general severity scores as one of independent variables [5]. Serum urea has been used as a marker of timing of starting RRT in several studies, which showed that patients with higher urea at start of RRT had worse outcome than patients with lower urea [49]. High urea is also included in general severity scores [5]. However, serum creatinine and urea concentrations clearly have strong co-linearity. In AKI patients, urea is almost always high when creatinine is high. Even if both variables are found to be independent variables in multivariable analysis, it seems unlikely that including both variables in a prediction model will improve prediction ability [50].
Physicians are faced with the impractical situation of having to choose among many concurrent outcome prediction models for AKI. To overcome this issue, it is recommended that investigators who have large data sets should conduct external validation studies of multiple existing models at once, in order to determine which model is most useful [51]. We believe that our study results will also be useful for future studies by providing the list of published outcome prediction models for AKI.

Strengths and limitations
The strength of our study is that, to the best of our knowledge, this is the first systematic review on AKI outcome prediction models in the medical literature. We have reviewed studies for both prediction models and their external validation, and provided potential candidate variables for future prediction models and the list of published prediction models for future external validation studies. However, our study also contains several limitations. First, recent studies suggest that AKI biomarkers might be useful to predict outcome and could be combined with physiological and laboratory variables to improve predicting ability [52,53]. However, prediction models should include only variables that are available at the time when the model is intended to be used, and biomarkers are not yet widely used clinically [54]. Second, we excluded six studies due to discrimination results not available [55][56][57][58][59][60]. However, these studies were generally old, small, and of poor methodological quality. We believe that including these studies would not change our main findings. Finally, the AKI definitions used in both prediction models and their external validation studies are outdated, and studies included were relatively old (the most recently published study is from 2011 and the data were collected between 2003 and 2007). There is an urgent need for a mortality prediction model based on current definitions of AKI, and this systematic review can be considered a first step to accomplish this task.

Conclusions
Multiple outcome prediction models for AKI have been derived previously. These scores had good performance in their internal validation studies, while poor performance was reported in their external validation, suggesting that there is no accurate model currently available. To generate accurate AKI prediction models, several recommendations can be provided: using a large database collected in multicenter, applying consensus AKI criteria, and collecting variables frequently used in previous models (mechanical ventilation, age, gender, hypotension, liver failure, oliguria, sepsis/septic shock, low albumin, consciousness and low platelet count). Information in this systematic review should be useful both for future prediction model derivation by providing potential candidate predictors, and for future external validation by listing up the published prediction models.