Prognostic Abilities and Quality Assessment of Models for the Prediction of 90-Day Mortality in Liver Transplant Waiting List Patients

Background Model of end-stage liver disease (MELD)-score and diverse variants are widely used for prognosis on liver transplant waiting-lists. Methods 818 consecutive patients on the liver transplant waiting-list included to calculate the MELD, MESO Index, MELD-Na, UKELD, iMELD, refitMELD, refitMELD-Na, upMELD and PELD-scores. Prognostic abilities for 90-day mortality were investigated applying Receiver-operating-characteristic-curve analysis. Independent risk factors for 90-day mortality were identified with multivariable binary logistic regression modelling. Methodological quality of the underlying development studies was assessed with a systematic assessment tool. Results 74 patients (9%) died on the liver transplant waiting list within 90 days after listing. All but one scores, refitMELD-Na, had acceptable prognostic performance with areas under the ROC-curves (AUROCs)>0.700. The iMELD performed best (AUROC = 0.798). In pediatric cases, the PELD-score just failed to reach the acceptable threshold with an AUROC = 0.699. All scores reached a mean quality score of 72.3%. Highest quality scores could be achieved by the UKELD and PELD-scores. Studies specifically lack statistical validity and model evaluation. Conclusions Inferior quality assessment of prognostic models does not necessarily imply inferior prognostic abilities. The iMELD might be a more reliable tool representing urgency of transplantation than the MELD-score. PELD-score is assumedly not accurate enough to allow graft allocation decision in pediatric liver transplantation.


Introduction
The Model of End-Stage Liver Disease (MELD) Score has originally been developed as a prognostic model to estimate 90-day mortality for patients who require a transjugular intrahepatic portosystemic shunt procedure [1].The original MELD score is based on three laboratory values including serum creatinine, serum bilirubin and the International Normalized Ratio (INR) and the cause of cirrhosis [1]. In 2001, Kamath et al. evaluated the Model of End-Stage Liver Disease (MELD) Score as a prognostic model in patient groups with a broader range of disease severity and etiology and suggested its application in donor liver allocation policies for liver transplantation [2]. Subsequently, Wiesner et al. assessed the capability of the MELD Score without including the cause of cirrhosis into the MELD Score formula to correctly rank potential liver recipients according to their severity of liver disease and mortality risk on the OPTN liver waiting list in the US. They were able to show that the MELD score can accurately predict 90-day mortality among patients with chronic liver disease on the liver waiting list and can be applied for allocation of donor livers (see Table 1) [3]. Today, the MELD Score is applied in liver allocation policies in several countries world-wide, including the US and Germany [4,5]. Since its introduction in German allocation policies, waiting list mortality has decreased from approximately 20% to 10% while post-transplant patient survival has declined significantly leading to 1-year survival rates that are up to 20% lower as compared to the United States and the United Kingdom [6][7][8][9]. It is astonishing that the MELD score has been introduced into liver allocation policies in Germany in December 2006 without prior validation of the prognostic ability of this prognostic model in German waiting list patients violating one of the essential quality assessment criteria for prognostic models as proposed by Jacob et al. in 2005 [10].
Since the introduction of MELD-Score based liver allocation in several countries, further prognostic models have been developed to predict 90-day mortality in liver transplant candidates including the MELD-sodium Index (MESO Index), MELD-Natrium-Score (MELD Na), United Kingdom End-Stage Liver Disease Score (UKELD),integrated MELD (iMELD), Revised model for End-Stage Liver Disease (refitMELD), revised model for End-Stage Liver Disease including sodium (refitMELDNa) and updated MELD Score (upMELD) ( Table 1) [11][12][13][14][15][16]. For pediatric patients, the Pediatric End-Stage Liver Disease (PELD) score has been developed ( Table 1) [17]. All of the above mentioned prognostic models have so far not been validated in German waiting list patients. The current study aims to validate the above mentioned prognostic models in a separate cohort of waiting list patients from Germany and to assess the fulfillment of the quality assessment criteria for prognostic models as proposed by Jacob et al. [10].

Inclusion and exclusion criteria
This is a single-center retrospective study including all waiting list patients (n = 818) listed for liver transplantation at Hannover Medical School between the 01.01.2007 and the 31.12.2013, who were either transplanted during that time interval or delisted due to clinical improvement or death (464 males (56.7%), 354 females (43.3%), median age at listing 46.0 years, range 0.02-73.6 years). Cases that were still on the waiting list for liver transplantation after the 31.12.2013 were excluded from analysis. Pediatric patients (n = 232, 28.4%) were defined as younger than 17 years due to specific donor organ allocation policies for these patients [5,18]. The distribution of relevant clinical characteristics is summarized in Table 2.
Calculation of the investigated prognostic models and scores Table 1 summarizes the equations and handling of variables for the calculation of the investigated scores including MELD, MESO Index, MELD-Na, UKELD, iMELD, refitMELD, refit-MELD-Na, upMELD and PELD, as previously described [3,[11][12][13][14][15][16][17]. These scores have been analyzed as prognostic models for the prediction of study endpoints. * If the patient is less than one year old (scores for patients listed for liver transplantation before listed for liver transplantation before the patient´s first birthday continue to include the value assigned for age (<1 year) until the patient reached the age of 24 months) ** A patient has growth failure (<-2 standard deviation) if either the patient's height is less than or equal to the expected sex-and age-matched low height value or the patient's weight is less than or equal to the expected sex-and age-matched low weight value. The OPTN/UNOS PELD Calculator is used for candidates who are under 12 years old.

Quality assessment of prognostic models
Quality assessment of the investigated prognostic models was carried out using the quality assessment tool for prognostic models in transplantation as proposed by Jacob and co-workers  [10]. Assessment of study quality evaluated the criteria internal quality (quality subheadings 1-4), external validity (quality subheadings 1-2), statistical validity (quality subheadings 1-4), evaluation of the model (quality subheadings 1-4) and practicality of the model (quality subheadings 1-4) (see Table 3). Each investigated prognostic model was judged by the quality  Table 3. Shown is the quality assessment tool for prognostic models basing on Jacob et al. [10].

Internal validity
Inception cohort established 0. criteria described in detail under the respective quality subheadings by giving zero or one point for each subheading depending on the fulfilment of the respective quality criteria leading to an overall minimum of zero points and a maximum of 20 points per assessed study. To achieve an equal balance in the weighing between the individual quality subheadings, the score for the two categories of external validity was multiplied by the factor two. This assessment was made independently by three authors (R.S., A.K. and M.B.), all questions and doubts regarding the quality assessment of each subheading were documented and discussed. Furthermore spider web diagrams were created for each of the prognostic models (Fig 1). Each corner of the pentagonal spider web represents one category of the quality assessment tool (internal validity, external validity, practicality of the model, evaluation of the model, and statistical validity). Each color represents one evaluator. An optimal assessment result would be an outer line connecting the five corner points. The nearer the line is at the center of the spider web, the poorer is the category's assessed quality.

Study endpoints
The primary study endpoint was 90-day mortality on the transplant waiting list. The analyses strived to facilitate a comparison of the areas under the receiver operating characteristic (AUROC) curves for the prediction of 90-day mortality with the investigated prognostic models using data from the complete cohort as well as clinically relevant sub cohorts including pediatric and adult waiting list candidates, as well as waiting list candidates with specific indications for liver transplantation. Secondary study endpoint was the comparison of investigated prognostic models by using the mean results of the quality criteria assessment scores provided by three evaluators. Identification of independent risk factors for 90-day mortality with the goal to find possible explanations for differences in prognostic model performances in the investigated cohort was performed.

Statistical analysis
This study was performed in accordance with the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis statement (TRIPOD) to guarantee highest possible quality standards [19].
Receiver Operating Characteristic (ROC)-curve analysis was performed to calculate the sensitivity, specificity, and overall model correctness of the investigated prognostic models. AUROCs larger than 0.700 indicate a potentially clinically useful prognostic model [20][21][22][23].
The relevance of variables as risk factors for the study endpoints was analyzed with binary logistic regression analysis. All statistically significant risk factors from univariable analyses have been taken into account for multivariable risk-adjusted models after exclusion of collinearity to identify independent risk factors for 90-day mortality (likelihood forward ratio inclusion method). For all statistical tests a p-value <0.05 was defined as significant. The SPSS statistics software version 21.0 (IBM, Somers, NY, USA) was used to perform statistical analysis.

Ethical considerations
The institutional review board of the Hannover Medical School reviewed and approved this study (approval decision number 1683-2013). All patients have agreed that their data may be used for scientific purposes. All data were fully anonymized and de-identified by the primary investigator (last author) before it was accessed for this study.
Primary data cannot be published with the manuscript due to local institutional policy restrictions. However, fully anonymized and de-identified data will be made available upon request by the corresponding author.
None of the transplant donors were from a vulnerable population and all donors or next of kin provided written informed consent that was freely given.

Events during follow-up
Mean follow-up was 228 days (median: 91.5 days, standard deviation (SD): 325.4 days, range: 0-2323 days) until liver transplantation (n = 567, 69.3%), patient's death (n = 150, 18.3%) or delisting (n = 101, 12.3%). 90-day mortality on the waiting list was observed in 74 patients (9%). 31 of 101 delisted patients were delisted due to clinical improvement or stable disease, three patients were delisted due to progress of hepatocellular carcinoma, four patients due to incompliance, 17 patients due to transfer to another transplant center, 21 due to deterioration of their clinical condition, and three patients due to their own decision. For 22 patients no reasons for delisting were documented.

Independent risk factors for 90-day-mortality
Univariable regression showed that age in years, creatinine in μmol/l, bilirubin in μmol/l, INR, sodium in mmol/l, PTT in s, albumin in g/dl, hemodialysis sessions per week (0-7), weight in kg, height in cm, acute/subacute hepatic failure, congenital biliary disease, body mass index in kg/m 2 , age < 16 years, and age < 12 years had a significant influence on 90-day mortality. Due to factor collinearity between body weight, body height and body mass index as well as between age and pediatric transplantation, only the more significant variables were included in multivariable modeling, which were BMI and age. To avoid collinearity between variables of the investigated scores and the scores themselves, two separate sets of risk-adjusted multivariable binary regression analyses were performed. The first risk-adjusted multivariable regression revealed creatinine (μmol/l), bilirubin (μmol/l), INR, sodium (mmol/l) and the body mass index (kg/m 2 ) as independent risk factors for 90-day mortality (see regression analysis 1, Table 4).The second multivariable regression analysis included the MELD-score variants and revealed BMI, days on the transplant waiting list and the iMELD as independent risk factors for 90 day mortality.

Prognostic models as independent risk factors for survival
In the complete cohort the iMELD was the only independent risk factor for 90-day mortality in risk-adjusted multivariable binary logistic regression analysis of prognostic models ( Table 4).
The PELD score could be identified as an independent risk factor for survival in all age groups in univariable analysis. It displayed a significant hazard for survival in both pediatric sub cohorts (age <12 years and age < 16 years).

ROC-curve analysis results
The prognostic performance of the MELD variants and the PELD as prognostic models to predict 90-day mortality in the complete cohort and selected sub-cohorts is summarized in Table 5. For all patients the iMELD and the MELD Na, for children the MESO-Index and MELD Na and for adults the iMELD and the MELD Na displayed the largest areas under the ROC-curve (AUROCs) for this prediction ( Table 5).
The PELD displayed a good prognostic performance in patients listed for re-transplantation and for patients listed for acute/subacute hepatic failure in adults and children. The UKELD and the iMELD showed the strongest prognostic performance for waiting list patients with cholestatic diseases. For every other sub-cohort the iMELD and MELD Na displayed the best prognostic performance.
The refitMELD Na failed in all investigated sub-cohorts to predict outcome as measured by its AUROC. For cases with cirrhosis and acute/subacute hepatic failure all prognostic models displayed AUROCs <0.700 (Table 5).

Quality assessment of prognostic models
The average quality assessment score of all investigated prognostic models was 14.45 points (72.25% of a maximum of 20 points), deploying the tool as suggested by Jacob et al. [10].
The MESO-Index (mean 11 score points) and the iMELD (mean 14.02 score points) reached the lowest overall quality scores based on their publications. The UKELD (mean 16 score points) and PELD (mean 15.36 score points) were rated with the highest overall quality scores (Table 3). The grouping of the indications leading to liver transplantation was performed according to the ELTR registry (http://www.eltr.org/). Separate sets of multivariable regression were performed in order to avoid collinearity of related variables (n.a. = not applicable).
The lowest quality assessment scores were awarded for the quality subheadings statistical validity (mean: 0.49) and evaluation of the model (mean: 0.67). Fig 1 shows the results of quality assessments of the three evaluators for the nine investigated prognostic models as spider web illustrations. Each corner of the pentagonal spider web represents one category of the quality assessment tool (internal validity, external validity, practicality of the model, evaluation of the model and statistical validity). An optimal assessment   result would be an outer line connecting the five corner points. The nearer the line is at the center of the spider web, the poorer is the category´s assessed quality.

Discussion
This is the first systematic evaluation of the quality and external validity of several prognostic models for 90-day mortality on the waiting list for liver transplantation. The quality of these models was assessed by three different investigators using the quality assessment tool proposed by Jacob et al. [10].The prognostic abilities of these models were assessed with an independent large data set from a single institution.

Prognostic abilities of the investigated models
In many countries the investigated models have either already gained unsurpassed clinical importance for the allocation of donor livers for transplantation or have the potential for such use. This study shows that the iMELD clearly delivered a very high potential to indicate urgency of transplantation. In risk-adjusted multivariable binary logistic regression it was observed that the iMELD was the only score which could be revealed as significant independent risk factor for 90-day-mortality on the transplant waiting list. The already established prognostic scores like MELD and UKELD do not reach significance in risk-adjusted multivariable binary logistic regression in the current cohort. Similar to other studies it can be confirmed that the MELD-score shows a good performance in ROC-curve analysis in all tested entities [24][25][26][27].Hence, the MELD-Score is an adequate tool to predict mortality on the transplant waiting list and therefore can be recommended as an integral part of liver allocation. However, the presented data suggests that there might be other prognostic models with a better performance regarding this highly relevant question, e.g. the iMELD and MELD Na. Both models show the best performance in most of the investigated sub-cohorts.
Interestingly, the refitMELD Na did not reach a larger AUROC than the previously developed refitMELD, although the refitMELD Na was published as an improved version of the latter score. This could also be shown in the investigated subgroups.
In the Eurotransplant community allocation of pediatric liver grafts is currently based on the recipients´PELD-score. Therefore, it is somehow surprising that in this analysis the PELD score was not able to reach relevant AUROCs larger than 0.700 in the pediatric sub-cohort and is not able to reliably predict 90-day-mortality of children on the transplant waiting list. In this specific sub-cohort, the best AUROC was reached by the MESO-Index (0.781). These astonishing findings need to be further evaluated and a subsequent alteration of allocation mechanisms might be necessary to overcome this discrepancy. This is further the case for the currently applied MELD-based allocation in adult liver transplantation since the MELD score reached an AUROC of 0.736, whereas the iMELD reached an AUROC 0.798. Thus, it is also recommended to investigate this issue systematically in a larger dataset.

Quality assessment of the prognostic models
The systematic quality assessment of the underlying publications of the investigated prognostic models pointed out that there might be relevant quality issues in these studies. None of the models achieved maximum points in the quality assessment tool by three different investigators. Especially the statistical validity and the evaluation of the model showed room for improvement in most of the studies. It is obvious that many prognostic models were developed with no complete regard to the quality assessment criteria for prognostic models as proposed by Jacob et al. [10]. This may be due to a lack of an international consensus on the methodology that should be applied for the development of prognostic models, which was most recently suggested by the TRIPOD working party [19].

Relation between quality assessment and prognostic abilities
Nevertheless, the current study shows that inferior quality assessment of prognostic models does not necessarily imply inferior prognostic value in this study´s cohort (e.g. iMELD). The reason for the differences between performance and quality could be the shortening of elementary information during the publication process. More transparence of the study design gives more confidence in the results and levels up the publication.

Limitations of this study
During interpretation of this study´s results it should be taken into account that the underlying data is captured from a single center retrospective database, thus, there might be a possible center bias. This is further reflected by a comparatively small sample size. Therefore, further studies are needed to confirm the presented promising results, preferably with larger cohorts e.g. from transplant registries.

Ethical requirements
The prioritization of patients and their timely access to an organ for liver transplantation frequently amounts to a decision on live and death. We therefore believe that very high ethical standards and a thorough evaluation of the quality and validity of prognostic models that are deployed for such use is mandatory. Such an evaluation includes an assessment of the internal, external and statistical validity, as well as the evaluation of model fit and clinical practicability. Jacob and colleagues proposed an excellent methodological approach as early as 2005 [10]. The recently published TRIPOD statement is an important step forward to qualitative prognostic research in transplantation and beyond [19].
A debate on the choice of the prognostic model that is intended to be applied for liver allocation and prioritization of liver transplantation should be guided by sound scientific data and a thorough quality assessment of the respective model. This would require a demonstration of the sensitivity, specificity and overall correctness of prediction of such a model including its model fit when applied on the population where it is intended to be deployed. Unfortunately, the MELD-score based liver allocation rules in Germany have been adopted without prior statistical evaluation in German liver transplant waiting list patients. The presented data suggest that the iMELD-score and the MELD-Na might provide a more accurate prediction of 90-day mortality on the transplant waiting list as compared to the MELD-Score, although this must be confirmed in larger studies.

Author Contributions
Conceptualization: RSS HS AK.