Statistical methods versus machine learning techniques for donor-recipient matching in liver transplantation

Donor-Recipient (D-R) matching is one of the main challenges to be fulfilled nowadays. Due to the increasing number of recipients and the small amount of donors in liver transplantation, the allocation method is crucial. In this paper, to establish a fair comparison, the United Network for Organ Sharing database was used with 4 different end-points (3 months, and 1, 2 and 5 years), with a total of 39, 189 D-R pairs and 28 donor and recipient variables. Modelling techniques were divided into two groups: 1) classical statistical methods, including Logistic Regression (LR) and Naïve Bayes (NB), and 2) standard machine learning techniques, including Multilayer Perceptron (MLP), Random Forest (RF), Gradient Boosting (GB) or Support Vector Machines (SVM), among others. The methods were compared with standard scores, MELD, SOFT and BAR. For the 5-years end-point, LR (AUC = 0.654) outperformed several machine learning techniques, such as MLP (AUC = 0.599), GB (AUC = 0.600), SVM (AUC = 0.624) or RF (AUC = 0.644), among others. Moreover, LR also outperformed standard scores. The same pattern was reproduced for the others 3 end-points. Complex machine learning methods were not able to improve the performance of liver allocation, probably due to the implicit limitations associated to the collection process of the database.


Introduction
Donor-Recipient (D-R) matching is one of the most challenging topics in Liver Transplantation (LT). Considering the increasing number of candidates for LT and the scarce number of available donors, the rationale for assignment of a given donor to potential candidates on a waiting list is a matter of controversy. For this purpose, some scores have been designed, whose implementation in practice has its supporters and detractors. Model for End-Stage Liver Disease (MELD) [1], Survival Following Liver Transplantation score (SOFT) [2] or Balance of Risk (BAR) [3] are examples of the intention to match donors and recipients to obtain the best post-transplant result. However, this result is also a subject of discussion. For some of these scores, the main objective is to decrease the mortality in the waiting list without affecting the result of the transplant. This is the case of MELD, the most widespread prioritization system nowadays. On the contrary, in other scores the idea of obtaining the greatest survival benefit prevails, which means combining the lowest mortality in the list with the best possible result. Unfortunately, none of these systems has been able to combine the urgency of a given transplant candidate with the best survival benefit among possible candidates. A decrease in mortality in the waiting list leads, in many cases, to worse post-transplant survival results; and, vice versa, obtaining better results may affect the opportunity to obtain a transplant for the most critical receipt on the waiting list.
In essence, D-R matching is a classification problem, where some variables of the donor are combined with variables of the listed recipients, surgical aspects and logistics factors to, in short, obtain a survival prediction [4]: survival of the graft, survival of the recipient, or both. A common problem of the available scores is a remarkably basic statistical methodology that only considers isolated variables with single random grafts or patient survival end-points. The combination of several variables and end-points, in the setting of artificial intelligence-based decisions that avoid human-guided bias, may be the basis for D-R matching and grafts allocation in the future.
In this paper, we aim to analyse how several machine learning techniques behave in the largest liver transplant database, up to the knowledge of the authors, the database provided by the United Network for Organ Sharing (UNOS) [5]. The opportunity to work with databases including thousands of donor-recipient pairs is crucial, aiming to establish the worldwide applicability of machine learning techniques in the results of LT. In order to have an idea of the magnitude of the database, since 1988, when the database was created, more than 170, 000 liver transplants have been performed representing almost a 22% of all the transplants made in USA, based on Organ Procurement and Transplantation Network (OPTN) data as of January, 2021. This database has been previously considered for some studies [6][7][8][9][10]. However, despite machine learning techniques have demonstrated to be able to obtain good results in several medical areas, some drawbacks are also found when applied to large databases [11][12][13][14].
This problem can be tackled from two different points of view: as a classification problem (by discretising the survival time using a set of end-points) or as a Survival Analysis (SA) problem (in which the outcome is the time until the occurrence of an event of interest, in this case, liver graft failure). In this sense, given the huge interest in the SA field, its use was proposed in a previous study of the authors [8]. In this work, we approached the application of SA techniques to the UNOS database. The results achieved in [8] denoted that SA techniques were only able to achieve a low performance in this dataset, showing that the problem is complex. Thus, in the current study, we decided to tackle this problem from a completely different point of view, in this case, as binary classification problems with different end-points.
The main interest of this paper is the application of state-of-the-art machine learning techniques to the largest database of liver transplant, the UNOS database. The importance of this study lies in the interest of obtaining efficient and accurate approaches, applicable in most situations. In this sense, the UNOS collects information not only from donors and recipients, but also from the pre-and post-transplant from all the health care system centres of the USA.

Materials and methods
All the information and procedures conducted in this studio were in accordance to the ethical standards of both the local Human Research Ethics Committee and the Declaration of Helsinki 1975. The data was originally requested to the UNOS/OPTN website (https://optn. transplant.hrsa.gov/data/request-data/) on 14th October 2015, and a STARFILE dataset (Standard Transplant Analysis and Research) was received, with de-identified patient-level information for transplant recipients and waiting list candidates. Therefore, the data used have been supplied by the United Network for Organ Sharing (UNOS) as the contractor for the Organ Procurement and Transplantation Network (OPTN). Note that patient data was obtained in a fully anonymised and de-identified manner, and that none of the authors of this paper had access to identifying patient information.

UNOS database: Patient selection and exclusion criteria
For this study we have considered the liver transplantation dataset from the United Network for Organ Sharing (UNOS) database [5]. Although it was founded in March 1984, we have considered those transplants performed from November 2004 onwards, date when the last change of variable formats was made. Partial and split liver transplants, living donor liver transplants and combined liver transplants were excluded from the study. All these transplants, which represent less than 5% of the total, are considered as particular cases, and the donor-recipient matching is usually done following specific criteria. All the recipients older than 18 years were included, and all the patients were followed from the date of transplant until graft-loss prior to five years after transplantation.
In order to make a complete analysis of the liver transplant, we have considered four different end-points (period of time to control graft-loss): three months (3M), one year (1Y), two years (2Y) and five years (5Y). The choice of these end-points has been made by experts [15][16][17]. Note that graft survival has been defined as the period from transplantation to the time a re-transplant is required or to the time the recipient dies. Deaths not related to the transplantation procedure have been excluded. The initial number of transplants was 39, 189, which decreases for every end-point considered, because some of the transplants are censored, i.e. there is no available information about the outcome for that specific end-point. Table 1 shows the total number of transplants performed for each end-point, as well as their class distribution. An important difference between the different end-points is that the degree of imbalance of the dataset is lower as we consider later end-points, given that the number of non-survival cases increases. For example, in the case of 5Y, 20, 456 transplants are considered, from which 8, 886 belong to the non-survival class, and 11, 570, to the graft-survival class. A notable imbalance degree is shown for all datasets which can lead to trivial classifiers (i.e. classifiers predicting survival for all D-R pairs).

Variable selection
Although the UNOS database includes more than 350 variables, a huge number of them are redundant (most likely due to format changes or important differences in the acquisition of the values), other ones are trivial (i.e. indexes and dates, among others), and a vast quantity of them has a high percentage of missing values (we have kept only those variables with a percentage of missing values lower than 10%). Moreover, other variables were discarded, such as patient address or other information not interesting enough for the medical decision making. As previously stated, the collection of data for large datasets (where different procedures or diverse protocols for obtaining scores or corporal indices are joined in a common database) causes controversy. This could happen not only due to the different mechanisms considered in every health care unit, but also because of the distinct ways to categorise the same patient, causing incongruities. Taking all of this into account, a final set of 28 variables was considered, which is shown in Table 2.
Unknown variables at transplant such as cold ischemia time were not considered, because they should be estimated without any option to establish a real comparison. Moreover, the inclusion of the cold ischemia time would be a handicap, specially for those countries with a huge territorial expansion such as the USA, in which the estimation of the cold ischemia time  would be highly inaccurate. In addition, for this kind of post-transplant variables, the prioritisation system in the USA is extremely complex to allow their estimation.
To impute the missing values, the average value in the training set is considered for continuous and quantitative variables, whereas the mode is considered for binary and qualitative variables.

Experimental settings
To tackle the D-R allocation problem, we have considered different survival prediction methods, which can be divided into two groups: 1) classical statistical methods, such as Logistic Regression (LR) and Naïve Bayes (NB), and 2) standard machine learning techniques, such as Multilayer Perceptron (MLP), Random Forest (RF), Support Vector Machines (SVM), Decision Trees (C4.5), k-nearest neighbours (kNN) and Gradient Boosting (GB). These techniques cover a vast range of methods from the state-of-the-art, including the best performance classifiers. All these methods have been run using scikit-learn framework [18]. Moreover, we compare their results against the standard scores used in the literature (MELD, DMELD, BAR, DRI, SOFT and PSOFT) [19].
Regarding the evaluation process, the Confusion Matrix (CM), the Accuracy (Acc), the Minimum Sensitivity (MS)-i.e. the minimum of the sensitivities for each class-, and the Area Under ROC curve (AUC) have been used. The last three metrics vary between 0 and 1, where, the higher value, the better performance obtained.
To evaluate the results, a 10-fold cross-validation technique has been used. The results are then computed using the mean and standard deviation from the 10 models obtained, and the confusion matrix is computed as the sum of the test confusion matrices obtained for each model.
Furthermore, to establish a robust comparison of the methods, the parameters of each technique have been chosen using another independent 10-fold cross-validation over the training set. The best parameter configuration is the one that obtained the maximum MS in the nested 10-fold cross-validation, because one of the main objectives is to increase the classification rate of the minority class (in this case, the non-survival). Note that the test sets are not used for model selection. The range of parameters used during model selection are specified in Table 3. On the other hand, the whole procedure is clarified in Fig 1.

Rule-based system
Once the best model is obtained, the D-R matching can be done following a rule-based system, trying to achieve a balance between graft survival and MELD. The procedure of the proposed system is the following: the model allocates a D-R pair, only if the probability of graft-survival is significantly higher than the rest of possible matchings. A probability of graft-survival is considered significantly higher than any other, if the difference between them exceeds the standard deviation of all the probabilities obtained by the model. In the case the model is not able to find significant differences between the possible matches, the recipient with higher MELD is chosen. Moreover, if there are two or more recipients with the same MELD, i.e. a second tie, the donor is allocated to the recipient with the highest number of days in the waiting list.

Results
The results obtained are shown in Table 4, where the mean and standard deviation (SD) of the results of the external 10-fold for every classifier and dataset considered (depending on the split-point) are shown.
As can be seen in Table 4, all the methods obtained similar results for all the datasets, but LR stands out, being the one with the performance in terms of MS and AUC, which are the metrics that we give more attention, given that Acc is not a good measure for imbalanced datasets [20]. Especially, focusing on the 5-years-end-point dataset, it can be seen that LR achieved the best performance for all metrics, using the second lowest time. Finally, according to the standard deviations, the models seem to be stable throughout the cross-validation stage.
From the results shown in Table 4, it can be concluded that complex machine learning techniques such as MLP, RF, SVM, C4.5 or GB obtained lightly worse results than LR, which is a standard statistical method. Focusing on the 5-years-end-point, most AUCs are over 0.600, being 0.654 in average for LR, the best result achieved. As we want to measure the longest graft-survival time, from this moment, we are going to focus on the 5-years end-point dataset.
In Fig 2, the ROC curves for the best models of the 5-years-end-point dataset are shown, considering the complete dataset (i.e. the union of the test predictions for the 10 folds). As can be seen, all models performs similarly, and we can differentiate three levels of performance: the best model is the LR, then, RF, SVM and NB perform similarly, followed by, C4.5, GB and MLP that perform equally, and finally, kNN obtained the worse value.
A comparison against the state-of-the-art scores MELD, DMELD, BAR, DRI, SOFT and PSOFT [19] is included in Fig 3 to show their predictive capability. It can be seen that there is a big difference between the AUC obtained by the best LR model and the one obtained by other scores, which are close to 0.5 (performance of a random classifier). Note that there is an increase higher than 14% between the best AUC obtained by the state-of-the-art scores (DMELD, 0.572) and the LR model (0.654).

Interpretation of the best model obtained
In this section, we analyse the best model obtained by the LR, specifically the importance of each variable for the prediction of graft-survival after transplantation (5-year end-point dataset) is studied in detail. The linear weights obtained are included in Table 5. In this table there are 49 variables, due to the decomposition of ordinal and nominal variables (see Table 2) into binary ones. We have considered the absolute value of the weights to rank the variables, and, depending on the sign, we have concluded a positive or negative influence on the graft-survival after transplantation (note that the positive class is the survival at 5 years).
As we can see in Table 5, the LR model has 49 coefficients. These coefficients are associated to as many characteristics or variables of the donor-recipient pairs, plus a coefficient that is the independent term of the model. The dependent variable is binary with two events: S if the graft survives more than 5 years (positive class) or NS on the contrary (negative class), following a Bernouilli distribution of parameter p. Although the vast majority of important variables belongs to the recipient, there are some variables of the donor with a great impact on the model, such as the age or whether the donor is non-heart-beating. Those variables that contribute most to the models are the number of days in the wait list (recipient), the MELD (recipient), the number of days in ICU (recipient), the age (both donor and recipient) and whether the donor is non-heart-beating, among others. As can be seen, MELD has a great influence in our model, but the number of days in wait list or in ICU has a similar or greater impact. These findings are consistent with the results in the state-of-the-art, where the age, the number of days in ICU and in wait list are considered important factors for the graft-survival after transplantation.

Simulation of the rule-based system
In this section, a simulation for the application of the rule-based system described previously is applied. First of all, the LR model chooses one D-R pair over the others, only if the difference between their probabilities of graft-survival is higher than 14%, since the standard deviation of all the probabilities obtained by the LR model is 0.139. For this simulation, 5 recipients and 10 donors are randomly selected, and we consider three scenarios: 1) selecting 5 recipients with a MELD in the range [23,27] (Table 6), 2) selecting those with MELD [34, 40] (Table 7), 3) instead of applying restrictions to MELD, we consider Extended Criteria Donors (ECDs), that is, donors over 70 years or with a BMI over 40, among others (Table 8). For visualisation purposes, those receptors which do not show significant differences with respect to the maximum graft-survival probability are shadowed. Bold face indicates selected values from all statistically significant. Finally, the chosen recipient identifier can be seen in the last row. Table 6 shows the first simulation. For this case, 5 recipients with MELD 23 − 27 and 10 donors are randomly selected. It can be seen that, when donor D1 is offered, following a MELD-driven approach, the recipient R1 would be chosen. However, considering the best LR model proposed, the recipient R3 would be considered for allocation, because the difference between both probabilities of graft-survival is significant. A similar situation happens for donors D3, D5 − D8 and D10. In the case of D2, the MELD-driven approach and the LR model lead to the same allocation, since R1 belongs to the group of better probabilities of graftsurvival and has the highest MELD (this pattern is repeated for D4 and D9). Note that the rulebased system lays on the basis of MELD-allocation, but, when the probabilities show significant differences, a lower-MELD recipient can be chosen. Therefore, a D-R pair is chosen only in cases of real biological (not mathematical) differences. Table 7 shows the simulation with recipients with MELD 34 − 40. The same strategy is applied to high-MELD recipients. Focusing on D5 and D9, it is interesting to remark that, when two probabilities of graft-survival do not show significant differences and there is a tie in   MELD, the recipient will be that with the longest time in the waiting list, in this case, R5. As in the previous simulation, the decision is only taken by the mathematical model when there are significant differences (D2 − D6, D8 and D9). Otherwise, the decision is made following the MELD score (D1, D7 and D10). Finally, a last simulation including Extended-Criteria Donors (ECD) is shown in Table 8. Transplants with ECD have been performed successfully for a number of years. In this simulation, the first 5 donors are older than 70, whereas the remaining 5 have a BMI higher than 40, representing a 4.38% and a 3.43%, respectively, of all the donors included in this study. The same analysis done for previous simulations could be also applied to this situation.

Discussion
To our knowledge, this is the first work that addresses Donor-Recipient (D-R) matching in Liver Transplantation (LT) using the UNOS data set. D-R matching has become one of the most challenging topics in LT in the last years. Unfortunately, standard scorers, such as MELD, SOFT or BAR, fail to consider both mortality in waiting list and benefit survival. These two objectives are difficult to meet, since these metrics pose them as conflicting objectives. A decrease in mortality in the waiting list leads in many cases to worse post-transplant survival results; and, vice versa, obtaining better results may affect the opportunity to be transplanted for the sickest one on the waiting list.
D-R matching is considered as a classification problem, and, for this, variables of the donor, variables of the listed recipients and surgical and logistical aspects are considered to assess the best matching possible [4], which can be based on the survival of the graft, the survival of the recipient, or both. A common problem of the available scores is a remarkably basic statistical methodology that only considers isolated variables with single random graft or patient survival end-points. The combination of several variables and end-points in the setting of artificial intelligence-based decisions that avoid human-guided bias may be the basis for D-R matching and grafts allocation in the future. In 2014, we tested Artificial Neural Networks (ANNs) in the complex scenario of D-R matching with D-R pairs from 11 Spanish transplant units [15]. This study demonstrated that ANNs are a valuable tool for organ allocation to obtain the best benefit of survival. In the current scenario of graft scarcity and waiting list deaths, the absence of a definitive and objective system for liver-donor assignment is unacceptable. After that, we validated ANN methodology in D-R matching in a different health care system (data from King's College Hospital, KCH), showing that it would be a powerful tool for D-R matching in comparison to other current models [21]. This methodology has been recently validated using gradient boosting and random forest classifiers [22] using data from 272 different centres, denoting that outstanding results could be obtained independently of the population location.
The main goal of this paper was to analyse the behaviour of machine learning techniques applied to the largest liver transplant database, provided by the UNOS [5]. Working with large databases is a great opportunity to achieve a worldwide application of machine learning techniques in the results of LT.
Machine learning methods lead in general to excellent results when combined with a huge amount of information. As an example, Electronic Health Records (EHRs) have been developed to speed up the mechanism for clinician decision making, based on information extracted from these records [23]. However, it has been demonstrated that, for large databases, machine learning algorithms are not always capable of reaching notable results, what can be caused for several reasons [13,24,25]: 1) missing values and the imputation techniques used, in combination with the need of clear guidelines regarding how to cope with attributes and patterns with different percentages of missing data, 2) the increasing quantity of different categories for some attributes, which makes the classifier lose accuracy, as well as, 3) the increasing number of Non Specified (NS) cases in this attributes, where some specific information is discarded, since no category matches the particular situation, 4) by contrast, attributes with several categories but a small number of cases per category, make null contribution, and finally, 5) the vast amount of subjective attributes manually introduced may cause incongruities between different expert opinions.
The results we have obtained in the present study include much of the problems described previously. Indeed, most AUCs are over 0.600, being 0.654 in average for LR, the best performance model. These results contrast with those obtained in previous models form the Spanish data set and its subsequent validation in the King's college Hospital dataset. However, in a similar study made with the UNOS dataset in heart transplantation, Miller et al. [13] have found a lack of improvement of advanced analytic algorithms, as we have described in the present study, concluding that prognostic abilities of machine learning techniques may be limited by quality of the clinical dataset. More recent studies [25,26] have also demonstrated no evidence of performance benefit for machine learning methods over logistic regression.
In the dataset considered in the present study, a vast amount of missing data were found. Not all the regional centres give the same importance to the data collection step for the database. Hence, the curation of the database is a tedious procedure due to the large percentage of missing values. The imputation of data makes the database lose veracity and robustness, leading to worse performances because of data granularity and quality. Furthermore, the entries of the database are collected by 11 regional centres. This process lacks from consistency, caused by diverse reasons: 1) the administrative centres provides their original data to a global database, being possible to give different formats to the variables from the other centres. 2) In addition, a given variable could be obtained following different procedures or index measures. 3) Finally, the database may include incongruities because a given situation could be categorised contradictorily. The inclusion of the cold ischemia time has arisen much controversy in the literature [27,28]. Although it has been considered an important variable, it is a post-transplant variable, for which a priori information is not known. Estimating the cold ischemia time to predict donor-recipient matching is a challenging task, because this estimation has to be done with no prior information but the cities of origin of both patients (which in the case of USA is arduous, given the large distances between the hospital of different states) and the strategy to allocate organs between all the regional centres.
Finally, it should be discussed the trajectory of different scores considered along the years [29,30]. The D-R matching has been performed following the guidelines proposed by scores with different goals published in the literature. With the exception of the LR model, the results obtained in the present study did not differ in essence from those obtained with the classic scores (MELD, BAR, SOFT, etc.) based on conventional biostatistics. This does not mean a lack of usefulness of artificial intelligence in the problem of D-R matching in liver transplantation, but the importance of emphasizing the need for well-designed and well-constructed databases, and, of course, filled with diligence and professionalism.
The main goal of machine learning is to provide the medical community with a tool bridging the gap between the medical decision (subjectivity) and strict mathematical scores (objectivity). For this purpose, a rule-based system is proposed for the management of the waiting list for liver transplant. This system is objective (does not include human subjectivity in the selection of the recipient), optimal (it is able to increase the post-transplant survival rates) and, finally, fair, because, if the model does not appreciate a significant difference between two recipients, the organ is allocated to the recipient with the most advanced disease (highest MELD). A deep analysis have been done to increase the most the understanding of the mathematical model and its consistency with the medical findings so far.

Conclusion
In this paper, a deep analysis of the UNOS database regarding liver transplant is presented. The best results are obtained by the Logistic Regression (LR) method, while machine learning techniques do not live up to their expectation. One of the reasons behind this is the lack of accuracy and robustness needed for machine learning methods to capture essential uncovered features of the problem tackled, although they have obtained outstanding results in other medical-related fields. Despite UNOS is considered a robust clinical registry, it is based on administrative data, making the database suffer from the problem of subjectivity, i.e., given two similar situations, two different choices are made. The extensive analysis made us prove these limitations, since all the methods applied almost had the same predictive ability.
However, as we believe that the techniques have done their best for the D-R matching in liver transplant, an interpretation of the LR best model obtained is done, as well as, several simulations of the organ allocation are included in different scenarios: medium-MELDs, high-MELDs and using extended-criteria donors. To our knowledge, this is the first study comparing different predictive methods in patients undergoing liver transplantation. Future work is required to improve the quality of available data, especially on large databases such as UNOS, in which the amount of information is huge and the curation step must be rigorous and severe. Moreover, reformulating the survival prediction problem into correlated binary classification problems as in [31] will also be considered as future work.