Prediction of Outcome in Acute Lower Gastrointestinal Bleeding Using Gradient Boosting

Background There are no widely used models in clinical care to predict outcome in acute lower gastro-intestinal bleeding (ALGIB). If available these could help triage patients at presentation to appropriate levels of care/intervention and improve medical resource utilisation. We aimed to apply a state-of-the-art machine learning classifier, gradient boosting (GB), to predict outcome in ALGIB using non-endoscopic measurements as predictors. Methods Non-endoscopic variables from patients with ALGIB attending the emergency departments of two teaching hospitals were analysed retrospectively for training/internal validation (n=170) and external validation (n=130) of the GB model. The performance of the GB algorithm in predicting recurrent bleeding, clinical intervention and severe bleeding was compared to a multiple logic regression (MLR) model and two published MLR-based prediction algorithms (BLEED and Strate prediction rule). Results The GB algorithm had the best negative predictive values for the chosen outcomes (>88%). On internal validation the accuracy of the GB algorithm for predicting recurrent bleeding, therapeutic intervention and severe bleeding were (88%, 88% and 78% respectively) and superior to the BLEED classification (64%, 68% and 63%), Strate prediction rule (78%, 78%, 67%) and conventional MLR (74%, 74% 62%). On external validation the accuracy was similar to conventional MLR for recurrent bleeding (88% vs. 83%) and therapeutic intervention (91% vs. 87%) but superior for severe bleeding (83% vs. 71%). Conclusion The gradient boosting algorithm accurately predicts outcome in patients with acute lower gastrointestinal bleeding and outperforms multiple logistic regression based models. These may be useful for risk stratification of patients on presentation to the emergency department.


Introduction
Acute lower gastro-intestinal bleeding (ALGIB) is a common emergency increasing in incidence with age [1], and may be more common than acute upper gastrointestinal bleeding in the elderly [2]. The causes and severity are heterogeneous, e.g. large volume bleeding from diverticulosis or minor bleeding from colitis with the aetiology and outcome often obscure to the clinician at presentation.
Due to concern about severe/recurrent bleeding or need for intervention routine clinical practice for the vast majority of patients with ALGIB who present to emergency department is admission to hospital for in-patient observation for variable number of days with a proportion undergoing endoscopy or radiological investigation. This strategy has the disadvantage of being invasive, expensive [8] and exposes patients to hospital acquired complications. If a reliable predictive model was available at the point of presentation to hospital low risk patients could be identified and triaged to outpatient management/a shorter inpatient stay. Resources could be then freed up for high-risk patients to be appropriately transferred to higher levels of care and undergo more aggressive investigation. Although multiple logistic regression (MLR) based scores have been developed for ALGIB [9,10] none have been recommended for routine clinical practice unlike the Glasgow Blatchford and Rockall scores for acute upper gastrointestinal bleeding [11]. Possible reasons include the lack of validation in diverse settings and modest accuracies in comparison with these scores. MLR based models may be limited in predicting outcome in ALGIB as they are based on the assumption that a linear combination of the observed features can be used to determine the probability of each particular outcome ignoring any variable interaction which may be the key for the accurate prediction.
In order to mitigate this potential limitation of MLR based scores and to accurately predict ALGIB outcome, we chose to implement and assess the performance of a non-parametric algorithm for classification, Gradient Boosting (GB) [12]. GB is a supervised machine learning algorithm, and is able to approximate the unknown functional mapping between the inputs, i.e. the non-endoscopic measurements, and the outputs, i.e. the ALGIB outcomes. Supervised learning algorithms are commonly trained on historical data consisting of examples of inputoutput pairs.
The GB algorithm embraces the notion of "ensemble learning", whereby multiple simple learning algorithms are used jointly in order to obtain better predictive performance than could be achieved from any of the constituent learning algorithms [13]. Of clinical relevance are several reports demonstrating that ensemble learning classification models are accurate in predicting outcome in a variety of clinical settings [14][15][16].
In particular, the GB algorithm relies on decision trees as constituent or "base" predictive algorithms. Decision tree are statistical models that recursively partition the input space in order to find rules that are predictive of the output. The classical CART (Classification and Regression Tree) algorithm was popularized in the 1980s by Breiman et al. [17]. Compared to other machine learning methods, GB possesses several strengths: 1) it is less prone to overfitting [13] 2) it is robust to noise [18] 3) it has an internal mechanism to estimate error rates, 4) it provides indices of variable importance and 5) it can be used when the predictors are both continuous and categorical.
The aim of this study was to test whether the GB algorithm was able to accurately predict clinical outcomes in patients presenting to emergency departments with ALGIB using nonendoscopic variables available to clinicians at that time. We also set out to compare the performance of the GB approach with conventional MLR and two previously published multivariate logistic regression models [9,10]. We show in this study that the gradient boosting algorithm accurately predicts rebleeding, severe bleeding and clinical intervention in patients with acute lower gastrointestinal bleeding and outperforms multiple logistic regression based models.

Ethics statement
This study was performed by analysing an existing anonomysed database of patients presenting with ALGIB collated for the purpose of audit to the emergency departments of Charing Cross and Hammersmith hospitals, London, UK (1 st January 2007 to 31 st December 2011). The study was approved by the Joint Research Compliance office at Imperial College Healthcare NHS Trust (ref 125HH25060). The office confirmed that no formal ethical review or informed consent was required as the study involved existing anonomysed routinely collected data, no new data was being collected and there was no clinical intervention.

Study design
The database had been generated retrospectively by identifying consecutive patients from electronic records who were aged 18 or over and presented to the emergency department with a primary diagnosis of ALGIB defined as PR bleeding (bright red or maroon coloured blood passed per rectum) within the previous three days. For this study we excluded patients with: 1) presentation most indicative/final diagnosis of an upper GI bleed (haematemesis, melaena, upper GI bleeding source detected at endoscopy or angiography), 2) inpatient ALGIB bleed, 3) ALGIB as a secondary admission symptom, 4) patients transferred from other hospitals and 5) incomplete patient data. Patients with haemodynamic instability and PR bleeding were required to have an upper GI endoscopy before inclusion into the study unless a definitive colonic source was identified.
Patients admitted to Charing Cross Hospital with ALGIB were analysed for training and internal validation of the GB algorithm. The external validation cohort consisted of patients admitted to Hammersmith Hospital over the same time period. Charing Cross and Hammersmith hospitals are large, busy general teaching hospitals with separate emergency departments and surgical teams.

Definitions of variables and final diagnoses
Data on 39 previously published variables associated with need for intervention or adverse outcome in ALGIB (Table 1) were identified from the literature and collected from prospectively generated databases of presenting diagnoses, laboratory results, endoscopic, discharge and coding databases. The variables were defined as follows: Unstable co-morbidity-any organ system abnormality that usually requires ICU admission, erratic mental status-clouding of consciousness due to any cause or presence of syncope confusion or coma, cardiovascular diseasehistory of angina, myocardial infarction, cardiomyopathy or heart failure, respiratory diseasecurrent or past history of copd, liver disease-history or presence of jaundice, cirrhosis or portal hypertension and renal failure-creatinine >125mircomol/l.
Definite colonic bleeding was defined as any signs of active or recent bleeding on endoscopy or angiography (stigmata of recent haemorrhage: active bleeding, a non bleeding visible vessel, or an adherent clot). Presumptive was defined haematochezia or blood per rectum and no suspicion of upper GI bleeding, with one or more potential bleeding sources below the ligament of Treitz.

Gradient boosting
We deployed the GB algorithm, as originally proposed by Friedman [13], which has been successfully applied in a number of clinical applications [19][20][21][22][23]. GB is a non-parametric algorithm for supervised machine learning. It approximates the unknown functional mapping from input explanatory variables to corresponding output variables. The non-parametric nature of the GB algorithm enables the estimation of a functional mapping from nonendoscopic measurements to ALGIB outcomes without the need to decide a priori the parametric form of this function. By contrast, parametric models like logistic regression assume that the log odds depend linearly on the covariates, and this linearity may be insufficient to capture the complexity of relationship between inputs and outputs [24].
For the training of the GB algorithm we consider m patients with non-endoscopic input measurements and their corresponding ALGIB outcome in the form of (x 1 ,y 1 ),. . ., (x m ,y m ) pairs, where each fx i g m i ¼ 1 is a vector containing the non-endoscopic measurements for patient i. We seek to approximate the unknown function y = F(x Ã )so that predictions can be made on a new patient for which we have observed the non-endoscopic input measurements x Ã .
The GB algorithm relies on an iterative model fitting procedure making use of many simple predictive algorithms or "base" learners [13], and combine them to form more complex decision rules. The unknown function F is estimated by minimizing a loss function L defined over the training set: GB constructs an approximation F N of F as a sum of N+1 "base" learners constructed through N boosting iterations, In our implementation, the "base" learners are regression trees 12 , which are particularly useful in clinical applications as they provide easy to interpret decision rules. GB starts with an initial base learner F 0 and then applies a steepest descent step for the minimization of the loss function with respect to F 0 . These two steps are repeated sequentially and each time a new learner is constructed to follow the direction along which the loss of the previous learner is minimized. The steepest descent method takes steps proportional to the negative gradient of the loss function in order to find the local minimum. More explicitly, the gradient of the loss function L for each training point x i at the iteration step n is given by The gradient is defined only at the data points fx i g m i ¼ 1 and cannot be generalized to other x values. One way to enable generalization is to choose a regression tree h(x,a n ) that produces h n ¼ fhðx i ; a n Þg m 1 most parallel to-g n R m . This regression tree can be obtained from the solution where a n are the parameters of the regression tree h(x,a n ) and β is the learning rate, which determines the contribution of each tree to the approximation. Having estimated the regression tree that is most highly correlated with-g n (x) over the data distribution, the next update of the approximation F N is given by F n ðxÞ ¼ F nÀ1 ðxÞ þ g n hðx; a n Þ; ð4Þ which uses the optimal length, The GB algorithm is summarized in the following pseudo-code: Gradient Boosting Algorithm where p is the response of the regression tree to the training data. 2 for n = 1 ! N do 3 g n ¼ r F nÀ1 ðxÞ Lðy; F nÀ1 ðxÞÞ gradient at the training data x.
After training, the parameters of the learned regression trees enclose rules capturing the (possibly non-linear) relationship between non-endoscopic variables and the ALGIB outcome. A new patient with non-endoscopic measurements x Ã is then assigned to a specific outcome class y simply by following the decision rules associated to that class.
To enable an optimal training of the GB algorithm, first we randomly divided our cohort of 170 patients collected at Charing Cross Hospital into training and validation datasets. Specifically, 70% of patients were assigned to the training set and the remaining 30% were utilized for internal validation. Internal validation was carried out with the objective of optimally tuning the GB hyperparameters (i.e. the number of boosting iterations, the depth of the regression trees and the learning rate) before assessing the performance of the algorithm externally on a cohort of patients admitted to Hammersmith Hospital (completely independent dataset).
For each one of the clinical outcomes in our studies we ranked the covariates using the internal GB mechanism for variable ranking, and selected the 10 best predictive variables for each outcome. We then investigated whether re-fitting the GB algorithm using only this reduced set of covariates would yield comparable performance.

MLR and published MLR based models for ALGIB
Conventional multiple logistic regression was applied to the Charing Cross and Hammmersmith cohorts using the same 39 non-endoscopic as in GB. Moreover, two published MLR based models for ALGIB, the BLEED score [10] (persistent bleeding, low systolic blood pressure, elevated prothrombin time, erratic mental status and unstable comorbid disease) and Strate prediction rule [9] (heart rate !100, systolic blood pressure 115, syncope, non tender abdominal examination, rectal bleeding within first 4hrs of evaluation, aspirin use and > 2 comorbid conditions) were calculated for the both cohorts.

Outcomes
The outcomes measured were therapeutic intervention (endoscopic, angiographic, surgical), severe bleeding (defined as ongoing or recurrent bleeding) and recurrent bleeding. These outcomes were chosen as they indicate the need for inpatient care. Therapeutic intervention to stop the source of a bleed was included as this suggested the presence of an ongoing bleed that was not resolving spontaneously. Definitions for the three outcomes were taken the published literature [5,9,25,26]. Severe bleeding was defined as the following: continued bleeding in the first 24 hours of hospitalisation (defined as a RBC transfusion of !2 units, and/or a haematocrit decrease of !20%), or recurrent bleeding after 24 hours of stability (defined as more than one transfusion of RBCs, a further haematocrit decrease of !20%, or readmission for ALGIB within 1 week of discharge). Recurrent bleeding was defined as recurrent haematochezia after 24 hours of stabilisation during which no active bleeding was observed, associated with any of the following as a new finding: decrease in haemoglobin of !2g/dl, decrease in haematocrit of !5%, haemodynamic instability, or having an additional RBC transfusion (!2 units received in total).

Statistical Analysis
The following statistical figures to predict severe bleeding, recurrent bleeding and therapeutic intervention were derived for all models in the internal and external validation cohorts: Sensitivity, Specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV) and Accuracy (sum of correct predictions over total predictions) using HDS Epimax, 2004 and Graph Pad Prism software. Comparison of continuous and categorical data (between the internal and external validation cohorts) was carried out using Mann-Whitney U and Fisher exact tests respectively. A two-tailed significance of 5% was used in all comparisons.

Characteristics and clinical outcomes of cohorts
For the Charing Cross cohort (CXC) following the initial search through emergency and endoscopy databases at Charing Cross hospital 174 patients were identified as having had a history of ALGIB and presentation to the emergency department. Four patients were excluded due to incomplete information for risk scoring. The 170 remaining patients made up the training/internal validation cohort. The same search process was used to compile the Hammersmith cohort (HC). 133 patients were attended the emergency department of Hammersmith hospital with a primary diagnosis of ALGIB. Of these patients three were excluded due incomplete data for risk scoring. The remaining 130 patients made up the external validation cohort. The accuracy of data collection was by random re-assessment of 10% of notes by another author (LA).
The demographic characteristic, clinical features and final diagnoses of the internal and external validation cohorts are shown in Tables 2 and 3. The patients in the internal Charing Cross cohort were similar to the Hammersmith cohort with regard to sex ratio, median age and length of stay. Nearly all patients (98-100%) in both cohorts were admitted to hospital for management of ALGIB which lasted a median of 4 days. Upper GI endoscopy to rule out an upper GI bleed was carried out in 20% of cases in the internal cohort and 14% of cases in the external cohort. Patients in both cohorts were similar in terms of undergoing a lower endoscopy procedure (74% vs. 81%) but in-patient colonoscopy was more common in the Charing Cross cohort (76% vs. 46%). This consisted of colonoscopy (56% Charing cross cohort, 45% Hammersmith cohort) with the remainder having flexible sigmoidoscopy, rigid sigmoidoscopy or proctoscopy. A CT or mesenteric angiogram was carried out in 8% of patients in the Charing Cross cohort and 5% in the Hammersmith cohort.
The three most common diagnoses were diverticulosis, colitis and anorectal disorders such as haemorrhoids. Final diagnoses were similar in both cohorts apart from a presumptive diagnosis of a diverticular bleed which was more common in the Charing Cross cohort and no diagnosis made which was more common in the Hammersmith cohort. Therapeutic intervention for bleeding (endoscopic therapy, angiographic embolisation or surgery) was more common in the Charing Cross cohort. Endoscopic intervention in both cohorts consisted of clipping, APC and banding. Angiographic embolisation of colonic vessels was carried out in nine patients in the internal cohort and four in the external cohort. Blood transfusion was significantly more common in the Charing Cross cohort (mean 1.6 units vs. 0.9 units) as was severe bleeding. There was no significant difference between the cohorts in rebleeding or death. All patients who died were !65 years of age and the causes of death were cancer (n = 2) cardiac failure (n = 1), pneumonia (n = 1), colonic ischemia (n = 1), pulmonary hypertension (n = 1) and unknown (n = 1). No patient died because of uncontrolled bleeding.

Predictive performance of gradient boosting and multiple logistic regression
The best GB algorithms using the 39 variables had predictive accuracies of 88%, 91% and 83% (Table 4) for recurrent bleeding, therapeutic intervention and severe bleeding respectively. The accuracies were similar in the Charing Cross and Hammersmith cohorts. The positive predictive value of all GB algorithms were not high in either cohort although importantly for clinical decision making the negative predictive values were high (88-98%) for the three outcomes. The top ten contributing predictors used in the GB algorithms for each outcome are listed on Table 5 in descending order. Four variables (heart rate, diastolic blood pressure, creatinine and APTT) were in the top ten most frequently used variables for all outcomes. On internal validation the accuracy of the GB models for predicting recurrent bleeding, therapeutic intervention and severe bleeding was (88%, 88% and 78% respectively) and superior to the BLEED classification (64%, 68% and 63%) ( Table 4), Strate prediction rule (78, 78, 67%) and conventional MLR (74%, 74% 62%). On external validation the accuracy of the GB algorithm was similar to conventional MLR for recurrent bleeding (88% vs. 83%), and therapeutic intervention (91% vs. 87%) but superior for severe bleeding (83 vs 71%). GB models using just the top ten predictive variables were less accurate in predicting rebleeding, severe bleeding and therapeutic intervention than those with the full set of 39 variables by 8-10% on average.

Discussion
There is a need for non-endoscopic risk scores to help risk stratify patients with ALGIB for early discharge/outpatient management or higher levels of care and thereby utilise resources efficiently. Current clinical practice includes colonoscopic-based triage which is invasive, adequate preparation difficult to achieve and treatable stigmata of haemorrhage infrequent [1]. This study has shown that a GB algorithm based on clinical and laboratory variables was accurate (>80%) in predicting the clinical outcomes of recurrent bleeding, therapeutic intervention and severe bleeding. GB had high negative predictive values (88-98%) in both Charing Cross and Hammersmith Cohorts. This suggests that these models may be useful to triage patients into a low-risk group who could be managed with an abbreviated stay in hospital avoiding high levels of care or as outpatients. The median and mean inpatient stay in the Charing Cross and Hammersmith cohorts was four and seven days respectively and therefore a reduction in this would allow for significant cost-savings and decrease exposure of patient to hospital associated hazards such as infections.
A particular strength of this study is the validation and good performance in an external cohort with a lower incidence of severe bleeding indicating the algorithm can maintain accuracy in a different setting. Our GB algorithm used only non-endoscopic variables available to the clinician in the emergency department and therefore has clinical applicability for decision-making. We would however emphasise that such a model is not aimed at replacing experienced decision making but rather aiding the process. This is the case with recommended risk scores for acute upper gastrointestinal bleeding such as the Rockall and Glasgow Blatchford scores [11].
One study to our knowledge has examined an ensemble machine learning model in acute gastrointestinal bleeding and was developed to identify the bleeding source, need for resuscitation and those who require urgent endoscopy [27]. This study differs substantially from ours in that a random forests algorithm was used by building classification trees ignoring the error of the previous tree in the sequence [28]. Also a mixed population of patients with both acute upper, middle and lower gastrointestinal bleeding was studied (no numbers given for ALGIB), there was no comparison with previously published scores and the model was not tested in an external cohort. Nevertheless accuracies of >75-80% were found for the studied outcomes providing evidence of the utility of the ensemble machine learning techniques.
The performance of the GB algorithm in our study was superior to MLR and two published MLR based models. Ensemble machine learning models have been shown to be more accurate than conventional logistic regression to classify disease or predict outcome in a variety of clinical settings [29][30][31]. Theoretical reasons for this are that logistic regression predicts outcomes based on linear combinations of independent variables by fitting a single model that best explain the relationship between observed values and outcome. On the other hand, the rationale of the GB algorithm to fit many simple models whose predictions are then combined can produce a good fit of the predicted outcome values to the observed values, even if the specific nature of the relationship between the predictor variables and the corresponding outcome is complex (e.g. nonlinear, interacted or noisy with outliers). Also, unlike multiple logistic regression, GB method can handle a large number of input variables and generate an internal unbiased estimate of the generalization error as the simple classification tree estimation progresses. Finally, the stage-wise model fitting procedure of the GB algorithm allows to automatically assess the influence of each non-endoscopic variable in the construction of a robust classification rule [12].
Other explanations are that the BLEED score [10] was designed to predict a composite endpoint of in-hospital complication (recurrent haemorrhage, surgery to control haemorrhage and hospital mortality) rather than the end-points we examined. The Strate prediction rule [9] was designed to predict severe bleeding which we studied but in our cohort performed least well for this outcome and better for recurrent bleeding and need for clinical intervention. Artificial neural networks, another machine learning classifier, have also been shown to be accurate in predicting re-bleeding and clinical intervention in ALGIB [5]. However for our cohort, the classification performance of neural networks was found inferior to GB in ALGIB (data not shown).
Our study has a number of potential limitations: First the database of patients was collated retrospectively and relied on the inherent accuracy of patient records. The majority of data collected was however quantitative and also available from prospectively generated electronic laboratory, endoscopic and patient records. Second the GB requires the input of many more variables than BLEED or Strate prediction rule (39 vs. 5 vs. 8) which increases complexity. Reduction in the number of variables used in the GB model to 10 led to decreased accuracy of 8-10% for the studied outcomes which would compromise clinical utility of the GB algorithm. Importantly however in our experience input of data for the 39 variables into the programme takes less than 5 minutes and therefore would be suitable for use in an emergency department/ ward particularly given the explosion of smart phone apps which allow for quick data entry with drop down menus where only positive inputs are required. This would be similar to endoscopy reports which are generated electronically, typically require >50 pieces of data and are used in routine clinical practice. Third the decision to give blood transfusion and apply endoscopic therapy was not protocol based which could have led to bias. In mitigation our analysis showed that endoscopic therapy was consistently applied in this study according to consensus guidelines (data not shown) and therefore limits this as a potential source of bias. Finally death was not examined as an outcome due its infrequent nature in our cohort. Death is rare in ALGIB occurring in <4% in large series and generally occurs in those with co-morbid conditions [6] and after an in-patient bleed [32] that latter which was an exclusion criteria for our study. Future work to examine the GB model in a larger cohort could be undertaken to examine its utility in predicting death. To determine the impact of the model in predicting outcomes in ALGIB a randomised study of the GB model plus routine clinical decision-making versus routine clinical decision-making could be performed.
In summary, gradient boosting accurately predicts outcome in patients with acute lower gastrointestinal bleeding. This machine learning approach has the potential to aid in the risk stratification of patients with ALGIB on presentation to the emergency department.