Developing and validating subjective and objective risk-assessment measures for predicting mortality after major surgery: An international prospective cohort study

Background Preoperative risk prediction is important for guiding clinical decision-making and resource allocation. Clinicians frequently rely solely on their own clinical judgement for risk prediction rather than objective measures. We aimed to compare the accuracy of freely available objective surgical risk tools with subjective clinical assessment in predicting 30-day mortality. Methods and findings We conducted a prospective observational study in 274 hospitals in the United Kingdom (UK), Australia, and New Zealand. For 1 week in 2017, prospective risk, surgical, and outcome data were collected on all adults aged 18 years and over undergoing surgery requiring at least a 1-night stay in hospital. Recruitment bias was avoided through an ethical waiver to patient consent; a mixture of rural, urban, district, and university hospitals participated. We compared subjective assessment with 3 previously published, open-access objective risk tools for predicting 30-day mortality: the Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality (P-POSSUM), Surgical Risk Scale (SRS), and Surgical Outcome Risk Tool (SORT). We then developed a logistic regression model combining subjective assessment and the best objective tool and compared its performance to each constituent method alone. We included 22,631 patients in the study: 52.8% were female, median age was 62 years (interquartile range [IQR] 46 to 73 years), median postoperative length of stay was 3 days (IQR 1 to 6), and inpatient 30-day mortality was 1.4%. Clinicians used subjective assessment alone in 88.7% of cases. All methods overpredicted risk, but visual inspection of plots showed the SORT to have the best calibration. The SORT demonstrated the best discrimination of the objective tools (SORT Area Under Receiver Operating Characteristic curve [AUROC] = 0.90, 95% confidence interval [CI]: 0.88–0.92; P-POSSUM = 0.89, 95% CI 0.88–0.91; SRS = 0.85, 95% CI 0.82–0.87). Subjective assessment demonstrated good discrimination (AUROC = 0.89, 95% CI: 0.86–0.91) that was not different from the SORT (p = 0.309). Combining subjective assessment and the SORT improved discrimination (bootstrap optimism-corrected AUROC = 0.92, 95% CI: 0.90–0.94) and demonstrated continuous Net Reclassification Improvement (NRI = 0.13, 95% CI: 0.06–0.20, p < 0.001) compared with subjective assessment alone. Decision-curve analysis (DCA) confirmed the superiority of the SORT over other previously published models, and the SORT–clinical judgement model again performed best overall. Our study is limited by the low mortality rate, by the lack of blinding in the ‘subjective’ risk assessments, and because we only compared the performance of clinical risk scores as opposed to other prediction tools such as exercise testing or frailty assessment. Conclusions In this study, we observed that the combination of subjective assessment with a parsimonious risk model improved perioperative risk estimation. This may be of value in helping clinicians allocate finite resources such as critical care and to support patient involvement in clinical decision-making.

a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 median age was 62 years (interquartile range [IQR] 46 to 73 years), median postoperative length of stay was 3 days (IQR 1 to 6), and inpatient 30-day mortality was 1.4%. Clinicians used subjective assessment alone in 88.7% of cases. All methods overpredicted risk, but visual inspection of plots showed the SORT to have the best calibration. The SORT demonstrated the best discrimination of the objective tools (SORT Area Under Receiver Operating Characteristic curve [AUROC] = 0.90, 95% confidence interval [CI]: 0.88-0.92; P-POSSUM = 0.89, 95% CI 0.88-0.91; SRS = 0.85, 95% CI 0.82-0.87). Subjective assessment demonstrated good discrimination (AUROC = 0.89, 95% CI: 0.86-0.91) that was not different from the SORT (p = 0.309). Combining subjective assessment and the SORT improved discrimination (bootstrap optimism-corrected AUROC = 0.92, 95% CI: 0.90-0.94) and demonstrated continuous Net Reclassification Improvement (NRI = 0.13, 95% CI: 0.06-0.20, p < 0.001) compared with subjective assessment alone. Decision-curve analysis (DCA) confirmed the superiority of the SORT over other previously published models, and the SORTclinical judgement model again performed best overall. Our study is limited by the low mortality rate, by the lack of blinding in the 'subjective' risk assessments, and because we only compared the performance of clinical risk scores as opposed to other prediction tools such as exercise testing or frailty assessment.

Conclusions
In this study, we observed that the combination of subjective assessment with a parsimonious risk model improved perioperative risk estimation. This may be of value in helping clinicians allocate finite resources such as critical care and to support patient involvement in clinical decision-making.

Author summary
Why was this study done?
• Over 3 million postoperative deaths occur worldwide per year.
• Some of these may be avoidable through risk-assessment-based modification of treatment pathways, such as postoperative critical care admission.
• There are multiple methods for predicting which patients are at high risk of death or complications from surgery, but these are not widely used, with clinicians instead usually relying on their subjective clinical judgement alone.
• Before this study, there was little information about whether clinical judgement was of better, worse, or equivalent accuracy to objective risk scores.
What did the researchers do and find?
• We conducted a 1-week cohort study in 274 hospitals in the UK, Australia, and New Zealand, during which we collected data on risk and surgical outcome on every patient who had an operation requiring an overnight stay in hospital.
• The clinical team (surgeons, anaesthetists) looking after the patient were asked to provide a subjective assessment of risk. We compared these assessments with the results of 3 freely available objective risk-assessment tools.
• We included data from 22,631 patients in our analyses and found that subjective assessment was as accurate as the best of the objective risk tools (the Surgical Outcome Risk Tool or SORT) for predicting death in hospital within 30 days of surgery.
• However, combining subjective and objective measurement using the SORT provided an even more accurate estimate.

Introduction
The provision of safe surgery is an international healthcare priority [1]. Guidelines recommend that preoperative risk estimation should guide treatment decisions and facilitate shared decision-making [2,3]. Furthermore, there is an ethical imperative (and in the United Kingdom [UK], a legal requirement) to provide an individualised assessment of a patient's risk of adverse outcomes [4]. Increasing evidence suggests that postoperative mortality in both high and low/middle-income settings is due less to what happens in the operating theatre and more to our 'failure to rescue' patients who develop postoperative complications [5,6]. These observations also point towards opportunity: once a patient has been identified as high risk, mitigation strategies such as pre-emptive admission to critical care or enhanced postoperative surveillance may prevent adverse outcomes [2]. However, critical care is a finite resource, with competition for beds between surgical and emergency medical admissions. To that end, the requirement for a postoperative critical care bed is itself a risk factor for last-minute cancellation, with consequent potential for disruption and harm for both patients and healthcare providers [7]. Thus, there is a need to accurately stratify patient risk so as to make the most of limited resources and improve perioperative outcomes. This is especially true given the scale of demand; more than 300 million operations take place annually worldwide [8]. With a major postoperative morbidity rate of around 15% [9,10], a short-term mortality rate between 1 and 3% [11], and a reproducible association between short-term morbidity and long-term survival [9,12,13], the impact of surgical complications on individual patients, healthcare resources, and society at large is clearly evident. Furthermore, if resources permitted, substantially larger numbers of patients would be considered for surgical intervention [1]. There are numerous methods available to help clinicians estimate perioperative risk, including frailty indices [14], functional capacity assessments such as cardiopulmonary exercise testing (CPET) [15], and dozens of risk prediction scores and models, many of which are open-source, are easily applied, and have been validated in multiple heterogeneous surgical cohorts [16]. Despite this myriad of choices, data from national Quality Improvement (QI) programmes indicate that clinicians do not routinely document an individualised risk assessment before surgery [10,17]. In part, this may relate to the availability of complex investigations and equipoise over which method is most accurate, particularly when the accuracy of objective methods compared with subjective assessment alone is disputed [15]. We therefore performed a prospective cohort study with the following objectives: to describe how clinicians assess risk in routine practice, to externally validate and compare the performance of 3 openaccess risk models with subjective assessment, and to investigate whether objective risk tools add value to subjective assessment.

Methods
This is a planned analysis of the Second Sprint National Anaesthesia Project: EPIdemiology of Critical Care provision after Surgery (SNAP-2: EPICCS) study, a prospective observational cohort study conducted in 274 hospitals from the UK, Australia, and New Zealand [18]. We report our findings in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE; S1 Text) and the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD; S2 Text) statements [19,20]. National research networks, including trainee-led networks, were used to maximise recruitment from public hospitals in all countries. All adult (�18 years) patients undergoing inpatient surgery and meeting our criteria (see 'Data set', below) during a 1-week period were included in our analyses for this paper. Patients were recruited between 21-27 March 2017 in the UK, 21-27 June 2017 in Australia, and 6-13 September 2017 in New Zealand.

Data set
All data (S3 Text) were collected prospectively. In this study, we defined objective risk assessment as the use of a risk calculation model or equation or tool that supplies a prediction of risk on a probability scale. Before surgery, perioperative teams answered the following question for each patient: 'What is the estimate of the perioperative team of the risk of death within 30 days?', with 6 categorical response options (<1%, 1%-2.5%, 2.6%-5%, 5.1%-10%, 10.1%-50%, and >50%). These thresholds were decided by expert consensus within the study steering group and study authors. Teams were then asked to record how they arrived at this estimate (for example, clinical judgement and/or an objective risk tool). The patient data for this study were collected from a wide range of participating publicly funded hospitals in the UK (n = 245), Australia (n = 21), and New Zealand (n = 8). These were a heterogeneous mix of secondary (42%) and tertiary care (58%) institutions and likely reflective of the general composition of hospitals in these countries. We have previously described the hospitals and their available facilities for providing perioperative care [21].
Patients included in the study were adults (�18 years) undergoing surgery or other interventions that required the presence of an anaesthetist and who were expected to require overnight stay in hospital. We included all procedures taking place in an operating theatre, radiology suite, endoscopy suite, or catheter laboratory for which inpatient (overnight) stay was planned, including both planned and emergency/urgent surgery of all types, endoscopy, and interventional radiology procedures.
Patients were excluded if they indicated they did not want to participate in the study. We also excluded ambulatory surgery, obstetric procedures (for example, cesarean sections and surgery for complications of childbirth), procedures on ASA-PS (American Society of Anesthesiologists Physical Status score) grade VI patients, noninterventional diagnostic imaging (for example, CT or MRI scanning without interventions), and emergency department or critical care interventions requiring anaesthesia or sedation but no interventional procedure.

Statistical analysis
The protocol for SNAP-2: EPICCS was previously published with aims, objectives, and research questions outlined [18]. Our primary outcome for the study described in this paper was inpatient 30-day mortality, recorded prospectively by local collaborators. We conducted 3 inferential analyses, the first using the entire patient data set and the second and third omitting the patients for whom an objective tool was used to predict perioperative risk (Fig 1). For the first analysis, we evaluated performance of the Portsmouth-Physiology and Operative Severity Score for the enUmeration of Mortality (P-POSSUM), Surgical Risk Scale (SRS), and Surgical Outcome Risk Tool (SORT) [16,[22][23][24]. The calibration and discrimination of all models was assessed in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) recommendations [20]. Calibration was assessed by graphical inspection of observed versus expected mortality and by the Hosmer-Lemeshow goodness-of-fit test [25]. Discrimination was assessed by calculating the Area Under Receiver Operating Characteristic curve (AUROC) [26]. AUROCs were compared using DeLong's test for 2 correlated ROC curves [27]. ROC curves can be constructed for both continuous predictions (for example, P-POSSUM, SRS, and SORT) and ordinal categorical predictions (for example, ASA-PS or the 6-category subjective predictions that clinicians were asked to make): in the former, sensitivities and specificities are calculated for every value in the probability range of 0 to 1, and then each point is plotted to obtain a smooth curve; in the latter, sensitivities and specificities are computed for each category, and the points form a polygon on the ROC plot.
The second analysis compared the performance of subjective assessment (defined as either using clinical judgement and/or ASA-PS) against the best-performing risk tool. For this, we included only patients for whom subjective assessment alone was used to predict the risk of 30-day mortality. Subjective assessment was then evaluated on calibration and discrimination. Point estimates of risk prediction were taken as the midpoint of the predicted risk intervals provided by clinicians (i.e., 0.5% for the interval <1%, 1.75% for the interval 1%-2.5%, and so on), and the proportion of observed mortality in each of these risk categories was calculated. Calibration was then assessed by plotting the observed mortality proportions against the midpoints of clinician-predicted risk intervals. We then compared the performance of subjective assessment against the best-performing risk model, using AUROC and the continuous Net Reclassification Improvement (NRI) statistic [25]. The NRI quantifies the proportion of individuals whose predictions improve in accuracy (positive reclassification) subtracted by the proportion whose predictions worsened in accuracy (negative reclassification) when using one prediction model versus another [28]. An NRI >0 indicates an overall improvement, <0 an overall deterioration, and zero no difference in prediction accuracy.
The third analysis evaluated the added value of combining subjective assessment with the best-performing risk tool by creating a logistic regression model with variables from both sources.
For this, we fitted a logistic regression model with 2 variables: the subjective assessment of risk and the mortality prediction from the best objective risk tool according to the following

PLOS MEDICINE
logit formula: ln(R/(1 − R)) = β 0 +β 1 X subjective + β 2 X objective , where R is the probability of 30-day mortality; β 0 , β 1 , and β 2 are the model coefficients; X subjective is the subjective clinical assessment (6 ordered categories, as above); and X objective is the risk of mortality as predicted using the most accurate risk model. An optimism-corrected performance estimate of the combined model was obtained using bootstrapped internal validation with 1,000 repetitions; this was then compared with subjective assessment and the most accurate risk model alone.
We used decision-curve analysis (DCA) to describe and compare the clinical implications of using each risk model. In DCA, a model is considered to have clinical value if it has the highest net benefit across the whole range of thresholds for which a patient would be labelled as 'high risk'. The net benefit is defined as the difference between the proportion of true positives (labelled as high risk and then going on to die within 30 days of surgery) and the proportion of false positives (labelled as high risk but not going on to die within 30 days) weighted by the odds of the selected threshold for the high-risk label. At any given threshold, the model with the higher net benefit is the preferred model [20,25,29].

Missing data
The P-POSSUM requires biochemical and haematological data for calculation; however, fit patients may not have preoperative blood tests [30], and in other cases, there may be no time for blood analysis before surgery. Therefore, in cases for which these data were missing, normal physiological ranges were imputed because this most closely follows what clinicians might reasonably do in practice when tests are not indicated or not feasible or results are missing. Following imputation, we performed a complete case analysis because we considered the proportion of cases with missing data in the remaining variables to be low (1.08%) [31].

Sensitivity analyses
We conducted a number of sensitivity analyses to examine the potential effects of differences in population characteristics on our main study findings. First, we repeated our analyses in a full cohort of patients, including those undergoing obstetric procedures. Second, we repeated the analysis in a subgroup of high-risk patients, defined according to previously published criteria based on age, type of surgery, and comorbidities [15,32]. Third, we evaluated the impact on the accuracy of subjective assessment of using objective tools by comparing discrimination and calibration of subjective assessment in the subgroup of patients whose risk estimates were not solely informed by clinical judgement. Fourth, we repeated our analyses separately in the UK and Australian/New Zealand cohorts to investigate the potential for geographical influences on our findings. Fifth, we examined the potential impact of normal value imputation on missing P-POSSUM values by repeating the analysis on only cases in which no missing P-POS-SUM variables were present. Finally, we conducted analyses on surgical specialty subgroups to evaluate the accuracy of the new model created on different subcohorts.
Analyses were performed using R version 3.5.2; p < 0.05 was considered statistically significant. Statistical code is available on request.

Results
Patient data were collected on 26,502 surgical episodes in 274 hospitals across the UK, Australia, and New Zealand (Table 1). A total of 3,871 cases were excluded from all analyses: 3,660 obstetric cases in which there were no deaths, plus a further 286 cases for missing values. This left 22,631 cases with adequate data for external validation of the P-POSSUM, SRS, and SORT models, the first part of our analyses (Fig 1). For the second and third analyses, in which we compared subjective assessment against the best-performing objective risk tool and combined

External validation of existing risk prediction models
The SORT was the best calibrated of the pre-existing models; however, all overpredicted risk (Fig 2A-2C; Hosmer-Lemeshow p-values all <0.001 for the SORT, P-POSSUM, and SRS). All models exhibited good-to-excellent discrimination (

Subjective assessment
There were 188 deaths (1.05%) within 30 days of surgery in the subset of 17,845 patients who had mortality estimates based on clinical judgement and/or ASA-PS alone. Subjective assessment overpredicted risk (Fig 3A, Hosmer-Lemeshow test p < 0.001) but demonstrated good discrimination ( Fig 3B and Table 3, AUROC = 0.89, 95% CI: 0.86-0.91), which was not significantly different from the SORT (p = 0.309). Continuous NRI analysis did not show improvement in classification when using the SORT compared with subjective assessment (Table 3 and S4 Text). The 30-day mortality outcomes at each level of clinician risk prediction were cross-tabulated, showing that clinician predictions correlated well with actual mortality outcomes (S2 Table).

Combining subjective and objective risk assessment
Bootstrapped internal validation yielded an optimism-corrected AUROC of 0.92 for a combined model using both subjective assessment and SORT predictions as independent variables (Table 3); this was better than subjective assessment alone (p < 0.001) and SORT alone (p = 0.021) ( Table 4). The model also significantly (p < 0.001) improved reclassification compared with subjective assessment alone in continuous NRI analysis (S4 Text). The improved NRI was largely attributable to the correct downgrading of patient risks-i.e., a large proportion of patients were correctly reclassified as lower risk using the combined model compared with subjective assessment. The DCA also favoured SORT over the other previously published models, but the combined clinician judgement-SORT model again performed best (Fig 4). The effect of combining information from subjective assessment and the SORT is further demonstrated by computing the conditional probabilities of 30-day mortality using the combined model over a full range of predictor values (Fig 5). When assessing the decision curves across all risk thresholds, the combined model outperformed P-POSSUM and SRS, and beyond approximately the 10% risk threshold, P-POSSUM and SRS demonstrated negative net benefits when they were used. The decision curve for our combined model incorporating both subjective assessment and SORT showed increased net benefit across almost the entire range of risk thresholds versus SORT alone.

Sensitivity analyses
A summary of the different sensitivity analyses is provided in S5 Text. In the first sensitivity analysis (S6 Text), we repeated the main study analyses using the full cohort of patients available from SNAP-2: EPICCS, including those undergoing obstetric procedures, and found that there were minimal differences seen from our main study findings. The SORT was again the best calibrated of the pre-existing models in this larger cohort, and all objective risk tools again overpredicted risk (S1 Fig; Hosmer For the second sensitivity analysis (S7 Text), we used a previously defined more restrictive inclusion criteria to identify high-risk patients [15,32]. This yielded a subgroup of 12,985 The third sensitivity analysis (S8 Text) used the subgroup whose mortality estimate was based on clinical judgement in conjunction with any objective risk tool (n = 4,751, S4 Fig).
The AUROC for subjective assessment in this subgroup was 0.88, which was not significantly different from the AUROC in the main cohort (p = 0.769). The calibration of subjective assessment in this subgroup was similar to that in the main cohort, again with a tendency to overpredict risk.
In the fourth sensitivity analysis (S9 Text), we looked for differences in performance of subjective clinical assessment and objective risk tools between the UK and the Australia/New Zealand cohorts (S5 Fig and S3 Table). The 30-day mortality in the Australia/New Zealand cohort (1.09%) was comparable to that of the UK (1.45%, p = 0.127). Visual inspection of calibration plots showed SORT to be worse calibrated in Australasia than the UK. AUROCs for the objective tools in the Australasian subset (P-POSSUM = 0.90, SRS = 0.81, SORT = 0.87) were not significantly different from the AUROCs in the UK subset (P-POSSUM = 0.89, SRS = 0.85, Table 3   For the fifth sensitivity analysis (S10 Text), we used the subgroup of patients who had no missing P-POSSUM variables (n = 18,362; see S1 Table for patient characteristics). Patients with complete P-POSSUM variables appeared to be older, have higher ASA-PS grades, and undergo higher-severity surgery in comparison with those with missing P-POSSUM variables. The AUROC for clinical assessments in the subgroup with full P-POSSUM variables was 0.90, which was not significantly different from the AUROC obtained for clinical assessments in the main study analysis (p = 0.587), and the predictions were similarly calibrated to clinical assessments in the main study analysis, again with a tendency to overpredict risk (S7 Fig). When comparing the performance of P-POSSUM (AUROC = 0.89), SRS (AUROC = 0.84), and SORT (AUROC = 0.90) in this subgroup, the performance was again similar to that of the main study cohort (p > 0.05 for all comparisons).

Coefficient Standard Error Z-Statistic p-Value
In the sixth and final sensitivity analysis (S11 Text), we evaluated the AUROC and calibration of the SORT-clinical judgement model in subgroups of patients according to surgical specialty (S4 Table). We found that the AUROC remained high within these subgroups (ranging from 0.87, 95% CI 0.75-0.98 in 1,033 cardiothoracic surgical patients through to 0.95, 95% CI 0.90-0.99 in 4,309 gynaecology and urology patients). Calibration was also good across different specialties, with the exception of vascular surgery (674 patients, AUROC 0.88, 95% CI 0.82-0.94; Hosmer-Lemeshow p-value = 0.009).

Discussion
We present data from an international cohort of patients undergoing inpatient surgery with a low risk of recruitment bias. Despite a plethora of options for objective risk assessment, in over 80% of patients, subjective assessment alone was used to predict 30-day mortality risk. All previously published risk models were poorly calibrated for this cohort of patients, reflecting the common problem of calibration drift over time. However, the combination of subjective clinical assessment with the parsimonious SORT model provides an accurate prediction of 30-day mortality, which is significantly better than any of the methods we evaluated used on their own. These findings should give confidence to clinicians that the combined SORT-clinical judgement model can be used to support the appropriate allocation of finite resources and to inform discussions with patients about the risks of surgery. The combined model accurately downgraded predicted risk compared with other methods; therefore, application of this approach may result in fewer low-risk patients inappropriately admitted to critical care (thus easing system pressures) and may result in fewer patients having their surgery cancelled for the lack of a critical care bed [7]. Finally, application of the SORT-clinical judgement model may assist hospital managers and policy makers in determining the likely demand for postoperative critical care, thus supporting best practice at the hospital, regional, or national level. This new model will now be incorporated into an open-access risk-assessment system (http:// www.sortsurgery.com/), enabling clinicians to combine their clinical estimation of risk and the SORT model to evaluate patient risk from major surgery.
To our knowledge, this is the first study comparing subjective and objective assessment for predicting perioperative mortality risk in a large multicentre international cohort. The highest-quality previous studies in this field have been challenged by recruitment bias because of the predominant participation of research active centres and the need for patient consent. For example, the METS (Measurement of Exercise Tolerance before Surgery) study ( [15], which compared clinical assessment of functional capacity with exercise testing, self-assessment, and a serum biomarker in 1,401 patients, and the VISION study (Vascular Events in non-cardiac surgery cohort study) [32], which evaluated postoperative biomarkers in 15,133 patients, had 27% and 68% screening to recruitment rates, respectively. One way of overcoming such biases would be to study the accuracy of prognostic models using routinely collected or administrative data; however, this is unlikely to enable the evaluation of subjective assessments in multiple centres. Our study avoided these issues through prospective data collection in an unselected cohort with an ethical waiver for patient consent. The mortality in our sample closely matches that recorded in UK administrative data of patients undergoing major or complex surgery [11], therefore supporting our assertion that our cohort was representative of the 'real-world' perioperative population.
Our observation that the majority of risk assessments conducted for perioperative patients do not involve objective measures is also noteworthy because subjective assessment is currently almost never incorporated into risk prediction tools for surgery. One exception is the American College of Surgeons National Surgical Quality Improvement Program Surgical Risk Calculator [33], which incorporates a 3-point scale of clinically assessed surgical risk (normal, high, or very high) to supplement a calculated prediction of mortality and various short-term outcomes. However, this system is proprietary, has rarely been evaluated outside the US, and is substantially more complex than the SORT-clinical judgement model, with 21 input variables compared with 8. Furthermore, their methodology for developing this 'uplift' was quite different from ours, using a panel of 80 surgeons to evaluate 10 case scenarios and grade them in retrospect.
We recognise some limitations to our study. First, models predicting rare events may appear optimistically accurate, as a model that identifies every patient as being at low risk of mortality in a group in which the probability of death approaches 0% would almost always appear to be correct. For this reason, we undertook several sensitivity analyses, including one that evaluated the performance of the various risk-assessment methods in a subgroup of patients who have been defined as high risk in previous studies of prognostic indicators and in whom the mortality rate was higher. We found that the performance of the SORT and subjective assessment remained good and compared favourably with previous evaluations of more complex risk-assessment methods [15,32]. Second, whilst we assumed that subjective assessments were truly clinically based judgements, because this was a pragmatic unblinded study, it was possible that information from other sources may have subconsciously influenced these assessments. For this reason, we undertook the second sensitivity analysis, which refuted this possible risk. Third, the very act of estimating mortality risk may lead clinicians to take actions that improve that risk, therefore biasing the outcome of the assessments made and in particular affecting the calibration of subjective risk estimates. The only way to avoid this risk would be to have used subjective assessments made by clinicians independent of the clinical management of individual patients, and this may be an interesting opportunity for future research. Fourth, since we undertook this study, other promising risk-assessment methods have been developed, including the Combined Assessment of Risk Encountered in Surgery (CARES) system, which was developed using electronic health records; unfortunately, we were unable to externally validate this system because we did not collect all the required variables [34]. We also did not evaluate the accuracy of other risk prediction methods such as frailty assessment or cardiopulmonary exercise testing. However, this was not an a priori objective of our study [18]; furthermore, our observation of the lack of 'real-world' use of these types of predictors is in itself an important finding, particularly given the substantial interest in such measures (some of which carry considerable cost) in the research literature [15,35]. Fifth, the UK cohort was substantially larger than the Australasian cohort; however, we found no significant differences in mortality or accuracy of the various risk-assessment methods between the 2 geographical groups. Finally, the study was conducted entirely in high-income countries; therefore, our findings should now be tested in low-and middle-income nations in order to evaluate global generalisability.
Our finding that the combination of subjective and SORT-based assessment is the best approach is important because it is likely to have face validity with clinicians, thereby improving the likelihood that our new model will be incorporated into clinical practice. There is a sound rationale for this finding, as it is likely that clinicians consider otherwise unmeasured factors that they recognise as important, such as severity of comorbid diseases, frailty, socioeconomic status, patient motivation, and anticipated technical challenges. Modern approaches to risk assessment using machine learning [36] provide promise for automation of risk prediction and incorporating data and calculations that clinicians may subconsciously consider when making subjective decisions; however, even these methods do not substantially outperform our simpler approach and are currently limited by recruitment biases and lack of availability. Future research could evaluate the benefits of incorporating clinical judgement into risk-assessment methods in medicine more generally.
Implementation of a widely available, parsimonious, and free-to-use risk-assessment tool to guide clinical decision-making about critical care allocations and other aspects of perioperative care may now be considered particularly important in view of the likely prevalence of endemic COVID-19 leading to an increased demand for critical care facilities. Therefore, now more than ever, risk-based allocation of these resources is important for the benefit of individual patients and the hospitalised population as a whole. Further to this, application of either the SORT or the SORT-clinical judgement model to perioperative population data may assist healthcare policy makers and managers in modelling the likely demand for postoperative critical care, thus improving system level planning and resource utilisation. Based on the results of this large generalisable cohort study, the focus of the perioperative academic community could now shift from evaluation of which risk prediction method might be best to testing the impact of SORT-clinical judgement-based decision-making on perioperative outcomes.
In conclusion, the combination of subjective and objective risk assessment using the SORT calculator provides a more accurate estimate of 30-day postoperative mortality than subjective assessment alone. Implementation of the SORT-clinical judgement model should lead to better clinical decision-making and improved allocation of resources such as critical care beds to patients who are most likely to benefit.  Table. AUROCs of the objective risk tools and subjective assessment, compared between the UK and Australian/New Zealand data subsets. We found no significant difference in discrimination using any of the risk prediction tools or using subjective assessment when comparing their performance in the UK and Australian/New Zealand data sets.