Performance And Agreement Of Risk Stratification Instruments For Postoperative Delirium In Persons Aged 50 Years Or Older

Several risk stratification instruments for postoperative delirium in older people have been developed because early interventions may prevent delirium. We investigated the performance and agreement of nine commonly used risk stratification instruments in an independent validation cohort of consecutive elective and emergency surgical patients aged ≥50 years with ≥1 risk factor for postoperative delirium. Data was collected prospectively. Delirium was diagnosed according to DSM-IV-TR criteria. The observed incidence of postoperative delirium was calculated per risk score per risk stratification instrument. In addition, the risk stratification instruments were compared in terms of area under the receiver operating characteristic (ROC) curve (AUC), and positive and negative predictive value. Finally, the positive agreement between the risk stratification instruments was calculated. When data required for an exact implementation of the original risk stratification instruments was not available, we used alternative data that was comparable. The study population included 292 patients: 60% men; mean age (SD), 66 (8) years; 90% elective surgery. The incidence of postoperative delirium was 9%. The maximum observed incidence per risk score was 50% (95%CI, 15–85%); for eight risk stratification instruments, the maximum observed incidence per risk score was ≤25%. The AUC (95%CI) for the risk stratification instruments varied between 0.50 (0.36–0.64) and 0.66 (0.48–0.83). No AUC was statistically significant from 0.50 (p≥0.11). Positive predictive values of the risk stratification instruments varied between 0–25%, negative predictive values between 89–95%. Positive agreement varied between 0–66%. No risk stratification instrument showed clearly superior performance. In conclusion, in this independent validation cohort, the performance and agreement of commonly used risk stratification instruments for postoperative delirium was poor. Although some caution is needed because the risk stratification instruments were not implemented exactly as described in the original studies, we think that their usefulness in clinical practice can be questioned.


Introduction
As a result of changing population demographics, an increasing number of older patients are undergoing surgery. It is estimated that in 2020, the number of surgical procedures performed in persons aged 65 years or older in the United States will be 14% to 47% higher (dependent on specialty) than in 2001 [1]. Importantly, more than 40% of the patients in this age group experience a major postoperative complication [2], of which postoperative delirium is among the most common [3]. Postoperative delirium is associated with worse outcomes in older patients but can be prevented by tailored interventions that address a number of modifiable risk factors [4]. Therefore, current guidelines recommend routine preoperative assessment of delirium risk in this age group [3].
For adequate assessment of postoperative delirium risk, reliable risk stratification instruments are essential. Ideally, a risk stratification instrument correctly identifies older surgical patients who are at increased risk of postoperative delirium and are likely to benefit from preoperative and postoperative interventions to prevent delirium [4]. Several risk stratification instruments for delirium have been developed since the early 1990s [5][6][7][8][9][10][11][12][13][14][15]. Most of them are based on well-known risk factors for delirium, such as high age, cognitive impairment and alcohol abuse, and were found to have a very good to excellent performance. For some risk stratification instruments, the positive predictive value for incident delirium was 83 percent or higher [8,11,13]. Nevertheless, the generalizibility of many of these risk stratification instruments can be questioned because their performance has only been investigated in highly specific patient populations such as, for example, patients undergoing cardiac surgery [13], or patients with elective hip or knee arthroplasty [9], or hip fracture [10]. Furthermore, the validity of several risk stratification instruments has been tested in only one or two independent validation samples since their development [9][10][11][12]. Therefore, the performance and relevance of current risk stratification instruments for delirium is still unclear.
The aim of the study was to investigate in an independent validation sample, the performance of commonly used risk stratification instruments for postoperative delirium in older patients. The study sample included a total of 292 persons aged 50 years or older who underwent elective or emergency surgery.

Study population
The study was performed at the University Medical Center Groningen, a 1,300 bed university hospital in the northern Netherlands. The study population included all consecutive elective and emergency surgical patients aged 50 years or older who were admitted between 1 October 2011 and 1 June 2012 and met at least one of the following inclusion criteria: memory problems; dependency in activities of daily living (ADL) during the last 24 hours; history of confusion during previous illness or hospitalization; alcohol abuse; thoracic or abdominal surgery; age >70 years (for emergency admission patients); planned ICU admission (for elective admission patients). The first three criteria (memory problems, dependency in ADL, and history of confusion) are part of the standard Hospital Patient Safety Program in the Netherlands [16]. Exclusion criteria were: delirium at admission; laparoscopic cholecystectomy or appendectomy; expected length of stay ,48 hours. Patients with hip fracture were not included because they took part in another study that interfered with the aims of this study.

Ethics statement
The study was approved by the Medical Ethical Committee (METc) of the University Medical Center Groningen, Groningen, the Netherlands, and was conducted in accordance with the guidelines of the Declaration of Helsinki. In accordance with the Dutch Medical Research (Human Subjects) Act, we did not seek written informed consent from the participants as all data were collected as part of standard patient care. This procedure was approved by the Medical Ethical Committee of the University Medical Center Groningen, Groningen, the Netherlands. The authors BLL and GJI were involved with the collection of the data and had access to identifying information. The data were anonymized prior to analysis.

Data collection
All data was collected prospectively by trained research nurses. On hospital admission, medical records were studied for in-and exclusion criteria, reason for admission, illness severity (clinical impression), medical history and current laboratory data. In addition, the Acute Physiology and Chronic Health Evaluation (APACHE) II score was calculated [17]. Patients were interviewed within two days of admission to collect data on physical, cognitive and psychological function before admission. This interview included questions contained in the Groningen Frailty Indicator (GFI) [18]. Type of surgery was ascertained from the patient's medical record.

Delirium assessment and definition
The incidence of postoperative delirium was determined prospectively. The Delirium Observation Screening (DOS) scale was used to screen for delirium [19,20]. The DOS scale was developed to assess symptoms of delirium based on observations during regular nursing care and can be used as a screening tool as well as a measure of severity of delirium [21]. It is part of the standard Hospital Patient Safety Program in the Netherlands [16]. The DOS scale includes 13-items and was administered by regular ward nurses once per shift (day, evening, and night). The lowest score is 0 points (normal behavior), the highest score is 13 points (strongly altered behavior). The cut-off point is usually set at 3 points with a score >3 points indicating delirium (negative predictive value, 99-100%; positive predictive value, 47-89%) [20,22]. In this study, patients with a score >3 points were visited on the same day by a geriatrician for further investigation. The geriatrician evaluated the presence or absence of delirium according to the criteria of the Diagnostic and Statistical Manual of Mental Disorders, fourth edition, text revision (DSM-IV-TR): 1. disturbance of consciousness with reduced ability to focus, sustain, or shift attention; 2. a change in cognition or the development of a perceptual disturbance that is not better accounted for by a preexisting, established, or evolving dementia; 3. the disturbance develops over a short period of time (usually hours to days) and tends to fluctuate during the course of the day; 4. There is evidence from the history, physical examination, or laboratory findings that the disturbance is caused by a medical condition, substance intoxication, or medication side effect [23].

Risk stratification instruments
A literature search was performed to identify relevant risk stratification instruments for delirium in adult hospital patients ( Figure 1). First, MEDLINE/ PubMed (1966 to July 2014) was seached with the key concepts ''delirium''AND ''risk factor''. From this search, we retrieved all potentially relevant original articles published in English since 1966 if the abstract suggested the development of a risk stratification instrument for delirium, and all systematic and nonsystematic reviews published in English since January 2000 if the abstract included a description of risk factors for delirium. The reviews were used to identify additional potentially relevant original articles by careful scanning of the texts and reference lists by one of the authors (CJJ). This yielded a total number of 60 original articles. Of these, seven articles were excluded after examination of the full-text version because they did not include the description of a risk stratification instrument. The remaining 53 articles were carefully read and included in a cited reference search using Web of Science (Thomson Reuters, New York, NY) which yielded another five original studies. Thus, in total, we found 58 original studies about risk stratification instruments for delirium. For the present analysis, we included studies if they described a risk stratification instrument for delirium that was developed for practicing clinicians and based on patient characteristics that are commonly identified at hospital admission, and if the risk stratification instrument was validated in at least one independent cohort. Studies were excluded if the risk stratification instrument was highly specific for one type of patient such as, for example, patients in Intensive Care or Stroke units, or if (alternative) data on risk factors was not available. Eventually, nine risk stratification instruments were included [5][6][7][8][9][10][11][12][13][14][15]: four were developed in medical patients [5,8,14,15], two in noncardiac surgery patients [6,11], one in medical and noncardiac surgery patients [7], one in cardiac surgery patients [13], and one in patients with elective arthroplasty or hip fracture [9,10].

Application of the risk stratification instruments
The risk stratification instruments were applied retrospectively to the study population. Although the risk stratification instruments included in this study are based on common risk factors for delirium, the definition and assessment of the risk factors vary widely between the risk stratification instruments (Table S1). For example, cognitive impairment is defined as Mini-Mental State Examination (MMSE) score ,24 points by Inouye et al. [5], and as Blessed Dementia Rating Scale (BDRS) score >4 points by O'Keeffe et al. [8]. Similarly, alcohol abuse is defined as Short Michigan Alcoholism Screening Test (SMAST) score .1 point by Pompei et al. [7], and as alcohol >3 times per week by Freter et al. [9,10]. As a result, some data required for an exact implementation of the original risk stratification instruments was not available. Therefore, some definitions of risk factors were substituted with alternative definitions involving data that was available (Table S1).  (2009), we defined the cut-off point for high risk as >1 point, >2 points, >2 points and >1 point, respectively. For these risk stratification instruments, the authors did not propose cut-off points. However, the cut-off points that we defined identified patients in whom the risk of postoperative delirium was at least 25% in the original studies [8][9][10][11]13]. This was comparable to the risk of postoperative delirium in the high risk groups that were identified by the other risk stratification instruments.

Statistical analyses
Normally distributed data are presented as mean and standard deviation (SD).
Nonnormally distributed data are presented as median and interquartile range (IQR). The incidence rate of postoperative delirium was calculated per risk score per risk stratification instrument. The 95% confidence intervals (CI) of the incidence rates were calculated as advised by NewCombe and Altman because the absolute number of incident cases was low [24]. Then, we calculated sensitivity and specificity and used receiver operating characteristic (ROC) curves to evaluate Figure 2. Observed incidence rate of postoperative delirium by risk stratification instrument (first author, year of publication) and risk score. Bars represent 95% confidence intervals. Dashed lines correspond to an incidence rate of 25%. * 95% confidence interval omitted because category included only one person. ** For the risk stratification instruments of Inouye (1993), Marcantonio (1994), Pompei (1994), Martinez (2012), and Kobayashi (2013) the cut-off point was defined by the authors of the original study. For the definition of the cut-off points of the other risk stratification instruments, see text. For the number of persons per risk score, see Table S2.
the predictive validity of each risk stratification instrument. In ROC curves, an area under the curve (AUC) between 0.50 and 1.00 indicates that the risk stratification instrument performs better than chance. In addition, we calculated the positive and negative predictive value in our study population. Positive agreement (the percentage of patients identified as being at high risk by two different risk stratification instruments) was calculated as advised by Cicchetti and Feinstein [25]; the 95% confidence intervals of positive agreement were calculated as advised by Mckinnon [26]. The level of statistical significance was set

Sensitivity analyses
Because the study population was relatively young, we repeated the analyses in a subsample of older patients (>60 years). We also repeated the analyses where possible with other definitions of the risk factors to investigate whether the results were dependent on the (possibly arbitrary) definition of the risk factors. This was done because some of the risk factors could not be implemented exactly as described in the original studies. Sensitivity analyses could be done for comorbidity, dependency in activities of daily living (ADL), and impairment in executive function.

Study population
The study population included a total of 292 patients of whom 60% were men (Table 1). Their mean age (SD) was 66 (8) years; 75 percent was aged >60 years and 31 percent >70 years. Most patients (90%) underwent an elective surgical procedure, either for oncological or benign diagnosis. Seventy-two percent of the participants had two or more comorbidities and 51% used four or more medications ( Table 1). The incidence of postoperative delirium was nine percent (95%CI, 6-13%).

Content of risk stratification instruments
The nine risk stratification instruments comprised many different risk factors ( Table 2). The number of risk factors per risk stratification instrument varied between two and six. Many risk factors were included in several risk stratification instruments. The most common risk factors were cognitive impairment (in seven risk stratification instruments), high age (in four risk stratification instruments), and alcohol abuse and dependency in activities of daily living (in three risk The definition of some risk factors differed from their definition in the original studies (see Table S1). b For some variables, N,292 due to missing data. c Letter fluency was measured in a subset of patients. stratification instruments). In our study population, there were large differences between the risk stratification instruments as well as within the risk stratification instruments in the prevalence of the risk factors (Table 2). For example, the risk stratification instrument of Greene (2009) comprised two risk factors with a prevalence rate of 30% whereas the three risk factors included by the risk

Predictive performance
The highest observed incidence of postoperative delirium for any risk stratification instrument and risk score was 50% (95%CI, 15-85%) which was found for patients with two points according to the risk stratification instrument of Martinez (2012) [14]. However, for eight risk stratification instruments, the highest observed incidence rate of postoperative delirium per risk score was equal to or less than 25% (Figure 2). In addition, some risk stratification instruments did not show a clear association between observed incidence of postoperative delirium and risk score (Figure 2). ROC curve analysis showed that the risk stratification instruments did not predict postoperative delirium better than chance (Figure 3). For all risk stratification instruments, the AUC was not statistically different from 0.50 (Table 3). If the outcomes of the risk stratification instruments were dichotomized into being at low vs. high risk of postoperative delirium, the positive predictive values of the risk stratification instruments were between 0% and 25% and the negative predictive values between 89% and 95% (Table 3).

Agreement
The positive agreement between the risk stratification instruments varied between 0 and 57% (95%CI, 26-88%). On average, the risk stratification instruments of Inouye (1993) and Pompei (1994) showed the lowest positive agreement with other risk stratification instruments (Table S3).

Sensitivity analyses
The analyses yielded essentially similar results when they were repeated in patients aged >60 years (mean age, 69; SD, 7 years). The incidence of postoperative delirium in this age group was 10 percent (95%CI, 7-15%). It was found for all risk stratification instruments that the test characterictics in persons aged >60 years were comparable to the test characteristics in persons aged >50 years (Text S2, Table A). The performance of the risk stratification instruments was also essentially similar for different definitions of comorbidity (risk stratification instrument of Pompei, 1994

Discussion
Reliable prediction of postoperative delirium is essential for the planning of good peroperative care in older persons. If it is recognized early that an older surgical patient is at increased risk of postoperative delirium, it is possible to select and tailor interventions that may prevent delirium [4], and to inform a patient properly about the risks of surgery. However, in this study, we found that commonly used risk stratification instruments performed no better than chance in distinguishing between patients at low or high risk of postoperative delirium. Accordingly, the positive predictive value of the risk stratification instruments was poor. Also, the agreement between the risk stratification instruments in identifying patients at high risk of postoperative delirium (positive agreement) was low. Therefore, the generalizability of these commonly used risk stratification instruments is probably limited.
All risk stratification instruments that were investigated in this study were previously evaluated in at least one independent validation sample. Most risk stratification instruments were developed and evaluated in studies that included a development and independent validation sample from the same target population [5-8, 13, 14]. Other risk stratification instruments were developed and evaluated in separate studies that included different categories of patients [9][10][11][12], such as, for example, patients undergoing elective hip or knee arthroplasty, or patients with hip fracture [9,10]. Nonetheless, most risk stratification instruments performed far better in the original studies than in this study. Whereas several original studies reported positive predictive values between 40% and 100% [6-8, 10, 11, 13, 14], this study found positive predictive values that were only between 0% and 25%. Thus, the risk stratification instruments yielded highly divergent results in different patient populations.
The large differences in performance of the risk stratification instruments could be ascribed to several factors. First, there was a difference between our study and the original studies in the definition and assessment of a number of risk factors included by the risk stratification instruments. This was due to the unavailability of some data required for the exact implementation of the risk stratification instruments. Although this could have influenced some of the results, the effect is likely to be small if it is assumed that the risk stratification instruments are robust. Second, there were differences in the incidence rate of delirium between the original development and validation studies. In most original studies, the incidence of delirium was between 15% and 52% [7,13]. Thus, compared to these incidence rates, the incidence of delirium in the current study (9%) was relatively low. Third, some of the risk stratification instruments were developed in medical patients [5,8,14,15], whereas the current study involved surgical patients. Fourth, several risk stratification instruments were developed in patient populations that were considerably older than the patient population of the current study [5, 7-10, 14, 15]. On the other hand, all risk stratification instruments were based on the same conceptual model that is widely accepted among experts in the field. In this conceptual model, the onset of delirium is not caused by one single factor but the outcome of a complex interaction of various risk factors [4]. Many of these risk factors have been identified and are included in the risk stratification instruments that were investigated in this study. Consequentially, it is not likely that the performance of these risk stratification instruments is strongly dependent on the characteristics of a specific study population.
To our knowledge, this is the first study that included data on agreement between risk stratification instruments for (postoperative) delirium. Interestingly, it was found that for most risk stratification instruments, positive agreement was very low. This implies that the various risk stratification instruments identified very different patients as being at high risk for postoperative delirium. This low positive agreement was somewhat surprising as the risk stratification instruments shared various risk factors such as, for example, older age, cognitive impairment, alcohol abuse and visual or hearing impairment, that are established risk factors for delirium [4]. It is unlikely that the low positive agreement is due to a different definition and assessment of these risk factors in the distinct risk stratification instruments as in this study, their definition and assessment was very similar. Therefore, the low positive agreement might be due to differences between the risk stratification instruments in the combination of risk factors although in our opinion, this would point to a certain lack of robustness of the commonly accepted risk factors for delirium. A more likely explanation is that the etiology of (postoperative) delirium is far more complex than currently understood and that probably, important risk factors have yet to be discovered. Although the concept of predisposing and precipitating risk factors is widely accepted [4], the common risk stratification instruments are mainly based on predisposing risk factors only. Possibly, predictive performance and agreement of the risk stratification instruments could be improved by adding clearly defined and quantifiable precipitating risk factors that are part of anesthetic and surgical procedures.
The positive predictive value is probably the most important test characteristic of a risk stratification instrument for postoperative delirium as the incidence rate of postoperative delirium may be relatively low. In this study, the differences in test characteristics between most risk stratification instruments were small but the best positive predictive value was found for the risk stratification instruments of  and Marcantonio (1994). In our opinion, these risk stratification instruments are equally easy to use in clinical practice.
Some limitations of our study have to be discussed. First, as discussed above, a number of risk factors was defined differently compared to the original studies. This was most clear for cognitive impairment that was defined by the performance on a formal screening test in some studies [5,6], and by the positive answer to only one question in our study. However, in our opinion, this is not a sufficient explanation for the low performance of the risk stratification instruments because there are also differences between the original studies in the definition of risk factors. For example, cognitive impairment was defined as MMSE score ,24 points in the study by Inouye et al. [5], as cognitive status interfering with social functioning in the study by O'Keeffe et al. [8], and as MMSE score ,24 points or previous postoperative delirium in the studies by Freter et al. [9,10]. Second, the observed incidence of postoperative delirium was relatively low. Although some cases of delirium could have been missed, this is unlikely as the DOS scale was used and this scale has a high negative predictive value for delirium [20,22]. Moreover, the incidence of postoperative delirium in this study was comparable to that in some of the original studies [6,9,11]. Third, the risk stratification instruments were applied retrospectively. Although this could have caused some errors in the risk stratification of individual patients, we think that this effect is small because all data used for the application of the risk stratification instruments was collected prospectively. Fourth, some risk stratification instruments were not developed in surgical patients but in medical patients. However, it is not feasible for clinicians to use different risk stratification instruments for different types of patients. Therefore, most clinicians use the risk stratification instrument of their choice for every kind of patient.
Our study also has several strengths. First, the study sample included consecutive patients from diverse surgical specialties. Second, all data was collected prospectively. Third, all patients were routinely screened for delirium with the DOS scale which has a high negative predictive value, and if the screening was positive, patients were further investigated by an expert geriatrician. Fourth, and most importantly, our study comprised a study population that was wholly independent from the development and validation samples of the original studies.
In conclusion, in this independent validation cohort, the performance and agreement of commonly used risk stratification instruments for (postoperative) delirium were poor. However, the translation of these findings into clinical practice requires some caution because the implementation of the risk stratification instruments in this study was not exactly similar to the implementation in the original studies. Nevertheless, we think that the usefulness of the current risk stratification instruments for delirium can be questioned and that these instruments need more rigorous evaluation in well designed prospective studies that include different clinical settings and patient populations.