The Reliability and Sensitivity of the National Institutes of Health Stroke Scale for Spontaneous Intracerebral Hemorrhage in an Uncontrolled Setting

Background and Purpose The National Institutes of Health Stroke Scale (NIHSS) is commonly used to measure neurologic function and guide treatment after spontaneous intracerebral hemorrhage (ICH) in routine stroke clinics. We evaluated its reliability and sensitivity to detect change with consecutive and unique rater combinations in a real-world setting. Methods Conservative measures of interrater reliability (unweighted Kappa (κ), Intraclass Correlation Coefficient (ICC1,1) and sensitivity to detect change (Minimal Detectable Difference (MDD)) were estimated. Sixty-one repeated ratings were completed within 1 week after ICH by physicians and nurses with no investigator intervention. Results Reliability (consistency) of the NIHSS total score was good for both physicians vs. nurses and nurses vs. nurses (ICC=0.78, 95%CI: 0.58-0.89 and ICC=0.75, 95%CI: 0.55-0.87 respectively) in this scenario. Reliability (agreement) of items 1C and 9 were excellent (κ>=0.61) for both rater comparisons, however, reliability was poor to fair on most remaining items (κ:0.01-0.60), with item 11 being completely unreliable in this scenario (κ<0.01). The MDD95 of the total NIHSS score was ±10 and ±11 points for physician vs. nurse and nurse vs. nurse comparisons. Conclusions The reliability of the NIHSS is good overall for ICH even in an uncontrolled setting. However, on repeated measurements changes in total NIHSS score of at least >=10 points need to be observed for clinicians to be confident that real changes had occurred within 1 week after ICH.


Introduction
The National Institutes of Health Stroke Scale (NIHSS) is a well known scale, originally designed to assess stroke severity in controlled clinical studies of ischemic stroke [1]. Despite this, it is now commonly used to measure neurologic function and guide treatment after spontaneous intracerebral hemorrhage (ICH) in day-to-day clinical settings as well [2]. Currently however, the sensitivity of the NIHSS for detecting changes after treatment is unclear, and reliability estimates from previous studies using distinct, controlled raters are overestimated for routine settings where raters are often transient and interchangeable. Without knowing the reliability or sensitivity to detect change in uncontrolled settings with typical raters, it would be impossible to appropriately quantify clinically meaningful neurologic changes after treatment using this scale [3]. We evaluated the reliability and sensitivity to detect change of the NIHSS for ICH patients in a typical, routine clinical setting with a realistic set of consecutive raters.

Methods
The study protocol was approved by the University of Calgary Conjoint Health Research Ethics Board. We obtained a waiver of written consent for patients to conduct this study. A consecutive series of 48 patients with ICH were followed prospectively in a stroke unit at a university hospital. Patients were included if they were adults (>=18 years) and had an imaging-confirmed ICH. Patients were excluded only if they had an illness that interfered with neurological assessments, or paired-measurements were taken greater than four hours apart.
Raters of the NIHSS were physicians and nurses trained in stroke who were blinded to the study protocol. There was no specific, defined set of raters chosen for this study. Rather, raters were enrolled consecutively into the study and represented typical raters who would normally evaluate patients in routine settings but were not excluded based on their level of professional training or experience. Two raters completed NIHSS measurements within the first week after ICH. No formal training was provided for this study although it is a policy at our centre to ensure that all clinicians are NIHSScertified prior to assessing stroke patients.
Interrater reliability of the total NIHSS score was quantified using an Intraclass Correlation Coefficient (ICC) model (1,1) [4]. This model was appropriate since all ratings were performed by a different set of raters [4]; which would be expected in routine settings since clinical rotations are often highly variable. Thus an interrater ICC (1,1) can be considered a realistic estimate of reliability for this scenario in contrast to a model 2 ICC which is used in the majority of reliability studies when a specific group of raters is defined a prori [5]. Interrater reliability of individual item scores was quantified using a conservative unweighted Kappa coefficient.
Sensitivity to detect change of the total NIHSS score was estimated at different levels of confidence using the Minimal Detectable Difference (MDD) [5,6]. The MDD is a statistical measure that accounts for normal variability in clinician measurements over a large group of patients and identifies the smallest amount of change that is required to detect any improvement or decline in the natural units of a scale [5] while accounting for this normal variability. The MDD does not describe clinically meaningful changes in scores, rather it quantifies a level of statistical uncertainty surrounding specific NIHSS scores so clinicians can assess how likely they have captured 'true' improvement or worsening. Factors associated with absolute disagreement on individual scale items and magnitude of disagreement on the total NIHSS score between raters were investigated using logistic and linear regression respectively. The required sample size for this study was estimated to be at least 22 paired-ratings per rater comparison [7].

Results
Sixty-one pairs of ratings were completed across 38 patients. Ten patients were excluded because repeated measurements were taken greater than four hours apart. All 61 pairs of ratings were performed by 61 independent and unique combinations of physician and nurse raters. The characteristics of the patients included in each rater comparison are described in Table 1. Reliability of the NIHSS total score was good for both physicians vs. nurses and nurses vs. nurses in this scenario. The full results of reliability and sensitivity to detect change analyses are presented in Table 2.

Discussion
Neurologic outcome scales such as the NIHSS are commonly used to assess neurologic function and determine how patients with stroke respond to treatment in day-to-day clinical settings. To our knowledge, this was the first study to evaluate the reliability and sensitivity to detect change of the NIHSS for ICH specifically and the first study to examine these properties for the NIHSS using a heterogeneous group of consecutive raters in an uncontrolled setting. Assessing the reliability of the NIHSS in an uncontrolled environment establishes a benchmark of what would be expected in daily Reliability of the NIH Stroke Scale PLOS ONE | www.plosone.org practice, in the naturalistic setting of a tertiary care stroke program, and therefore the ICC estimated for the total NIHSS score in this study could be viewed as a conservative estimate of reliability [5]. We are confident that estimates presented in this study are generalizable to other routine stroke clinics but stress that they are not generalizable to settings where control of raters is implied such as in a randomized clinical trial of therapy.
This study suggests that the reliability of the total NIHSS score was good in an uncontrolled setting but, as expected, it was lower than previous investigations with pre-defined raters [8], and may be affected by patient age, sex, and ICH location [9]. NIHSS measurements are never error-free in any scenario. The MDD is a statistical measure which explains this error and quantifies the smallest amount of change the NIHSS can accurately measure [5]. This study demonstrates that in an uncontrolled clinical setting, observed changes in the total NIHSS score (worsening/improvement) of 3 points, although may be considered clinically meaningful for some individual patients, over a large group of patients, can only be considered real with 50% certainty at best, due to natural errors in measurement, and the degree of error that affects individual NIHSS measurements is fairly substantial, despite good observed reliability overall.
Clinicians should define clinical improvement outside the range of the natural statistical error of NIHSS scores, specifically it must be defined as >=10 points (if nurses and physicians are making the measurements) further from the baseline/previous score, to conclude that observed measurements reflect real neurologic changes, with any substantial certainty (95%).
As with many previous studies of reliability we assessed consistency and agreement between raters while taking multiple measurements within the same set of patients. Reliability studies attempt to quantify and describe the interaction between raters and patients in different scenarios, thus the unit of analysis in reliability studies is 'ratings' versus 'patients' which is atypical for most clinical studies. Specifically, reliability coefficients are measures which describe raterpatient interactions, and therefore can only be valid if the combination of raters, patients, and times of assessment are independent and mutually exclusive across each pair of ratings, as they were in our study. Further, it is reiterated that this study did not assess clinically meaningful changes on the NIHSS. Rather, this study evaluated the errors associated with rating the NIHSS using a statistical distribution-based method. Clearly, further studies are still needed to identify what magnitude of change is necessary on the NIHSS to observe clinically important changes. Assumptions cannot be made regarding clinically important changes on a scale if it is unknown what strength of signal is required to overcome the natural error of a scale and register a change to begin with. Thus, this study provides evidence for these future investigations.

Conclusion
The NIHSS total score is reliable for ICH even in an uncontrolled setting, however, good reliability does not imply good sensitivity for detecting true neurologic function. Thus, clinicians need to be aware of important patient characteristics that may be associated with increased variability among repeated measurements.