Validation of a Polish version of the National Institutes of Health Stroke Scale: Do moderate psychometric properties affect its clinical utility?

Background The National Institutes of Health Stroke Scale (NIHSS) is a validated tool for assessing the severity of stroke. It has been adapted into several languages; however, a Polish version with large-scale psychometric validation, including repeatability and separate assessments of anterior and posterior stroke, has not been developed. We aimed to adapt and validate a Polish version of the NIHSS (PL-NIHSS) while focusing on the psychometric properties and site of stroke. Methods The study included 225 patients with ischemic stroke (102 anterior and 123 posterior circulation stroke). Four NIHSS-certified researchers estimated stroke severity using the most appropriate scales to assess the psychometric properties (including internal consistency, homogeneity, scalability, and discriminatory power of individual items) and ultimately determine the reliability, repeatability, and validity of the PL-NIHSS. Results The PL-NIHSS achieved Cronbach’s alpha coefficient of 0.6885, which indicates moderate internal consistency and homogeneity. Slightly more than half of the individual items provided sufficient discriminatory power (r > 0.3). A favorable coefficient of repeatability (0.6267; 95% confidence interval: 0.5737–0.6904), narrow limits of inter-rater agreement, and excellent intraclass correlation coefficients or weighted kappa values (> 0.90), demonstrated high reliability of PL-NIHSS. Highly significant correlations with other tools confirmed the validity and predictive value of the PL-NIHSS. In posterior stroke, the PL-NIHSS achieved the required Cronbach’s alpha coefficient (0.71070). Additionally, stroke location did not affect other psychometric features or instrument reliability and validity. Conclusions We developed a valid and reliable tool for assessing stroke severity in Polish-speaking participants. Moderate psychometric features were emphasized without limiting its clinical applications.


Introduction
Clinometric scales are used to objectively evaluate the severity of stroke. Undoubtedly, the National Institutes of Health Stroke Scale (NIHSS) has played the most important role in stroke assessment for several years [1]. It is widely accepted due to its simplicity, high reproducibility, and ease of performance [2] and was designed to be used not only by neurologists, but also by other thoroughly trained members of the stroke team [3]. Furthermore, apart from delivering an objective and reliable estimation of stroke severity, numerous studies have stressed the usefulness of the NIHSS in assessing the clinical prognoses, outcomes, and risks for large intracranial vessel occlusions, thus, emphasizing its predictive value [4,5]. Several researchers from various countries have adapted and validated the NIHSS after demonstrating its high reproducibility and highlighting its clinical utility [6][7][8][9][10][11][12][13][14][15]. However, less attention has been focused on the psychometric properties, such as internal consistency or the discriminatory power of individual items, because these factors have only been analyzed in individual reports [16]. Notably, the NIHSS psychometric parameters that determine homogeneity, stability, and individual component discriminatory power are equally important as its overall utility and clinical validity. Obtaining the appropriate values for all these components will determine the overall quality of the diagnostic tool, and it is of utmost importance that these features are independent of the language, country, region, and culture. In light of this observation, the lack of a reliable and in-depth analysis of the NIHSS scale is a shortcoming; therefore, a comprehensive assessment of the NIHSS is essential to better define its structural features as well as overall clinical and practical relevance.
The language barrier and lack of a standardized stroke evaluation tool in Poland have resulted in a clinical need for a reliable and valid instrument that can enable members of the stroke team in evaluating Polish-speaking patients. The aim of the current study was to develop and validate a Polish version of the NIHSS (PL-NIHSS) and to assess its psychometric properties, including internal consistency, homogeneity, and scalability in relation to its overall reliability and clinical accuracy.

Study design and participants
This prospective, observational, single-center study was conducted between December 2019 and August 2020 in the Stroke Unit of the Department of Neurology at the University Hospital No. 1, Bydgoszcz, Poland. We enrolled 225 patients with ischemic stroke, including 102 patients with anterior and 123 patients with posterior circulation stroke. All participants met the requirements of the updated definition of stroke proposed by the American Heart and Stroke Association [17].
The clinical and functional parameters were assessed within 24 hours of stroke onset using the PL-NIHSS and Glasgow Coma Scale (GCS). The questionnaires were completed by four investigators, including two stroke physicians, a stroke research nurse, and a physiotherapist, all of whom were NIHSS-certified and had several years of experience in the intensive stroke unit.
Estimation of the inter-rater reliability of the PL-NIHSS was based on evaluations by three randomly selected researchers. The time difference between each assessment did not exceed 2 hours. Repeatability was assessed by analyzing the total PL-NIHSS values assessed by two randomly selected examiners. Three hours later, one researcher randomly selected from the initial three researchers re-assessed the patient (test-retest) using the PL-NIHSS to estimate intrarater reliability. Subsequently, a randomly selected researcher (from the total group of researchers) evaluated the patient within the first 24 hours of onset of stroke using the GCS to evaluate its construct validity and again at 3 months using the Barthel Index and modified Rankin Scale (mRS) to assess its predictive validity.
The following exclusion criteria were used: (1) significant speech impairment or disturbances of consciousness that prevented a patient from providing informed consent to participate in the study, and (2) patients undergoing specific stroke therapy (intravenous thrombolysis and/or endovascular treatment), which can significantly contribute to discernable fluctuations in the clinical condition. The baseline characteristics of the participants are summarized in Table 1.

PL-NIHSS
Adaptation of the English version of the NIHSS into Polish was performed in accordance with standards proposed by the International Quality of Life Assessment Project [18]. Two forward translations were used to create an intermediate version that was translated back for comparison with the original version. After analyzing for any contradictions or misinterpretations and obtaining agreement on the consistency and equivalence, the scale was reviewed by Polish-speaking neurologists who estimated how well it was comprehended and rated its overall acceptance. Each item received the required minimum of three points (out of a total of four points) in the content validity rating [19], and after considering minor corrections and suggestions, a preliminary version of the PL-NIHSS was established (S1 Table). Subsequently, the items that assessed speech disorders, inattention, or visual extinction (Fig 1) were modified and adapted to the cultural aspects that would be better recognized and understood by the Polish population. The word complexity, knowledge of phrases, and commonness of idioms were considered while maintaining the content and meaning of the original items. The researchers completed the PL-NIHSS training based on repeated clinical examinations of all the items. The same rules were also adapted for the assessment of individual components included in the original NIHSS [20].

Ethical statement
The study protocol was approved by the Bioethics Committee of the Nicolaus Copernicus University in Torun at Collegium Medicum of Ludwik Rydygier in Bydgoszcz (KB number 732/ 2019). All participants read and understood the study protocol and provided informed written consent to participate in the study.

Statistical evaluation methods
STATISTICA v13.1 (Dell Technologies, Round Rock, TX, USA) was used for the statistical analyses. The following tests were performed: Spearman's rank correlation (estimation of construct and predictive validity), intraclass correlation coefficient (evaluation of inter-rater and intra-rater agreement), and weighted Cohen's kappa (intra-rater agreement). Cronbach's alpha coefficient and Bland-Altman analysis were performed to assess the psychometric properties of the PL-NIHSS [21,22]. A p-level < 0.05 was considered statistically significant.

Results
A Cronbach's alpha coefficient of 0.6885 was achieved in all patients with stroke with individual values of 0.6387 and 0.7107 for anterior and posterior stroke, respectively. The characteristics of individual items are summarized in Table 2. In the group that included both types of stroke (irrespective of location), only 8/15 (53.3%) items achieved a satisfactory and required discriminant level (r>0.3) [23]. Of those, only three, including items for facial palsy, dysarthria, and extinction or inattention, achieved a high correlation with the others (r>0.5). Limb ataxia was the least correlated with the other components. However, when limb ataxia and right arm motor function were excluded, the overall alpha coefficient increased. In the patients with anterior stroke, eight items met the minimum requirements for discriminatory power; of those, only items for visual field, best gaze, and extinction or inattention achieved high values. Notably, the motor function of the right arm and limb ataxia were distinguished from the other items by negative correlation values. Removing four items (motor function of right arm, motor function of right leg, limb ataxia, and best language) improved the overall accuracy of the PL-NIHSS. In the patients with posterior stroke, eight items achieved a satisfactory discriminant level, and half of them, including items for facial palsy, motor function of the left arm, motor function of the left leg, and dysarthria were highly correlated with the others. Only one item (sensory) was negatively correlated with the others; however, removing four items (sensory, limb ataxia, level of consciousnesscommands, and visual field) increased the overall alpha coefficient. The median inter-item correlation for the entire stroke group was 0.1834, while the values were 0.1807 and 0.1737 for anterior and posterior stroke, respectively.
The results of the inter-rater and intra-rater agreements are summarized in Table 3. Excellent weighted kappa values (κ > 0.9) and intraclass correlation coefficients (ICC > 0.9) among all the items indicated high reproducibility of the PL-NIHSS. A favorable coefficient of repeatability (CR = 0.6267; 95% confidence interval [CI] = 0.5737-0.6904) and narrow limits of agreement (lower: -0.6408, 95%CI = -0.7128 to -0.5689; upper: 0.6142, 95% CI = 0.5422-0.6862) were observed in Bland-Altman analyses (Fig 2), thus, emphasizing the accuracy of PL-NIHSS. A vast majority of related pairs of total scores (n = 211; 93.8%) fell within the limits of agreement and reached an identical total number of points whereas the maximum difference in the total score between the examiners was two points, which was observed only in three cases.  We observed a moderate, but significant correlation between the PL-NIHSS score and the initial GCS score (r = -0.4460, p < 0.0001), which indicated satisfactory construct validity ( Fig  3A). On the 90th day after the onset of stroke, we also observed a high correlation between the PL-NIHSS, Barthel Index (r = -0.8648, p < 0.0001), and mRS (r = 0.8310, p < 0.0001), which reflected the predictive validity of the device (Fig 3B and 3C). We found no significant differences in the assessment of the reliability (ICC, kappa, CR, limits of agreement) or validity (correlation coefficient) between the patient groups with anterior and posterior stroke as well as in comparison of each subgroup with the overall group.

Discussion
To our knowledge, this study describes the first adaptation and validation of a Polish version of the NIHSS (PL-NIHSS). In this novel report, we highlighted its moderate psychometric properties, assessed its repeatability using Bland-Altman statistics, and analyzed its internal consistency, reliability, and validity based on the stroke location (anterior or posterior).
An ideally constructed stroke scale should be characterized by appropriate psychometric parameters, which demonstrate the correct structure of the tool. Particularly, it should be characterized by scalability (internal consistency and homogeneity) by confirming that each component of the instrument is equally important and measures the same attribute [24]. According to Nunnally's principle, the Cronbach's alpha coefficient used for this assessment should reach a minimum of 0.7 [21]. Each item on the scale should also significantly correlate with the others (discriminatory power), and its removal should not increase the overall reliability of the scale. We observed a sufficient alpha coefficient only in posterior stroke (slightly exceeding the limit), whereas the required value was not achieved in the groups with anterior and overall stroke. Only slightly more than half of the assessed items had appropriate discriminatory power in the overall stroke group as well as in the anterior and posterior stroke subgroups. Additionally, some items did not correlate with the others at all, thus, contributing to a reduction in the quality of the entire tool. The median correlation coefficients were far below those expected. Our findings emphasized doubtful homogeneity of the adapted version of the NIHSS and are inconsistent with the data reported by Sun et al. [16] who demonstrated Cronbach's alpha coefficient of 0.92 and mean inter-item correlation of 0.44. However, they analyzed only 48 patients with stroke, and the small sample size may have significantly affected their overall study reliability [25]. The moderate psychometric properties observed in our study indicated a lack of homogeneity and internal consistency, and therefore, suggests a structural disadvantage of the NIHSS. Accordingly, further research to improve the existing NIHSS version should be supported in order to develop a scalable tool in accordance with the current international guidelines.
Irrespective of the design imperfections, the significant clinical utility of the validated version of the NIHSS should be emphasized; it was particularly manifested in the high reliability and validity observed in our study. Our findings are consistent with those of other studies in this topic; however, we noted higher individual item agreement values than those reported by most other investigators. Only one report by Jurjans et al. [26] found that all the items of a Latvian validated tool achieved excellent ICC (> 0.95) in both inter-rater and intra-rater assessments. The authors of validation reports of other scales found moderate, and sometimes, even poor agreement between the selected items [9][10][11][12][13][14][15]. Notably, the sample size in the present and the Latvian study were larger than those in the other studies, thus, emphasizing the significance of the results in this study as well as highlighting the high reproducibility of the PL-NIHSS. Simultaneously, our research supports the wide use and assessment of the NIHSS by qualified, trained, and certified members of the stroke team and not just neurologists. A clear advantage of our study over others is the assessment of repeatability based on the agreement achieved between raters regarding the total score and not just individual items. To our knowledge, this is the first study to emphasize a satisfactory coefficient of repeatability and narrow limits of agreement using Bland-Altman statistics, thus, confirming the stability and reliability of the validated tool. The high construct and predictive validity of the PL-NIHSS was reflected in the significant, high, and moderate correlations with other instruments used in similar situations in other studies.
Another strength of our study is the assessment of the psychometric parameters, reliability, and validity depending on the stroke location. Many reports have demonstrated that the NIHSS is more accurate when used to assess the severity of anterior stroke whereas the clinical condition of posterior stroke is often underestimated [27]. Therefore, unlike previous studies, we attempted to validate the PL-NIHSS with both types of stroke and found that specifying the type of stroke did not negatively affect the parameters, thus, confirming the reproducibility, repeatability, and validity of the tool. This result verified the high accuracy of the validated instrument, irrespective of the area of brain vascularization. Surprisingly, better psychometric properties, such as internal consistency or homogeneity, were noted in the patients with posterior stroke. These differences between the compared groups confirmed that better scalability of the PL-NIHSS did not translate into a more accurate assessment of stroke severity or increase its validity and reliability. Furthermore, we hypothesized that the psychometric properties of the validated instrument did not affect or limit its clinical utility. Nevertheless, we believe that the optimal situation occurs when the commonly used scale is characterized by high psychometric values as well as high reliability and validity.
The current study has some limitations. The study sample size was moderate, although it was larger than in those in other studies. Our study was a single-center study; therefore, verification of our postulates, particularly regarding the psychometric aspects, is required in multicenter studies, preferably with international cooperation. Due to the requirement for obtaining informed written consent, some patients with stroke were procedurally excluded, and therefore, the data did not cover the entire stroke profile (especially of patients with severe strokes).

Conclusions
We developed a valid and reliable Polish version of the NIHSS suitable for use in everyday practice by trained and certified staff of the Polish-speaking stroke unit. The moderate psychometric properties emphasized in the PL-NIHSS did not affect its clinical usefulness. However, considering the international requirements for commonly used diagnostic tools, further research should be pursued to improve the design and structural quality of the current NIHSS.
Supporting information S1