Absolute and Relative Reliability of the Timed ‘Up & Go’ Test and ‘30second Chair-Stand’ Test in Hospitalised Patients with Stroke

Objective The timed ‘Up & Go’ test and ‘30second Chair-Stand’ test are simple clinical outcome measures widely used to assess functional performance. The reliability of both tests in hospitalised stroke patients is unknown. The purpose was to investigate the relative and absolute reliability of both tests in patients admitted to an acute stroke unit. Methods Sixty-two patients (men, n = 41) attended two test sessions separated by a one hours rest. Intraclass correlation coefficients (ICC2,1) were calculated to assess relative reliability. Absolute reliability was expressed as Standard Error of Measurement (with 95% certainty—SEM95) and Smallest Real Difference (SRD) and as percentage of their respective means if heteroscedasticity was observed in Bland Altman plots (SEM95% and SRD%). Results ICC values for interrater reliability were 0.97 and 0.99 for the timed ‘Up & Go’ test and 0.88 and 0.94 for ‘30second Chair-Stand’ test, respectively. ICC values for intrarater reliability were 0.95 and 0.96 for the timed ‘Up & Go’ test and 0.87 and 0.91 for ‘30second Chair-Stand’ test, respectively. Heteroscedasticity was observed in the timed ‘Up & Go’ test. Interrater SEM95% ranged from 9.8% to 14.2% with corresponding SRD% of 13.9–20.1%. Intrarater SEM95% ranged from 15.8% to 18.7% with corresponding SRD% of 22.3–26.5%. For ‘30second Chair-Stand’ test interrater SEM95 ranged between 1.5 and 1.9 repetitions with corresponding SRD of 2 and 3 and intrarater SEM95 ranged between 1.8 and 2.0 repetitions with corresponding SRD values of 3. Conclusion Excellent reliability was observed for the timed ‘Up & Go’ test and the ‘30second Chair-Stand’ test in hospitalised stroke patients. The thresholds to detect a real change in performance were 18.7% for the timed ‘Up & Go’ test and 2.0 repetitions for the ‘30second Chair-Stand’ in groups of patients and 26.5% and 3 repetitions in individual patients, respectively.


Introduction
Stroke is one of the most disabling conditions leading to loss of mobility and independency [1]. It is important that functional performance is evaluated with valid and reliable clinical outcome measures. Knowing both the absolute and relative reliability of an outcome measure, enable clinicians and researchers to evaluate the results on a scientific basis and be 95% confident, that a change in the outcome score represents an actual change in performance, and not just a change caused by measurement error and simple test re-test variation.
Walking, standing and sitting on a chair are among the most affected activities for stroke patients and are considered important for the independency of everyday life [2][3][4]. The timed 'Up & Go' test (TUG) and '30second Chair-Stand' test (30s-CST) are both outcome measures widely used in different groups of patients to assess functional performance such as walking, turning and the ability to perform sit to stand-tasks. [4][5][6][7][8][9][10]. TUG and 30s-CST are easy to administer compared with other performance measures and can easily be implemented in clinical practice [2,4,11,12].
TUG has been used in several studies to assess functional performance and risk of falling in stroke patients [8,9,13,14]. These are among the most important aspects of stroke rehabilitation, since critical decisions regarding discharge destination, need for further rehabilitation and assistance after discharge depend on assessment of functional performance and risk of falling. However, the reliability of TUG in patients with acute stroke ( 3 months) has never been reported. In chronic stroke patients ( 6 months) high relative test-retest reliability with Interclass Correlations Coefficients (ICC) ! 0.9 has been observed for TUG [3,11,13,15]. Three of the studies also assessed absolute reliability for TUG and found that the standard error of measurement (SEM) was 1.14 seconds [11], 1.34 seconds [3] and 2.83 seconds [15], respectively. Different outcome measures are used to assess sit-to-stand tasks [2,12,16,17]. For severely disabled patients it may be too difficult to perform 5 or 10 times sit-to-stand repetitions [2,12]. Instead of measuring the time it takes to complete a number of repetitions, the 30s-CST measures the numbers of chair-stands a person can perform in 30 seconds, making it possible to assess a wider variation of functional performance levels, with the possible scoring ranging between 0 and !20 repetitions depending on the functional capacity of the subjects [12]. The reliability of 30s-CST in patients with acute ( 3 months) or chronic stroke (!6 months) has not been reported. In patients with dementia [6], patients with osteoarthritis (OA) [18][19][20] and older adults [21] the test-retest reliability ranged between 0.81-0.98 for 30s-CST. The absolute reliability was 1.26 in patients with dementia [6] and ranged from 0.7 to 1.27 in patients with OA [19,20] for a group of patients and the absolute reliability for individual patients were 3.49 in patients with dementia [6] and ranged from 1.64 to 2.6 for patients with OA [19,20]. The purpose was therefore to investigate the relative and absolute reliability of the timed 'Up & Go' test (TUG) and '30second Chair-Stand' test (30s-CST) in stroke patients, admitted to an acute stroke unit.

Materials and Methods
The study was designed as an intra-and interrater intraday reliability study. The reporting of the study follows the Guidelines for Reporting Reliability ad Agreement Studies [22].

Procedures
Test sessions. The patients attended two test sessions separated by a one hour rest on the same day during hospitalisation. Within each test session both TUG and 30s-CST were performed. The interval between test and re-test was chosen to avoid a change in the functional level of the patients while minimising potential fatigue.
In each test session the patients performed four trials of TUG including a test trial and one trial of 30s-CST with each rater. The patients had a five to ten minute rest before being tested by the second rater. The test leader was present in all the test sessions whereas the raters were not present at each other's test sessions. The same standardised verbal instructions were given by the raters and the patients were asked if they understood the instruction. If not, the instructions were given again. If the patients did not understand the instructions after three repetitions, the patients were excluded from the study. No talking or encouragement was allowed during testing. To prevent muscular fatigue all patients were brought to the test area in a wheelchair regardless of functional capacity. During the one hour rest between the two test sessions, the patients were instructed to rest either in a chair, wheelchair or in bed and were offered something to drink and eat.
The test sessions took place either in the morning after breakfast or in the afternoon after lunch.
Raters. Three experienced physiotherapists participated; two as raters and one as test leader. The raters had two years and three years of experience in the stroke unit, respectively, whereas the test leader had more than 20 years of experience with stroke patients. The test leader collected demographic and clinical data regarding functional capacity prior to the first test session, and recorded the results of each trial of TUG and 30s-CST during the test sessions on a standardised test sheet.
The raters were blinded to each other's results for the 30s-CST and to the all results of TUG. The raters started and stopped the stopwatch in TUG and the test leader read and recorded the time from the stopwatch, while the raters were kept blinded. To minimize bias the raters changed their measurement sequence between the first and the second test session. The patients performed the tests in the same order in all the test sessions starting with TUG each time.
Both raters received extensive training in the protocol, and a pilot study was conducted prior to the study including 20 hospitalised patients with an acute stroke to calibrate the raters.
The New Mobility Score (NMS). NMS [23] was used to assess pre-stroke functional capacity as an indication of the patient's independency in everyday life before they had a stroke. This has an impact on the expected outcome on functional capacity after rehabilitation, including an expectation on how the patients will perform in the physical performance measures.
NMS measures the ability to walk; indoors, outdoors and during shopping. A score between 0 and 3 (0: not at all, 1: with help from another person, 2: with an aid, 3: no difficulty) is provided for each function resulting in a total score ranging from 0 (no walking ability) to 9 (fully independent) [23].
The test leader obtained the score by interviewing all the patients.

Physical performance measures
The timed 'Up & Go' test. TUG measures the time in seconds it takes an individual to rise from a standard arm chair, walk 3 meters to a line drawn on the floor (at least one foot must touch the line), turn and walk back to the chair to a seated position. No physical assistance is allowed and the use of a walking aid is recorded. The patients wear regular footwear [5].
To achieve a stable performance in TUG, we chose that each patient was given a practice trial followed by three timed trials as recommend by Kristensen et al in patients with hip fracture [24]. Only the best trial was included in the analysis. The following instructions were given to all patients: "On the command "ready, get set, go" please rise from the chair and walk as fast and safely as possible to the line drawn on the floor, turn around and walk back to the chair and sit down. You have to touch the line with at least one foot and you decide which side to turn to. You may use the armrest for arm support to stand up or sit down, if you like". The rater started the stop watch on "go" and stopped it as soon as the patient was seated again-i.e. when the buttocks reached the seat [24]. The chair had a seat height of 46 cm. Armrest height of 68 cm. The depth of seat was 45 cm and a backrest height of 83 cm.
'30second Chair-Stand' test. 30s-CST measures the number of sit-to-stand repetitions, with arms crossed over the chest, an individual can do in 30 seconds from a chair with a seat height of 43 cm. The patients were instructed to sit on the chair without touching the backrest and with feet approximately a shoulders width apart and with the knees flexed in 90 degrees [12].
The following instructions were given to all patients: "On the command "ready, get set, go" you have to complete as many sit-to-stands as possible within 30 seconds, and to stand with extended knees, and be fully seated between each stand. " All patients were tested according to the original manual [12]. If a patient was not able to rise from the chair a modified version of the test was used allowing the patient to use the armrest to rise to a standing position from a chair with a seat height of 46 cm [25]. The modified version was developed by the researchers behind the original 30s-CST and is described in the Senior Fitness Test [25]. When modifying 30s-CST the chair from the TUG was used.
Each patient was given a practice trial with one repetition of sit-to-stand followed by one timed trial. To prevent the chair from moving backwards during testing, it was placed with the backrest against a wall during the 30s-CST.
Both tests were performed according to standardised guidelines [5,12,24,25] and were performed in a long corridor next to the entrance of the Neurological Department at Herlev Hospital.

Ethics statement
All patients were informed and gave written consent in accordance with the Declaration of Helsinki prior to inclusion in the study. The Research Ethics Committees in the Capital Region of Denmark (j.nr. H-3-2013-FSP10) reviewed the protocol and approved the protocol but found that a formal approval was not required. Danish Data Protection Agency approved the study (j.nr. 2013-41-1601).

Statistical analysis
It is recommended to have a sample size of at least 50 participants to ensure adequate precision for the estimates of measurement error [11,26,27]. The sample size in this study was estimated from this recommendation, and with an expected drop-out of 20%, we planned to include no less than 60 patients during the time of recruitment.
Descriptive statistics and tests for normality (Shapiro-Wilk) were performed for all variables. Results are expressed as mean ± SD and as median and range if the data was not normally distributed. This was, however, only the case for the time from admission to testing data and the NMS data.
Paired t-test was used to assess if significant systematic changes between test trials and between raters were present (p 0.05).
Intraclass Correlation Coefficient (ICC 2,1 ) with corresponding 95% confidence intervals was used to calculate relative reliability. The acceptable ICC 2,1 was set at a minimal level of >0.8 [28].
Absolute reliability was calculated to establish the variability of repeated measurements using the actual units of the measurements. It was calculated as standard error of measurement (SEM) using a 2-way random ANOVA using the error components to calculate the SEM. The corresponding smallest real difference (SRD) was calculated (SEM x 1.96 x p 2) as well as SEM 95 (SEM x 1.96) to express the variation with 95% certainty for individual subjects and for groups of subjects, respectively.
Bland Altman plots were used to visualise potential systematic variations around the zero line as well as heteroscedasticity. If heteroscedasticity was present SEM% and SRD% were calculated being independent of the units of measurement; SEM% = (SEM/mean)100, SEM 95% = (SEM 95 /mean)100 and SRD% = (SRD/mean)100, where mean is the mean of all the TUG and 30s-CST scores respectively from both raters.
Bland Altman plots were also used to identify outliers. An outlier was considered to be present when the difference between the two test sessions was outside 2 standard deviations [11].
All data were analysed using

Results
Seventy patients were included during the time of recruitment. Seven were excluded due to feeling unwell, withdrawal of consent or were unable to participate in the re-test session. The remaining 63 patients were included in the data analysis (Fig 1). One patient was excluded from the analysis, because the patient was considered to be an extreme outlier due to improvement more than 2SD (18.8 sec) in TUG during the second test session. Moreover, due to an extended length of stay in the department, the patient was tested 65 days post stroke, which was substantially later than the remaining patients (Range: 2-38 days). In the final analysis, 62 patients were included in the study, of whom 61 completed TUG and 62 completed 30s-CST, respectively.
Sixty-two patients aged 71.6 ±13.6 (mean ± SD), range 40-91 years were included in the final analysis of whom 66% were male (n = 41). The time from admission until participation in the test sessions ranged between 2-38 days with a median of 5 days post stroke. NMS prestroke ranged between 3 and 9 with a median score of 9. The majority of the patients were thus independent in everyday life prior to admission, even though more dependent patients also were included. Clinical Characteristics are presented in Table 1.
Seventy-one percent of the patients performed 30s-CST in accordance with the original test, and 29% performed the test with the standardised modification.
The means and standard deviations for TUG and 30s-CST from the two test sessions are presented in Table 2.
A small but significant learning effect was seen in TUG and 30s-CST (Table 3) when the test results were analysed for the four trials in the order they were actually performed by the patients regardless of the raters, who as previously described changed their testing sequence between test session 1 and test session 2.

Reliability analysis
The results of intrarater and interrater reliability are presented in Table 4 showing high agreement with ICC 2.1 values ranging from 0.95-0.99 for TUG and from 0.87-0.94 for 30s-CST, respectively. The smallest measurable difference for a group of patients (SEM 95 ) and individual patients (SRD) for TUG and 30s-CST are shown in Table 4. With 95% certainty values above 3.0 seconds for a group of patients and 4.2 seconds for individual patients indicating a real improvement for TUG. More than 2.0 repetitions for a group of patients and 3 repetitions for an individual patient indicated a real improvement for 30s-CST with 95% certainty.
Heteroscedasticity was observed in the TUG data for both intrarater and interrater with indications of a larger variability for higher test values, as illustrated in the Bland Altman plot (Fig 2). Consequently, the SEM%, SEM 95 % and the SRD% were calculated. Intrarater SEM% ranged from 8.1 to 9.6 and SEM 95 % ranged from 15.8 to 18.7 with corresponding SRD% of 22.3-26.5. Interrater SEM% ranged from 5.0 to 7.3 and SEM 95 % ranged from 9.8 to 14.2 with corresponding SRD% of 13.9-20.1.

Discussion
To the best of our knowledge, this is the first study to investigate the relative and absolute reliability of TUG and 30s-CST in patients admitted to an acute stroke unit. The main findings in the present study were 1) excellent relative reliability for both TUG and 30s-CST. 2) The measurement error for TUG and 30s-CST was relatively low, indicating that both outcome measures can be used to detect a real change in functional performance.
To evaluate physical performance after a stroke and changes following an intervention we need reliable outcome measures. It is recommended, that the assessment should include the analysis of agreement between measurements, systematic changes in the mean and measurement error [11,26]. The statistics applied in the present study are the most commonly used, and the results can be applied in daily clinical practice using the SRD-score for both outcome measures.
The most commonly used method to evaluate reliability is the Intraclass Correlation Coefficient (ICC). There is no clear definition of an acceptable ICC value and often values above 0.70 are used as minimum standards for reporting good reliability [22,29]. Interpretation of the ICC value should also include considerations for the clinical relevance of the results [22].The ICC values in the present study are considered high and were well above our a priori defined minimum acceptable level of 0.8. Moreover, with one exception the lower limit of the 95% confidence intervals for both TUG and 30s-CST were also above 0.8. Thus the relative reliability observed in the present study seems not only good but also clinical relevant.

Relative reliability
Few studies have examined the relative reliability of TUG in chronic stroke patients [3,11,13,15] and our results in patients with acute stroke correspond well with these findings. Three Table 3 studies found ICC values between 0.95 and 0.97 [11,13,15] whereas one study [3] found ICCvalues ranging between 0.75-0.84 for the intrarater reliability and between 0.91-0.96 for the interrater reliability. To our knowledge no studies have investigated the relative reliability of 30s-CST in stroke patients. The high relative reliability in our study was in accordance with those found in patients with mild to moderate dementia (ICC 2,1 0.84) [6], in elderly people with cognitive impairments (ICC 3,1 0.94) [21], as well as in patients with OA (ICC 2,1 0.81) [19] and (ICC 1,1 0.95, 0.97, 0.98) [20].
No studies were found investigating the absolute reliability of 30s-CST in stroke patients. Hesseberg et al. [21] found SEM values of 0.86 repetitions with corresponding values of minimal detectable change (MDC 95 ) of 2.4 repetitions in patients with cognitive impairments.
Whereas Blakewood et al. [6] found SEM values of 1.26 repetitions and MDC 95 values of 3.49 in patients with dementia. The results corresponds well with our findings for 30s-CST and similar results are also seen in patients with hip and knee OA with values of SEM of 0.7 repetitions with corresponding MDC 90 values of 1.64 repetitions [20] and SEM values of 1.27 repetitions with corresponding values of MCII of 2.6 repetitions [19].

Strengths and limitations
There is a large variation among patients admitted to an acute stroke unit. The varying severity of their hemiparesis and cognitive status makes the group of patients heterogeneous. Out of the 62 patients included in the analysis, 69.4% used an assistive walking device in the test sessions, and the results ( Table 2) indicated a large variation in the physical function of the participants. The outlier excluded in this study is an example of the variation in physical and cognitive performance among stroke patients seen in clinical practice. The reason for the extreme improvement in the TUG score during the last test session is unknown.
Since the functional capacity of the patients post stroke was not assessed using standardized measures, the loss of motor function, balance and walking ability as a result of hemiparesis is not transparent. Even though a definition of "clinically observed problems with balance" is given, a more precise description of the functional capacity of the participants in this study would have made comparison to other patient populations more straight forward for clinicians as well as researchers.
To complete TUG the patients must be able to walk a short distance independently with or without an assistive device. 30s-CST requires only the ability to stand up and maintain standing balance. Since most of the patients completed both TUG and 30s-CST the results regarding the reliability of 30s-CST should be interpreted with some caution, when applied to stroke patients without walking ability.
In studies of test-retest reliability it has been recommended, that the sample size should be at least 30, preferably 50 [11,26,27]. The larger the sample size, the more reliable the estimates of measurement error. The relatively large sample size of the present study increased the generalisability of the results to stroke patients on other acute stroke units.
In the present study the following was done to minimise bias potentially affecting the test protocol. The test manuals were followed carefully, including standardised verbal instructions and extensive calibration and familiarisation of the two raters. All patients had the same time interval between tests and were tested in the same environment, which was the same location as the physiotherapists normally use for testing.
There is a lack of consensus on how many trials stoke patients have to perform to ensure stable TUG scores. The number of TUG trials performed by stroke patients varies from one to three trials [3,11,13,15], and in the original article participants performed one trial after a familiarisation trial [5]. Faria et al. [3] investigated 16 subjects 1-12.9 years post stroke, and found that the measurement error was very similar between scores of the first trial, the means of two and three trials, and the best and worst values for the three trials.
We chose to use the best of three trials, which has been recommended to ensure stable TUG scores in other hospitalised patients [24]. A small but significant learning effect was, however, still found in the present study-the best of the first three trials were on average approximately one second slower than the best of the second, third and fourth three trials, respectively. A small learning effect in TUG was also seen in chronic patients using the mean of two trials [11]. To minimise the learning effect it could be suggested to perform three trials of TUG followed by a 10 minute pause and then three additional trials of TUG. However, this would be much more time consuming, making TUG less suitable in clinical practice.
A small but significant learning effect was also seen for 30s-CST between the first trial and the second, third and fourth trial, respectively (Table 3). Since the patients in this study exhibited a stabile test score in the second to fourth trial, it could be considered to use the best of two trials of 30s-CST as an alternative to the familiarisation procedure, with just one repetition of sit-to-stand. On the other hand, there is a risk of fatigue depending on the physical condition of the patients.
It is therefore important to consider how to analyse the scores of TUG as well as 30s-CST in future research and in clinical practice, to ensure that the scores reflect a true estimate of the physical performance.

Implications
TUG and 30s-CST reflect aspects of important and common everyday activities, which make the outcome measures meaningful for the patients. Both outcome measures are suitable to implement in clinical practice, since they are not time consuming, are easy to administer and require no special equipment.
Unlike TUG only few studies have investigated the reliability of 30s-CST for various groups of patients, although it is used to detect improvement of functional performance in many studies. One of the major advantages when using the 30s-CST instead of the 5 or 10 times sit-tostand test is the possibility of recording a test result, even when the patients are not able to perform a single sit-to-stand repetition, making it possible to test all hospitalised stroke patients on admission and discharge. By applying the results from this study to clinical practice, the physiotherapist can be 95% confident, that an improvement in the test score equal to or more than 3 repetitions indicates a real improvement in functional performance. Moreover, 3 repetitions might also reflect a clinically important difference for the patients. An improvement of 2.6 repetitions was found to be clinically relevant in patients with OA, who were comparable to the present stroke patients regarding 30s-CST performance (% 8 repetitions at baseline) [19]. Training sit-to-stand tasks is an essential part of the rehabilitation in stroke patients and implementation of a sit-to-stand test to evaluate functional performance, seems therefore very relevant in clinical practice.
TUG covers a wide range of functional performances since the test requires the strength to stand up from a chair and maintain balance while walking and turning. TUG is found to be a responsive test to detect improvement in mobility during the first three months after stroke with a five seconds improvement in the test score (from 17 to 12 seconds) [31] indicating that TUG can be used in clinical practice to detect a real change in mobility with in the first three months post stroke.
A disadvantage with TUG is the ceiling effect seen in the group of patients with relatively good walking ability [4]. In the study originally describing TUG no assistance beside the participants walking device was permitted [5]. In clinical practice TUG is commonly performed with either minor physical or verbal support from the physiotherapist enabling the physiotherapist to assess functional performance also in the group of patients with lower functional ability. When allowing physical or verbal support, it is important to note that it is a deviation from the original test, and therefore the results of the present study with strict adherence to the original protocol, may not be applied.
Even though both outcome measures are easy to perform two of the 63 participants included in the study still had difficulty understanding the verbal instructions. This could indicate that hospitalised patients with stroke also need a visual instruction. When implementing the outcome measures, it is important to calibrate the raters, to ensure that they give the same verbal instructions regarding the actual wording, the tone and the gesticulation.

Conclusion
The timed 'Up & Go' test and '30second Chair-Stand' Test showed excellent reliability in hospitalised patients with a first event of stroke. Based on the present results we recommend that the threshold used to detect a real change for a group of patients is 18.7% for TUG, and 2.0 repetitions for 30s-CST in research settings. For individual patients in the clinical setting a change of 26.5% for TUG and 3 repetitions for 30s-CST are recommended to be interpreted as a real change in performance.