Test-retest reliability and minimal detectable change scores for the short physical performance battery, one-legged standing test and timed up and go test in patients undergoing hemodialysis

Functional tests are commonly used for chronic kidney disease (CKD) patients undergoing hemodialysis (HD). However, the relative and absolute reliability of such physical performance-outcome assessments must first be determined in specific patient cohorts. The aims of this study were to assess the relative and the absolute reliability of the Short Physical Performance Battery (SPPB), One-Legged Stance Test (OLST), and Timed Up and Go (TUG) test, as well as the minimal detectable change (MDC) scores for these tests in CKD patients receiving HD. Seventy-one end-stage CKD patients receiving HD therapy, aged between 21 and 90 years, participated in the study. The patients completed two testing sessions one to two weeks apart and performed by the same examiner, comprising the following tests: the SPPB (n = 65), OLST (n = 62), and TUG test (n = 66). High intraclass correlation coefficients (≥0.90) were found for all the tests, suggesting that their relative reliability is excellent. The MDC scores for the 90% confidence intervals were as follows: 1.7 points for the SPPB, 11.3 seconds for the OLST, and 2.9 seconds for the TUG test. The reliability of the SPPB, OLST, and TUG test for this sample were all considered to be acceptable. The MDC data generated by these tests can be used to monitor meaningful changes in the functional capacity of the daily living-related activity of CKD patients on HD.

Introduction Renal failure is a common problem with more than two million people worldwide were being treated by dialysis because of chronic kidney disease (CKD) [1]. According to the EPIRCE (Epidemiology in Chronic Renal Failure in Spain) study, 10% of the Spanish adult population suffers from some form of renal failure, with 6.8% presenting stage 3-5 CKD; in 2010, this meant that approximately 4 million people in Spain suffered from CKD requiring renal replacement treatment [2]. Hemodialysis (HD) is the most common renal replacement treatment, but other possibilities include peritoneal dialysis or kidney transplantation. The latter is especially desirable as a definitive treatment, given that patients on long-term HD have high levels of comorbidity (mainly cardiovascular problems) and physical function problems [3].
The benefits of exercise for CKD are well described in the literature, and so, since the early 80s, these patients have been prescribed exercise programs as part of their treatment. Physical function tests are commonly used to assess the effectiveness of exercise and other interventions, and these should be chosen based on their specific reliability in the CKD patient population. A previous study investigated the relative and absolute reliability and the minimal detectable change (MDC) of several physical functional tests, including the sit to stand 10 and 60, one heel rise test, handgrip test, and 6-minute walking test [4], but there are no studies regarding the reliability of other commonly used tests such as the Short Physical Performance Battery (SPPB), One-Legged Standing Test (OLST), or Timed Up and Go (TUG) test. Various authors have reported the functional properties of these tests for several sample groups, especially in elderly populations, but these tests remain insufficiently studied in CKD groups [5][6][7][8][9][10][11][12][13][14][15][16][17][18].
The SPPB is a simple test that measures lower extremity function using tasks that mimic daily activities; it is particularly useful for predicting outcomes such as falls, institutionalization, and death in elderly populations [5]. Although this test has been applied to CKD patients [6,7], neither its relative and absolute reliability nor its MDC have previously been calculated.
The OLST, also known as the one-leg stand [8,9], one-legged stance [10], single leg stance time [11,12], or unipedal balance test [13], measures the time, in seconds, that a person can stand on one leg, and is also a good predictor of falls [14]. To the best of our knowledge, no previous studies have use this test in CKD populations.
Finally, the TUG test is a simple and valid method for assessing patients' levels of functional mobility [15]; it measures the time taken for an individual to stand up from a chair, walk three meters, turn, walk back, and sit back down. The TUG test has been used for different chronic diseases such as Alzheimer, chronic heart failure, or chronic obstructive pulmonary disease [16][17][18]. It has also been used in CKD patients undergoing HD [19][20][21][22] but neither its relative and absolute reliability nor its MDC have been previously calculated.

Setting and participants
The participants were recruited from two HD units in Valencia and one unit in Barcelona (Spain) between 2013 and 2015. All the participants were explained the protocol and the procedures to be used, and signed their written informed consent prior to participation. This study was approved by the Ethics Committee at the Hospital Universitario Doctor Peset and is registered at ClinicalTrials.gov (reference number NCT02830490). The attending nephrologist reviewed and authorized their patients' potential inclusion before the subjects were approached to solicit their interest. Patients were included in the study if they had been receiving maintenance HD for at least 3 months and did not have any acute or chronic medical conditions that would preclude the collection of the test data; they were excluded if they had recently had a myocardial infarction (within 6 weeks), unstable angina, malignant arrhythmias, or any disorder that was exacerbated by activity. The following demographic and clinical data were collected from the patients' medical histories: age, sex, body mass index, time on HD, creatinine, albumin, and hemoglobin levels, cause of kidney disease, and the Charlson Comorbidity Index score.

Procedure
Participants performed the SPPB, OLST, and TUG tests twice, with an interval of one to two weeks between the testing sessions (test-retest evaluation research format), always immediately before the first HD session of the week. Every effort was made to maintain consistency between the testing sessions, including control of factors such as the day of the week, time of day, testing area, and the person conducting the assessment, although not all the subjects could be assessed in both sessions. At the two HD units in Valencia, two different physical therapists (researchers 1 and 2) with 11 and 8 years' experience in physical function evaluation, respectively, performed and assessed the tests; a renal nurse with 5 years' experience in evaluating physical function assessed the participants at the third HD unit in Barcelona.
Short physical performance battery. The SPPB objectively measures lower extremity function, including performance-based balance, endurance, and strength. Each component is scored from 0 to 4 and summed to yield scores between 0 (poor) and 12 (best) performance [5] (Table 1).
To test standing balance, the participants were asked to maintain their feet in the side-byside, semi-tandem (heel of one foot beside the big toe of the other foot), and tandem (heel of one foot directly in front of the other foot) positions for 10 seconds each. In order to test endurance, we asked the subjects to walk for four meters at their normal pace. Participants were allowed to use their usual walking aid, although they were encouraged not to use it, and were scored according to the quartiles for the length of time required. Lower limb strength was tested by asking the subjects to fold their arms across their chests while standing up and sitting down five times (STS-5) as quickly as they could. The chair used for the test had no armrests and was backed up against a wall to minimize the risk falling. A stopwatch recorded the time taken until the peak of the fifth rise [23,24].
One-legged standing test. The OLST is a good predictor of falls [14]; in elderly cohorts when the maximum standing time is 30 s with open eyes, the ICC ranges from 0.60 [8] to 0.86 [11], and the MDC 95 is 24.1 s [11]; for individuals with a hip fracture in the affected leg the ICC is 0.75 and the MDC 95 is 10.7 s, while in the non-affected leg the ICC is 0.83 and the MDC 95 is 5.5 s [12]; in patients with lower limb amputation the ICC is 0.87 with open eyes and using a maximum time of 60 s, and the MDC 95 is 2.74 s [9].
To perform the OLST patients had to maintain a one-legged stance for as long as they could with their eyes open, and allowing them to freely-move their arms. All subjects wore shoes and they were allowed to choose their preferred leg; if they experienced pain or other symptoms in the first leg they were permitted to use the other leg. The participants were given three trials to try to achieve 45 seconds, and they were verbally encouraged to maintain the one-legged standing position for as long as possible during each trial; the longest balance time from the three recorded trials was used for the data analysis. The test concluded if the participant used their arms to touch the wall, if the raised foot touched the ground, if the subject moved the standing foot, or when 45 seconds had been achieved [13].
Timed up and go test. The TUG test has shown excellent test-retest reliability in older adults (ICC > 0.98) [15,25], chronic heart failure patients (ICC = 0.93) [16], and those with Parkinson (ICC = 0.80) [26] or Alzheimer disease (ICC = 0.985-0.988; MDC 90 = 4.09 s) [27]. Here the TUG test subjects were given verbal instructions to stand up from a standard arm chair (using the arms if necessary), to walk three meters as quickly and safely as possible, turn back at a cone set out by the researchers, walk back, and sit down in the chair. The participants were allowed to wear their regular footwear and to use a walking aid if needed. A stopwatch was started on the word "go" and stopped when the subject was fully seated with their back against the backrest. The time to complete the test was recorded in three consecutive trials, using the first one to familiarize the subjects with the test. The best time from the three trials was analyzed [25,28,29].
Human activity profile. To evaluate the physical activity level, the participants were asked to complete the Human Activity Profile (HAP) that has been validated in the population with renal disease [30]. The HAP questionnaire consists of a list of 94 items, which assesses activities ranked in ascending order of level of energy. The participants had three possibilities to answer: (1) still doing this activity, (2) have stopped doing this activity, or (3) never did this activity. The HAP assesses the Maximal activity score level of activity (MAS) (the highest level of activity) and the adjusted activity score (ASS). The MAS is calculated as the activity with the highest oxygen consumption requirement that the subject still performs, while the ASS = MAS Table 1. Short physical performance battery scoring.

Test
Scoring Total

Balance Test
Side by side: the subject is asked to stand with both feet side by side and the time they can maintain the posture is measured 0 ! Unable or 0-9 s 1 ! 10 s

points
Semi-tandem: The subject is asked to stand with one foot slightly in front of the other and the time is they can maintain the posture measured 0 ! Unable or 0-9 s 1 ! 10 s

Tandem:
The subject is asked to stand with one food in front of the other and the time they can maintain the posture is measured 0 ! Unable or 0-2 s 1 ! 3-9 s 2 ! 10 s

4-m gait speed
The time taken for the subject to walk 4 m at their normal pace is measured twice; the best score from the two trials is used. Use of a walking aid in the test was recorded.

STS-5
The time taken for the subject to rise 5 times, as fast as possible, from sitting in a chair is measured. The test is completed with the patient's arms crossed across their chest and they are not allowed to use any tools to help them to stand. The chair is armless and is situated against a wall in order to help maintain its stability and to avoid participant falls.

Statistics
Normally-distributed descriptive data are reported as the mean plus the standard deviation (SD), or otherwise, as the median plus the range. The Kolmogorov-Smirnov test was used to assess the normality of the data. We also performed paired comparisons with the paired t-test or the Wilcoxon signed rank test to assess any systematic bias between the trials. The ICC (model alpha) and a two-way random-effects model were used to assess the test-retest reliability of the data for all the repeated tests; we considered an ICC above 0.75 to demonstrate good reliability, although for clinical measurements it has been suggested that the ICC should exceed 0.90 [32]. The SEM was used to determine the absolute reliability of the tests and represents the extent to which the outcome can vary in the measurement process. It was calculated with the following formula: Where r is the ICC for the participant groups. The MDC is defined as the amount of change in a measurement required to conclude that the difference is not attributable to error; it is the smallest change that falls outside the expected range of error thus, any change exceeding the MDC 90 is considered genuine and indicates confidence in the test's predictive abilities [4,27,33,34]. The MDC 90 was computed from the SEM with the following formula: A Bland-Altman plot of each participant's mean score (SPPB, OLST, TUG) plotted against their difference score (trial 1-trial 2) was constructed to display the spread of difference scores about the mean difference score. The Bland-Altman plots also display the 95% limits of agreement (95% LOA) which represents the expected range of difference scores across trials of the tests. The 95% LOA was calculated as the difference in mean scores of the tests ± SD x 1.96, with the SD as the standard deviation of the difference scores.
Correlation between the three tests and hemoglobin, albumin and creatinine was explored thouth the Spearman correlation coefficient.
We set the level of significance required to a probability of P 0.05 for all our statistical analyses. The data were managed and analyzed using the Statistical Package for Social Sciences (SPSS) version 20.0 for Windows.

Results
Data were collected from 71 participants (29 women and 42 men) with end-stage CKD receiving HD treatment at three different HD units; the mean age was 61.7±16.4 years. Some demographic details were unavailable (e.g., no height for one participant); descriptive statistics for all the participants are shown in Table 2. The activity level of the sample according to the human activity profile adjusted activity score was low, with a mean score below 53. Fig 1. shows the number of subjects who performed each test; there were 6, 9, and 5 drop outs for the SPPB, OLST, and TUG test, respectively, and the reasons for these withdrawals are shown in Fig 1; no adverse events occurred during testing.
The results of the repeated tests are shown in Table 3 (see S1 Table Original data from  SPPB, S2 Table Original data from OLST, S3 Table Original data from TUG). For the SPPB, the mean plus SD in trial 1 and trial 2 were 9.6±3 and 10±2.9 repetitions, respectively (p = 0.94); for the OLST it was 13.5±14.9 s for trial 1 and 15.1±15 s for trial 2 (p = 0.89); and for the TUG test it was 11.2±6.3 s and 10.7±5.8 s for trial 1 and 2, respectively (p = 0.96). The ICCs were high for all of the outcome measurements: 0.94 (95% confidence interval [CI] = 0.91-0.97) for the SPPB; 0.90 (95% CI = 0.83-0.94) for the OLST, and 0.96 (95% CI = 0.94-0.98) for the TUG test. The paired comparisons showed insignificant differences between trial 1 and trial 2 for all three tests. Table 4 shows the MDC 90 values for the SPPB, OLST, and TUG test (1.7 points, 11.3 s, and 2.9 s, respectively).
Bland-Altman plots indicated no systematic bias as scores were distributed above and below the mean difference (Fig 2, Fig 3 and Fig 4).
Spearman correlation coefficient showed a significant correlation only between the TUG and the inverse creatinine value (r = 0.375; p = 0.004)

Discussion
The SPPB, OLST, and TUG test are widely used performance tests, probably owing to their simplicity and low cost. Our findings demonstrated that the test-retest relative reliability (ICC) for the use of these clinical tests for CKD patients was excellent: all three values reached or exceeded 0.90±33, meaning that the two successive assessments we performed one to two weeks apart were very reproducible. Hemoglobin levels (g/dL), mean (SD) 11.00 (1.37)

Test-retest reliability
The SPPB examines three areas of lower-extremity function (static balance, gait speed, and getting in and out of a chair) that are representative of essential tasks for independent living among CKD patients on HD. The SPPB is useful for predicting outcomes such as falls, institutionalization, and death in elderly population [5], and although it has previously been applied  Test-retest reliability for some physical functioning test in hemodialysis patients  [37]. Similar to our study, Studenski et al. [36] performed the test-retest after one week, although they used a different testing site between trials: first during an outpatient clinic visit and then as part of a comprehensive home visit. In our case, we acquired all the measurements for both trails at the same location and within one or two weeks. In our study the ICC for the SPPB was high, suggesting that it is a good physical performance test for identifying loss of mobility in CKD patients undergoing HD. Future longitudinal studies should clarify whether the SPPB can predict difficulties in the activities of daily living in HD patients, as it can in elderly and older hospitalized patient populations [37,38]. No previous studies have reported the relative reliability for the OLST in patients undergoing HD, although the OLST ICC values reported in other populations are generally lower than our results (ICC!0.90). In elderly populations the ICC ranges from 0.60 [8] to 0.86 [11], following hip fracture it was 0.75 and 0.83 in the affected and non-affected leg, respectively [12], and it was 0.87 for patients with a lower-limb amputation [9]. In contrast, an ICC of 0.994 was reported for a subgroup of 50 healthy military health-care beneficiaries aged 18 and older.
There are a wide variety of published protocols for performing the OLST, but surprisingly little consensus regarding how it should be conducted. For example, some studies use a maximum time of 10 seconds [39,40], and others 30 seconds [8,12,41], 45 seconds [13,39], or 60 seconds [9,11,42]. We chose to use 45 seconds as maximum time because Briggs et al. [10] posit that a limit of 45 seconds results in normal data distribution [10,13]. Another variable is the number of attempts the patient is allowed to achieve the maximum time: while some Test-retest reliability for some physical functioning test in hemodialysis patients studies do not report this factor [8,12], in other trials it ranges between three [39,41,42] and five [9,11]. Additionally, some authors use the average of the trials for their statistical analyses [11,39] while others use the single longest time achieved [9,10,42]. Following the procedure published by Hurvitz et al. [13], itself based on Briggs et al. [10], we performed three trials and used the longest time achieved for our data analysis. This strategy appears to provide a good indication of balance capabilities because the best trial results were almost always obtained among the first three test trial results [10,13].
The details of how the OLST studies are executed also often differ: as in other studies we allowed our participants to keep their eyes open [9,39,41,42], wear shoes, choose the leg they preferred for the test, and to move their arms to help maintain their balance [13]. Moreover, our sample size was larger than that of previous studies (n = 62) and the ages included ranged from 21 to 90 years (mean 61.4±16.4 years), making ours a relatively young sample compared to other studies (see Table 5). Future studies should aim to assess if the OLST is useful for predicting falls in CKD patients.
The TUG test is a validated and commonly used method for assessing functional mobility; its relative reliability values have been reported in different populations including elderly (ICC = 0.98-0.99) [15,25], chronic heart failure (ICC = 0.93) [16], Parkinson disease (ICC = 0.80) [26], and Alzheimer disease (ICC = 0.985-0.988) [27] cohorts. Our results showed that the relative reliability of this test for patients undergoing HD is excellent (ICC = 0.96), therefore suggesting that this is an appropriate test for assessing this aspect of physical function in CKD patient groups. Additionally, this was the only test that correlated with the inverse creatinine values of the sample.
Taken together, our findings demonstrate that test-retest reliability for the SPPB, OLST, and TUG clinical tests was excellent. Factors that might explain these good results, and that should therefore be considered in the application these tests in clinical environments, include performing these tests (i) before a HD session, (ii) on the same day of the week, and (iii) after adequate research training and standardization of the assessors' instructions. However, it is surprising that the relative reliability (ICC) in a sample with such high comorbidity (CKD patients on maintenance HD) was higher than in other cohorts with, presumably, lower health status variability (e.g. elderly populations with no chronic disease). This could mean that young people receiving renal replacement treatment are usually in a better physical condition than elderly populations receiving HD, leading to the increased consistency seen in the former in this present study.
Another reason could be the uniformity of our protocol which we designed to ensure standardization, both of the procedures and between the researchers performing the tests. Our testing instructions were the result of a consensus between the different research teams at each Test-retest reliability for some physical functioning test in hemodialysis patients center undertaking the study. Surprisingly, our review of previously published studies regarding functional testing, revealed inconsistencies between the testing protocols used across a variety of tests, including the OLST. These factors might lead to inappropriate results being reported and may hinder meaningful comparison between the outcomes of different studies. Thus, we believe it is very important that both researchers and clinicians assess physical functioning in future studies using the same tools and by implementing standardized instructions.

Minimal detectable change
Despite the excellent test-retest reliability results for our patient cohort, the performance of individual participants between sessions still substantially varied, producing high MDC values ( Table 4). The MDC 90 is the threshold of change that a measurement must reach in order to exceed the anticipated measurement error and variability, and is a conservative estimate of clinically meaningful score changes. In this case, the magnitude of clinically meaningful change in these physical performance tests can help clinicians and researchers to identify important functional changes in CKD patients undergoing HD [4]. The MDC for the SPPB, OLST, and TUG test have been previously studied in other populations including the elderly [5,8,11,43], people recovering from a hip fracture [12] or lower-limb amputation [9], and in groups with Alzheimer disease [27]. Nevertheless, to our knowledge, this is the first study to calculate the MDC of these tests in patients with CKD undergoing HD.
Our results produced an MDC 90 of 1.7 points for the SPPB, whereas in an elderly population, a change of one point was representative of a meaningful difference in the risk of future mortality and the incidence of disability [5]. Another large study of older adults (n = 482; mean age 74.1±5.7 years) reported a SEM of 1.42 points [44], compared to the SEM of 0.72 points we obtained in this study. In this case, the time frame of the test-retest assessment was longer than in our study: the subjects were evaluated at the participant's house every three months for the first year and every 6 months for the second year. In our study we strictly replicated all the measurement conditions, but even so, the physiological and clinical status of patients undergoing HD can widely vary, potentially leading to heterogeneity in the results.
Our OLST results gave an MDC 90 of 11.3 s, whereas in a community-dwelling population, the MDC 95 was 24.1 s [11]. This, perhaps surprising difference can be explained by the high SD in the latter study sample (20.4 s) [45]. In patients with a lower-limb amputation the MDC 95 was 2.74 s [9], and this difference can also be related to the evaluation procedure: while we performed three trials with a maximum time of 45 seconds, other studies performed five trials with a maximum time of 60 seconds [9,11]. We chose three rather than five trials to try to achieve the longest time possible (in the knowledge that the best score is usually obtained in the first three trials), while also aiming to reduce variability and to avoid muscle fatigue [10].
The MDC 90 for the TUG test in this present study was 2.9 s. In comparison, the MDC 95 in a cohort with Parkinson disease was 3.5 s [26] (similar to our results if we calculate the MDC 90 ) and in another sample with Alzheimer disease, the MDC 90 was 4.09 s [27]. The high MDC found in the Alzheimer disease study can be explained by its high SD (19.95 ± 9.81 s in mildmoderate disease and 28.01 ± 17.49 s in moderate-severe to severe disease); patients with a higher level of dementia produce more variable results and need more time to perform the test compared to less demented subjects, thus generating higher MDC scores. Another important difference is the number of trials performed: while we carried out three trials, Ries at al. [27] performed two trials in patients with Alzheimer disease and Huang et al. [26] only measured the TUG once, so as to avoid fatigue (although they concluded that more trials would increase the stability of the measurement and would reduce its MDC). Hence, performing more than one trial increases the stability of the test, and as a result, decreases the MDC.
In summary, the MDC 90 results that we obtained in this study (1.7 points for the SPPB, 11.3 s for the OLST, and 2.9 s for the TUG test) represent the threshold-change values required to be 90% certain that any changes noted in the test results for any given individual patient are not due to internal variability. In the clinical field, researchers and clinicians should use these MDC values to determine whether differences in the test results obtained between follow-up trails in their CKD patients on maintenance HD represent true changes which may be associated with poor prognosis.

Study limitations
The main limitation of this study was the variability of our cohort in terms of its broad sample age range which may have introduced error related to the probable increased presence of comorbidities in older patients. It is also worth noting that the patient participation rate was low. Additionally, we did not register interdialytic weight gain between the first and the second evaluation day, though we tried to keep all other factors stable (HD session of the week, time, assessor). Moreover, only 30 minutes were available to perform these assessment tests before the HD session started which may have led us to rush in some cases. However, despite this time constraint, we tried to limit extrinsic variation by following a strict methodology. Another potential limitation to inter-study comparisons is the lack of academic consensus on the exact OLST testing procedure.

Conclusions
In conclusion, our results demonstrate excellent test-retest reliability for the SPPB, the OLST, and the TUG test in CKD patients undergoing HD. The MDC 90 values for each test provide clinicians with useful threshold values for identifying true changes beyond those that can be expected from individual variability. This information will help care givers to monitor changes in the performance of their patients over time and to assess the effectiveness of interventions to maintain or improve the physical performance of patients receiving HD treatment.
Supporting information S1