Current assessment tools for sport-related concussion are limited by a reliance on subjective interpretation and patient symptom reporting. Robotic assessments may provide more objective and precise measures of neurological function than traditional clinical tests.
To determine the reliability of assessments of sensory, motor and cognitive function conducted with the KINARM end-point robotic device in young adult elite athletes.
Sixty-four randomly selected healthy, young adult elite athletes participated. Twenty-five individuals (25 M, mean age±SD, 20.2±2.1 years) participated in a within-season study, where three assessments were conducted within a single season (assessments labeled by session: S1, S2, S3). An additional 39 individuals (28M; 22.8±6.0 years) participated in a year-to-year study, where annual pre-season assessments were conducted for three consecutive seasons (assessments labeled by year: Y1, Y2, Y3). Forty-four parameters from five robotic tasks (Visually Guided Reaching, Position Matching, Object Hit, Object Hit and Avoid, and Trail Making B) and overall Task Scores describing performance on each task were quantified.
Test-retest reliability was determined by intra-class correlation coefficients (ICCs) between the first and second, and second and third assessments. In the within-season study, ICCs were ≥0.50 for 68% of parameters between S1 and S2, 80% of parameters between S2 and S3, and for three of the five Task Scores both between S1 and S2, and S2 and S3. In the year-to-year study, ICCs were ≥0.50 for 64% of parameters between Y1 and Y2, 82% of parameters between Y2 and Y3, and for four of the five Task Scores both between Y1 and Y2, and Y2 and Y3.
Overall, the results suggest moderate-to-good test-retest reliability for the majority of parameters measured by the KINARM robot in healthy young adult elite athletes. Future work will consider the potential use of this information for clinical assessment of concussion-related neurological deficits.
Citation: Mang CS, Whitten TA, Cosh MS, Scott SH, Wiley JP, Debert CT, et al. (2018) Test-retest reliability of the KINARM end-point robot for assessment of sensory, motor and neurocognitive function in young adult athletes. PLoS ONE 13(4): e0196205. https://doi.org/10.1371/journal.pone.0196205
Editor: Radouil Tzekov, Roskamp Institute, UNITED STATES
Received: July 10, 2017; Accepted: April 9, 2018; Published: April 24, 2018
Copyright: © 2018 Mang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Funding: This work was funded by Own the Podium Canada, the Canadian Academy of Sport and Exercise Medicine Research Fund, the Canadian Sport Institute Calgary and Jim Smith of Calgary, Alberta (BWB). CSM was funded by post-doctoral fellowships from the Canadian Institutes of Health Research and Alberta Innovates Health Solutions. TAW was funded by a Mitacs fellowship in partnership with Own the Podium Canada.
Competing interests: We have read the journal’s policy and the authors of this manuscript have the following competing interests: SHS is cofounder and chief scientific officer of BKIN Technologies Ltd., the company that commercializes the KINARM device used in this study. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Accurate clinical assessment and appropriate management of sport-related concussion (SRC) remains a challenge in sport medicine [1,2]. Importantly, subtle deficits, if missed, may have significant consequences in a sport environment. A key source of difficulty for the field relates to the wide range of symptoms, which cross multiple neurological domains, and variable symptom severity that can be experienced post-SRC [2–4]. Adding to the challenge of assessing such a diffuse injury is that traditionally many clinical assessments of neurologic dysfunction rely on patient symptom reporting and subjective interpretation that may be influenced by examiners’ disciplines and prior clinical experience [5–7].
Developing assessments with good test-retest reliability poses another challenge to the field that is critical to overcome if such assessments are to be used to establish baseline levels of function and/or track individuals’ recovery post-SRC. For example, a large study of the Immediate Post-concussion Assessment Tool and Cognitive Test (ImPACT), one of the most commonly used SRC assessment tools in clinical and research settings , recently reported lower test-retest reliability than was previously thought (intraclass correlation coefficients [ICCs] for the four composite scores ranging from 0.36 to 0.90 across 1-, 2- and 3-year time intervals) . Thus, there remains a need to develop valid clinical tools that are able to objectively, efficiently and reliably assess functional impairment across multiple domains.
Robotic technology offers a promising avenue for development of improved assessments of neurologic deficits post-SRC. Robotic tools can be designed to provide rapid and comprehensive assessments of neurologic function with a level of precision and accuracy not found in standard clinical assessments [5,10]. The Kinesiological Instrument for Normal and Altered Reaching Movements (KINARM, BKIN Technologies Ltd, Kingston, Canada) is a robotic device on which a set of standard assessments of sensory, motor and neurocognitive function have been developed [11–13]. Subsets of the KINARM Standard Tests (KST™) have been found to be reliable for neurologic assessment in adults with acute stroke [14–17], and sufficiently sensitive to detect impairments related to stroke [14–18] and moderate/severe brain injury . Recently, a subset of the KINARM Standard Tests were also found to have generally moderate-to-good test-retest reliability in healthy pediatric ice hockey players when tested twice in a single day and once a week later . The same KINARM Standard Tests were utilized in a study of mild traumatic brain injury (mTBI) presenting to the emergency department . In this study, acute performance (<24 hrs post-injury) on KINARM assessments was predictive of persistent symptoms three weeks later , suggesting potential clinical utility of the KINARM in the mTBI population, a group that likely overlaps to a considerable degree with individuals with acute SRC. Nevertheless, the test-retest reliability of these KINARM assessments specifically in young adult athletes, and at long test intervals is yet to be determined.
The primary objective of the current study was to determine test-retest reliability of KINARM robotic assessments of sensory, motor and neurocognitive function in healthy, non-concussed, young adult elite athletes. As a SRC could occur with variable timeframes relative to a healthy baseline assessment, we evaluated test-retest reliability of three assessments conducted within a single season and across three consecutive seasons in two separate groups of athletes. Determining test-retest reliability of KINARM assessments in athletic populations is a crucial first step towards future application of robotics to this field.
Materials and methods
Participants were recruited from a larger sample participating in an ongoing prospective longitudinal study of SRC. For the prospective study, participants are recruited from varsity sports teams (football, hockey, wrestling, basketball, rugby, volleyball), junior hockey teams, national sports teams (alpine skiing, freestyle skiing, Nordic skiing, luge, skeleton, bobsled, speed skating, track and field, waterpolo, wrestling), a youth sport school, and national youth developmental programs for the sports listed above. Inclusion criteria for the prospective study is that the individuals are actively participating in organized sport and aged 10 years or older. Participants are excluded if they are medically unstable (e.g., active cardiac disease, progressive neurological disorder), have a peripheral or central nervous system disorder or ongoing musculoskeletal compromise of the upper extremity. For the current test-retest reliability studies, athletes between the ages of 18 and 40 years old who were participating in the larger prospective sample were recruited if they were healthy and without concussion within six months of study onset. Twenty-five athletes completed three assessments to evaluate test-retest reliability within a single season (within-season study). A separate group of 39 athletes completed assessments annually over three years (year-to-year study). All participants remained healthy and did not sustain any concussions over the course of the study periods. Visual acuity (Snellen chart, 20 feet) was determined in all 25 of the participants in the within-season study and 37 of the 39 individuals in the year-to-year study. Participants provided written, informed consent prior to assessments. Procedures were approved by the University of Calgary Conjoint Health Research Ethics Board (Ethics ID: 23963) and conducted in accordance with its guidelines.
Robotic assessments were conducted with the KINARM end-point device. Seated in front of the KINARM and grasping the end-point handles, participants moved their arms in the horizontal plane to interact with a virtual reality system displayed in the same plane. Robot height was adjusted so participants’ heads rested in the centre of the visual field (Fig 1A). Participants completed three ~20 minute robotic assessments. Within-season study participants completed their first assessment (S1, i.e. session 1) during the pre-season, then a second assessment (S2) 1–3 months later, and their third assessment (S3) 9–12 months following S2. Year-to-year study participants completed Y1 (i.e. year 1), Y2 and Y3 assessments on an approximately annual basis in three consecutive athletic pre-seasons. No concussions were sustained between any of the assessments. Each assessment included the following tasks: Visually Guided Reaching [15,19,21], Position Matching [17,19,21,22], Object Hit , Object Hit and Avoid , and Trail making B [23,24].
The KINARM end-point robotic device is shown in panel A and screen shots of a representative participant performing the tasks employed in the other panels (B, 4-target Visually guided reaching; C, 4-target Position Matching; D, Object Hit; E, Object Hit and Avoid; F, Trail Making B). In panel B, the red lines depict representative hand paths when moving to each target location. In panel C, the robot moved the dominant (passive) arm and the participant mirror-matched with the non-dominant (active) arm. Shapes represent each target location and the corresponding mirror matched position. Ellipses around the targets represent variability in position matching across trials. In panels D and E, red and purple lines depict dominant and non-dominant hand movements throughout the task. Green rectangles represent paddles that participants use to hit objects. In panel F, the red line depicts the participant’s hand path.
The Visually Guided Reaching (VGR) (Fig 1B) task evaluated upper-extremity visuomotor capability. The robot handle was displayed as a white circle (0.5 cm radius) and targets shown as red circles (1.0 cm radius). Peripheral targets were presented discretely in a random order. Participants reached from the central target to one of four peripheral targets located 10 cm away and then back. Participants performed 20 out and 20 back reaching movements with the dominant arm as quickly and accurately as possible.
The Position Matching (PM) (Fig 1C) task assessed upper-extremity proprioception (position sense). The robot moved the dominant arm (passive arm) to one of four possible targets located at the corners of a 20 cm × 20 cm square grid. When the passive arm movement was complete, participants moved the non-dominant arm (active arm) to mirror-match the passive arm position, notifying the operator when they had matched to advance to the next trial. Vision of the arms and targets was occluded. Twenty-four trials were completed with target locations presented in a pseudo-random order.
The Object Hit (OH) (Fig 1D) task examined rapid, bimanual upper-extremity motor ability and visuospatial attention. Robot handles were represented as 2 cm wide green paddles. Objects (red circles with 2 cm radius) were “dropped” from 10 evenly spaced bins across the top of the screen. Participants used both hands to hit as many objects away as possible, receiving haptic feedback when an object was hit. Over 105 seconds, 30 objects were dropped at random from each bin (300 objects total). As the task progressed, object speed and frequency increased. The size of the paddles were smaller and the speed of the objects faster in this version of the task relative to the original task developed for the KINARM  because in pilot work the elite athletes demonstrated ceiling effects in performance on the original task.
The Object Hit and Avoid (OHA) (Fig 1E) task evaluated similar attributes as the OH task, with added emphasis on attention, rapid motor selection and inhibition. The task proceeded similarly to OH, except at the beginning participants were shown two of a possible ten object shapes that would be “dropped” from the bins at the top of the screen. Participants were instructed to hit those two “target” shapes, while avoiding the other eight “distractor” shapes. As with the OH task, participants received haptic feedback when contacting a target, but distractors passed through the participants’ paddle. Two hundred targets and 100 distractors were dropped over 105 seconds. Again, the size of the paddles were smaller and the speed of the objects faster in this version of the task relative to the original task developed for the KINARM .
The Trail Making B (TMB) (Fig 1F) task is a robot-based version of the standard paper and pencil task designed to evaluate visual attention and task-switching [23,24]. The robot handle was represented as a white circle (0.5 cm radius) and an array of 25 white circular targets labeled with letters (A-L) and numbers (1–13) was presented. Participants traced through the targets in an alpha-numeric sequence (i.e., 1-A-2-B-3-C…13-L). When a correct target was entered, the target turned green. If an incorrect target was entered, the target turned red and the participant was required to return to the prior correct target before continuing. Participants were familiarized to the task by completing a five-target version of the task prior to the full task. Participants were randomly presented with one of eight possible target patterns.
Robotic outcome measures
Forty-four parameters quantified performance on spatial and temporal aspects of the five tasks (Table 1). Measures of performance on each parameter were quantified in each participant as a z-score relative to an age-corrected normative model of single session baseline performance in athletes between 18–40 years old who participated in the larger prospective study of SRC (VGR, n = 522, 351M/171F; PM, n = 520, 349M/171F; OH, n = 528, 355M/173F; OHA, n = 528, 356M/172F; TMB, n = 528, 356M/172F; BKIN Technologies Ltd.) . Four athletes in the within-season test-retest reliability study were 17 years old at the time of their first assessment and were assigned z-scores as though they were 18 years old. All other participants were 18 years of age or older. Only parameters with a normally distributed dataset (or dataset that could be transformed to a normal distribution) are reported. Overall ‘Task Scores’ were also derived from a root-mean-square (RMS) distance of parameter z-scores for a given task (BKIN Technologies Ltd.), allowing consideration of overall performance on each task by combining performance on all task parameters into a single metric [25–27]. Task Scores near zero indicated superior performance and higher values indicated worse performance. A Task Score of one indicated performance was one standard deviation worse than the norm.
Analyses were conducted separately for the within-season and year-to-year studies. ICCs evaluated test-retest reliability of the KINARM tasks between the first and second, and second and third assessments. The ICC model (2, 2) used a two-way repeated-measures, random effects analysis of variance model with type consistency. Assessment number was used as the random sample to compute the ICCs. ICCs of 0.75 and above, between 0.50 and 0.74, and below 0.50 were taken to indicate good, moderate, and poor reliability, respectively [12,28].
Bland-Altman (B-A) plots were inspected to identify systematic differences (i.e., practice effects) in measures between the first and second, and second and third assessments for each parameter. Creating a B-A plot involves plotting the mean of two measurements (e.g., assessment 1 and 2) against the difference between the same two measurements. When inspecting the B-A plots, the 95% limits of agreement, which describe the range of differences between the two measurements (mean±2SD), are commonly noted . Specifically, B-A plots are used to describe the agreement between two quantitative measurements, rather than the relationship between them, as is achieved through correlational analyses . When inspecting a B-A plot, the mean difference between the measurements is observed to determine potential bias or lack of agreement, and the relationship between the difference between the measurements and the true value (taken to be the mean of the two measurements) is considered. The bias can be considered significant if the confidence of the interval of the mean difference does not include the line of equality (i.e., zero) . Thus, practice effects were considered present when the 95% confidence interval (CI) around the mean difference value did not cross zero.
Effect sizes described the magnitude of change, or practice effects, between assessments (S1 and S2, S2 and S3; Y1 and Y2, Y2 and Y3) for each measure [12,30]. Effect sizes were calculated as follows: where:
δ = effect size.
m1 = group mean at baseline.
m2 = group mean at follow-up.
SD1 = group standard deviation at baseline.
For effect size calculations between the second and third assessment, the second and third assessments were considered baseline and follow-up, respectively. Effect sizes with absolute values of 0.80 and above, between 0.50 and 0.79, and below 0.50 were considered large, medium and small, respectively . Statistical analyses were performed in MATLAB 2016a (Mathworks, Natick, MA, USA). ICCs were calculated using the Intraclass Correlation Coefficient function .
Athletes in the within-season study were all male, 20.2±2.1 (mean±SD) years of age, and had visual acuity of 20/50 uncorrected or greater. The year-to-year study participants were an average of 22.8±6.0 years of age (28M, 11F) and had visual acuity of 20/40 uncorrected or greater. Participants who typically wore corrective lenses also wore them during the robotic testing on both initial and repeat assessments. Participant characteristics are highlighted in Table 2.
Within-season test-retest reliability
Fig 2 shows data from representative parameters in the VGR (A, C) and OH tasks (B, D). There was strong correlation between measurements taken across assessments in VGR Reaction Time (ICCS1-S2 = 0.83 [95% CI: 0.61–0.93]; ICCS2toS3 = 0.91 [95% CI: 0.81–0.96]). Even distribution of data points around the unity line suggested no practice effects for this parameter. Inspection of the B-A plot (Fig 2C) also suggested no practice effects between assessments (S1 to S2, S2 to S3), as Reaction Time difference values were evenly and randomly distributed around zero (δS1toS2 = 0.28; δS2toS3 = -0.10). OH Total Hits (Fig 2B) also had strong correlation between measurements taken across assessments (ICCS1-S2 = 0.83 [95% CI: 0.61–0.93]; ICCS2toS3 = 0.88 [95% CI: 0.76–0.95]); however, data points showing S1 and S2 measurements (filled markers) were all distributed above the unity line, suggesting a S1-to-S2 practice effect. A practice effect was not observed between S2 and S3 (open markers). These observations were corroborated by the B-A plot (Fig 2D), which shows that data points depicting the difference in Total Hits between S1 and S2 (filled markers) were all below the line of equality, while differences in Total Hits between S2 and S3 (open markers) were evenly and randomly distributed around zero (δS1toS2 = -1.35; δS2toS3 = 0.03). While the majority of parameters did not show large practice effects, those that did (e.g., Total Hits from the OH task) only changed markedly between S1 and S2, with performance leveling between S2 and S3 (Table 3).
On scatter plots (A and B), filled markers represent measures taken at S1 (x-axis) and S2 (y-axis). Open markers represent measures taken at S2 (x-axis) and S3 (y-axis). The black line depicts the unity line. On B-A plots, the values depicted on the x- and y-axis were derived from measures taken at S1 and S2 (filled markers), and S2 and S3 (open markers). Horizontal lines show 95% limits of agreement (S1 to S2, solid line; S2 and S3, dashed line). The thick black horizontal lines denote no difference (zero, line of equality) in performance between assessments. Markers in the dashed line box represent mean differences in performance between assessments.
Results from all analyses conducted for the Within-season study are presented in Table 3. Considering performance in S1 and S2, ICCs were below 0.50 for 32% of parameters, between 0.50 and 0.74 for 38% of parameters, and 0.75 or greater for 30% of parameters. Between S2 and S3, ICCs were below 0.50 for 20% of parameters, between 0.50 and 0.74 for 48% of parameters, and 0.75 or greater for 32% of parameters. B-A plots suggested agreement between assessments in most parameters, with practice effects apparent in 45% of parameters between S1 and S2 and 18% of parameters between S2 and S3. Examination of δ values suggested only small practice effects for the majority of parameters. Between S1 and S2, δ values were below 0.50 for 80% of parameters, between 0.50 and 0.79 for 9% of parameters and 0.80 or greater for 11% of parameters. The only parameters for which large δ values ≥0.80 were found between S1 and S2 were Total Hits, Hits with Dominant Hand and Median Error in the OH task, and Test Time and Dwell Time in the TMB task. Between S2 and S3, practice effects dissipated and δ values were small (below 0.50) for all parameters.
Overall Task Scores yielded ICCs greater than 0.50 for VGR, PM, and OHA between S1 and S2, and for VGR, PM and TMB between S2 and S3. Only the TMB Task Score had a δ of 0.80 or higher and only between S1 and S2.
Year-to-year test-retest reliability
Fig 3 shows data from representative parameters in the VGR (A, C) and OH tasks (B, D). Similar to the within-season study, there was strong correlation between VGR Reaction Time measurements taken between assessments (Fig 3A, ICCY1-Y2 = 0.86 [95% CI: 0.74–0.93]; ICCY2toY3 = 0.89 [95% CI: 0.79–0.94]), good agreement (Fig 3C) and small practice effect sizes (δY1toY2 = 0.11; δY2toY3 = -0.09). OH Total Hits also showed a strong correlation between measurements taken across assessments (Fig 3B, ICCY1-Y2 = 0.83 [95% CI: 0.61–0.93]; ICCY2toY3 = 0.88 [95% CI: 0.76, 0.95]); however, most data points showing Y1 and Y2 measurements (filled markers) fell above the line of unity and difference values between Y1 and Y2 depicted on the B-A plot (Fig 3D) were mostly below the line of equality. This evidence of a practice effect dissipated between Y2 and Y3 (open markers, Fig 3B and 3D), as was also reflected in the effect size values (δY1toY2 = -1.10; δY2toY3 = 0.21). Similar to the within-season data, Total Hits from the OH task is one of only a few parameters (three total) that showed a substantial practice effect ≥0.80 between Y1 and Y2, and performance then plateaued between Y2 and Y3.
On scatter plots (A and B), filled markers represent measures taken at Y1 (x-axis) and Y2 (y-axis). Open markers represent measures taken at Y2 (x-axis) and Y3 (y-axis). The black line depicts the unity line. On B-A plots, the values depicted on the x- and y- axis were derived from measures taken at Y1 and Y2 (filled markers), and Y2 and Y3 (open markers). Horizontal lines show 95% limits of agreement (Y1 to Y2, solid line; Y2 and Y3, dashed line). The thick black horizontal lines denote no difference (zero, line of equality) in performance between assessments. Markers in the dashed line box represent mean differences in performance between assessments.
Year-to-year study results are presented in Table 4. Considering performance in Y1 and Y2, ICCs were below 0.50 for 36% of parameters, between 0.50 and 0.74 for 43% of parameters, and 0.75 or above for 21% of parameters. Between Y2 and Y3, ICCs were below 0.50 for 18% of parameters, between 0.50 and 0.74 for 34% of parameters, and 0.75 or above for 48% of parameters. B-A plots suggested agreement between assessments in most parameters, with practice effects apparent in 34% of parameters between Y1 and Y2, and 16% of parameters between Y2 and Y3. Examination of practice effect sizes supported these observations. Between Y1 and Y2, practice effect sizes were below 0.50 for 86% of parameters, between 0.50 and 0.79 for 7% of parameters and 0.80 or above for 7% of parameters. The only parameters for which large δ values ≥0.80 were Total Hits, Hits with Non-dominant Hand and Median Error in the OH task between Y1 and Y2. Between Y2 and Y3, performance plateaued with δ values below 0.50 for all parameters.
Overall Task Scores yielded ICCs greater than 0.50 for all but the OHA task between both sets of time points (Y1 to Y2, and Y2 to Y3), and practice effects were minimal for all tasks between each time point (all δ’s <0.50).
A summary of all ICC and effect size results for both the within-season and year-to-year studies is presented in Fig 4.
Panel A depicts the number of parameters (/44) with ICCs <0.50 (poor, black bar), between 0.50–0.74 (moderate, dark grey bar) and ≥0.75 (good, light grey bar). Panel B depicts the number of parameters (/44) with δ values <0.50 (small, black bar), between 0.50–0.79 (moderate, dark grey bar) and ≥0.80 (large, light grey bar).
The main finding of the current work was that there was moderate-to-good test-retest reliability for most performance parameters derived from the KINARM. Large practice effects were apparent between the first and second assessment for a minority of parameters (20% in within-season, 14% in year-to-year), with no considerable performance changes made between a second and third assessment for any parameters. These findings were comparable whether assessments were conducted within a single season or in the pre-season period for three consecutive seasons.
A study of pediatric ice-hockey players employed similar analyses of the same KINARM assessments as those included in the present work, but conducted two assessments in immediate succession on the same day, and a third assessment one week later . In the pediatric study  and the current within-season and year-to year studies conducted in adults, ICCs were 0.50 or above for the majority of the parameters across all assessments. Importantly, our current results suggest that if an individual were assessed at a baseline time point and then re-assessed after sustaining a SRC, the reliability of the majority of the KINARM parameters would be expected to be similar whether the injury occurred early (i.e. within-season study results) or late in the season (i.e. year-to-year results). Nevertheless, the proportion of parameters with moderate-to-good reliability between the first and second assessment appeared higher in the pediatric study (75%)  compared to the present work (within-season study, 68%’ year-to-year study, 64%). Also, in the pediatric study there was little difference in the proportion of parameters with ICCs above 0.50 between assessments (75% Assessment 1 to 2, 73% Assessment 2 to 3) . In contrast, for both the within-season and year-to-year studies there were more parameters with ICCs above 0.50 between the second and third, relative to the first and second assessments (within-season, 68% S1 to S2, 80% S2 to S3; year-to-year, 64% Y1 to Y2, 82% Y2 to Y3). These slightly disparate findings likely relate to the short interval between tests in the pediatric study  relative to the studies reported here. Regardless, moderate-to-good reliability was found for the majority of KINARM task performance parameters previously in pediatric athletes , and now here in young adult athletes across two timeframes of testing.
The test-retest reliability of the subset of KINARM Standard Tests studied presently appears comparable to other assessment tools currently being used to evaluate SRC in clinical and research settings. For example, investigation of test-retest reliability of computerized neurocognitive tests, such as the ImPACT, in athletes has yielded results suggesting a range from very poor to very good reliability [9,33–35]. A study of young healthy individuals completing the Sensory Organization Test of Computerized Dynamic Posturography (SOT), a test commonly used to evaluate postural stability in SRC research [36,37], determined an ICC of 0.64 for its composite score and ICCs ranging from 0.43 to 0.79 for the scores derived for each of the six SOT conditions . Also, a test-retest reliability study of the post-concussion symptom scale in the Sport Concussion Assessment Tool (version 3) reported an ICC of 0.43 at an approximately 6-month test interval . Additionally, the Test Time parameter from the robot-based version of the Trails B test used currently yielded an ICC of 0.30 in the within-season study and 0.70 in the year-to-year study when considering the first and second assessments. Prior work with the pencil and paper version of this task has reported ICCs ranging from 0.39 to 0.85 [33,40,41]. The higher ICC in the year-to-year study relative to the within-season study here likely relates to an almost two-fold smaller practice effect in the year-to-year study for this particular parameter (within-season δ = 1.22; year-to-year δ = 0.65).
When evaluating practice effects in this study, only a small proportion of parameters demonstrated a large change in performance between the first and second assessments (within-season, 11%; year-to-year, 7%). In the within-season study, these large practice effects were exclusively observed for parameters derived from the OH and TMB tasks, and for the year-to-year study only for parameters from the OH task. For the TMB task in the within-season study, the largest practice effect was found for the Test Time parameter, a notable finding given its common clinical use [23,24]. Similar practice effects in the pencil and paper version of the TMB task have been reported previously . Also of interest, the proportion of total parameters with δ values of 0.50 or greater from the first to second assessment were similar whether they were separated by one to two months (within-season study) or one year (year-to-year study). This finding suggests that the practice effects associated with completing a baseline assessment with these tasks does not necessarily diminish with time for most parameters. As with our prior work with the KINARM , performance plateaued between the second and third assessment. This performance plateau after the second administration of the KINARM tasks is consistent with changes in performance observed in athletes undergoing computerized and pencil and paper neurocognitive tests . When baseline testing and re-assessing individuals post-injury, such practice effects could obscure the detection of impairments. Potential strategies to mitigate such an impact for the KINARM assessments include baseline testing individuals twice prior to the athletic season, or accounting for a known normative practice effect when evaluating post-injury performance.
We also studied Task Scores that measure global performance on each robotic assessment . In the within-season study, reliability of Task Scores varied between tasks and time points, with only the VGR and PM Task Scores yielding moderate-to-good reliability across both sets of time points (S1 to S2, and S2 to S3). In contrast, for the year-to-year study, Tasks Scores had moderate-to-good reliability for all tasks except OHA across all assessments (Y1 to Y2, and Y2 to Y3). In both studies, practice effects on Task Scores were minimal. Derivation of a weighted Task Score, with greater emphasis on parameters with moderate-to-good test-retest reliability, may provide a means to further improve reliability of such composite scores .
As mentioned above, the reliability of the KINARM tasks utilized here was similar to other assessment tools currently being used to assess SRC-related deficits, such as ImPACT neurocognitive testing [9,33–35], SOT posturography , and symptom reporting . Given that SRC is a diffuse injury with many potential overlapping pathologies, it is not always clear what to look for (e.g., many different signs/symptoms, cognitive impairment, sensorimotor deficits, balance impairment, vestibular features, oculomotor deficits, cervicogenic features, autonomic nervous system dysfunction). There may be subtle deficits, which if missed, may have significant consequences in a high-risk sporting context. There is also the challenge of athlete cover-ups as they often want to return to sport quickly. Recent evidence demonstrates that the window for physiological recovery may outlast symptom and clinical recovery [43,44]. SRC is a brain injury but the acute clinical symptoms largely reflect an impairment in neurological function rather than structural brain injury . The KINARM potentially adds to previously developed SRC assessment approaches by examining additional elements of neurologic function from multiple systems and brain processes simultaneously (e.g., upper-extremity motor and sensory function) that could plausibly be impacted by SRC. A further benefit of the KINARM is that it can be custom programmed, and has the potential to be integrated with force plate and eye tracking technology. Thus, it is possible that new tasks designed to assess elements of neurologic function probed by ImPACT, SOT or oculomotor testing could be added to the repertoire of KINARM tasks used to assess SRC, or that more complex and demanding dual-tasks could be developed to measure subtle SRC-related deficits that are clinically significant when returning to high-risk sport participation. Lastly, the efficiency of the KINARM assessment (<20 minutes for all tasks used here) and its provision of immediate access to objective quantitative data and normative models for the KINARM Standard Tests suggests that it holds promise for overcoming many of the challenges associated with improving SRC assessment.
Although the test-retest reliability and practice effects appear similar across the within-season and year-to-year studies for most of the KINARM parameters, differences in the compositions of the study samples make it difficult to draw comparisons. For example, the within-season study included only male athletes from varsity team sports, while the year-to-year study included both male and female athletes from varsity team sports and national level individual sports. There is currently limited information available regarding sex differences in KINARM standard task performance. One study examined sex differences in performance of a version of a KINARM standard task . Using the KINARM exoskeleton robot and a 9-target version of the PM task (KINARM endpoint robot and 4-target PM task used here), a significant sex difference was reported for the Absolute Error XY parameter, but no others . Other work demonstrated sex differences in neuropsychological testing  and aspects of motor behaviour, such as visual reaction time . Taken together, these studies suggest that sex could also potentially contribute to variability in KINARM task performance and reliability in our current study. The potential influence of type of athlete (i.e., sport) on KINARM task performance has not been studied previously, but it stands to reason that the varying skillsets required of athletes participating in different sports and at different levels of competition might be reflected in KINARM task performance. We are not currently powered to examine these potential effects further, but it is important to note that differences in sex and sport distribution could have plausibly impacted variability and reliability across the two studies reported here.
Another important consideration when interpreting the current work is that we included only healthy participants, which can lead to low inter-subject variability in performance and negatively impact measures of test-retest reliability. Consistent with this postulation, the ICCs for the VGR and PM tasks were slightly lower than what was found in individuals with acute stroke [15,17], a finding also reported in the test-retest reliability pediatric study that was conducted in healthy athletes . Future work may consider evaluating the reliability of these tests in individuals with SRC, as test-retest reliability of these measures may be higher when determined in a sample with greater inter-subject variability. Additionally, we pooled participants with and without a history of concussion in our study samples. Although past work in pediatric athletes has indicated no effect of a distant history of concussion on performance of the KINARM tasks used here , only further work with larger, balanced groups can address whether performance or test-retest reliability is moderated by a self-reported history of physician diagnosed concussion and associated recovery time in young adult athletes. Lastly, it should also be noted that we cannot account for intrinsic variability in performance related to potential visual system deficits or individual changes that occurred between assessments. Visual acuity for participants is reported in the Results section; however, subtle problems that might be detected through examination by an optometrist or opthamologist (e.g., suboptimal refractive correction, slight ocular misalignment, subtle oculo-motor problems, etc.) could potentially impact performance on the majority of the KINARM tasks employed. Nevertheless, the within-subjects nature of the test-retest reliability study design should at least partly mitigate variability introduced by participants with potentially sub-clinical visual system problems.
Given the end goal of utilizing this technology for assessment of acute SRC and recovery, the feasibility of using the KINARM to this end should also be addressed. Prior work examined the feasibility of integrating the KINARM into an emergency department setting for neurologic assessment of mTBI . The primary challenges faced in this emergency department study related to technological maintenance and the implementation of the large, relatively immobile KINARM device into a busy clinical environment . These same challenges are certainly applicable to the integration of the KINARM device into a sport medicine clinic for assessment of SRC. We overcame technological challenges through continued collaboration with the support staff of the makers of the KINARM (BKin Technologies Ltd.) and involvement of researchers with skills in computer science. Rather than adapt a mobile KINARM, as done in prior work , we have a dedicated room in the sports medicine clinic for all KINARM testing, and a staff member whose primary tasks are to schedule, collect and compile the information obtained from the robotic assessments. Conducting truly acute assessments of SRC (i.e., sideline assessments or <24 hours) with this device is unlikely given its size (>800 lbs) and electrical requirements, but not necessarily impossible. Rather, the most realistic and clinically meaningful use of the KINARM for SRC may be in reliably, efficiently, and objectively tracking recovery relative to a previously conducted baseline test. Nevertheless, further work is needed to gain a better understanding of the full potential real-world applicability of robotic technology for SRC assessment.
We found moderate-to-good test-retest reliability and minimal practice effects in healthy, young adult athletes for most of the KINARM task parameters evaluated. While differences in study sample composition precluded direct comparison between the within-season and year-to-year studies, the reliability and magnitude of practice effects appeared similar across the two timeframes of testing. Future work with the KINARM may consider focusing on parameters known to have high test-retest reliability, refining methods of deriving composite scores to increase their reliability, and development of additional tasks that are more demanding and further involve cognitive, postural and oculomotor function. Overall, our findings support continued study of the feasibility and effectiveness of the KINARM for prospectively assessing the effects of SRC on neurologic function.
S1 File. Data for test-retest reliability studies.
We would like to acknowledge the participating athletes, coaches and team therapists.
- 1. Powell JM, Ferraro JV, Dikmen SS, Temkin NR, Bell KR. Accuracy of mild traumatic brain injury diagnosis. Arch Phys Med Rehabil. 2008;89: 1550–1555.
- 2. McCrea MA, Nelson LD, Guskiewicz K. Diagnosis and Management of Acute Concussion. Phys Med Rehabil Clin N Am. 2017;28: 271–286. pmid:28390513
- 3. McCrory P, Meeuwisse WH, Aubry M, Cantu B, Dvorak J, Echemendia RJ, et al. Consensus statement on concussion in sport: the 4th International Conference on Concussion in Sport held in Zurich, November 2012. Br J Sports Med. 2013;47: 250–258. pmid:23479479
- 4. McCrea M, Iverson GL, Echemendia RJ, Makdissi M, Raftery M. Day of injury assessment of sport-related concussion. Br J Sports Med. 2013;47: 272–284. pmid:23479484
- 5. Scott SH, Dukelow SP. Potential of robots as next-generation technology for clinical assessment of neurological disorders and upper-limb therapy. J Rehabil Res Dev. 2011;48: 335–353. pmid:21674387
- 6. McLeod TC, Leach C. Psychometric properties of self-report concussion scales and checklists. J Athl Train. 2012;47: 221–223. pmid:22488289
- 7. Alla S, Sullivan SJ, Hale L, McCrory P. Self-report scales/checklists for the measurement of concussion symptoms: a systematic review. Br J Sports Med. 2009;43 Suppl 1: i3–12.
- 8. Covassin T, Elbin RJ 3rd, Stiller-Ostrowski JL, Kontos AP. Immediate post-concussion assessment and cognitive testing (ImPACT) practices of sports medicine professionals. J Athl Train. 2009;44: 639–644. pmid:19911091
- 9. Brett BL, Smyk N, Solomon G, Baughman BC, Schatz P. Long-term Stability and Reliability of Baseline Cognitive Assessments in High School Athletes Using ImPACT at 1-, 2-, and 3-year Test-Retest Intervals. Arch Clin Neuropsychol. 2016.
- 10. Dukelow SP. The potential power of robotics for upper extremity stroke rehabilitation. Int J Stroke. 2017;12: 7–8.
- 11. Scott SH. Apparatus for measuring and perturbing shoulder and elbow joint positions and torques during reaching. J Neurosci Methods. 1999;89: 119–127. pmid:10491942
- 12. Little CE, Emery C, Black A, Scott SH, Meeuwisse W, Nettel-Aguirre A, et al. Test-retest reliability of KINARM robot sensorimotor and cognitive assessment: in pediatric ice hockey players. J Neuroeng Rehabil. 2015;12: 78-015-0070-0.
- 13. Little CE, Emery C, Scott SH, Meeuwisse W, Palacios-Derflingher L, Dukelow SP. Do children and adolescent ice hockey players with and without a history of concussion differ in robotic testing of sensory, motor and cognitive function? J Neuroeng Rehabil. 2016;13: 89. pmid:27729040
- 14. Bourke TC, Lowrey CR, Dukelow SP, Bagg SD, Norman KE, Scott SH. A robot-based behavioural task to quantify impairments in rapid motor decisions and actions after stroke. J Neuroeng Rehabil. 2016;13: 91. pmid:27724945
- 15. Coderre AM, Zeid AA, Dukelow SP, Demmer MJ, Moore KD, Demers MJ, et al. Assessment of upper-limb sensorimotor function of subacute stroke patients using visually guided reaching. Neurorehabil Neural Repair. 2010;24: 528–541. pmid:20233965
- 16. Tyryshkin K, Coderre AM, Glasgow JI, Herter TM, Bagg SD, Dukelow SP, et al. A robotic object hitting task to quantify sensorimotor impairments in participants with stroke. J Neuroeng Rehabil. 2014;11: 47-0003-11-47.
- 17. Dukelow SP, Herter TM, Moore KD, Demers MJ, Glasgow JI, Bagg SD, et al. Quantitative assessment of limb position sense following stroke. Neurorehabil Neural Repair. 2010;24: 178–187. pmid:19794134
- 18. Otaka E, Otaka Y, Kasuga S, Nishimoto A, Yamazaki K, Kawakami M, et al. Clinical usefulness and validity of robotic measures of reaching movement in hemiparetic stroke patients. J Neuroeng Rehabil. 2015;12: 66-015-0059-8.
- 19. Debert CT, Herter TM, Scott SH, Dukelow S. Robotic assessment of sensorimotor deficits after traumatic brain injury. J Neurol Phys Ther. 2012;36: 58–67.
- 20. Subbian V, Ratcliff JJ, Korfhagen JJ, Hart KW, Meunier JM, Shaw GJ, et al. A Novel Tool for Evaluation of Mild Traumatic Brain Injury Patients in the Emergency Department: Does Robotic Assessment of Neuromotor Performance Following Injury Predict the Presence of Postconcussion Symptoms at Follow-up? Acad Emerg Med. 2016;23: 382–392. pmid:26806406
- 21. Dukelow SP, Herter TM, Bagg SD, Scott SH. The independence of deficits in position sense and visually guided reaching following stroke. J Neuroeng Rehabil. 2012;9: 72-0003-9-72.
- 22. Herter TM, Scott SH, Dukelow SP. Systematic changes in position sense accompany normal aging across adulthood. J Neuroeng Rehabil. 2014;11: 43-0003-11-43.
- 23. Arbuthnott K, Frank J. Trail making test, part B as a measure of executive control: validation using a set-switching paradigm. J Clin Exp Neuropsychol. 2000;22: 518–528. pmid:10923061
- 24. Tombaugh TN. Trail Making Test A and B: normative data stratified by age and education. Arch Clin Neuropsychol. 2004;19: 203–214. pmid:15010086
McLean D, Brown IE. Dexterit-E 3.6 User Guide. 2016.
- 26. Kenzie JM, Semrau JA, Hill MD, Scott SH, Dukelow SP. A composite robotic-based measure of upper limb proprioception. J Neuroeng Rehabil. 2017;14: 114-017-0329-8.
- 27. Simmatis L, Krett J, Scott SH, Jin AY. Robotic exoskeleton assessment of transient ischemic attack. PLoS One. 2017;12: e0188786. pmid:29272289
Portney LG, Watkins MP. Foundations of Clinical Research: Applications to Practice. New Jersey: Prentice Hall; 2009.
- 29. Giavarina D. Understanding Bland Altman analysis. Biochem Med (Zagreb). 2015;25: 141–151.
- 30. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care. 1989;27: S178–89. pmid:2646488
Cohen J. Statistical power analysis for the behavioral sciences, 2nd Edition. New Jersey: Lawrence Erlbaum Associates; 1988.
Salarian A. Intraclass Correlation Coefficient. MATLAB Central File Exchange. 2008.
- 33. Register-Mihalik JK, Kontos DL, Guskiewicz KM, Mihalik JP, Conder R, Shields EW. Age-related differences and reliability on computerized and paper-and-pencil neurocognitive assessment batteries. J Athl Train. 2012;47: 297–305. pmid:22892411
- 34. Broglio SP, Ferrara MS, Macciocchi SN, Baumgartner TA, Elliott R. Test-retest reliability of computerized concussion assessment programs. J Athl Train. 2007;42: 509–514. pmid:18174939
- 35. Iverson GL, Lovell MR, Collins MW. Validity of ImPACT for measuring processing speed following sports-related concussion. J Clin Exp Neuropsychol. 2005;27: 683–689. pmid:16019644
- 36. Murray N, Salvatore A, Powell D, Reed-Jones R. Reliability and validity evidence of multiple balance assessments in athletes with a concussion. J Athl Train. 2014;49: 540–549. pmid:24933431
- 37. Resch JE, Brown CN, Schmidt J, Macciocchi SN, Blueitt D, Cullum CM, et al. The sensitivity and specificity of clinical measures of sport concussion: three tests are better than one. BMJ Open Sport Exerc Med. 2016;2: e000012. pmid:27900145
- 38. Wrisley DM, Stephens MJ, Mosley S, Wojnowski A, Duffy J, Burkard R. Learning effects of repetitive administrations of the sensory organization test in healthy young adults. Arch Phys Med Rehabil. 2007;88: 1049–1054. pmid:17678669
- 39. Chin EY, Nelson LD, Barr WB, McCrory P, McCrea MA. Reliability and Validity of the Sport Concussion Assessment Tool-3 (SCAT3) in High School and Collegiate Athletes. Am J Sports Med. 2016;44: 2276–2285. pmid:27281276
- 40. Giovagnoli AR, Del Pesce M, Mascheroni S, Simoncelli M, Laiacona M, Capitani E. Trail making test: normative values from 287 normal adult controls. Ital J Neurol Sci. 1996;17: 305–309. pmid:8915764
- 41. Valovich McLeod TC, Barr WB, McCrea M, Guskiewicz KM. Psychometric and measurement properties of concussion assessment tools in youth sports. J Athl Train. 2006;41: 399–408. pmid:17273465
- 42. Kane M, Case SM. The reliability and validity of weighted composite scores. Applied Measurement in Education. 2004;17: 221–240.
- 43. Kamins J, Bigler E, Covassin T, Henry L, Kemp S, Leddy JJ, et al. What is the physiological time to recovery after concussion? A systematic review. Br J Sports Med. 2017;51: 935–940. pmid:28455363
- 44. McCrory P, Meeuwisse W, Dvorak J, Aubry M, Bailes J, Broglio S, et al. Consensus statement on concussion in sport-the 5th international conference on concussion in sport held in Berlin, October 2016. Br J Sports Med. 2017.
- 45. Covassin T, Swanik CB, Sachs M, Kendrick Z, Schatz P, Zillmer E, et al. Sex differences in baseline neuropsychological function and concussion symptoms of collegiate athletes. Br J Sports Med. 2006;40: 923–7; discussion 927. pmid:16990442
- 46. Bleecker ML, Bolla-Wilson K, Agnew J, Meyers DA. Simple visual reaction time: sex and age differences. Dev Neuropsychol. 1987;3: 165–172.
- 47. Subbian V, Ratcliff JJ, Meunier JM, Korfhagen JJ, Beyette FR Jr, Shaw GJ. Integration of New Technology for Research in the Emergency Department: Feasibility of Deploying a Robotic Assessment Tool for Mild Traumatic Brain Injury Evaluation. IEEE J Transl Eng Health Med. 2015;3: 3200109. pmid:27170908