Test-retest reliability of the KINARM end-point robot for assessment of sensory, motor and neurocognitive function in young adult athletes

Background Current assessment tools for sport-related concussion are limited by a reliance on subjective interpretation and patient symptom reporting. Robotic assessments may provide more objective and precise measures of neurological function than traditional clinical tests. Objective To determine the reliability of assessments of sensory, motor and cognitive function conducted with the KINARM end-point robotic device in young adult elite athletes. Methods Sixty-four randomly selected healthy, young adult elite athletes participated. Twenty-five individuals (25 M, mean age±SD, 20.2±2.1 years) participated in a within-season study, where three assessments were conducted within a single season (assessments labeled by session: S1, S2, S3). An additional 39 individuals (28M; 22.8±6.0 years) participated in a year-to-year study, where annual pre-season assessments were conducted for three consecutive seasons (assessments labeled by year: Y1, Y2, Y3). Forty-four parameters from five robotic tasks (Visually Guided Reaching, Position Matching, Object Hit, Object Hit and Avoid, and Trail Making B) and overall Task Scores describing performance on each task were quantified. Results Test-retest reliability was determined by intra-class correlation coefficients (ICCs) between the first and second, and second and third assessments. In the within-season study, ICCs were ≥0.50 for 68% of parameters between S1 and S2, 80% of parameters between S2 and S3, and for three of the five Task Scores both between S1 and S2, and S2 and S3. In the year-to-year study, ICCs were ≥0.50 for 64% of parameters between Y1 and Y2, 82% of parameters between Y2 and Y3, and for four of the five Task Scores both between Y1 and Y2, and Y2 and Y3. Conclusions Overall, the results suggest moderate-to-good test-retest reliability for the majority of parameters measured by the KINARM robot in healthy young adult elite athletes. Future work will consider the potential use of this information for clinical assessment of concussion-related neurological deficits.


Objective
To determine the reliability of assessments of sensory, motor and cognitive function conducted with the KINARM end-point robotic device in young adult elite athletes.

Methods
Sixty-four randomly selected healthy, young adult elite athletes participated. Twenty-five individuals (25 M, mean age±SD, 20.2±2.1 years) participated in a within-season study, where three assessments were conducted within a single season (assessments labeled by session: S1, S2, S3). An additional 39 individuals (28M; 22.8±6.0 years) participated in a year-to-year study, where annual pre-season assessments were conducted for three consecutive seasons (assessments labeled by year: Y1, Y2, Y3). Forty-four parameters from five robotic tasks (Visually Guided Reaching, Position Matching, Object Hit, Object Hit and Avoid, and Trail Making B) and overall Task Scores describing performance on each task were quantified.

Results
Test-retest reliability was determined by intra-class correlation coefficients (ICCs) between the first and second, and second and third assessments. In the within-season study, ICCs PLOS

Introduction
Accurate clinical assessment and appropriate management of sport-related concussion (SRC) remains a challenge in sport medicine [1,2]. Importantly, subtle deficits, if missed, may have significant consequences in a sport environment. A key source of difficulty for the field relates to the wide range of symptoms, which cross multiple neurological domains, and variable symptom severity that can be experienced post-SRC [2][3][4]. Adding to the challenge of assessing such a diffuse injury is that traditionally many clinical assessments of neurologic dysfunction rely on patient symptom reporting and subjective interpretation that may be influenced by examiners' disciplines and prior clinical experience [5][6][7].
Developing assessments with good test-retest reliability poses another challenge to the field that is critical to overcome if such assessments are to be used to establish baseline levels of function and/or track individuals' recovery post-SRC. For example, a large study of the Immediate Post-concussion Assessment Tool and Cognitive Test (ImPACT), one of the most commonly used SRC assessment tools in clinical and research settings [8], recently reported lower test-retest reliability than was previously thought (intraclass correlation coefficients [ICCs] for the four composite scores ranging from 0.36 to 0.90 across 1-, 2-and 3-year time intervals) [9]. Thus, there remains a need to develop valid clinical tools that are able to objectively, efficiently and reliably assess functional impairment across multiple domains.
Robotic technology offers a promising avenue for development of improved assessments of neurologic deficits post-SRC. Robotic tools can be designed to provide rapid and comprehensive assessments of neurologic function with a level of precision and accuracy not found in standard clinical assessments [5,10]. The Kinesiological Instrument for Normal and Altered Reaching Movements (KINARM, BKIN Technologies Ltd, Kingston, Canada) is a robotic device on which a set of standard assessments of sensory, motor and neurocognitive function have been developed [11][12][13]. Subsets of the KINARM Standard Tests (KST™) have been found to be reliable for neurologic assessment in adults with acute stroke [14][15][16][17], and sufficiently sensitive to detect impairments related to stroke [14][15][16][17][18] and moderate/severe brain injury [19]. Recently, a subset of the KINARM Standard Tests were also found to have generally moderate-to-good test-retest reliability in healthy pediatric ice hockey players when tested twice in a single day and once a week later [12]. The same KINARM Standard Tests were utilized in a study of mild traumatic brain injury (mTBI) presenting to the emergency department [20]. In this study, acute performance (<24 hrs post-injury) on KINARM assessments was predictive of persistent symptoms three weeks later [20], suggesting potential clinical utility of the KINARM in the mTBI population, a group that likely overlaps to a considerable degree with individuals with acute SRC. Nevertheless, the test-retest reliability of these KINARM assessments specifically in young adult athletes, and at long test intervals is yet to be determined.
The primary objective of the current study was to determine test-retest reliability of KINARM robotic assessments of sensory, motor and neurocognitive function in healthy, nonconcussed, young adult elite athletes. As a SRC could occur with variable timeframes relative to a healthy baseline assessment, we evaluated test-retest reliability of three assessments conducted within a single season and across three consecutive seasons in two separate groups of athletes. Determining test-retest reliability of KINARM assessments in athletic populations is a crucial first step towards future application of robotics to this field.

Participants
Participants were recruited from a larger sample participating in an ongoing prospective longitudinal study of SRC. For the prospective study, participants are recruited from varsity sports teams (football, hockey, wrestling, basketball, rugby, volleyball), junior hockey teams, national sports teams (alpine skiing, freestyle skiing, Nordic skiing, luge, skeleton, bobsled, speed skating, track and field, waterpolo, wrestling), a youth sport school, and national youth developmental programs for the sports listed above. Inclusion criteria for the prospective study is that the individuals are actively participating in organized sport and aged 10 years or older. Participants are excluded if they are medically unstable (e.g., active cardiac disease, progressive neurological disorder), have a peripheral or central nervous system disorder or ongoing musculoskeletal compromise of the upper extremity. For the current test-retest reliability studies, athletes between the ages of 18 and 40 years old who were participating in the larger prospective sample were recruited if they were healthy and without concussion within six months of study onset. Twenty-five athletes completed three assessments to evaluate test-retest reliability within a single season (within-season study). A separate group of 39 athletes completed assessments annually over three years (year-to-year study). All participants remained healthy and did not sustain any concussions over the course of the study periods. Visual acuity (Snellen chart, 20 feet) was determined in all 25 of the participants in the within-season study and 37 of the 39 individuals in the year-to-year study. Participants provided written, informed consent prior to assessments. Procedures were approved by the University of Calgary Conjoint Health Research Ethics Board (Ethics ID: 23963) and conducted in accordance with its guidelines.

Robotic assessment
Robotic assessments were conducted with the KINARM end-point device. Seated in front of the KINARM and grasping the end-point handles, participants moved their arms in the horizontal plane to interact with a virtual reality system displayed in the same plane. Robot height was adjusted so participants' heads rested in the centre of the visual field ( Fig 1A). Participants completed three~20 minute robotic assessments. Within-season study participants completed their first assessment (S1, i.e. session 1) during the pre-season, then a second assessment (S2) 1-3 months later, and their third assessment (S3) 9-12 months following S2. Year-to-year study participants completed Y1 (i.e. year 1), Y2 and Y3 assessments on an approximately annual basis in three consecutive athletic pre-seasons. No concussions were sustained between any of the assessments. Each assessment included the following tasks: Visually Guided Reaching [15,19,21], Position Matching [17,19,21,22], Object Hit [16], Object Hit and Avoid [14], and Trail making B [23,24].
The Visually Guided Reaching (VGR) (Fig 1B) task evaluated upper-extremity visuomotor capability. The robot handle was displayed as a white circle (0.5 cm radius) and targets shown as red circles (1.0 cm radius). Peripheral targets were presented discretely in a random order. Participants reached from the central target to one of four peripheral targets located 10 cm away and then back. Participants performed 20 out and 20 back reaching movements with the dominant arm as quickly and accurately as possible. In panel B, the red lines depict representative hand paths when moving to each target location. In panel C, the robot moved the dominant (passive) arm and the participant mirror-matched with the non-dominant (active) arm. Shapes represent each target location and the corresponding mirror matched position. Ellipses around the targets represent variability in position matching across trials. In panels D and E, red and purple lines depict dominant and non-dominant hand movements throughout the task. Green rectangles represent paddles that participants use to hit objects. In panel F, the red line depicts the participant's hand path. The Position Matching (PM) (Fig 1C) task assessed upper-extremity proprioception (position sense). The robot moved the dominant arm (passive arm) to one of four possible targets located at the corners of a 20 cm × 20 cm square grid. When the passive arm movement was complete, participants moved the non-dominant arm (active arm) to mirror-match the passive arm position, notifying the operator when they had matched to advance to the next trial. Vision of the arms and targets was occluded. Twenty-four trials were completed with target locations presented in a pseudo-random order.
The Object Hit (OH) (Fig 1D) task examined rapid, bimanual upper-extremity motor ability and visuospatial attention. Robot handles were represented as 2 cm wide green paddles. Objects (red circles with 2 cm radius) were "dropped" from 10 evenly spaced bins across the top of the screen. Participants used both hands to hit as many objects away as possible, receiving haptic feedback when an object was hit. Over 105 seconds, 30 objects were dropped at random from each bin (300 objects total). As the task progressed, object speed and frequency increased. The size of the paddles were smaller and the speed of the objects faster in this version of the task relative to the original task developed for the KINARM [16] because in pilot work the elite athletes demonstrated ceiling effects in performance on the original task.
The Object Hit and Avoid (OHA) ( Fig 1E) task evaluated similar attributes as the OH task, with added emphasis on attention, rapid motor selection and inhibition. The task proceeded similarly to OH, except at the beginning participants were shown two of a possible ten object shapes that would be "dropped" from the bins at the top of the screen. Participants were instructed to hit those two "target" shapes, while avoiding the other eight "distractor" shapes. As with the OH task, participants received haptic feedback when contacting a target, but distractors passed through the participants' paddle. Two hundred targets and 100 distractors were dropped over 105 seconds. Again, the size of the paddles were smaller and the speed of the objects faster in this version of the task relative to the original task developed for the KINARM [14].
The Trail Making B (TMB) (Fig 1F) task is a robot-based version of the standard paper and pencil task designed to evaluate visual attention and task-switching [23,24]. The robot handle was represented as a white circle (0.5 cm radius) and an array of 25 white circular targets labeled with letters (A-L) and numbers (1-13) was presented. Participants traced through the targets in an alpha-numeric sequence (i.e., 1-A-2-B-3-C. . .13-L). When a correct target was entered, the target turned green. If an incorrect target was entered, the target turned red and the participant was required to return to the prior correct target before continuing. Participants were familiarized to the task by completing a five-target version of the task prior to the full task. Participants were randomly presented with one of eight possible target patterns.

Robotic outcome measures
Forty-four parameters quantified performance on spatial and temporal aspects of the five tasks ( Table 1). Measures of performance on each parameter were quantified in each participant as a z-score relative to an age-corrected normative model of single session baseline performance in athletes between 18-40 years old who participated in the larger prospective study of SRC (VGR, n = 522, 351M/171F; PM, n = 520, 349M/171F; OH, n = 528, 355M/173F; OHA, n = 528, 356M/172F; TMB, n = 528, 356M/172F; BKIN Technologies Ltd.) [25]. Four athletes in the within-season test-retest reliability study were 17 years old at the time of their first assessment and were assigned z-scores as though they were 18 years old. All other participants were 18 years of age or older. Only parameters with a normally distributed dataset (or dataset that could be transformed to a normal distribution) are reported. Overall 'Task Scores' were also derived from a root-mean-square (RMS) distance of parameter z-scores for a given task (BKIN Angular deviation between (i) a straight line from hand position at movement onset to destination target, (ii) a straight line from hand position at movement onset to hand position after initial phase of movement.
Feed-forward control: initial phase of movement.
Initial distance ratio, IDR Ratio of (i) distance hand travelled during initial phase of movement to (ii) distance hand travelled between movement onset and movement offset (or the end of the trial if the destination target is not reached).
Feed-forward control: initial phase of movement.
Initial speed ratio, ISR Ratio of (i) the maximum hand speed during initial phase of movement to (ii) global hand speed maximum of the trial.
Feed-forward control: initial phase of movement. Hand bias hits, HBH Value between -1 and 1 that describes the bias in number of balls hit by dominant and non-dominant hands.
Motor performance.
Miss bias, MB Value between -1 and 1 that describes the bias in number of balls missed in dominant and non-dominant sides of workspace.
Spatial performance.
Hand transition, HT Indicates where preference for using the dominant over non-dominant hand switches in work space.
Spatial performance.

Hand selection overlap, HSO
Captures use of both hands and how often their use overlaps within workspace (i.e., balls hits with both dominant and non-dominant hands in same area of work space).
Motor performance.
Median error, ME Point in the task (% complete) when the participant made half of their errors. Spatial and temporal performance.

Hand speed dominant (m/s), HSD
Mean speed of dominant hand throughout the entire task. Motor performance.

Hand speed non-dominant (m/s), HSND
Mean speed of non-dominant hand throughout entire task. Motor performance.
Hand speed bias, HSB Value between -1 and 1 that describes the bias in mean hand speed between the dominant and non-dominant hands.
Motor performance.

Movement area dominant (m2), MAD
Area of space dominant hand entered during the entire task. Motor performance.

Movement area nondominant (m2), MAND
Area of space the non-dominant hand entered during the entire task. Motor performance.

Movement area bias, MAB
A value from -1 to 1 that describes the bias in movement area between the dominant and non-dominant hands.
Motor performance.
(Continued) Technologies Ltd.), allowing consideration of overall performance on each task by combining performance on all task parameters into a single metric [25][26][27]. Task Scores near zero indicated superior performance and higher values indicated worse performance. A Task Score of one indicated performance was one standard deviation worse than the norm.

Statistical analyses
Analyses were conducted separately for the within-season and year-to-year studies. ICCs evaluated test-retest reliability of the KINARM tasks between the first and second, and second and third assessments. The ICC model (2, 2) used a two-way repeated-measures, random effects analysis of variance model with type consistency. Assessment number was used as the random sample to compute the ICCs. ICCs of 0.75 and above, between 0.50 and 0.74, and below 0.50 were taken to indicate good, moderate, and poor reliability, respectively [12,28]. Bland-Altman (B-A) plots were inspected to identify systematic differences (i.e., practice effects) in measures between the first and second, and second and third assessments for each parameter. Creating a B-A plot involves plotting the mean of two measurements (e.g., assessment 1 and 2) against the difference between the same two measurements. When inspecting the B-A plots, the 95% limits of agreement, which describe the range of differences between the two measurements (mean±2SD), are commonly noted [28]. Specifically, B-A plots are used to describe the agreement between two quantitative measurements, rather than the relationship between them, as is achieved through correlational analyses [29]. When inspecting a B-A plot, the mean difference between the measurements is observed to determine potential bias or lack of agreement, and the relationship between the difference between the measurements and the true value (taken to be the mean of the two measurements) is considered. The bias can be considered significant if the confidence of the interval of the mean difference does not include the line of equality (i.e., zero) [29]. Thus, practice effects were considered present when the 95% confidence interval (CI) around the mean difference value did not cross zero.
Effect sizes described the magnitude of change, or practice effects, between assessments (S1 and S2, S2 and S3; Y1 and Y2, Y2 and Y3) for each measure [12,30]. Effect sizes were calculated as follows: where: δ = effect size. m1 = group mean at baseline. m2 = group mean at follow-up. SD1 = group standard deviation at baseline. For effect size calculations between the second and third assessment, the second and third assessments were considered baseline and follow-up, respectively. Effect sizes with absolute values of 0.80 and above, between 0.50 and 0.79, and below 0.50 were considered large, medium and small, respectively [31]. Statistical analyses were performed in MATLAB 2016a (Mathworks, Natick, MA, USA). ICCs were calculated using the Intraclass Correlation Coefficient function [32].

Participants
Athletes in the within-season study were all male, 20.2±2.1 (mean±SD) years of age, and had visual acuity of 20/50 uncorrected or greater. The year-to-year study participants were an average of 22.8±6.0 years of age (28M, 11F) and had visual acuity of 20/40 uncorrected or greater. Participants who typically wore corrective lenses also wore them during the robotic testing on both initial and repeat assessments. Participant characteristics are highlighted in Table 2.  2C) also suggested no practice effects between assessments (S1 to S2, S2 to S3), as Reaction Time difference values were evenly and randomly distributed around zero (δ S1toS2 = 0.28; δ S2toS3 = -0.10). OH Total Hits (Fig 2B) also had strong correlation between measurements taken across assessments (ICC S1-S2 = 0.83 [95% CI: 0.61-0.93];

Fig 2. Scatter plots (top) and B-A plots (bottom) for representative parameters in the VGR (left) and OH (right) tasks from the within-season test-retest reliability study.
On scatter plots (A and B), filled markers represent measures taken at S1 (x-axis) and S2 (y-axis). Open markers represent measures taken at S2 (x-axis) and S3 (y-axis ; however, data points showing S1 and S2 measurements (filled markers) were all distributed above the unity line, suggesting a S1-to-S2 practice effect. A practice effect was not observed between S2 and S3 (open markers). These observations were corroborated by the B-A plot (Fig 2D), which shows that data points depicting the difference in Total Hits between S1 and S2 (filled markers) were all below the line of equality, while differences in Total Hits between S2 and S3 (open markers) were evenly and randomly distributed around zero (δ S1toS2 = -1.35; δ S2toS3 = 0.03). While the majority of parameters did not show large practice effects, those that did (e.g., Total Hits from the OH task) only changed markedly between S1 and S2, with performance leveling between S2 and S3 (Table 3). Results from all analyses conducted for the Within-season study are presented in Table 3. Considering performance in S1 and S2, ICCs were below 0.50 for 32% of parameters, between 0.50 and 0.74 for 38% of parameters, and 0.75 or greater for 30% of parameters. Between S2 and S3, ICCs were below 0.50 for 20% of parameters, between 0.50 and 0.74 for 48% of parameters, and 0.75 or greater for 32% of parameters. B-A plots suggested agreement between assessments in most parameters, with practice effects apparent in 45% of parameters between S1 and S2 and 18% of parameters between S2 and S3. Examination of δ values suggested only small practice effects for the majority of parameters. Between S1 and S2, δ values were below 0.50 for 80% of parameters, between 0.50 and 0.79 for 9% of parameters and 0.80 or greater for 11% of parameters. The only parameters for which large δ values !0.80 were found between S1 and S2 were Total Hits, Hits with Dominant Hand and Median Error in the OH task, and Test Time and Dwell Time in the TMB task. Between S2 and S3, practice effects dissipated and δ values were small (below 0.50) for all parameters.
Overall Task Scores yielded ICCs greater than 0.50 for VGR, PM, and OHA between S1 and S2, and for VGR, PM and TMB between S2 and S3. Only the TMB Task Score had a δ of 0.80 or higher and only between S1 and S2.  (Fig 3D) were mostly below the line of equality. This evidence of a practice effect dissipated between Y2 and Y3 (open markers, Fig 3B and 3D), as was also reflected in the effect size values (δ Y1toY2 = -1.10; δ Y2toY3 = 0.21). Similar to the within-season data, Total Hits from the OH task is one of only a few parameters (three total) that showed a substantial practice effect !0.80 between Y1 and Y2, and performance then plateaued between Y2 and Y3.

Year-to-year test-retest reliability
Year-to-year study results are presented in Table 4. Considering performance in Y1 and Y2, ICCs were below 0.50 for 36% of parameters, between 0.50 and 0.74 for 43% of parameters, and 0.75 or above for 21% of parameters. Between Y2 and Y3, ICCs were below 0.50 for 18% of parameters, between 0.50 and 0.74 for 34% of parameters, and 0.75 or above for 48% of parameters. B-A plots suggested agreement between assessments in most parameters, with practice effects apparent in 34% of parameters between Y1 and Y2, and 16% of parameters between Y2 and Y3. Examination of practice effect sizes supported these observations. Between Overall Task Scores yielded ICCs greater than 0.50 for all but the OHA task between both sets of time points (Y1 to Y2, and Y2 to Y3), and practice effects were minimal for all tasks between each time point (all δ's <0.50).
A summary of all ICC and effect size results for both the within-season and year-to-year studies is presented in Fig 4.

Discussion
The main finding of the current work was that there was moderate-to-good test-retest reliability for most performance parameters derived from the KINARM. Large practice effects were apparent between the first and second assessment for a minority of parameters (20% in withinseason, 14% in year-to-year), with no considerable performance changes made between a second and third assessment for any parameters. These findings were comparable whether assessments were conducted within a single season or in the pre-season period for three consecutive seasons.
A study of pediatric ice-hockey players employed similar analyses of the same KINARM assessments as those included in the present work, but conducted two assessments in immediate succession on the same day, and a third assessment one week later [12]. In the pediatric study [12] and the current within-season and year-to year studies conducted in adults, ICCs were 0.50 or above for the majority of the parameters across all assessments. Importantly, our current results suggest that if an individual were assessed at a baseline time point and then reassessed after sustaining a SRC, the reliability of the majority of the KINARM parameters would be expected to be similar whether the injury occurred early (i.e. within-season study results) or late in the season (i.e. year-to-year results). Nevertheless, the proportion of  [7]; OHA: Distractor hits total, dominant and non-dominant [6]; TMB: Error count [11,12]), but data from these parameters could not be transformed to fit a normal distribution and thus, a normative model could not be computed to determine z-scores. As a result, these parameters are not reported here. https://doi.org/10.1371/journal.pone.0196205.t003 Test-retest reliability of the KINARM robot in athletes parameters with moderate-to-good reliability between the first and second assessment appeared higher in the pediatric study (75%) [12] compared to the present work (within-season study, 68%' year-to-year study, 64%). Also, in the pediatric study there was little difference in the proportion of parameters with ICCs above 0.50 between assessments (75% Assessment 1 to 2, 73% Assessment 2 to 3) [12]. In contrast, for both the within-season and year-to-year studies there were more parameters with ICCs above 0.50 between the second and third, relative to the first and second assessments (within-season, 68% S1 to S2, 80% S2 to S3; year-to-year, 64% Y1 to Y2, 82% Y2 to Y3). These slightly disparate findings likely relate to the short interval between tests in the pediatric study [12] relative to the studies reported here. Regardless, Test-retest reliability of the KINARM robot in athletes   moderate-to-good reliability was found for the majority of KINARM task performance parameters previously in pediatric athletes [12], and now here in young adult athletes across two timeframes of testing. The test-retest reliability of the subset of KINARM Standard Tests studied presently appears comparable to other assessment tools currently being used to evaluate SRC in clinical and research settings. For example, investigation of test-retest reliability of computerized neurocognitive tests, such as the ImPACT, in athletes has yielded results suggesting a range from very poor to very good reliability [9,[33][34][35]. A study of young healthy individuals completing the Sensory Organization Test of Computerized Dynamic Posturography (SOT), a test commonly used to evaluate postural stability in SRC research [36,37], determined an ICC of 0.64 for its composite score and ICCs ranging from 0.43 to 0.79 for the scores derived for each of  [7]; OHA: Distractor hits total, dominant and non-dominant [6]; TMB: Error count [11,12]), but data from these parameters could not be transformed to fit a normal distribution and thus, a normative model could not be computed to determine z-scores. As a result, these parameters are not reported here.

Fig 3. Scatter plots (top) and B-A plots (bottom) for representative parameters in the VGR (left) and OH (right) tasks from the year-to-year
https://doi.org/10.1371/journal.pone.0196205.t004 Test-retest reliability of the KINARM robot in athletes the six SOT conditions [38]. Also, a test-retest reliability study of the post-concussion symptom scale in the Sport Concussion Assessment Tool (version 3) reported an ICC of 0.43 at an approximately 6-month test interval [39]. Additionally, the Test Time parameter from the robot-based version of the Trails B test used currently yielded an ICC of 0.30 in the within-season study and 0.70 in the year-to-year study when considering the first and second assessments. Prior work with the pencil and paper version of this task has reported ICCs ranging from 0.39 to 0.85 [33,40,41]. The higher ICC in the year-to-year study relative to the withinseason study here likely relates to an almost two-fold smaller practice effect in the year-to-year study for this particular parameter (within-season δ = 1.22; year-to-year δ = 0.65). When evaluating practice effects in this study, only a small proportion of parameters demonstrated a large change in performance between the first and second assessments (within-season, 11%; year-to-year, 7%). In the within-season study, these large practice effects were exclusively observed for parameters derived from the OH and TMB tasks, and for the year-toyear study only for parameters from the OH task. For the TMB task in the within-season study, the largest practice effect was found for the Test Time parameter, a notable finding given its common clinical use [23,24]. Similar practice effects in the pencil and paper version of the TMB task have been reported previously [33]. Also of interest, the proportion of total parameters with δ values of 0.50 or greater from the first to second assessment were similar whether they were separated by one to two months (within-season study) or one year (year-to-year study). This finding suggests that the practice effects associated with completing a baseline assessment with these tasks does not necessarily diminish with time for most parameters. As with our prior work with the KINARM [12], performance plateaued between the second and third assessment. This performance plateau after the second administration of the KINARM tasks is consistent with changes in performance observed in athletes undergoing computerized and pencil and paper neurocognitive tests [33]. When baseline testing and re-assessing individuals post-injury, such practice effects could obscure the detection of impairments. Potential strategies to mitigate such an impact for the KINARM assessments include baseline testing individuals twice prior to the athletic season, or accounting for a known normative practice effect when evaluating post-injury performance.
We also studied Task Scores that measure global performance on each robotic assessment [25]. In the within-season study, reliability of Task Scores varied between tasks and time points, with only the VGR and PM Task Scores yielding moderate-to-good reliability across both sets of time points (S1 to S2, and S2 to S3). In contrast, for the year-to-year study, Tasks Scores had moderate-to-good reliability for all tasks except OHA across all assessments (Y1 to Y2, and Y2 to Y3). In both studies, practice effects on Task Scores were minimal. Derivation of a weighted Task Score, with greater emphasis on parameters with moderate-to-good test-retest reliability, may provide a means to further improve reliability of such composite scores [42].
As mentioned above, the reliability of the KINARM tasks utilized here was similar to other assessment tools currently being used to assess SRC-related deficits, such as ImPACT neurocognitive testing [9,[33][34][35], SOT posturography [38], and symptom reporting [39]. Given that SRC is a diffuse injury with many potential overlapping pathologies, it is not always clear what to look for (e.g., many different signs/symptoms, cognitive impairment, sensorimotor deficits, balance impairment, vestibular features, oculomotor deficits, cervicogenic features, autonomic nervous system dysfunction). There may be subtle deficits, which if missed, may have significant consequences in a high-risk sporting context. There is also the challenge of athlete coverups as they often want to return to sport quickly. Recent evidence demonstrates that the window for physiological recovery may outlast symptom and clinical recovery [43,44]. SRC is a brain injury but the acute clinical symptoms largely reflect an impairment in neurological function rather than structural brain injury [44]. The KINARM potentially adds to previously developed SRC assessment approaches by examining additional elements of neurologic function from multiple systems and brain processes simultaneously (e.g., upper-extremity motor and sensory function) that could plausibly be impacted by SRC. A further benefit of the KINARM is that it can be custom programmed, and has the potential to be integrated with force plate and eye tracking technology. Thus, it is possible that new tasks designed to assess elements of neurologic function probed by ImPACT, SOT or oculomotor testing could be added to the repertoire of KINARM tasks used to assess SRC, or that more complex and demanding dual-tasks could be developed to measure subtle SRC-related deficits that are clinically significant when returning to high-risk sport participation. Lastly, the efficiency of the KINARM assessment (<20 minutes for all tasks used here) and its provision of immediate access to objective quantitative data and normative models for the KINARM Standard Tests suggests that it holds promise for overcoming many of the challenges associated with improving SRC assessment.
Although the test-retest reliability and practice effects appear similar across the within-season and year-to-year studies for most of the KINARM parameters, differences in the compositions of the study samples make it difficult to draw comparisons. For example, the withinseason study included only male athletes from varsity team sports, while the year-to-year study included both male and female athletes from varsity team sports and national level individual sports. There is currently limited information available regarding sex differences in KINARM standard task performance. One study examined sex differences in performance of a version of a KINARM standard task [22]. Using the KINARM exoskeleton robot and a 9-target version of the PM task (KINARM endpoint robot and 4-target PM task used here), a significant sex difference was reported for the Absolute Error XY parameter, but no others [22]. Other work demonstrated sex differences in neuropsychological testing [45] and aspects of motor behaviour, such as visual reaction time [46]. Taken together, these studies suggest that sex could also potentially contribute to variability in KINARM task performance and reliability in our current study. The potential influence of type of athlete (i.e., sport) on KINARM task performance has not been studied previously, but it stands to reason that the varying skillsets required of athletes participating in different sports and at different levels of competition might be reflected in KINARM task performance. We are not currently powered to examine these potential effects further, but it is important to note that differences in sex and sport distribution could have plausibly impacted variability and reliability across the two studies reported here.
Another important consideration when interpreting the current work is that we included only healthy participants, which can lead to low inter-subject variability in performance and negatively impact measures of test-retest reliability. Consistent with this postulation, the ICCs for the VGR and PM tasks were slightly lower than what was found in individuals with acute stroke [15,17], a finding also reported in the test-retest reliability pediatric study that was conducted in healthy athletes [12]. Future work may consider evaluating the reliability of these tests in individuals with SRC, as test-retest reliability of these measures may be higher when determined in a sample with greater inter-subject variability. Additionally, we pooled participants with and without a history of concussion in our study samples. Although past work in pediatric athletes has indicated no effect of a distant history of concussion on performance of the KINARM tasks used here [13], only further work with larger, balanced groups can address whether performance or test-retest reliability is moderated by a self-reported history of physician diagnosed concussion and associated recovery time in young adult athletes. Lastly, it should also be noted that we cannot account for intrinsic variability in performance related to potential visual system deficits or individual changes that occurred between assessments. Visual acuity for participants is reported in the Results section; however, subtle problems that might be detected through examination by an optometrist or opthamologist (e.g., suboptimal refractive correction, slight ocular misalignment, subtle oculo-motor problems, etc.) could potentially impact performance on the majority of the KINARM tasks employed. Nevertheless, the within-subjects nature of the test-retest reliability study design should at least partly mitigate variability introduced by participants with potentially sub-clinical visual system problems.
Given the end goal of utilizing this technology for assessment of acute SRC and recovery, the feasibility of using the KINARM to this end should also be addressed. Prior work examined the feasibility of integrating the KINARM into an emergency department setting for neurologic assessment of mTBI [47]. The primary challenges faced in this emergency department study related to technological maintenance and the implementation of the large, relatively immobile KINARM device into a busy clinical environment [47]. These same challenges are certainly applicable to the integration of the KINARM device into a sport medicine clinic for assessment of SRC. We overcame technological challenges through continued collaboration with the support staff of the makers of the KINARM (BKin Technologies Ltd.) and involvement of researchers with skills in computer science. Rather than adapt a mobile KINARM, as done in prior work [47], we have a dedicated room in the sports medicine clinic for all KINARM testing, and a staff member whose primary tasks are to schedule, collect and compile the information obtained from the robotic assessments. Conducting truly acute assessments of SRC (i.e., sideline assessments or <24 hours) with this device is unlikely given its size (>800 lbs) and electrical requirements, but not necessarily impossible. Rather, the most realistic and clinically meaningful use of the KINARM for SRC may be in reliably, efficiently, and objectively tracking recovery relative to a previously conducted baseline test. Nevertheless, further work is needed to gain a better understanding of the full potential real-world applicability of robotic technology for SRC assessment.

Conclusions
We found moderate-to-good test-retest reliability and minimal practice effects in healthy, young adult athletes for most of the KINARM task parameters evaluated. While differences in study sample composition precluded direct comparison between the within-season and yearto-year studies, the reliability and magnitude of practice effects appeared similar across the two timeframes of testing. Future work with the KINARM may consider focusing on parameters known to have high test-retest reliability, refining methods of deriving composite scores to increase their reliability, and development of additional tasks that are more demanding and further involve cognitive, postural and oculomotor function. Overall, our findings support continued study of the feasibility and effectiveness of the KINARM for prospectively assessing the effects of SRC on neurologic function.