The Intra- and Inter-Rater Reliability of an Instrumented Spasticity Assessment in Children with Cerebral Palsy

Aim Despite the impact of spasticity, there is a lack of objective, clinically reliable and valid tools for its assessment. This study aims to evaluate the reliability of various performance- and spasticity-related parameters collected with a manually controlled instrumented spasticity assessment in four lower limb muscles in children with cerebral palsy (CP). Method The lateral gastrocnemius, medial hamstrings, rectus femoris and hip adductors of 12 children with spastic CP (12.8 years, ±4.13 years, bilateral/unilateral involvement n=7/5) were passively stretched in the sagittal plane at incremental velocities. Muscle activity, joint motion, and torque were synchronously recorded using electromyography, inertial sensors, and a force/torque load-cell. Reliability was assessed on three levels: (1) intra- and (2) inter-rater within session, and (3) intra-rater between session. Results Parameters were found to be reliable in all three analyses, with 90% containing intra-class correlation coefficients >0.6, and 70% of standard error of measurement values <20% of the mean values. The most reliable analysis was intra-rater within session, followed by intra-rater between session, and then inter-rater within session. The Adds evaluation had a slightly lower level of reliability than that of the other muscles. Conclusions Limited intrinsic/extrinsic errors were introduced by repeated stretch repetitions. The parameters were more reliable when the same rater, rather than different raters performed the evaluation. Standardisation and training should be further improved to reduce extrinsic error when different raters perform the measurement. Errors were also muscle specific, or related to the measurement set-up. They need to be accounted for, in particular when assessing pre-post interventions or longitudinal follow-up. The parameters of the instrumented spasticity assessment demonstrate a wide range of applications for both research and clinical environments in the quantification of spasticity.

Introduction in different patient pathologies and age groups. For example, motor-driven isokinetic devices measure limb resistance to passive movement with high reliability [13,14,17,18], but are often bulky and difficult to apply to children for high-velocity stretches [11]. Furthermore, a stretch reflex may be easier to excite by a transient acceleration, which is robotically more difficult to apply [19]. Therefore, a manually controlled instrumented displacement method offers a more attractive and clinically relevant alternative [20][21][22]. However, since spasticity is considered to be force-and velocity-dependent, the interaction between patient and examiner may affect the measurement, so a manually controlled displacement method must follow a standardized protocol, and its psychometric properties should be well defined before it is used in clinical practice [11].
Reliability is considered as the basic psychometric criterion for assessment tools. Without it, the consistency of a measurement cannot be evaluated [23], and consequently, the effect of intervention cannot be determined. Some variations arise from methodological errors, and can be considered as indications for improving the quality of the measurement (extrinsic errors), whilst other errors occur naturally, and can only be measured and accounted for (intrinsic errors) [24]. In a spasticity assessment, the variability of sequential stretch repetitions is a measure of the inherent intrinsic error. Preparation of the skin for sEMG placement, participant and limb positioning, time of day and activity prior to measurement are examples of extrinsic errors.
A manually controlled Instrumented Spasticity Assessment (ISA) was recently developed and validated to identify the severity of spasticity in the muscles of children with CP, and distinguish them from the muscle behaviour in typically developing (TD) children [25]. ISA has also been used to evaluate intervention responsiveness to botulinum toxin type-A (BTX) injections in the medial hamstrings [26]. However, until now, a comprehensive reliability study of both the intra-and inter-rater assessments, with an exploration of the influence of various sources of intrinsic and extrinsic error, has yet to be established. The current study aims to evaluate the intra-rater within session, the inter-rater within session, and the intra-rater between session reliability of various performance-and spasticity-related parameters collected with ISA in children with CP. It was hypothesised that a) the parameters assessed with ISA are overall reliable, and b) the data selection procedure does not contribute significantly as a source of extrinsic error.

Methodology Participants
Twelve participants were recruited from the Clinical Motion Analysis Laboratory, University Hospital of Pellenberg. The inclusion criteria were: (1) diagnosis of spastic CP; (2) 5-18 years of age; and (3) the ability to understand and perform the test procedure. Children were excluded if they had received BTX injections six months prior to the assessment; an intrathecal baclofen pump; selective dorsal rhizotomy; or lower limb orthopaedic surgery. The Ethical Committee of the University Hospitals of Leuven approved the experimental protocol (s50808) and written informed consent for participation was acquired from all parents.

Data acquisition
ISA has previously been reported and described [25]. The device has three components (Fig 1): (1) joint angle characteristics are measured using three inertial measurement units (IMUs: Analog Devices, ADIS16354) at a sample rate of 200 Hz; (2) reactive resistance is measured using a six degrees of freedom force/torque load-cell (ATI mini45: Industrial Automation) at a sample rate of 200 Hz; (3) sEMG activity of agonist and corresponding antagonist muscle is evaluated with a telemetric Zerowire system (Cometa, Milan, IT) at a sample rate of 2000 Hz. Labview (version 8.5, National Instruments) was used for data acquisition.

Measurement
The four muscles evaluated with ISA were: the lateral belly of the gastrocnemius (LatGas), medial hamstrings (MedHam), rectus femoris (RecFem) and the hip adductors (Adds). These muscles were selected as they are frequently treated for spasticity [8], and are also superficial, which is necessary for acquisition with sEMG. Prior to ISA, all participants underwent a lower limb clinical assessment, including evaluation of passive range of motion (ROM), muscle strength, and muscle selectivity [25]. The MAS and MTS were performed to provide a notion of spasticity. The MAS was performed for all four muscle groups, and in addition, the MTS was performed for the gastrocnemius and hamstrings in cases where a MAS 1+ was given. In children with unilateral involvement, the affected side was tested. In children with bilateral involvement, the most affected side (highest average MAS-score, or, in case of symmetrical MAS-scores, the most severe MTS score) was tested. Body-weight, height and length of lower limb segments (full leg, from superior iliac spine to medial malleolus; lower-leg, from the tibiafemoral joint space to the medial malleolus; foot, from lateral malleolus to the head of metatarsal two) were recorded. Preparation Preparation prior to data collection consisted of shaving and cleansing the skin, and application of the sEMG electrodes [25]. One IMU was placed on each segment (thigh, shank, and foot) in positions not interfering with the placement of the sEMG electrodes. IMU placement was arbitrary as calibration trials were carried out during the measurement (S1 Fig [25]). The force/torque loadcell was calibrated and attached to the appropriate limb segment with an orthosis. Measurements of LatGas, MedHam, and RecFem were carried out with the participant in supine lying. Measurement of the Adds was carried out in side lying. For the latter measurement, the force/torque sensor was omitted, as the leg was deemed too heavy to balance on the sensor.

Protocol
Data collection began with three repetitions of a maximum voluntary isometric contraction (MVIC) for each muscle. IMU calibrations for the ankle, knee and hip were performed, and moment arms were measured with a tape measure. Four repetitions of a manually applied passive muscle stretch at three incremental velocities were performed for each muscle. Low velocity (LV) corresponded to moving the hip, knee or ankle over the available ROM during five seconds, the medium velocity was an intermediate stretch of approximately one second (not included in the current data analysis) and the third, a high velocity (HV) stretch, was performed as fast as possible. The interval between stretch repetitions was seven seconds, to avoid the effects of decreased post activation depression in spastic muscles [27]. This stemmed from the five seconds [28], and 10-15 seconds [29] proposed by other groups in literature. An overview of the measurement protocol per muscle can be found in

Research design
Three aspects of reliability were assessed in this study (Fig 3). Sets of stretch repetitions were performed consecutively by two trained raters in a randomised order (coin flipping), which allowed for evaluation of the inter-rater within session (inter-rater WS ) reliability. During this analysis, the participant stayed in the evaluation room, and the sensors were not removed. Comparison between the first three good quality stretch repetitions carried out during this session by the first rater provided the data for the evaluation of the intra-rater within session (intra-rater WS ) reliability. Upon completion, all sensors were removed and the participant was given a two-hour resting period to allow for washout, during which the participant was in the hospital cafeteria. Following the break, the first rater reapplied all the sensors, and measured the participant for a second time for the evaluation of the intra-rater between session (intrarater BS ) reliability. The consistency of data selection was also evaluated (see data selection section).

Data analysis
The data from the acquired LV and HV stretches were processed in MATLAB (version 8.1.0.604 R2013a: MathWorks). The raw sEMG signal was filtered with a 6 th order zero-phase Butterworth bandpass filter from 20 to 500 Hz. The root mean square (rms) envelope of the sEMG signal (rms-EMG) was extracted by applying a low-pass 30Hz 6 th order zero-phase Butterworth filter on the squared signal. EMG onset was defined on the rms-EMG signal as the time of the first muscle activity according to the method of Staude and Wolf [31]. In cases where this method failed (i.e. no onset or constant activation), a threshold method was applied (onset = rms-EMG activity 2SD >baseline during a 0.05s interval). To estimate joint angles, a Kalman smoother [32] was applied to the data from the IMUs. Muscle lengths were estimated based on the joint angles and anthropometric data using OpenSim software [33]. The torque signals were processed with a low-pass filter with a cut-off frequency of 40Hz [21]. The net internal joint torque was calculated from the segment lengths, moment arms, exerted forces and moments, and the external forces caused by gravity and inertia [34] (see S1

Data selection
For the data acquired from the three analyses, a blinded, independent third rater performed the data selection. In addition, to assess the reliability of the selection procedure, the first rater also selected the data from the inter-rater WS analysis (Fig 3). Data selection was performed by visualising the raw-and processed data signals in MATLAB. Any questionable performance of a stretch repetition annotated during the acquisition was taken into account during data selection.
Reasons for excluding stretch repetitions were due to poor performance or poor quality data. Performance-related reasons for data exclusion included poor handling of the force/ torque sensor (mentioned during the acquisition), inconsistent maximum stretch velocities within one trial (for LV, stretch repetitions that were >7°/s from the average of all the repetitions; for HV, stretch repetitions that were >40°/s from the average of all the repetitions, derived from previously collected data [26]), or stretches that were performed outside the desired plane of motion (forces and torques registered in directions other than the sagittal plane). Poor quality EMG included clear artefacts in the EMG signal, loss of the EMG signal, a highly inconsistent EMG pattern in comparison with the other stretch repetitions, low signalto-noise ratio or active assistance of the participant during the passive stretches (activation of agonist and/or antagonist prior to stretch onset or at inconsistent moments during stretch). The automatic definition of EMG onset was visually inspected. In those cases when neither automatic EMG onset detection method was successful, the third rater manually determined the EMG onset based on visual inspection.

Outcome parameters
Twelve parameters based on previous ISA literature [24,34,35] were selected and categorised as either performance-related (five parameters) or spasticity-related (seven parameters).
Performance-related. Performance-related parameters were used to evaluate the quality of the performance of the stretch repetitions. They included the ROM covered during LV and HV stretches (ROM LV and ROM HV , respectively). The maximum velocity reached during LV and HV stretches (V MAX LV and V MAX HV , respectively), and the single largest value of the rms-EMG amplitude acquired from the three MVIC repetitions (peak MVIC). Spasticity-related. Spasticity-related parameters were extracted from rms-EMG and from the computed net internal joint torque. A 'zone of maximum velocity' (V max zone) was demarcated in order to emphasise the velocity-dependent character of spasticity. The V max zone was defined as starting 200ms prior to V MAX and ending at 90% of the full ROM of the stretch. Average rms-EMG was calculated by dividing the area under the rms-EMG time curve by the duration of the V max zone (rms-EMG, expressed in mV). This parameter was also expressed as a normalised percentage to the peak MVIC (rms-EMG, expressed as %). Torque (expressed in Nm) was analysed at 70°knee flexion for the MedHam and RecFem, and at 10°plantar flexion for the LatGas. These angles corresponded to a common mid-ROM angle amongst all participants. Work (expressed in J) was defined as the integral of torque with respect to the position between V MAX and 90% of the ROM. The muscle-lengthening threshold was defined as the muscle length at the time of EMG onset during a LV stretch. EMG onset during LV stretches were not often present in the LatGas and RecFem [25]. Therefore, this parameter was only calculated for the MedHam and Adds. In all four muscles, muscle-lengthening velocity threshold was defined as the muscle-lengthening velocity at the time of EMG onset during a HV stretch. All muscle lengths and muscle lengthening velocity thresholds were expressed as a percentage of the muscle length in the anatomical zero position (ML and MLV, expressed as % and %/s, respectively). The angle of catch (AOC) was defined as the angle that corresponded to the time of the first local minimum power after the time that maximum power was reached [36], and was expressed as a percentage of the ROM. To provide a measure of the severity of spasticity, the absolute change between the average of 3-4 repetitions from HV and LV stretch repetitions ( HV-LV ) were calculated for rms-EMG, Torque and Work.
For the intra-rater WS analysis, only ROM, V MAX , ML and MLV were calculated. For the interrater WS and intra-rater BS analyses, ROM, V MAX , rms-EMG HV-LV , Torque HV-LV , Work HV-LV , ML and MLV were calculated by taking the average of 3-4 good stretch repetitions per velocity. AOC was calculated from the first well performed HV stretch, and its reliability was only evaluated for the inter-rater WS and intra-rater BS analyses. The reliability of MVIC was only evaluated for the intra-rater BS analysis.

Statistical analysis
Group descriptive statistics of all parameters were calculated per muscle and measurement session. Bland-Altman plots portraying limits of agreement were created and independently reviewed by two raters to determine any systematic bias. Relative and absolute reliability were evaluated using the intra-class correlation coefficients (ICC 2,1 for intra-rater WS and ICC 2,k for inter-rater WS and intra-rater BS ) with 95% confidence intervals [37] and the standard error of measurement (SEM), respectively. The reliability of the data selection procedure was determined by calculating the ICC (ICC 2,k) and SEM on the data curated by raters one and three. The ICC was investigated for absolute agreement to detect any relevant systematic error between raters. The SEM was calculated from the square root of the mean square error from one-way ANOVA, and expressed as a percentage of the mean of the test and re-test values [23]. SEM% values <20% were considered acceptable based upon the average change in previously reported ISA parameters following treatment with BTX in the MedHam [25,26]. ICCs >0.80 indicated high relative reliability, 0.60-0.79 indicated moderately-high relative reliability, 0.40-0.59 indicated moderate relative reliability and <0.40 indicated low relative reliability [38]. To identify the most responsive spasticity-related parameters, the minimal detectable change (MDC) was calculated (MDC = SEM x 1.645 x p 2) [39], and expressed as a percentage of the mean of the test and re-test values. Statistical analysis was performed using MATLAB 7.6.0 R2013a (MathWorks), SPSS Statistics (version 22 IBM), and MedCalc (version 12.7).

Results
Twelve children participated in the study (Table 1). One child participated only in the interrater WS analysis, and two children participated only in the intra-rater WS&BS analysis. This yielded a total of 11 children for the intra-rater WS&BS analyses, and 10 children for the interrater WS analysis. Data of two RecFem and one Adds were excluded due to time restrictions at the time of data collection, or due to poor quality EMG. The ML parameter was not calculated for two MedHam and five Adds in the intra-rater WS&BS analyses, and for one MedHam and four Adds in the inter-rater WS analysis, due to a lack of EMG onset at LV. Similarly, due to a lack of EMG onset at HV, the MLV parameter was not calculated for two MedHam and two Adds in the intra-rater WS&BS analyses, and for one MedHam and one Adds in the inter-rater WS analysis.

Data selection
Following the selection of the 1249 stretch repetitions from the inter-rater WS and intra-rater BS analyses, 139 (11%) were excluded. From the session curated by raters one and three (total 570 stretch repetitions), rater one excluded 131 stretch repetitions (23%) and rater three excluded 76 stretch repetitions (13%). Table 2 reports the subsequent ICC and SEM% values of the data curated by the two raters. Of all the 39 ICC values, two (MLV in the LatGas and AOC in the RecFem) were <0.6. The ICC of the ML for the Adds was not computable. This happens when the between-subject variation is relatively small compared to the within-subject variation.
SEM% values <20% were found in all but one of the 16 performance-related parameters, the exception being V max LV for the Adds. For the spasticity-related parameters, SEM% values <20% were found in all but five of the 23 parameters (MLV in the LatGas and Adds, Torque of MedHam, and rms-EMG and rms-EMG % of the Adds). (n = 11) (n = 10) (n = 11) (n = 10) (n = 9) (n = 9) (n = 9) (n = 9) MAS score (n = 12) The intra-rater WS , inter-rater WS , and intra-rater BS analyses Results from the reliability analyses for the LatGas and MedHam can be found in Table 3, and those for the RecFem and Adds in Table 4. Parameters computed using HV-LV, tended to have higher SD values. This was especially the case for the rms-EMG HV-LV parameters. There was no evidence of systematic bias or heteroscedasticity. Of all the ICC values, 76% were >0.8 and 14% >0.6 ( Table 5). Of the 11 ICC values <0.6, four were in the intra-rater BS analysis, and seven in the inter-rater WS analysis. There were three V max LV ; two V max HV ; two rms-EMG HV-LV (%); one ROM LV ; one Torque HV-LV ; one AOC and one MLV. Four were found in the LatGas, three in the MedHam, and two in both the RecFem and Adds. ICC values with their corresponding confidence intervals for inter-rater WS and intra-rater BS are displayed in Fig 4. In the LatGas and MedHam, overall wider CIs of the ICC values were seen for the inter-rater WS than for the intra-rater BS , except for the rms-EMG HV-LV (%), which was wide in both analyses. With the exception of V max LV and AOC, the opposite trend was seen for the RecFem. CIs of both Adds analyses were similar, but generally wider than those in the other muscles.

Standard error of measurement (SEM)
For the SEM values of all four muscles, expressed as a percentage of the average of the mean of the test and re-test values, 37% were below 10% error, 33% were between 11-20% error, 17% were between 21-30% error and 13% were 30% error (Table 5). Of those 32 SEM values >20%, 17 were found in the intra-rater BS analysis, 14 were found in the inter-rater WS analysis and one in the intra-rater WS analysis. The higher SEM values were seven rms-EMG HV-LV (%); five rms-EMG HV-LV (mV); four V max LV ; four Work HV-LV ; four MVIC; four MLV; three Torque HV-LV ; and one ROM LV , and were more often found in the RecFem and Adds than in the LatGas and MedHam.

Reliability
Intra-rater WS analysis. The intra-rater WS analysis compared the first three good quality stretch repetitions in the same measurement session. This assessed for any error inherent to the investigated parameters. Such error may be caused by intrinsic factors such as spasticity, post activation depression, thixotropy, or an extrinsic error like the waiting time between stretch repetitions. In this analysis, most parameters showed an ICC >0.8 and SEM% values <20%. SEM% values were comparable to, if not smaller than the values from the two other reliability analyses. This finding confirms a limited contribution of error due to three repeated stretch repetitions, and infers that a seven second waiting period is satisfactory, allowing for the influence of any hyper-excitability or post activation depression of a muscle stretch to subside [25].
Intra-rater BS analysis. After the intra-rater WS analysis, the second most reliable analysis was the intra-rater BS , where extrinsic errors introduced between sessions were analysed. Reapplication of the IMU sensors in different sessions requires a new calibration procedure, possibly influencing the joint motion parameters. A similar justification can also be made for the re-application of the sEMG electrodes and orthoses, which may influence the spasticity-related parameters and the handling of a stretch. Additionally, the participant and the limb on the support frame need to be repositioned. Nonetheless, the intra-rater BS analysis still demonstrated a satisfactory level of reliability. In order to further improve a between session analysis, the sources of extrinsic error should be accounted for and reduced. Bar-On et al. have previously evaluated the reliability for the intra-rater WS&BS analyses for several parameters of the LatGas and MedHam [25]. In comparison with the current study, they showed lower ICC and generally higher SEM values for all performance-and some spasticity-related parameters. This Table 5. The number of parameters in all three analyses categorised according to their intra-class correlation coefficient (ICC) and standard error of measurement (SEM) and expressed as a percentage of the mean test and re-test values for all four muscles.  finding was expected as their study included only six participants, which may not have been a representative sample. Furthermore, in contrast to the two-hour interval between measurement sessions of the current study, Bar-On et al. reported an average interval of 13 days [25]. Too short an interval may interfere with the participants' concentration, whilst too long an interval makes it challenging to control what happens during the interim period. The appropriate time interval for a between session reliability analysis should be further investigated.
Inter-rater WS analysis. The reliability of ISA was generally higher when comparing within and between sessions performed by the same rater, than between two different raters. Interrater reliability is significant if ISA is to be used in clinical practice, as the same rater is not always available to perform a follow up assessment. Furthermore, considering that the current inter-rater analysis investigated within the same session, additional extrinsic errors are also anticipated between sessions. Standardisation and training should be further improved to increase the reliability when different raters perform the measurement. This could be achieved by ensuring that different raters practice together when learning how to grasp the loadcell, where to stand when performing each measurement, the addition of a metronome beep to suggest and support specific stretch velocities, and by the use of training videos.
Investigated muscles. When comparing the four muscles, the performance-related parameters had a tendency to be most reliable in the MedHam, followed by LatGas and RecFem, and then Adds. For the spasticity-related parameters, the RecFem had the highest reliability, followed by MedHam and LatGas, and then Adds. It is not so surprising that the Adds were the least reliable of the investigated muscles, as they are also the most difficult stretch to perform. It requires movement of the entire limb, as opposed to just a single segment, which may allow a larger introduction of errors. Furthermore, identifying only one of the adductor muscles is challenging in children with CP, and crosstalk between muscles may have occurred. Additionally, the nature of spasticity in the Adds may have a higher intrinsic error than the other three muscles. This could not be confirmed by the current study, as indications of spasticity severity (HV-LV) were not computable in the intra-rater WS analysis, and comparisons between different muscles with spasticity have not been reported in literature.

The implications of data selection
Since ISA is a manually performed test, the selection procedure is essential in ensuring that only well performed stretch repetitions are included for analysis. However, as the selection procedure was not automated, it has to be considered as a possible source of extrinsic error. Two raters independently curated the same set of data, following the same rules of data exclusion. The final number of included stretch repetitions varied between the two raters (excluded: rater one = 23%; rater three = 13%). Despite these differences, small SEM% values were found in all but five of the 23 spasticity-related parameters. The exception was the MLV parameter in the LatGas and Adds. This parameter was calculated by defining the timing of EMG onset. In those cases when neither automatic EMG onset detection method was successful, the EMG onset was manually determined, which may explain some of the discrepancy between raters. Another exception was the Torque parameter of MedHam. Stretch repetitions were seldom excluded due to artefacts in the torque signal. Therefore, exclusion of stretch repetitions based on other criteria was the likely cause of a high SEM% for the torque parameter. Lastly, low selection agreement between raters also influenced the two rms-EMG parameters of the Adds. This may have been caused by the high EMG baseline often seen in the Adds. Overall though, the investigation of the data selection procedure confirmed the hypothesis that little extrinsic error is introduced, as long as three well-performed stretch repetitions are available, and that both raters adhere to the well-defined selection criteria. In the future, the addition of a live feedback system informing the clinician in real time about each stretch repetition, will avoid the issue of capturing excess data to provide at least three well-performed stretch repetitions.

ISA compared to other literature
To the best of the author's knowledge, only six other groups evaluated the reliability of a manually controlled device that combines multidimensional signals for the assessment of spasticity (Table 6).
Overall, the parameters that could be compared to previous studies were shown to be of either similar, or higher reliability in ISA. Although all the studies in Table 6 assessed spasticity with multidimensional signals, only two studies investigated the reliability of both the biomechanical and electrophysiological parameters, and that was in the pathology of stroke [41,42]. Furthermore, no study assessed the reliability of a manually controlled device in CP. For the studies that assessed an intra-rater WS analysis, waiting time between stretch repetitions varied from one second to 15 seconds, suggesting that the seven second time interval selected for ISA is a fair compromise. Between sessions analyses intervals ranged from 10 minutes, to one day, illustrating the obscurity of what is sufficient. Finally, the extent of statistical analyses for assessing reliability varied between studies, and it can be viewed as a limitation that only one study investigated a measure of absolute reliability.

Implications of findings
Reliability is considered to be the basic psychometric criterion for assessment tools, and without it, validity and responsiveness cannot be determined. The SEM infers that the smaller its value, the fewer the errors (random and systematic), and in turn the greater the reliability [43]. An SEM% value may also be referenced in terms of the responsiveness to treatment. If an SEM value is able to yield an MDC value small enough to detect change post treatment, it can be statistically interpreted as reliable. Based on the results of the current study, we can attempt to assess the clinical feasibility of ISA in its current state. As previously identified, all four investigated muscles had EMG onsets at high velocity, suggesting some component of velocity-dependent spasticity. In addition, the MedHam and Adds also had an EMG onset at low velocity, suggesting a component of position-dependent spasticity. This already suggests a possible distinction for evaluating various types of spastic behaviour. Certain ISA parameters have been deemed sensitive enough to differentiate between pre and post treatment intervention with BTX in the MedHam [26]. In order to validate this finding, the corresponding MDC values of the same spasticity-related parameters from the current study can be compared to the average treatment induced change values reported in literature ( Table 7). The MDC value of the rms-EMG HV-LV (mV) parameter was small enough to detect a response in the MedHam to treatment with BTX. This is expected because the rms-EMG parameter most closely reflects the definition of spasticity [4]. However, the effect of BTX treatment on the MedHam did not exceed the reported MDC values for the torque and work parameters. These parameters not only reflect spasticity, but also non-neural tissue changes such as increased passive muscle stiffness and viscosity. These non-neural components could account for the parameters' limited response in detecting a change post BTX [44]. Another consideration is that these parameters are highly dependent on the way the stretch is performed (grasp of the force/torque load-cell). Further research is required to study the effect of tone reduction treatment for all lower limb muscles, using the MDC values of the spasticity related parameters reported by the current study. Additionally, progress is also required to decompose the biomechanical parameters into their neural and non-neural components.
For a device like ISA, the MDC alone is not enough, and it is also important to acknowledge the minimally important change (MIC). The MIC can be established by evaluating the effect of decreasing spasticity on the development of secondary muscle deformities. On a future consideration, changes in function by means of 3D motion analysis, and patient/clinician feedback can also be used.

Study limitations
Several study limitations need to be acknowledged. The number of participants was small, especially for a reliability study applying parametric statistics. Twelve participants are comparable to the sizes recruited in other studies [21,28,29,[40][41][42], but are still limited taking into Table 7. MDC for the spasticity-related parameters of the medial hamstrings (MedHam), and the average difference of those parameters between pre and post treatment with Botulinum Toxin-A (BTX) as previously reported [26]. account the power analysis estimated by Walter et al [45]. The medium velocity stretch repetitions were excluded from this investigation, as manually acquiring them with ISA is more challenging and time consuming than with a motorized system. In those cases where a low ICC value was combined with a relatively low SEM% value, it can be argued that the ICC may not have been a suitable statistic. The ICC is indicative of relative reliability, so if the sample group is homogenous, ICC values will be small, even if the test-retest variability is small, and vice versa [23]. This limitation necessitated the inclusion of a measure of absolute reliability. If an SEM is high, consideration of the various sources of error can help to determine if it can be reduced [24]. In the case of a high ICC value with a high SEM, this may indicate systematic error. One way to estimate the presence of systematic error over random error is to compare various ICC calculation models [23]. Parameters involving HV-LV calculations often showed poorer reliability. As these parameters were not assessed in the intra-rater WS analysis, further investigation is required to determine where the error is coming from, and if it can be reduced. The MVIC may be difficult to collect in children with CP [46], therefore, it was decided that both normalised and non-normalised rms-EMG parameters would be investigated. Overall, the non-normalised rms-EMG parameter appeared to be more reliable, indicating that the MVIC introduced error. This should be considered in future studies when attempting to detect severity of spasticity or responsiveness to an intervention.
For reasons of feasibility, this study was unable to evaluate the reliability of an inter-rater BS analysis. Based on the findings of the intra-rater BS and inter-rater WS analyses, it is assumed that there will be some degree of error within the parameters of an inter-rater BS analysis. Consequently, without this analysis, if two different raters perform the pre and post measurements of an intervention, it is unknown if the investigated parameters will be sensitive enough to detect a change. This gap remains a limitation in ascertaining the true reliability of ISA in the clinical setting.
As angles were only calculated in the sagittal plane, it was assumed that calibration and stretch trials were only performed within this plane, and in addition, that only one joint was moved during stretch. A previous study reported limited measurement error when small outof-plane-movements, or movement of the proximal joint occur [25]. Nevertheless, in the current study, participants lacking neutral joint-alignment were excluded, and out-of-plane movements were minimized by means of standardised reporting on the performance of each stretch.
Lastly, inertial influences on torque were estimated with anthropometric approximations, whereby the foot and lower leg were considered as one segment (see appendix 1) [34]. Fortunately, a previous study has shown that the error introduced by assuming the ankle as fixed during knee movements only has a limited effect on the resulting knee-joint torque [25].

Conclusion
Based on the outcomes of this reliability study, together with the previously published literature, ISA has been demonstrated to possess a wide range of applications in both the research and clinical environment. The sources of error identified within this study seem to be small, and to not have a large impact on the parameters. The intra-rater WS was the most reliable of the three analyses, followed by the intra-rater BS , and then the inter-rater WS . The time interval between sessions, re-application of sensors and repositioning of the participant are likely sources of error. When two different raters perform the measurement, standardisation and training should be improved to minimise the extrinsic error as much as possible. Errors were also muscle specific, or related to the measurement set-up. This variation needs to be accounted for, especially when assessing pre-post interventions or longitudinal follow-up.