Manual muscle testing and hand-held dynamometry in people with inflammatory myopathy: An intra- and interrater reliability and validity study

Manual muscle testing (MMT) and hand-held dynamometry (HHD) are commonly used in people with inflammatory myopathy (IM), but their clinimetric properties have not yet been sufficiently studied. To evaluate the reliability and validity of MMT and HHD, maximum isometric strength was measured in eight muscle groups across three measurement events. To evaluate reliability of HHD, intra-class correlation coefficients (ICC), the standard error of measurements (SEM) and smallest detectable changes (SDC) were calculated. To measure reliability of MMT linear Cohen`s Kappa was computed for single muscle groups and ICC for total score. Additionally, correlations between MMT8 and HHD were evaluated with Spearman Correlation Coefficients. Fifty people with myositis (56±14 years, 76% female) were included in the study. Intra-and interrater reliability of HHD yielded excellent ICCs (0.75–0.97) for all muscle groups, except for interrater reliability of ankle extension (0.61). The corresponding SEMs% ranged from 8 to 28% and the SDCs% from 23 to 65%. MMT8 total score revealed excellent intra-and interrater reliability (ICC>0.9). Intrarater reliability of single muscle groups was substantial for shoulder and hip abduction, elbow and neck flexion, and hip extension (0.64–0.69); moderate for wrist (0.53) and knee extension (0.49) and fair for ankle extension (0.35). Interrater reliability was moderate for neck flexion (0.54) and hip abduction (0.44); fair for shoulder abduction, elbow flexion, wrist and ankle extension (0.20–0.33); and slight for knee extension (0.08). Correlations between the two tests were low for wrist, knee, ankle, and hip extension; moderate for elbow flexion, neck flexion and hip abduction; and good for shoulder abduction. In conclusion, the MMT8 total score is a reliable assessment to consider general muscle weakness in people with myositis but not for single muscle groups. In contrast, our results confirm that HHD can be recommended to evaluate strength of single muscle groups.

Introduction between 0.28 and 0.85 were reported for the MMT8 whilst reliability values of the HHD ranged from 0.88 to 0.98 [4,17]. Absolute agreement parameters have, to the best of our knowledge, not been reported. Although both measures (MMT8 and HHD) are used to assess maximal voluntary isometric muscle contraction it is not investigated whether the results of MMT8 and HHD in people with myositis are comparable.
The first aim of the present study was, therefore, to evaluate intra-and interrater reliability of the MMT8 and HHD in adults with myositis. Secondly, this study aimed to determine concordance between MMT8 and HHD. It was hypothesised that HHD would demonstrate excellent reliability (ICC>0.75), that MMT8 would demonstrate substantial reliability (Kappa values between 0.61 and 0.8) and that the concordance between HHD and MMT8 would be good (Spearman correlation between 0.7 and 0.9) for all tested muscle groups.

Participants
A convenience sample of 50 people with myositis was recruited from the Department of Rheumatology of the University Hospital Zurich, Switzerland between August 2014 and May 2016. All patients presenting for evaluation of myositis were asked by their physician if they would be interested to participate in this study. Interested patients were then contacted by one of the researchers and checked for inclusion and exclusion criteria. Inclusion criteria were diagnosis of polymyositis, dermatomyositis or a myositis associated disorder (scleroderma, systematic lupus erythematosus, Sjögren's syndrome), age over 18, and ability to read and understand German. Exclusion criteria were diagnoses of inclusion body myositis, pulmonary hypertension, osteoporosis, severe cardiovascular and/or pulmonary disease, pain syndrome, and paresis. The participants gave their signed informed consent to participate, and the study was approved by the local ethics committee (registration no. 2014-0022 of the Cantonal Ethics Committee Zurich, Switzerland). The individual in this manuscript demonstrating a measurement set-up has given written informed consent (as outlined in PLOS consent form) to publish these case details. This study is registered at the ClinicalTrials.gov (registration number: NCT0 3059394).
Out of 76 people with myositis who met all inclusion criteria, 50 agreed to participate. Four dropped out after the first measurements. Therefore, reliability was analyzed with data from 46 participants. Due to pain or incapacity to perform certain test positions, some muscle groups could not be tested in all participants. The detailed sample selection process is shown in Fig 1.

Testers
The measurements were performed by two senior physiotherapists, experienced with treatment and measurement of people with rheumatologic diseases. The female physiotherapists were 35 and 47 years old, had a body height of 162 cm and 175 cm and weighted 49 kg and 60 kg, respectively. The two testers were instructed and trained in the use of the MMT8 and the HHD before study start.
hour break, the MMT8 and HHD were conducted by tester 2 for interrater reliability (Measurement 3, Fig 2). Measures were performed in the same order, in the same test room and if possible at the same time of the day, to optimize the standardisation of the test procedure.
Manual muscle testing (MMT8). The dominant side of the following eight muscle groups was tested in a standardised order: shoulder abduction, elbow flexion, ankle extension, hip abduction, hip extension, knee extension, wrist extension and neck flexion. The dominant side was based on the self-declared hand preference. Detailed description of the participants' and  therapist's position and the precise test instructions for each muscle group is described in the "manual muscle testing procedure for MMT8 Testing". Each muscle group was scored according the Kendall 10-point Scale (Table 1) [18]. Scores between 0-3, 4-6, and 7-9 indicate severe, moderate and mild weakness, respectively and a score of 10 means that there is no detectable weakness [19]. The single scores were added to receive a total score varying from 0 to 80 (0 = no muscle contraction, 80 = normal strength).
Hand-held dynamometry. Muscle strength of the same muscle groups that were included in the MMT8 was assessed using the MicroFET2 hand-held dynamometer. The MicroFET2 is a battery operated hand-held device which measures peak force in Newtons (N), up to a value of 890N (Force Evaluating and Testing, Hoggan Health Industries Inc. West Draper, UT, USA). Each muscle action was measured in a gravity-neutralized position. Testing procedure and test position were performed according to standardised protocols [20][21][22]. After at least one familiarization trial, each muscle group was assessed twice. Isometric "make" tests were used [20]. Peak force values were recorded for each trial. Participant position, placement of the dynamometer, verbal instruction and location of stabilisation provided for each tested muscle group are described in the "Manual Quantitative Muscle Testing". The individual in this manuscript demonstrating a measurement set-up has given written informed consent (as outlined in PLOS consent form) to publish these case details.

Data analysis
Demographic data (gender, age, BMI, diagnosis, disease stage, time since diagnosis) were defined using descriptive statistics. Normality of the data was evaluated using Shapiro Wilk test. The level of significance was set to α 0.05 (with Bonferroni correction for multiple comparisons). No imputation was performed. A case was deleted when a variable was missing for a particular analysis, however, this case was included in analyses for which all required variables were present. Due to this pairwise deletion, the total N was not consistent across all analyses. SPSS version 22.0 (SPSS Inc, Chicago, Illinois) was used for data analysis.
Hand-held dynamometry. Data of each muscle group and total score were summarized by mean and standard deviation. To compute total scores, the values of each muscle group were added and this sum was divided by eight. The peak force of the best trial (peak force) and the averaged force of the two performed trials (averaged peak force) were used for data analysis.
Relative reliability, which expresses how well participants can be distinguished from each other despite the presence of measurement error, was determined by calculating intraclass Holds test position against strong pressure 10 no weakness correlation coefficient (ICC) [23]. The ICC 2 (A,1) formula for reliability of the highest score and the ICC 2 (A,k) formula for reliability of the average score were used [23,24]. For interpretation of ICC values, the following classification was considered: >0.75 excellent reliability, 0.40-0.75: fair to good reliability, and <0.40 poor reliability [25].
To evaluate changes over time, variability between participants and, therefore, relative reliability is not particularly informative. In this case absolute measurement error, also called agreement parameters, are indicated [26]. Therefore, the standard error of measurement (SEM) and the smallest detectable change (SDC) were calculated. The SEM represents the standard deviation of repeated measures of one individual and is calculated with the formula SEM agreement = p (σ pt 2 + σ resiudual 2 ) [26]. The SDC represents the minimal change that must be overcome to ensure real change and is calculated with the formula SDC = 1.96 x p 2 x SEM [26]. To evaluate a systematic failure between strength measures, Bland an Altman plots were drawn with the free Medcalc statistic software (MedCalc Software, Ostend, Belgium) [27].
Manual muscle testing (MMT8). Raw MMT scores (0-10) as well as graded MMT scores (0-3: severe weakness, 4-6: moderate weakness, 7-9: mild weakness, 10: no weakness) [19] are ordinal scales and, therefore, summarized by medians and interquartile ranges for single muscle groups and for the graded total score. Floor and ceiling effects were determined by calculating the number of individuals obtaining, respectively, the highest or lowest scores, where a limit of 15% should not be exceeded [28].
To measure reliability of single muscle groups and for the graded total score weighted Cohen's Kappa was computed using the GraphPad software (http://faculty.vassar.edu/lowry/ kappa.html). Because misclassifications between adjacent categories are less serious than those between more distant categories we used a linear Kappa [23]. To interpret kappa values we applied Landis and Koch benchmarks (>0.8: almost perfect, 0.61-0.8: substantial, 0.41-0.6: moderate, 0.21-0.4: fair, <0.2 slight) [29]. For ordinal data there are no parameters of measurement error that quantify the measurement error in units of measurement [23].
Raw MMT total scores are summarized as means and standard deviations and parametric statistics was used because they approximated interval data. Reliability of the raw total score was determined by calculating intraclass correlation coefficient (ICC 2 (A,1), SEM and SDC.
Concordance between HHD and MMT8. Correlations between HHD and MMT8 were calculated with Spearman's rho. When scoring the MMT, raters might consider participant's body weight. Therefore, absolute force as well as normalized force (absolute force divided by body weight) of HHD were correlated with the MMT. A Spearman correlation coefficient greater than 0.9 was considered 'excellent', a coefficient between 0.7 and 0.9 'good' and one between 0.5 and 0.7 'moderate' [30]. Additionally, the associations between the two muscle strength assessments are depicted with boxplots with strength values of each muscle group displayed for the MMT grades.

Results
The demographic and health related data of the 50 participants are summarized in Table 2.

Hand-held dynamometry
The muscle strength values (M1, M2, and M3) and the reliability parameters (ICC, SEM and SDC) for peak force are presented in Table 3. The mean peak forces ranged from 55 N (wrist extension) to 219 N (knee extension) and the standard deviations ranged from 25 (neck flexion) to 92 N (knee extension). All strength measurement data were normally distributed and there was no significant difference between measurement 1 and 2 or between 2 and 3 (t-test, p!0.003; corrected for 16 comparisons). For all muscle groups, except for elbow flexion and for knee extension, intrarater reliability of peak force (ICCs between 0.71 and 0.86) was higher than interrater reliability (ICCs between 0.45 and 0.9). For elbow flexion and knee extension, the ICCs for intrarater reliability were lower than those for interrater reliability (0.83 versus 0.9, and 0.82 versus 0.87, respectively). Six out of eight measured muscle groups showed excellent intrarater reliability. Hip abduction and neck flexion had fair to good intrarater reliability. Interrater reliability was excellent for three muscle groups (shoulder abduction, elbow flexion, and knee extension) and fair to good for the other five muscle groups (ankle extension, hip abduction, hip extension, wrist extension and neck flexion). Intra-and interrater reliability was excellent for total score (0.92 and 0.94). The corresponding SEMs for single scores varied between 12 and 37 Newton and the SDCs% ranged from 40 to 70% for intra-and from 33 to 78% for interrater reliability. The SEM for the total score was 12 N and the SDC 27% for intrarater reliability and 10 N and 23% for interrater reliability.
The results and reliability parameters of averaged peak force are shown in Table 4. Intrarater and interrater ICCs for single muscle groups and for the total score were excellent (0.75-0.97), except for interrater reliability of ankle extension (0.61) which was fair to good. All SEMs (8-30N) and SDCs% (23-65%) for single muscle groups and for the total score (SEM: 7-8, SDC%: [16][17][18][19] were smaller for averaged peak force than for peak force values. Bland Altman plots between M1 and M2 (intrarater) and between M2 and M3 (interrater) are shown for peak force (Fig 3 and Fig 4). For all comparisons, most of the data were within two standard deviations in the Bland-Altman plots. The plots illustrated small, but non-systematic errors between test and retest. Limits of agreement were always greater for intra-than for interrater reliability and visual inspection showed no tendency towards heteroscedasticity.

Manual muscle testing
The results of the raw MMT8 score (M1, M2, M3, intrarater Kappa, and interrater Kappa) are presented in Table 5 (single muscle groups) and Table 6 (total score) and those from the graded score in Table 7.
MMT-scores were between 1 and 10 for the weakest muscle group (hip extension) and between 7 and 10 for the strongest muscle group (knee extension and ankle extension). No differences between the measurements over time (M1, M2, M3) were seen (Wilcoxon, p!0.003). All but one muscle group (hip extension) showed ceiling effects of 22 to 82% (Fig 5) with medians of the raw scores ranging from 8 to 10 points. The total raw score had no ceiling effect and varied from 46 to 80 with a mean of 70 points. The three muscle groups with the lowest score were neck flexion, hip abduction, and hip extension with moderate to severe weakness of 18%, 20%, 26%, respectively. Most of the participants had mild weakness (total graded score). Intrarater reliability of the single muscle groups (raw as well as graded score) were substantial for shoulder abduction, elbow flexion, neck flexion, hip abduction, and hip extension (linear weighted Kappa varying from 0.61 to 0.69); moderate for wrist extension and knee extension (linear weighted Kappa varying from 0.49 to 0.53) and fair for ankle extension (linear weighted Kappa varying between 0.35 and 0.37). Interrater reliability (raw and graded scores) were moderate for neck flexion and hip abduction (linear weighted Kappa from 0.44 to 0.58); fair for shoulder abduction, wrist extension, and ankle extension (linear weighted Kappa varying from 0.20 to 0.35); and slight for knee extension (linear weighted Kappa of 0.08 and 0.18). Graded scores showed better Interrater reliability than row scores for elbow flexion (0.43 versus 0.3) and for hip extension (0.65 versus 0.59).
Intrarater and interrater reliability for total weakness score was substantial (0.88) and moderate (0.42), respectively. Intrarater and interrater reliability of the raw total score were excellent (ICC > 0.9 for both measures), and SEM and SDC% were 1.8 N and 6.9% and 2.2 N and 8.6%, respectively.

Concordance between MMT and HHD
Analysis of inter-muscle-assessment-method showed low correlations for four muscle groups (wrist, knee, ankle and hip extension), moderate correlations for three muscle groups (hip abduction and elbow and neck flexion) and a good correlation for shoulder abduction between results obtained by the MMT8 and HHD, for both absolute force and force normalized to body weight (Table 8). Fig 6 illustrates no consistent association between results from the MMT8 and the HHD in the different muscle groups. In elbow flexion, knee extension and neck flexion the median strength value is higher for a higher MMT score. However, the distribution of strength values for each muscle group showed a large range with considerable overlaps in the interquartile ranges. For the other four muscle groups (shoulder abduction, ankle extension, hip abduction, hip extension, wrist extension) the median strength value did not progressively increase between the consecutive score categories of MMT. Markedly, the median strength value is higher for grade seven than grade eight and nine in hip abduction, hip extension and wrist extension.

Discussion
This study evaluated the intra-and interrater reliability of the MMT8 and a HHD, and the concordance between these two measures in a consecutively recruited convenience sample Reliability and validity of muscle strength testing in people with myositis (n = 50) of people with myositis. In our sample, 76% of the participants were female. This gender distribution reflects the known higher prevalence of IM in females compared to males [31,32]. The results of this study revealed excellent (ICC>0.7) intra-and-interrater reliability of the averaged peak force, except for the interrater reliability of ankle extension (ICC = 0.61). For peak force measurement, excellent ICCs were found for intrarater reliability for all muscle groups and the total score. Conversely, only three single muscle groups and the total score yielded excellent peak force interrater reliability scores. The SEMs and SDCs varied widely between single muscle groups. The SEMs% of the individual muscle groups ranged from 8 to 25% and the SDCs% from 23 to 78%. The SEMs% for the total score varied between 6 and 10% and the SDCs% between 16 and 27%. For the MMT8, the total score showed excellent intraand interrater reliability (ICC>0.9), the single muscle group revealed Kappa values of 0.35-0.69 for intrarater reliability and values of 0.08-0.58 for interrater reliability, however, considerable ceiling effects (22-82%) were determined.

Hand-held dynamometry
Our findings are in accordance with the findings from Stoll et al., who also reported excellent intra-and interrater reliability (ICCs intrarater: 0.88-0.98, ICCs interrater: 0.81-0.98) in seven people with myositis [17]. These results are only partially comparable, because different muscles groups were assessed. Neck flexion, shoulder elevation, elbow flexion and extension, hip flexion, and knee flexion and extension where evaluated by Stoll et al. while the muscle groups in our study were equal to those measured in the MMT8. Furthermore, no data about absolute reliability (measurement error) were reported by Stoll et al. [17]. Thus, it is not yet possible to compare the measurement errors of both studies and we cannot conclusively determine what measurement protocol leads to the optimal values to measure change in a patient's strength values. Whether or not a measurement error is acceptable, depends on the amount of improvement or deterioration that one wants to detect [33]. The observed change in muscle strength must, therefore, be larger than the threshold of the SDC to ensure a real change in muscle strength. As ! 15% improvement in muscle strength is defined to be clinically relevant [18], an estimated SDC of 15% may be acceptable. The observed SDC measures in our study showed considerably higher values (SDCs between 29 and 65%) than the recommended 15%. However improvements of muscle strength varying between 38 and 62% are common [34,35], therefore, dynamometry is capable to capture these improvements. These considerable improvements may be explained by the training principles of initial values, i.e. people with lowest level of fitness have greatest room for improvement [36].
As intrarater reliability is superior to interrater reliability, we recommend measurements to be performed by the same tester, a recommendation of particular importance when considering measurement error. Furthermore, reliability might be improved by using the average value of multiple measurements at each time point, instead of the peak force values [37]. We could confirm that ICCs and measurement errors were better for the averaged value of two performed measurements than for the maximum value. In clinical practice and research trial even three to four measurements were performed [9,38]. Reliability and validity of muscle strength testing in people with myositis A well-known problem of hand held dynamometry is that the testers are often too weak to provide counterbalance to test certain lower extremity muscles [39]. Stone et al hypothesized that reliability was compromised by inadequate tester strength even in frail populations [40]. We tried to overcome this limitation by using a belt to stabilize the dynamometer or the examiner where this seemed necessary. When measuring knee extension, the dynamometer was always fixed with a belt (Fig 7). When measuring hip abduction and extension in strong participants the examiners stabilized themselves with a belt (Fig 8). Although measurement of knee extension could not be limited by the strength of the examiner, the reliability parameters were not superior for these measures compared to the other muscle groups. If the examiners' strength were too low to assess actual strength, we would anticipate detection of a ceiling effect. As our data did not show any ceiling effect, we concluded that the force of the examiners was not a limiting factor.

Manual muscle testing
Compared to other IM-trials, our participants showed relatively low muscle weakness. The median score of MMT8 in our sample was 10 to 50% higher than the score reported by Harris-Love et al. [19], and our total score exceeded the score reported by Rider et al. (87.5% versus 76.5%) [4]. We could confirm or even exceed known ceiling effects [5]. In seven out of eight muscle groups more than 20% of the included participants revealed the highest scores, which theoretically implies that these participants had no muscle weakness at the time of measurement. Conversely, Anderson et al. demonstrated a substantial number of participants (28-41%) classified with 'normal'MMT values had muscle weakness following evaluation with  Reliability and validity of muscle strength testing in people with myositis isokinetic dynamometry. Therefore, the MMT cannot differentiate mild muscle weakness from normal muscle strength [41]. This finding was confirmed by Bohannon et al., who examined participants from four different studies with a manual muscle grade 5 (grade 5 of the Medical Research Council Scale equals 10 in the Kendal scale) and revealed that the highest grade encompassed a broad range of forces between 85N and 650 N. They concluded that MMT may lack the sensitivity to properly assess relatively strong muscle groups [42]. Whereas intrarater reliability of five single muscle groups was substantial, those of interrater reliability were only slight to moderate. One study, that evaluated reliability in adult people with myositis reported higher interrater reliability. The authors identified excellent interrater reliability for shoulder abduction, elbow flexion, knee extension and hip abduction, fair to good interrater reliability for hip extension, neck flexion and wrist extension, and poor for ankle extension [43]. This study included seven participants and used ICCs to calculate reliability, although MMT scores are ordinarily scaled. Therefore, these results should be interpreted with caution. The results of our study were partially in line with one report in which juvenile people with myositis (n = 10) were tested for intra-and interrater reliability. The intrarater reliability was also higher (Spearman's rank correlation coefficient: 0.8) than interrater reliability (Kendall's W: 0.72). In contrast to our study, the study of Rider et al. revealed acceptable interrater reliability [4]. Despite a detailed test protocol, standardised test environment, defined test order, and experienced and trained examiners, we could not reach satisfying interrater reliability for single muscle groups.
Nevertheless, intrarater as well as interrater reliability of the total score was excellent. These findings were supported by one report evaluating reliability in children with juvenile DM. The authors emphasized that it is important to use MMT summary scores, because the interrater reliability varies between individual muscle groups [44].
Absolute reliability could only be calculated for the total score. SDC and SDC per cent were lower for intrarater reliability (4.9 points, respectively 7%) than for interrater reliability (6.2  points, respectively 9%). A consortium of rheumatologists and neurologists has reached consensus that MMT8 should improve by ! 15% to classify adult people with PM/DM as Reliability and validity of muscle strength testing in people with myositis improved [18]. According to our calculations the MMT8 total score is capable of capturing such improvements.

Concordance between HHD and MMT8
Although the QMT and the MMT8 were both supposed to measure maximum isometric muscle strength the correlation for the majority of single muscle groups and the total score were only moderate or even worse. Additionally, graphical presentation of the data showed variable relationship between MMT and HHD. If MMT and HHD would measure the same construct of isometric muscle strength, we would expect that an increase in MMT scores correspond with an increase in the median of peak force of HHD and that the interquartile ranges between MMT scores would not overlap. In our data, only three muscle groups showed a constant increase in peak force and MMT-scores, but interquartile ranges were overlapping in all muscle groups. We could therefore confirm the variable relationship found by Noreau et al. for upper extremities [15]. In contrast to our results, previous studies reported good correlations (>0.7) between manual muscle test and HHD for knee extension [13,14]. There are several possible explanations for these low correlations: First, high ceiling effects could be responsible for the low correlations. With the MMT8 no differences were seen for a considerable amount of participants (22-82%) whereas HHD gives different values for these participants. It seems to be difficult to detect and grade mild symmetrical muscle weakness with the MMT, partly because the examiner must consider the normal variation in strength in relation to age, weight, height, and gender [41]. Then, variations in the weight of the participant's extremities, the force applied by the examiner, and the strength of the examiner could affect the subjective scoring of MMT8. Next, participant's test position is different for MMT8 and HHD. While for the HHD, a gravity-neutralized position is needed, the MMT 8-test-position varies depending on the degree of weakness (from movement in horizontal plan to an antigravity position). For grades 5 and higher, participants have to hold the extremity against gravity and then the tester has to add pressure. The force needed to hold the extremity against gravity is not considered in scoring the MMT. Taken together, our results indicate that MMT does not measure the same parameter measured by HHD. Previous studies revealed that HHD is an appropriate method to assess isometric muscle strength compared with the gold standard isokinetic testing. Therefore, we conclude that MMT8 is an inadequate method to assess isometric muscle strength of individual muscle groups.

Limitations and future research
This study had several limitations. First, a heterogeneous sample of people with myositis was included. Our participants suffer from different kind of myositis in different disease stages (acute, sub-acute and chronic). Due to inadequate sample size for a reliable subgroup analysis, we could not evaluate more homogeneous subsamples. Second, we did not record medications of our participants. Third, the measurements of this study were performed by two female examiners with several years of clinical experience and training in muscle strength assessment. Including more examiners in the reliability study would improve external validity of the results. Since strength assessment is exhausting for the people with myositis, we decided not to include more than two examiners in our study. Fourth, as no generally valid test protocol for HHD exits we developed our own measurement protocol, which hampers the comparison with other study results. Fifth, whilst MMT8-scores can be interpreted (severe, moderate, mild, no weakness), this is not possible with the HHD. However, individual strength values could be compared with normal reference values. Different authors published such reference values for shoulder abduction [20][21][22][45][46][47], elbow flexion [20][21][22][45][46][47][48], ankle extension [20][21][22][45][46][47][48], hip abduction [20-22, 45, 47, 48], hip extension [47], knee extension [20][21][22][45][46][47][48], wrist extension [20][21][22][45][46][47], and neck flexion [22,46,47]. Because the published reference values were captured with different devices, in different test positions and with different placement of the devices, a direct comparison may not be adequate. None of these previous studies used the same device as we did and, to the best of our knowledge, there exists no reference values for this device. Bohannon et al emphasized that dynamometers should not be used interchangeable, because the magnitude of the force measured with two different devices differed significantly although they demonstrated good to high reliability and correlations [49]. Therefore, it is not possible to consider conclusively if a muscle group is weakened or not. Last, we did not include a gold standard for strength measurement.
To overcome these limitations future research should compile gender and age specific reference values for key muscles in people with myositis. Thereby the use of a generally accepted standardised protocol is important. These reference values may help to judge strength values of people with myositis. Furthermore, the validity of these muscle tests needs further investigation.

Conclusion
The fact that the correlation between HHD and MMT8 is not satisfactory raises doubt as to whether the MMT8 measures the same construct (isometric strength) as HHD. The MMT8 total score is a reliable and time efficient assessment to consider general muscle weakness in people with myositis. However, since only the total score of MMT8 showed good reliability parameters MMT8 should not be used to evaluate changes (either improvement or deterioration) in single muscle groups of people with myositis. On the contrary, HHD could be recommended to evaluate isometric muscle strength of single muscle groups in people with myositis if the following important aspects are considered: examiners are experienced and trained in muscle testing, a standardised protocol is followed, a belt to stabilize examiner or the device is used, and the average of at least two measures is applied.