Quantifying skeletal muscle volume and shape in humans using MRI: A systematic review of validity and reliability

Aims The aim of this study was to report the metrological qualities of techniques currently used to quantify skeletal muscle volume and 3D shape in healthy and pathological muscles. Methods A systematic review was conducted (Prospero CRD42018082708). PubMed, Web of Science, Cochrane and Scopus databases were searched using relevant keywords and inclusion/exclusion criteria. The quality of the articles was evaluated using a customized scale. Results Thirty articles were included, 6 of which included pathological muscles. Most evaluated lower limb muscles. Partially or completely automatic and manual techniques were assessed in 10 and 24 articles, respectively. Manual slice-by-slice segmentation reliability was good-to-excellent (n = 8 articles) and validity against dissection was moderate to good(n = 1). Manual slice-by-slice segmentation was used as a gold-standard method in the other articles. Reduction of the number of manually segmented slices (n = 6) provided good to excellent validity if a sufficient number of appropriate slices was chosen. Segmentation on one slice (n = 11) increased volume errors. The Deformation of a Parametric Specific Object (DPSO) method (n = 5) decreased the number of manually-segmented slices required for any chosen level of error. Other automatic techniques combined with different statistical shape or atlas/images-based methods (n = 4) had good validity. Some particularities were highlighted for specific muscles. Except for manual slice by slice segmentation, reliability has rarely been reported. Conclusions The results of this systematic review help the choice of appropriate segmentation techniques, according to the purpose of the measurement. In healthy populations, techniques that greatly simplified the process of manual segmentation yielded greater errors in volume and shape estimations. Reduction of the number of manually segmented slices was possible with appropriately chosen segmented slices or with DPSO. Other automatic techniques showed promise, but data were insufficient for their validation. More data on the metrological quality of techniques used in the cases of muscle pathology are required.


Results
Thirty articles were included, 6 of which included pathological muscles. Most evaluated lower limb muscles. Partially or completely automatic and manual techniques were assessed in 10 and 24 articles, respectively. Manual slice-by-slice segmentation reliability was good-to-excellent (n = 8 articles) and validity against dissection was moderate to good (n = 1). Manual slice-by-slice segmentation was used as a gold-standard method in the other articles. Reduction of the number of manually segmented slices (n = 6) provided good to excellent validity if a sufficient number of appropriate slices was chosen. Segmentation on one slice (n = 11) increased volume errors. The Deformation of a Parametric Specific Object (DPSO) method (n = 5) decreased the number of manually-segmented slices required for any chosen level of error. Other automatic techniques combined with different statistical shape or atlas/images-based methods (n = 4) had good validity. Some particularities were highlighted for specific muscles. Except for manual slice by slice segmentation, reliability has rarely been reported. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Introduction
The volume and shape of a muscle are strongly related to its function [1][2][3][4]. Structural differences between muscles, which result from different muscle fibre architecture, are good predictors of force generation capacity [1]. Physiological cross-sectional area is the major determinant of joint torque [1]. Muscle volume, which is closely related to physiological cross sectional area, was shown to be strongly connected with joint torque in both healthy and pathological populations [2][3][4][5]. Changes in muscle volumes and shapes may be normal, such as hypertrophy after a strengthening program, or atrophy associated with ageing [6,7]. Changes can also be pathological due to neuromuscular disease or injury [5,8,9].
Assessment of muscle volume and shape is essential for both clinical practice and research. Measurement of muscle volume facilitates surveillance of neuromuscular disease progression [10,11] and the effects of treatments [12,13], as well as being useful for diagnostic purposes [14,15]. Muscle shapes can be used to distinguish between pathologies [16,17] and modelling individual muscles can be useful when planning surgery [18], evaluating changes over time [6,19] and in order to improve the understanding of particular symptoms or diseases [16,17,[20][21][22].
Magnetic resonance imaging (MRI) is the gold-standard technique for the evaluation of muscle volumes and three-dimensional (3D) shapes, and is used as a reference to validate other imaging techniques for this purpose [23,24]. Many manual and automatic segmentation techniques have been developed for the estimation of muscle volumes and 3D shapes from MRI data [25][26][27][28][29]. However, despite the widespread use of these measurements in both clinical practice and research, to date neither their metrological qualities nor their feasibility for use in routine practice have been specifically reviewed.
Knowledge of the validity and reliability of measurement methods is essential when choosing a technique in order to ensure an accurate interpretation of the results [30,31]. Validity is the degree to which a technique measures what it is intended to measure, and the extent to which the values obtained are similar to the true values. Reliability is the extent to which a technique yields the same results over repeated trials in stable study subjects [31,32]. Techniques that are easy to use may lack validity or reliability whereas techniques that are valid and reliable are not always feasible for use in a research or clinical setting if they are too time-consuming. It may thus be necessary to compromise between (I) the metrological accuracy required and (II) practical considerations of usage.
The main aim of this systematic review was to report the validity and reliability of techniques used to estimate skeletal muscle volumes and 3D muscle shapes based on MRI data in healthy and pathological muscles in humans. The secondary aims were to determine the feasibility of those techniques and to provide recommendations for future research.

Quality assessment of selected studies
Since no standardized tools exist to determine the quality of articles in the field of radiology, a customized quality assessment scale was developed from other scales in the literature [36,37]. The aim of the scale was to assess both the intrinsic quality of each article (maximum score 30) and the metrological qualities of the method evaluated (maximum score 11). The total score was named the Q score and was out of 100. The first (quality) part of the scale was based on previously published quality checklists for systematic reviews as well as scales for the assessment of the quality of studies included in systematic reviews. Those scales included questions relating to study design and quality of the reporting of methodologies and results [38][39][40], for example "were the aims clearly stated" or "was the description of patient recruitment clear" (S2 Table). The second (metrological) part of the scale was based on published scales that were specifically designed for the evaluation of metrological studies in other fields than radiology [31,36,37,41,42]. It included questions such as "was concurrent validity evaluated?" or "Was the gold standard measure described?". The grades for the questions ranged from 0 to 2. This scale was only used for the purposes of the present study. The quality rating was carried out independently by two examiners (CP and BB) and disagreements were resolved by consensus.

Data extraction and analysis
Information regarding the samples included, muscles evaluated, magnetic field strengths and MRI protocols used were collected from each article. The technique evaluated, the reference technique used, operators and outcome measures (validity, reliability and feasibility) were also recorded ( Table 1 and S3 Table). In this paper, validity refers to the concept of concurrent validity [31] and reliability refers to the correlations between different measurements within the same stable subject, as well as the measurement error [30,43]. To assess the validity and reliability of the results reported in each article, the following values were considered: standard error of the estimate (SEE) and root mean square error (RMSE), values > 10% = poor, 5-10% = moderate, 1-5% = good and < 1% = excellent. The same limits were used for the coefficient of variation. Mean differences, results > 5% = poor, 2-5% = moderate, 1-2% = good and < 1% = excellent. For mean distances, results with distances > 6 mm = poor, 3-6 mm = moderate, 1-3 mm = good and < 1 mm = excellent. Intraclass correlation coefficients (ICC) and r 2 values from 0-0.49 = poor, 0.5-0.69 = moderate, 0.7-0.89 = good and > 0.9 = excellent [44]. The same limits were used for the Dice similarity index(DSI). DSI is the size of the overlap of the two segmentations divided by the total size of the two objects. If different statistical analyses were available in the same study, the worst results were primarily used for the classification. Although we acknowledge that there is no reference or reported recommendation for this categorization, it was used to provide clarity and to standardize the hierarchy of the results reported in the selected articles. The results for validity and reliability were also extracted as they were reported in each original article (S4 Table). When similar evaluations were carried out, for example a bilateral psoas evaluation in a healthy subject using the same technique for each side [45], only the poorest values of validity or reliability were reported. Technique feasibility was determined as the time required for manual segmentation to be carried out or from the time needed to run automatic techniques.

Selection process
The literature search identified 2160 citations in PubMed, 324 citations in the Cochrane Library, 3911 citations in Scopus, 2302 citations in Web of Science. After removing duplicates, 4631 remained. After screening titles and abstracts, 86 articles were found to be potentially eligible. Finally, 30 met the inclusion criteria and were included (Fig 1).
CSA segmentation on a single slice without muscle length. Estimation of muscle volume using CSA segmentation on a single slice without muscle length was evaluated in three articles (Range Q score: 68-72; mean Q score: 69.3 [9,26,56]). Specific slices were chosen, either the one with the largest CSA [26] or those taken at specific locations [9,56]. Manual slice-by-slice segmentation was used as the control reference to evaluate validity, and showed that it was poor to moderate (n = 3 [9,26,56]). Poor results were found for supraspinatus, and subscapularis muscles [56]. Intra and inter-rater reliability were good (n = 1 [56]).

Automatic segmentation techniques (Tables 1 and 2, S3, S4 and S5 Tables)
Deformation of a parametric specific object method with manual segmentation. Estimation of muscle volume and/or 3D shape using the deformation of a parametric specific object (DPSO) method with manual segmentation was evaluated in five articles (Range Q score: 46-71; mean Q score: 62.2 [27,53,59,64,68]).This technique involves manual contouring on a reduced set of images, followed by a parametric shape-based interpolation combined with a kriging technique in order to obtain a surface model without using the intermediate slices [68,69]. Validity was moderate to good compared to slice-by-slice manual segmentation (n = 4 [27,59,64,68]). Reducing the number of slices increased the error (n = 1 [53]). Reliability was poor to good depending on the muscle (n = 2, [59,64]). The number of manually segmented slices required to obtain a pre-determined error was specific to each muscle. A larger number of slices was necessary for gluteus minimus, gluteus medius, obliquus and iliacus.
Other automatic segmentation techniques. Four other methods to estimate 3D muscle shapes were evaluated: semi-automated and automated atlas-based segmentation [48], imagebased and shape-based segmentation [29], atlas-based and statistical shape-based segmentation [67], and interactive-segmentation using shape priors and statistical shape modelling [65] (Range Q scores: 49-73: mean Q score: 55.5). Andrews et al. used a probabilistic shape representation called generalized log-ratio representation that included adjacency information along with a rotationally invariant random forest boundary detector to automatically segment thigh muscles [65]. Kim et al. used an active contour segmentation method with a level sets approach to automatically extract supraspinatus muscle from an MR image [29]. Engstrom et al., used a statistical shape model (SSM) to automatically segment quadratus lumborum [67]. During the fitting process, the deformable SSM was constrained using probabilistic MR atlases. Le Trotter et al. used a multi-atlas based automatic segmentation method to quantify the volume of the quadriceps femoris muscle group [48]. Validity against slice by slice manual segmentation was moderate to excellent and most of the results showed good validity. No studies of reliability were found. Table) The duration of segmentation was evaluated in eight studies [25,26,46,56,59,[64][65][66]. Use of a reduced number of slices to obtain muscle volume divided segmentation time by 4, use of only one or two slices divided segmentation time by 26 and 15, respectively [56]. Use of the DPSO method to evaluate 3D shape halved the time taken in one article [59] and divided it by 12 in another [27]. Using automatic segmentation methods, one article reported that the time-torun, without human interaction, was about 50 minutes per image [65]. No other studies evaluated feasibility.

Discussion
This review included 30 articles which primarily focused on segmentation techniques. It has reported currently available evidence for the metrological qualities of manual and automatic segmentation techniques that estimate muscle volume and shape, and the feasibility of their use in a clinical or research setting. The majority of studies reviewed included healthy subjects, evaluated lower limb muscles and used slice-by-slice manual segmentation as the gold-standard reference. Greater errors in volume and shape estimation were found to be produced by methods that simplified and shortened the manual segmentation process. Sufficient evidence was available to support the validity of the DPSO technique. A lack of robust studies meant that other automatic segmentation techniques could not be validated but the evidence currently available was considered to be encouraging and further work on these methods is indicated. Some particularities for specific muscles and segmentation techniques were highlighted.

Metrological qualities of manual and automatic techniques
Manual segmentation techniques. Slice-by-slice manual segmentation was the most evaluated technique but its validity was only evaluated in one study (on rotator cuff muscles). As slice-by-slice manual segmentation is widely used as a reference method, further studies are warranted to confirm its validity. With regards to reliability, results varied among muscles. The use of different volume calculation methods did not seem to change the errors, indicating that errors found between measurements were likely related to segmentation. The quality of the results was lower for deep muscles such as gluteus minimus and for muscles whose boundaries are unclear, such as the individual muscles of the quadriceps. Identifying their external borders appears challenging. To limit these segmentation errors, we believe that it is essential that standardized procedures using clear anatomical landmarks per muscle are developed and implemented [46]. Despite the fact that few studies evaluated image acquisition methods, they appear to be key for the limitation of segmentation errors [70]. Regarding the studies that compared data from subjects with healthy or pathological muscles, the weaker reliability for pathological muscles could be attributed to shape changes and boundaries that are more difficult to identify [65]. Slice-by-slice manual segmentation is also time-consuming, hence it cannot be easily used in clinical practice.
Techniques based on the manual segmentation of a reduced number of slices reached good to excellent validity when a sufficient number of slices was segmented. The appropriate number of slices varied among muscles. For most, fewer than half of the total number of slices need to be manually segmented, with slice thicknesses of 10 mm and interslice distances of 5 mm, allowing shorter processing time, whilst maintaining an almost equivalent level of performance compared to slice by slice segmentation. Results can further be improved by the choice of appropriate slices to segment [55]. Errors in volume estimation can however occur when the number of segmented slices is reduced [26,27,47,49,55]. We were unable to determine any general rules based on muscle shape or the size, thus further studies are required to assess these methods in muscles that were not evaluated in this systematic review, especially upper limb and trunk muscles. Lastly, important differences between volume calculation methods were also highlighted. For example, the cone method was inappropriate for fusiform muscles [27,47].
Use of even faster techniques, such as the segmentation of a single slice with or without muscle length, could be associated with a loss of precision. Because of their speed of realization, these techniques can be used in clinical practice if the aim is, for example, to estimate the degree of muscle loss in diseases that causes severe atrophy, where differences of more than 10% in volume would normally be expected. Special attention must however be paid when using these methods for non-fusiform muscles. Although the guidelines used for the choice of each slice were detailed for each technique, there was little reliability evaluations. It has been previously reported that the optimal location of measurements can be difficult to both define and reproduce [61] thus there is a potential for errors to occur from manual CSA segmentation. Further studies are warranted to evaluate reliability.
Automatic segmentation techniques. The DPSO method, which involves automatic segmentation of intermediate slices, had good validity if enough slices were manually segmented. For non-fusiform and small muscles, a greater number of slices has to be manually segmented to maintain good accuracy. If this method is found to be reliable, it could be used in association with manual techniques to reduce the number of manually segmented slices and help save time. Further studies are warranted to determine which technique is the most accurate and fast between manual segmentation of a reduced number of slices with different volume estimation methods and manual segmentation with DPSO [27]. The results could differ depending on the muscles, because of their specific shapes and localizations.
The validity of the other four partially or completely automatic techniques analysed (semi-automated and automated atlas-based segmentation, image based and shape-based segmentation, atlas based and statistical shape-based segmentation) could not be confirmed in this review due to the small number of low-quality studies currently available, however it is important to note that results were encouraging. These techniques appeared to be promising in terms of validity. High quality, additional metrological studies are thus needed to validate them. Each technique had its own characteristics: segmentation using generalized logratio representation transformation can impose soft constraints whereas deformable statistical shape models and atlas-based segmentations use hard constraints. However, the generalized log-ratio representation method cannot effectively delineate pose variability as against the other techniques and thus requires image pre-processing as an additional step. Thus, some techniques may be more appropriate than others depending on the muscles and their properties and on the characteristics of the population (children, persons with muscle pathology etc.). Other findings indicated that techniques, such as random-walk segmentation [71,72], wavelet-based segmentation [73], or deep learning-based segmentation [74] should additionally be investigated further to determine if they could provide rapid, accurate, valid and reliable measurements of muscle volume and shape for use in routine clinical practice.

Pathological muscles
Methods to estimate skeletal muscle volumes and/or 3D muscle shapes using MRI data are used clinically for diagnosis [14], to evaluate the effects of treatment [12], and as an aid to preoperative planning [18]. In the case of muscle pathologies, changes in muscle shape and signal occur because of muscle degeneration, which can render identification of muscle boundaries in MRI difficult (due to fatty and fibrous infiltration) [16,17]. Modification of the anatomical landmarks used for CSA segmentation, of techniques that are based on shape factors, and of volume estimation methods may therefore be required. This is, however, currently unknown due to a lack of studies that have evaluated pathological muscles. This finding suggested that specific metrological studies are required depending on the pathology being investigated in order to avoid measurement errors and that caution must be applied when extrapolating the results of techniques used in healthy muscles to those with pathologies.

Image acquisition
The MRI protocol used to acquire images can have a huge impact on segmentation outcomes [70]. The studies included in this review mostly used T1 weighted sequences, suggesting that these anatomical sequences are appropriate for segmentation because of their ability to provide good quality images of the muscles, to distinguish the margins between them and because of their capacity to contrast bones from muscle [9,27,29,64]. However other sequences could also be used and differences in metrological properties between sequences were shown in one article [59]. No other studies compared different sequences in the articles included. Thus, data regarding the validity of the different sequences are warranted [59]. Regarding the issue of 2D or 3D acquisition, of the seven articles which used 3D sequences, none showed that 3D sequences yielded better results than 2D sequences. Most of them evaluated manual segmentation techniques. Since 3D sequences take longer to acquire, have lower contrast and are more sensitive to susceptibility and B0 inhomogeneities [75], there was no evidence to recommend 3D acquisition for manual segmentation. Continuous slice acquisition, allowing muscle tracking, might be an interesting method [55]. The size of the muscle should be considered in determining the resolution to use to avoid partial volume artefacts [49,66]. A greater resolution is needed for small muscles. We suggest the use of a T1 sequence, 2D acquisition with continuous slices between 1 and 10mm thick, oriented in an orthogonal way to the large axis of the muscles, with a resolution that avoids partial volume effects. However, the paucity of data in the articles included in the systematic review does not allow strong recommendations to be made. Lastly, no data are currently available to show the effect of MRI scanner and coil type on data acquisition and the quality of metrological parameters, despite the fact that all of these elements could impact on the accuracy and reliability of the muscle volume and shape [54,65,76,77]. Further studies are therefore warranted to clarify these issues.
The feasibility of MRI can be limited by the availability of MRI scanners and the cost of MRI devices and assessments. Thus, some other techniques, for example using ultrasonography, could be interesting to estimate skeletal muscle volumes and 3D muscle shapes [78].

Improving future metrological study methodology
We believe future work should include evaluation of test-retest reliability since we found only two articles that assessed this [48,62]. Test retest reliability refers to the extent to which the rating of one sample of individuals by one observer on two or more separate occasions using the same test yields similar results, with all test conditions remaining as constant as possible [31]. This is of high importance because factors such as patient positioning could impact on the accuracy and reliability of the muscle volume and shape as determined using MRI [54,65,76,77]. The second evaluation of great importance in future work is responsiveness. Responsiveness refers to the quality of a measure when showing changes [32], and is also a very important quality for the evaluation of neuromuscular disease progression [10,11] and the effects of treatment [12,50]. We were unable to report on the responsiveness of techniques in the present review as it was only evaluated in two articles.
Furthermore, precise reporting of the statistical analysis method employed is essential for metrological studies. As a result of the work undertaken in this review we recommend that the following evaluations are included as standard in future work, in addition to the usual analyses of correlation to improve the internal validity, on measurement technique studies [30,79]. The first evaluation we recommend is measurement error. In order to demonstrate the reliability of a technique, the standardized error of measurement, including limits of agreement or smallest detectable change [30], should be known as they indicate whether the observed difference is due to a true change in muscle volume or size, or if it is simply a measurement error.

Limitations
When considering our findings and recommendations, it is important to note that the strength of any conclusions depends on the quality of the original articles [43]. The articles were rated as moderate to good quality, however only two included statistical power calculations, reducing the conclusions that can be drawn from the results. This aspect of study design should be included in all future studies into this topic. A second limitation of this study is the heterogeneity of source material included, in particular the different MRI parameters used in the studies and the different muscles evaluated prevented pooled analysis from being carried out and complicated the synthesis of the results. Regarding MRI, even when the same sequences were used, the parameters remained heterogeneous since they were device-dependent. Regarding muscles, some muscles have been the focus of many studies, whilst others have been neglected. Clinicians and researchers should bear this in mind when using a technique that has not been previously evaluated for the muscle in question. The results of this study are therefore only relevant for the methods of estimation of muscle volume and shape evaluated by the studies included, and must be generalised with caution to other methods and other muscles. Finally, the statistical methods employed by the different studies also varied considerably which, in turn, further prevented more definite conclusions from being drawn in this review. The different statistical methods used to report concurrent validity (including r 2 , ICC, Dice Similarity Index), Tannimoto coefficient, mean differences, SD, SEE, RMSE and point-to-surface distance) and reliability (such as ICC, mean differences, RMSE, coefficient of variation and standard deviation) limited the synthesis of the data with a quantitative pooled analysis. Future work should aim to overcome as far as possible such diversity in order to both strengthen results as well as improving the generalisability of findings across different methods.

Conclusion
The results of this systematic review provide a rationale for the choice of appropriate segmentation techniques depending on the muscle, the need for precision and the available time. Such uses could include diagnosis of a disease, evaluation of a treatment response, monitoring of disease progression or measurement for research purposes. Further research is required to confirm the validity of manual slice-by-slice segmentation and automatic techniques, except for DPSO for which there is sufficiently strong supporting evidence. The reliability of most techniques in current use also needs to be confirmed, except for manual slice-by-slice segmentation, which has been shown to be sufficiently reliable (if time consuming). Studies to evaluate different MRI protocols are warranted. Specific studies in pathological muscles are also needed to enable the proper application of such techniques in routine clinical practice.
Supporting information S1 Table. Prisma