Assessing Lumbar Paraspinal Muscle Cross-sectional Area and Fat Composition with T1 versus T2-Weighted Magnetic Resonance Imaging: Reliability and Concurrent Validity.

Background: Studies using magnetic resonance imaging to assess lumbar multifidus cross-sectional area frequently utilize T1 or T2-weighted sequences, but seldom provide the rationale for their sequence choice. However, technical considerations between their acquisition protocols could impact on the ability to assess lumbar multifidus anatomy or its fat/muscle distinction. Our objectives were to examine the concurrent validity of lumbar multifidus morphology measures of T2 compared to T1-weighted sequences, and to assess the reliability of repeated lumbar multifidus measures. Methods: The lumbar multifidus total cross-sectional area from 360 muscles was measured bilaterally at L4 and L5, with histogram analysis determining the muscle/fat threshold values per muscle. Images were later re-randomized and reassessed for intra-rater reliability. Matched images were visually rated for consistency of outlining between both image sequences. Bland-Altman bias, limits of agreement, and plots were calculated for differences in total cross-sectional area and percentage fat between and within sequences, and intra-rater reliability analysed. Results: T1-weighted total cross-sectional area measures were systematically larger than T2 (0.2 cm2), with limits of agreement <±10% at both spinal levels. For percentage fat, no systematic bias occurred, but limits of agreement approached ±15%. Visually, muscle outlining was consistent between sequences, with substantial mismatches occurring in <5% of cases. Intra-rater reliability was excellent (ICC: 0.981 – 0.998); with bias and limits of agreement less than 1% and ±5%, respectively.

Conclusion: Total cross-sectional area measures and outlining of muscle boundaries were consistent between sequences, and intra-rater reliability for total crosssectional area and percentage fat was high indicating that either MRI sequence could be used interchangeably for this purpose. However, further studies comparing the accuracy of various methods for distinguishing fat from muscle are recommended.

BACKGROUND
Over the last three decades, there has been a rapid increase in research interest regarding the role paraspinal muscles, and in particular the lumbopelvic stabilizing lumbar multifidi (LM), may play in relation to low back and leg pain. Most of this research has utilized diagnostic and functional imaging methods, including diagnostic ultrasound [2][3][4], computed tomography (CT) [5][6][7], and magnetic resonance imaging (MRI) [8][9][10]. A comprehensive literature search including articles from 1980-2017, focusing on imaging studies evaluating paraspinal muscles for various clinical, surgical, pathological, or anatomical reasons, identified an exponentially increasing use of advanced imaging to evaluate the paraspinal muscles ( Fig. 1). The majority of this research utilized MRI, particularly when looking at static evaluation of muscle.
Various MRI methods have been employed to assess the LM, ranging from the standard T1-weighted and T2-weighted spin-echo (spin echo) imaging sequences, to more sophisticated approaches such as functional [11,12], opposed-phase [13], or chemical-shift MR [14], and MR spectroscopy [15]. However, spin echo (including fast/turbo spin echo) sequences are the most common methods used for MR imaging of the spine [16]. Studies incorporating spin echo imaging have utilised either T1-weighted [9,[17][18][19][20] or T2-weighted [8,10,[21][22][23][24] sequences, but seldom provided the rationale for their sequence choice. There are, however, important technical considerations between T1 and T2-weighted sequences that could affect their ability to assess the anatomy or fat/muscle distinction within the LM.
Traditionally, T1-weighted sequences are described as providing greater anatomical detail than T2-weighted sequences (as T2-weighted sequences are more susceptible to motion artifact and lower signal-to-noise ratios) and better distinction between fat and fluid signal, due to the different T1 relaxation times of these two tissue types. Conversely, due to the longer T2 relaxation times for fat and fluid, the signals for these two tissues can both be high on T2-weighted sequences. Since muscle signal tends to be comparatively lower on T2 than on T1-weighted sequences, the signal difference between fat and muscle may be naturally greater on T2-weighted sequences [16,25].
The question is whether these inherent differences are sufficient to negate our ability to apply these two sequences interchangeably, or even to compare outcomes between the two sequences. Suh et al. [26] compared the reliability of histographic analysis for T1 and T2-weighted sequences and found equivalent intra-and interrater reliability. However, this study did not include comparisons of muscle outlines, cross-sectional area (CSA) measures, or histographic outcomes between sequences.
While early evidence suggests that T1 and T2-weighted sequences are interchangeable for measuring paraspinal muscle morphology and atrophy, based on our literature search the validity of this assumption has not been previously tested. Therefore, the primary study objective was to examine the concurrent validity of LM morphology measures acquired from T2-weighted MR sequences compared to matched T1-weighted sequences. The secondary objective was to assess the intra- Manual image angulation through the abnormal disc was made on the T1-weighted sequence, then copied for T2 image acquisition. To ensure muscle anatomy was directly matched across selected sequences, the location of axial slices was matched between the T1 and T2-weighted sequences by cross-referencing with the sagittal slices and confirming that the table and image slice location protocols were identical. Sequences that could not be precisely matched were excluded from analysis.

Image Selection
Based on sample size calculations for agreement studies (α = 0.05; β = 0.90; k = 0) [27], 45 cases were randomly selected from a pool of 100 MRI cases, using a random number generator [28]. Cases were included if they were of sufficient quality and scope to demonstrate the LM at L4/5 (L4) and L5/S1 (L5) bilaterally. We excluded six cases, which did not demonstrate the required anatomical landmarks (3), demonstrated quality issues significant enough to affect measurement accuracy (e.g., abnormal alignment, severe pathology) (2), or failed to align the slice levels between the two sequences (1). The 45 cases included were sub-divided by spinal level (L4 and L5) and imaging sequence (T1 and T2) into 180 images and assessed bilaterally.
Image slice selection for each spinal level was based on the image that best demonstrated the following anatomical landmarks bilaterally: facet joints / articular processes, laminae, spinous process, and the lateral border of the LM on the same slice, as determined by the lead examiner (JC). Although this approach resulted in some between-case variation in slice levels, it did ensure the slices with the clearest LM boundaries were included. Once selected, each image was randomly assigned and encoded with a sequential image number by an assistant not associated with the project. This ensured the examiners were blinded to the randomization process and that the individual cases, spinal levels, and imaging sequences were assessed randomly. To undertake a second round of measurements, all images were re-randomized as above, with only the new image code included on the images.

Muscle morphology measurement procedures
The images selected were used to quantify muscle area versus fatty infiltrated tissue of the LM on T1 and T2-weighted sequences. Measures of LM morphology included: total CSA; total muscle (i.e., fat-free) CSA; and, total fat CSA. Measurement procedures were undertaken in a three-step process (described below) by an examiner with over 30 years of experience in MRI interpretation, as well as previous experience using sliceOmatic software.
To perform the measurements, we utilized sliceOmatic v5.0.7d [TomoVision, Magog, Canada]; it compared favourably in comparison studies with several of the above programs for adipose tissue assessment on MRI [18], and has been used extensively in adipose tissue and muscle quantification analysis research throughout the body [29] and specifically for evaluating cross-sectional LM morphology on MRI [19,30].
This system allows for outlining muscle CSA, with specific quantification of the fat and muscle tissue.
Determining the muscle/fat transition value: for this study, we needed to account for variations between signal intensity and image acquisition size to compare different spin echo sequences, as well as considering the variations in image intensity from superficial to deeper structures, or from side to side, that can be present within an image. To identify a threshold value between muscle and fat also requires accounting for the fact that as muscle degrades towards fat it may do so gradually rather than fully, such that a broad grey-scale transition is present on the image. To attempt to account for each of the above variables, a protocol was developed using a histographic threshold analysis procedure, as this was considered to be the most efficient and consistent method to apply.
To determine the muscle/fat threshold value to apply bilaterally across the full depth of each muscle, the lead examiner acquired an initial histogram for each image by first outlining both multifidus muscles (connected via the subcutaneous fat but excluding any vertebral structures -see Figs. 2A & 2B). The threshold was then determined by identifying the point at which the histogram curve intersected the X-axis to the nearest value of ten. After this value was recorded, the initial outline was deleted, and a second outline acquired as described above, with the new intersecting value recorded. The average of these two values was inputted into a spreadsheet as the muscle/fat segmentation threshold.
For 25 images, there was an insufficient amount of lean muscle mass on one or both sides of the image to acquire a valid histogram reading. In those instances, a visual estimation of the transition value between muscle and fat was determined by moving the cursor over pixels of muscles on each side and selecting a grey level image value that the examiner felt best represented the transition threshold.
Although this process introduced a subjective element into the procedure, this scenario reflects clinical practice and represents a pragmatic solution to the interpretation of challenging images.
Total CSA outlining procedure: left and right multifidus outlines were individually traced with a computer mouse to create regions of interest corresponding to the cross-section of the muscle at that spinal level. For each measurement, the entire muscle boundary was manually outlined up to, but excluding, the cortical margins of the vertebral arch and supraspinous ligament medially and anteriorly, and posteriorly up to, but excluding, the superficial fascia. The clearest evidence of a fat/fascial boundary between the LM and erector spinae was used for the lateral margin. All muscle and fat within these boundaries were included. These measurement parameters accord with recent recommendations for assessing paraspinal muscle morphology [14,31]. A detailed description of the outlining parameters applied, including methods for addressing variations from the "normal" boundary appearances, can be found in Additional file 1.
To assess qualitatively the similarity of anatomical outlines between imaging sequences, a snapshot of the initial outlined image was saved for later comparison of the muscle outlines between imaging sequences. Once all CSA measures were completed, the matching images between each sequence were overlaid (by making one image partially transparent), and the muscle boundaries divided into approximate quadrants. Screen magnification was set at 200%, and each quadrant's outline between images rated as 0 = perfect/near perfect; 1 = mild mismatch; or, 2 = significant mismatch. Each muscle quadrant was rated individually and separately by two different examiners (JC, ADZ) for consistency of anatomical outlining between sequences (see Fig. 2C). The protocols used to determine CSA outlining consistency were tested on five cases by each examiner, revised for clarity by consensus, and then performed on all cases. The final protocols used, including the criteria for qualitative agreement, can be found in Additional file 2. Once all ratings were initially completed, a follow-up consensus meeting was held and any discrepancies between examiner ratings discussed to reach final agreement on each rating.
CSA measurements: fat and muscle tissues were color-tagged by side and by imaging sequence for assigning measurement values (e.g., T2 right muscle = red; T2 left muscle = purple). The right and left-sided muscle outlines were then filled in with their corresponding color tag, creating total area and tissue-specific crosssectional measurements that were exported to a spreadsheet for later analysis (see Fig. 2D). SliceOmatic has the capacity for multiple images to be opened simultaneously, which allowed for assessment of images in groups of five. Once all images in a group were measured and the data exported, the outlines for each image were deleted. Those five images were then randomly reassessed and the measurement data exported. The means of these two measurements were used to analyse the CSA.
Intra-rater analysis: To assess intra-rater reliability of the CSA measures, all 180 images were re-randomized and recoded, histographic analysis repeated, and muscles measures preformed bilaterally by JC at L4 and L5. To provide a washout period, this phase started three weeks after all initial measurements had been completed.

Statistical Analyses
Cross-sectional area measurements were recorded by level, side, and imaging sequence. Data were checked for non-plausible entries. Bland-Altman (BA) analysis (bias and limits of agreement (with 95% CIs)) and plots were calculated to compare T1 with T2-weighted outcomes for total CSA and percentage fat CSA. As percentage muscle CSA was merely the inverse result of percentage fat CSA, this measure was not reported.
While understanding the potentially arbitrary nature of establishing limits of agreement (LOA) for this study, an a priori range for acceptable variations in LOA of ± 10% was set, based on previous studies on differences in multifidus CSA between symptomatic and normal/asymptomatic low back pain subjects [19,[32][33][34][35][36]. To apply this threshold to CSA values, the overall means for total CSA at L4 and L5 were calculated (based on the average of the means between the first and second measures of both sequences), with a mean total CSA at L4 of 10.0 cm 2 , and at L5, 10.8 cm 2 . For consistency of interpretation, the LOA 10% variability threshold for total CSA was set at ± 1.0 cm 2 for both spinal levels.
For the second round of measures, CSA was recorded by level and side, then the muscle CSA, fat CSA, total CSA, and percentage fat CSA quantitatively analysed against the initial measurements using two-way mixed effects, absolute agreement, single-rater intraclass correlation coefficients (ICC (3,1)); intra-rater ICC values greater than 0.90 were considered excellent [37]. Standard error of measurement and minimal detectable difference [1.96 x SQRT (2) x SEM] were determined.
To look more precisely at the distribution of any measurement variability relating to total CSA, percentage fat CSA and percentage muscle CSA, BA bias and 95% LOA statistics and plots were calculated for T1 and T2-weighted intra-rater measures.
The a priori LOA threshold of ± 10% (see above) was applied.
As there was a minimal difference in outcomes between sides for all analyses, right and left-sided outcomes were assessed together; agreement and reliability outcomes were reported bilaterally.

RESULTS
We included data from 45 participants (age and sex data excluded from cases), totalling 360 individual muscles analysed. The mean (± SD) total CSA at L4 was

Levels of agreement between imaging sequences
The statistical outcomes and BA plots are provided in Table 1 and Fig. 3, respectively. For total CSA measurements at L4 and L5, T1-weighted sequences systematically measured 0.2 cm 2 larger than T2, although this would be an unimportant difference during practical application of the measurement process.
Allowing for a small number of values outside the LOA range, the distribution of differences appears relatively consistent across all measurement averages, and the differences fall within ± 10% of the mean total CSA for the LM at both L4 and L5.
However, analysis of fat as a percentage of total CSA was less consistent. Although no systematic bias was noted between the two imaging sequences, the LOA for percentage fat approached ± 15% overall.
CI: confidence interval; LOA: limits of agreement. *All % muscle results were inversely identical to % fat results so were not included.

Muscle outlining consistency
Visual analysis of muscle outlining demonstrated perfect or nearly perfect consistency between sequences, at each level and bilaterally, in 83% of cases (Table 2). Conversely, significant outlining mismatches only occurred bilaterally along 4.8% of the muscle boundaries, being twice as common at L4, and much more likely to involve the anterior or lateral margins (80%).  Anterior 3  9  33  45  3  8  34  45  Medial  1  3  41  45  0  4  41  45  Posterior 1  6  38  45  3  4  38  45   Lateral 5  6  34  45  7  2  36  45  Totals  (%) 10 (6) 24 (13) 146 (81) 180  13 (7) 18 (10)  Regarding the distribution of cases requiring consensus for agreement of ratings (Table 3), the spinal levels and sides were relatively equal; however, the anterior and lateral margins were more than twice as likely to require discussion to reach consensus. This corresponds with the higher levels of outlining variations at the anterior and lateral boundaries between imaging sequences noted in the visual analysis.   Table 5 and Fig. 4 provide summaries of the descriptive outcomes and BA plots, respectively, for the total CSA. The initial measures were slightly larger (0.1 cm 2 ) than the second, but the distribution was generally consistent across the range of

Muscle outlining consistency
Visual analysis of muscle outlining demonstrated perfect or nearly perfect consistency between sequences, at each level and bilaterally, in 83% of cases (Table 2). Conversely, significant outlining mismatches only occurred bilaterally along 4.8% of the muscle boundaries, being twice as common at L4, and much more likely to involve the anterior or lateral margins (80%). Regarding the distribution of cases requiring consensus for agreement of ratings (Table 3), the spinal levels and sides were relatively equal; however, the anterior and lateral margins were more than twice as likely to require discussion to reach consensus. This corresponds with the higher levels of outlining variations at the anterior and lateral boundaries between imaging sequences noted in the visual analysis.   Table 5 and Fig. 4 provide summaries of the descriptive outcomes and BA plots, respectively, for the total CSA. The initial measures were slightly larger (0.1 cm 2 ) than the second, but the distribution was generally consistent across the range of measurements. Any larger variations tended to occur in muscles with a smaller total CSA.

DISCUSSION
When considering the interchangeability of T1 and T2-weighted sequences to measure the lower lumbar multifidus muscles, the total CSA would appear to be consistent between sequences, but not necessarily when measuring the CSA of different tissue types (e.g., fat) within the muscle boundaries. Although no systematic bias was present between the two sequences when assessing the percentage of fat within the total CSA, the differences between sequences became more variable when less muscle was present.
Contributing factors for the increased variability in distinguishing muscle from fat between sequences may include: 1) for a small percentage of cases, muscle outlining was substantially different between sequences; however, this affected cases with ample healthy muscle tissue as well as reduced muscle tissue, so would seem to be a small contributor; 2) the ability of the software's histogram tool to identify muscle and fat peaks when there were limited amounts of muscle was problematic, requiring visual estimation of the threshold values, which introduced potential for threshold value error between sequences; 3) T1-weighted sequences may inherently have higher fat signal than T2-weighted sequences, which could have accentuated the differences between T1 and T2-weighted tissue signal as the fat percentage increased.
Neither the spinal level nor body side being measured had a notable impact on any outcomes. Additionally, as ~ 95% of muscle outlines showed minimal to no difference between sequences, any agreement that was found in the total CSA measurements was based on direct matching of muscle outlines, not fortuitously similar cross-sectional measures of incorrectly outlined muscles. This confirms that outlining of muscles can also be performed consistently on either MR sequencealthough the following limitations should be considered.
When outlining the muscle boundaries, adequate visualization of landmarks is crucial for consistency. Two keys factors came into play in this regard: 1) the variability of anatomy between patients; 2) the variability of landmarks between MR sequences in the same patient. When considering "between patient" variability, the medial and posterior boundaries had relatively consistent margins to follow, with the spinous process and posterior fascial boundaries generally fully visible on every image. These two boundaries were the least likely to show a significant mismatch between sequences, or to require a consensus discussion to confirm an outline rating. For the anterior and lateral margins, this was not the case.
A protocol has been suggested to alleviate variations in outlining these margins [14], and we developed an additional protocol (see Additional file 1), which improved consistency; however, these accommodations are unable to address all potential variations in slice plane anatomy. Anteriorly, the laminar cortex may or may not be visible across the full margin, and the facet joint / articular process anatomy may be fully, or only partially, present; the presence of facet joint hypertrophy adds another layer of complexity.
Laterally, the margins between the multifidus and erector spinae muscles are often indistinct, particularly when the patient has less body fat to enhance the fascial boundaries. The upper and lower aspects of this margin are at times effectively invisible, with no adjacent reference points to assist. Each of these issues is likely to require the examiner to "estimate" the true boundaries.
When comparing the ability of T1 and T2-weighted sequences to assess the LM anatomy in the same patient, subtle variations in brightness or darkness of muscle boundary anatomy, and slight variations in slice location due to patient movement or breathing differences between slice acquisitions, may ultimately determine whether the muscle boundaries will be visible This effect was most apparent at the anterior and lateral margins in a small number of cases in our study, due to the inherent challenges previously discussed. Figure 6 exemplifies these issues. Alternative methods of distinguishing functional multifidus muscle from nonfunctional tissue (e.g., Beneck, Fortin [19,22]) could be tested to see if this issue can be overcome. Finally, the establishment of a clinically relevant range for LOA needs to account for the inherent errors that occur with any manual measurement system. No comparable studies comparing the use of two spin echo sequences to measure multifidus muscle morphology were available to establish this range in an a priori manner, although it was deemed important to pre-determine this range. The potentially arbitrary nature of the value we established is acknowledged.

CONCLUSIONS
In this study, total CSA measures and the outlining of LM muscle boundaries were consistent between sequences, indicating there are no important concerns with using T1 or T2-weighted sequences interchangeably for this purpose. Intra-rater reliability in measuring total CSA and the percentage of fat or muscle within the total CSA was also high, confirming either MRI sequence could be used reliably by the same assessor. However, we found inconsistent identification of the functional muscle and/or fat area within the total muscle CSA, with a reduction in consistency of tissue-specific measurements as the fat percentage increased, particularly at L5.

Consent for publication
Not applicable

Availability of data and material
The imaging datasets analysed during the current study are not publicly available as they are patient files which require permission from the database manager to access; however, the datasets generated from this study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
This project was partially funded through a PhD scholarship from the Chiropractic and Osteopathic College of Australasia (COCA Research Ltd.). The funding body made no other contributions to any aspects of this study.