Quantitative assessment of the lumbar intervertebral disc via T2 shows excellent long-term reliability

Methodologies for the quantitative assessment of the spine tissues, in particular the intervertebral disc (IVD), have not been well established in terms of long-term reliability. This is required for designing prospective studies. 1H water T2 in the IVD (“T2”) has attained wider use in assessment of the lumbar intervertebral discs via magnetic resonance imaging. The reliability of IVD T2 measurements are yet to be established. IVD T2 was assessed nine times at regular intervals over 368 days on six anatomical slices centred at the lumbar spine using a spin-echo multi-echo sequence in 12 men. To assess repeatability, intra-class correlation co-efficients (ICCs), standard error of the measurement, minimal detectable difference and co-efficients of variation (CVs) were calculated along with their 95% confidence intervals. Bland-Altman analysis was also performed. ICCs were above 0.93, with the exception of nuclear T2 at L5/S1, where the ICC was 0.88. CVs of the central-slice nucleus sub-region ranged from 4.3% (average of all levels) to 10.1% for L5/S1 and between 2.2% to 3.2% for whole IVD T2 (1.8% for the average of all levels). Averaging between vertebral levels improved reliability. Reliability of measurements was least at L5/S1. ICCs of degenerated IVDs were lower. Test-retest reliability was excellent for whole IVD and good to excellent for IVD subregions. The findings help to establish the long-term repeatability of lumbar IVD T2 for the implementation of prospective studies and determination of significant changes within individuals.


Introduction
With a long-term view to automated radiographic analysis of spinal structures on magnetic resonance imaging (MRI), establishing methodologies for quantification of spinal structures is a key step. For example, intervertebral disc (IVD) degeneration is currently assessed by semiquantitative grading scales, such as the Pfirrmann grading scheme [1]. These scales are insensitive to subtle changes in the IVD [2] and in the long-term, similar to current assessment of bone mineral density in the spine, a population based reference to which to compare the IVDs of an individual to determine the degree of IVD degeneration would be preferable. The expertise in the field for quantitative MRI of the IVD is improving: approaches for automated identification and segmentation of the lumbar IVD [3] and methods to quantify the IVD on MRI [4] are being established. The assessment of IVD T 2 relaxation time is, arguably, currently the most widely studied approach, with works on comparison of T 2 to the composition and structure [4][5][6][7] of the IVD and to IVD degeneration [8][9][10][11]. In an IVD the T 2 value of water 1 H is predominantly determined by the mobility of water, which depends on the rapid exchange between free mobile water and the immobile water of the hydration shell of macromolecules like glycosaminoglycan and collagen. In so far T 2 is a qualitative marker for the relative water and protein content in the IVD. If, for example the volume of an IVD increases with a parallel increase in T 2 , this volume increase would predominantly result from water uptake. A volume increase at constant T 2 would indicate a proportional water uptake and protein synthesis.
To accept a methodology for wider use, it is important to understand its reliability; that is, how much measurement error exists when scanning the same person one time-point to the next. This then drives the responses to key questions of, for example, sample size estimation in study design [12] or whether the change seen in an individual over time is 'more than measurement error' (i.e. minimum detectable change). The reliability of IVD T 2 measurements has received limited attention. In our search of the literature, we were able to identify only one study [13] that examined the reliability of T 2 measurements of the IVD in (living) human participants where the sequence included four echos in 25 and 100ms; thus, leading to a less precise estimation of T 2 than currently possible. This study examined reliability by performing scanning 15 minutes after the first scan in 10 participants. These authors found that the measurement variability was higher at the lower lumbar spine in comparison to the upper lumbar spine. Given that the adaptation of the IVD is a slow process [14,15], and long-term reliability is typically worse than the short-term reliability, it is important to know what the long-term reliability of IVD T 2 measurements are. In this work, we examine the long-term reliability of IVD T2 over 1.5 years.

Study design and participants
Twelve males aged 20-44 years participated in the study which was conducted as part of a wider project examining the spine in repeated bouts of bed-rest [16]. Participant mean (SD) age, height and weight were 34.3(8.3)yr, 176.2(5.9)cm and 69.8(8.0)kg respectively. Nine scanning sessions were performed over 1yr: baseline, 34d, 60d, 154d, 188d, 214d, 308d, 342d and 368d post-baseline. For assessing reliability using nine repeated measurements, it is recommended that at least four participants are measured [17]; thus, our design ensured an adequate assessment of measurement reliability. The participants were tested at a university hospital. Exclusion criteria included a history of chronic low back pain, current episode of low back pain or any known genetic muscle and bone diseases. None of the participants had a history of spinal surgery. This study was carried out in accordance with the Declaration of Helsinki guidelines for research on human participants (1989). Ethical oversight and approval was provided by the Comite de Protection des Personnes Sud-Ouest et Outre-Mer I (CPP SOOM I; Toulouse, France. Id number B120707-48). All participants gave their informed written consent prior to involvement in the study.

Magnetic resonance imaging-protocol
To minimise the impact of normal diurnal variation [18] on the study, all scanning was performed after 20:00 hours to control for the influence of body-fluid shifts [19] participants were required to remain in lying for 2hr prior to the start of the scanning procedure. Participants were transferred in lying to the scanning table and positioned in supine lying in the scanner with a standard cushion behind their knees and a standard pillow behind their head. A spinecho multi-echo sequences on a 1.5T Siemens Avanto scanner (Erlangen, Germany) was used with the body coil to collect images at 16  300x300mm; image resolution: 0.781mm/pixel) centred at the lumbar vertebrae. To enable Pfirrmann grading of IVD degeneration at the first scanning session, a T2-weighted sequence (thickness: 3mm; interslice distance: 0.3mm; repetition time: 8740ms; echo time: 104mm; field of view: 380x380mm; image resolution: 1.188mm/pixel) encompassing the entire lumbar spine with 29 sagittal anatomical slices was also collected. Images were exported for further offline processing. Pfirrmann grading was performed on IVDs from baseline scans by a qualified radiologist. To ensure blinding of the examiner to prior measurements, each testing session was assigned a random number (obtained from www.random.org).

Segmentation of IVDs and the semi-automated finding of five IVD subregions
MATLAB (version R2018b; MathWorks; Massachusetts, USA) was used for all semi-automated image processing. For all anatomical slices, the images were automatically partitioned to include only the vertebral bodies and IVDs. The second echo from each anatomical slice was used for starting IVD segmentation as this showed the highest contrast. The third image (typically at the centre of the spine) was used first for localising the position of the IVDs. The image was first median-filtered (one pixel width, 15 rows; 'median-image'). Then, for each pixel-column (i.e. at each x-axis coordinate, where the y-axis of the image is of the pixels in the cephalocaudal direction and the x-axis pixels in the anteroposterior direction) the difference in signal intensity units between the original image and the median-filtered image was calculated. All pixels 100 units or more than the median-filtered image were then labelled white and the remainder black ( Fig 1A).
To localise the centre of the spine, the pixel-column which had the highest number of pixels with a signal intensity difference �100 units was taken. The signal intensity values of this pixel-column ( Fig 1B) were then considered. The local minima with a signal intensity value of less than 150 units were taken as starting points. This identified, anatomically, the vertebral end-plate. Contour tracking (seeking a minimum of three neighbouring pixels with a signal intensity value <150 units) was performed anteriorly and posteriorly along the detected starting points (vertebral end-plates; Fig 1C).
At the next step, the anterior and posterior boundaries of the spine were delineated. First, for each pixel on each echo at a given anatomical slice the signal intensities are averaged ('averaged-image'). Then the regions between each identified intervertebral disc (identified at previous step; 'vertebral bodies') were examined. The operator marked approximately eight support points per disc region in the middle layer. These were then connected optimally by means of graph search (Dijkstra algorithm). The deviation of the gradients from the gradient direction on the support points and the grey values were used as costs in the optimality function. The algorithm always provided the path with the lowest costs (i.e. the shortest possible connection along the object edges). Subsequently, the support points were propagated to the adjacent layers. For this purpose, a pixel with a similar gradient and grey value was searched in the environment of the current position in the neighbouring layer. Then the graph search algorithm is used again to find the optimal path between the new points. If necessary, the operator had the possibility to move the node manually to the best position. The whole process was repeated until all layers of a data set were segmented. The x-coordinate used as the starting point for contour tracking is again taken as the starting point. From this starting point in each pixel-row in each vertebral body, the algorithm searched left and right for the first instance of a signal intensity difference 80 units between the average-image and the median-image. This was then taken as the anterior and posterior border, respectively, of the spine ( Fig 1D). Finally, the detected contours were filled and extracted, giving the locations of the IVDs (Fig 1E). The process was then repeated for the other anatomical slices.
To subdivide the respective disc into five equally sized segments anterior to posteriorly, the object orientation and the centroid were determined first by fitting an ellipse to the IVD, the angle between the major axis of the ellipse and the x-axis was noted and the IVD was then rotated to the x-axis. The width of the rotated object was measured and then divided into five segments of equidistant width ( Fig 1F) and the mean grey values of the individual segments were determined. The coordinates of the region of interest for each IVD was applied to each echo and the signal intensity data were extracted for the whole IVD and subregions in each echo.

Calculation of T2 values for each IVD subregion
In each IVD and for each of the five predefined subregions the corresponding T 2 was calculated using custom code written in the R statistical environment (version 3.4.2, www.r-project. org) from the decay of signal intensity measured at 16 different echo times (Fig 2). The primary analysis focussed on whole IVD T2 (T2 values of the whole IVD averaged across all anatomical slices) and T 2 in the nucleus (T 2 in the central subregion of the five IVD subregions in the third anatomical slice). Secondary analysis considered the reliability in the five IVD subregions from the anterior to posterior annulus, averaged across all images.

Statistical analysis for the repeatability of T2 measurements
Following recommendations for studies of reliability [20], for each parameter the intra-class correlation coefficient (Type 1,1 according to Shrout and Fleiss [21] and Type 1 according to McGraw and Wong [22]), the standard error of the measurement (SEM = [SD of sample scores] � [square-root of {1-ICC}]) and the minimum detectable difference (MD = SEM � [square root of 2] � 1.96) were calculated [23]. The minimum detectable difference is also termed the minimum detectable change (MDC) by other authors. Coefficients of variation (CVs) and their 95% confidence intervals (CIs) were also calculated as described elsewhere [17]. To test for agreement between pairs of testing days, Bland-Altman plots [24] were generated, the "limits of agreement" were calculated (as 1.96 times the standard deviation of the differences between testing days), and bias (e.g. overestimation of smaller values and underestimation of larger values) was assessed formally for the linear case via linear regression and the plots were also inspected visually for non-linear forms of bias. Earlier authors [25] have classified ICCs as poor, moderate, good or excellent when the ICC estimate was, respectively, less than 0.40, 0.40-0.59, 0.60-0.74 or greater than 0.75. To enable evaluation of the stability of T2 over differing spans of time, we also calculated the Pearson's correlation coefficient of the T2 in the whole IVD and in the nucleus for all testing days. An alpha-level of 0.05 was assumed for statistical significance. All analyses were conducted in the "R" statistical environment (version 3.4.2, www.r-project.org) and the "irr" package was used for calculation of ICCs.

Results
All twelve participants completed the first four testing time-points (baseline, 34d, 60d and 154d post-baseline), after this eleven were tested at 188d post-baseline, 10 at 214d and 308d post-baseline and eight at 342d and 368d post-baseline.

Reliability
For the whole IVDs and for the IVDs' nucleus of the entire lumbar spine ICCs of T 2 measured at different examination days (Fig 3) were above 0.93 ('excellent'; Table 1) with the exception of nuclear T 2 at L5/S1 where the ICC was 0.88 ('excellent'). CVs were larger for the nucleus (central-slice only, subregion 3; 4.3% for the average of all levels to 10.1% for L5/S1) as compared to the T 2 of the whole IVD (1.8% for the average of all levels and 2.2% [L2/3 and L3/4] to 3.2% [L5/S1]). Averaging between vertebral levels improved reliability, with significant improvements of CVs only (Table 1). Results from Bland-Altman analysis (S1 Table) showed agreement between all testing days for these parameters. In 7 of 96 cases (7.3%) there was some evidence of linear bias and none of these remained significant after controlling for falsepositives by the false discovery rate method [26]. The reliability data for individual IVD subregions from the anterior annulus (subregion 1) to posterior annulus (subregion 5) are presented in Table 2.
The correlation of T 2 between different scanning time-points is presented in Table 3. Shorter duration correlations, such as between baseline and 34d post-baseline were typically higher (0.97 for the nucleus and 0.99 for the whole IVD) than between baseline and the final testing date 368d post-baseline (0.93 for the nucleus and 0.98 for the whole IVD).

Discussion
This study was the first to consider the long-term repeatability of quantitative MRI of the lumbar IVDs. We found lumbar IVD T 2 to be stable over one year, with long-term ICCs greater than 0.90 for almost all regions of interest. An exception was the L5/S1 nucleus with an ICC of 0.88. Long-term CVs were also acceptable, typically being <5%.
The current study expands the knowledge base on techniques for quantifying the IVD and facilitates long-term studies using T 2 as a key outcome measure. To date, the only reliability data available for IVD T 2 was on ten participants measured approximately 10 to 15 minutes apart [13]. They found, similar to our work, that reliability of T 2 quantification to be lowest at L5/S1 as compared to the remainder of the lumbar spine. Other studies of IVD measurement reliability have been performed on T2-weighed images, such as for morphology and signal intensity related parameters [27] and IVD height [28]. In both these cases, however, reliability was assessed by measuring the same images. A major component of measurement error comes  Table) showed agreement between baseline and follow-up testing time-points. The subtle downward slope over time in both parameters was not significant (p>0.76) on further linear regression analysis. https://doi.org/10.1371/journal.pone.0249855.g003

PLOS ONE
Long-term stability of intervertebral disc T2 from testing the same person on different days (such as with repositioning error of the patient in the MRI and of the slices in sequence setup).
Quantifying IVD signal properties as they relate to IVD composition, rather than IVD morphology, is likely more sensitive to detecting subtle changes. For example, water content losses occur progressively with age [29], whereas losses of disc height are less pronounced [30]. In the Pfirrmann [1] grading scheme, losses of height become pronounced in severe degeneration only, whereas stepwise reductions of signal intensity can be seen with each Pfirrmann grade [31], and IVD height is unlikely [32] an appropriate early indicator of IVD degeneration.
The findings of this study can have application immediately in research and potentially in the future in clinical practice. The findings here can be used in research when the T2 measurements are used at outcome measures: for sample size estimation in prospective studies and/or retrospective sensitivity analysis for existing data. Ultimately, in clinical practice, similar to current practice in the interpretation of bone mineral density changes over time [33], once developments in automated image analysis have progressed, the test-retest repeatability data here can inform whether a patient's results over time exceed measurement error and thus potentially inform clinical management of that patients.
The strengths of the current study include its prospective design over a long period of time (12 months) and it is unique amongst method development studies for quantifying the lumbar IVDs, such as via T 2 . The main limitation of the current study is the loss to followup of participants beyond 154d post-baseline, which indicates that estimates of reliability beyond this time-point may be less robust. Another limitation relates to the use of a convenience sample of male participants in a wider project. Whilst we speculate reliability would be similar in clinical populations, such as in people with low back pain, we cannot be certain of this. Similarly, we have not examined individuals older than 45 years of age and did not examine women.  For results of Bland-Altman testing, see S1 Table. https://doi.org/10.1371/journal.pone.0249855.t001

Conclusions
In conclusion, T 2 MRI was shown to produce reliable estimates of whole and regional lumbar IVD structure. This method was more reliable when averaging between vertebral levels, yet less reliable when IVDs had greater degeneration. These findings establish long-term repeatability of this method for future prospective studies with regards to design (i.e. sample size calculations) and implementation (i.e. quantification of changes over time).   Table 4).
https://doi.org/10.1371/journal.pone.0249855.g004 Supporting information S1 Table. Results from Bland-Altman testing. � : 0.008<p<0.026 and indicates raw p-value on regression testing for linear bias. None of these P-values remain significant after adjustment for multiple comparisons via the false discovery rate method. Note that all lower limits remain below zero and all upper limits above zero, indicating no systematic mean bias. (PDF)