Test-retest reliability of myelin imaging in the human spinal cord: Measurement errors versus region- and aging-induced variations

Purpose To implement a statistical framework for assessing the precision of several quantitative MRI metrics sensitive to myelin in the human spinal cord: T1, Magnetization Transfer Ratio (MTR), saturation imposed by an off-resonance pulse (MTsat) and Macromolecular Tissue Volume (MTV). Methods Thirty-three healthy subjects within two age groups (young, elderly) were scanned at 3T. Among them, 16 underwent the protocol twice to assess repeatability. Statistical reliability indexes such as the Minimal Detectable Change (MDC) were compared across metrics quantified within different cervical levels and white matter (WM) sub-regions. The differences between pathways and age groups were quantified and interpreted in context of the test-retest repeatability of the measurements. Results The MDC was respectively 105.7ms, 2.77%, 0.37% and 4.08% for T1, MTR, MTsat and MTV when quantified over all WM, while the standard-deviation across subjects was 70.5ms, 1.34%, 0.20% and 2.44%. Even though particular WM regions did exhibit significant differences, these differences were on the same order as test-retest errors. No significant difference was found between age groups for all metrics. Conclusion While T1-based metrics (T1 and MTV) exhibited better reliability than MT-based measurements (MTR and MTsat), the observed differences between subjects or WM regions were comparable to (and often smaller than) the MDC. This makes it difficult to determine if observed changes are due to variations in myelin content, or simply due to measurement error. Measurement error remains a challenge in spinal cord myelin imaging, but this study provides statistical guidelines to standardize the field and make it possible to conduct large-scale multi-center studies.


Results
The MDC was respectively 105.7ms, 2.77%, 0.37% and 4.08% for T 1 , MTR, MT sat and MTV when quantified over all WM, while the standard-deviation across subjects was 70.5ms, 1.34%, 0.20% and 2.44%. Even though particular WM regions did exhibit significant differences, these differences were on the same order as test-retest errors. No significant difference was found between age groups for all metrics. PLOS  The time constant of the transverse relaxation due to spin-spin interactions and local field inhomogeneities (T 2 Ã ) has also exhibited sensitivity to myelin [32][33][34]. However, T 2 Ã includes important contributions from other factors, such as iron content [4,35], fiber orientation [36], blood vessels [37] and blood oxygen level [38].
Inhomogeneous Magnetization Transfer (ihMT) ratio is another recent metric [39] that is thought to be particularly sensitive and specific to myelin [40,41]. However, the measurement of this metric requires non-product sequence which are currently not available on clinical scanners.

Terminology
The above-mentioned metrics have their own advantages and limitations in quantifying myelin content in the CNS. To compare them, the relevant criteria for a myelin biomarker needs to be defined properly. Sensitivity and specificity are often the outstanding criteria. Here, sensitivity refers to the ability of the metric to monitor the variations in myelin content, while the specificity describes its exclusivity to myelin variations, i.e. to what extent the variations in the metric values are due to variations in the myelin content only. However, before tackling the sensitivity and specificity of a metric, it is essential to assess its repeatability. Indeed, sensitivity and specificity cannot be determined precisely if the metric values dramatically change between different scan sessions. The repeatability refers to the agreement (measurement precision) between two or more measurements made at different time points under the same conditions (e.g., same protocol, same scanner, same subjects, etc.) [42]. The repeatability must not be mistaken with reproducibility, which refers to the agreement between two or more measurements made at different time points under changing conditions. In both repeatability and reproducibility studies, the reliability is a relevant aspect to assess. The reliability compares the variability of scores due to measurement errors to the variability in the "true", error-free scores, i.e. to the variability induced by true variations of the measured feature (e.g., true variations in myelin content).

Review of past studies on qMRI metrics repeatability
The question of repeatability is even more relevant for spinal cord studies, where noise, motion and susceptibility artifacts make it difficult to acquire high quality images [43]. Previous studies investigated the repeatability of quantitative MRI metrics. Taso et al. [44] reported the repeatability of MTR, ihMTR and DTI (Diffusion Tensor Imaging) indexes within 3 healthy subjects at 3 time points by means of coefficients of variations (CV), defined as the ratio of the between-scans standard-deviation over the mean across scans. However, this index does not allow to properly compare between different metrics, as the means can differ drastically across metrics or even for a single metric across different studies (e.g., MTR [45]), yielding lower CVs for metrics with higher mean values. Smith et al. [46] also reported the test-retest repeatability of DTI and MT metrics within 9 healthy subjects at 2 time points using the normalized Bland-Altman difference (i.e. mean difference between scans divided by the mean across scans), which makes it harder to compare the repeatability between metrics with different means. Grussu et al. [47] reported the test-retest repeatability of NODDI (Neurite Orientation Dispersion and Density Imaging) indexes within 5 heathy subjects. The test-retest reliability was quantified by means of Intra-Class Correlation (ICC) coefficients defined as the ratio of the inter-subject variance over the total variance (i.e. the sum of the within-and between-subjects variances). Smith et al. [48] assessed the repeatability of MTR and F (fraction of exchanging protons bound to macromolecules) from quantitative magnetization transfer (qMT) imaging by means of the 95% confidence interval for the test-retest difference. However, this estimate of the measurement error was not properly compared neither between metrics nor in the context of the differences observed between (expected) different myelin contents.
The test-retest repeatability has been studied extensively in research fields other than qMRI, notably in rehabilitation research [49][50][51][52][53]. Useful statistical indexes to quantify repeatability are provided. First, the existence of a systematic bias between test and retest measurements can be examined by the confidence interval for the test-retest difference (CI d ), as used in Smith et al. [48]. Then, the reliability can be assessed by the intra-class coefficient based on a two-way mixed effects model of analysis of variance. Finally, groups can be compared taking measurement errors into account (which is not done with usual statistical tests) using CI d , showing whether the difference between groups is distinguishable from measurement errors or not. In the same vein, one can compute the Minimum Detectable Change (MDC) to quantify the minimum difference between two single metric values that is necessary to report a "true" errorfree change, again taking the measurement errors into account. The MDC is particularly appropriate and intuitive for clinicians who would like to assess whether a treatment affects their patient or not.
In this work, we propose a statistical framework to quantify the test-retest reliability of qMRI metrics. We (i) quantify the repeatability of T 1 , MTR, MT sat and MTV in the spinal cord using a clinically-compatible protocol and (ii) evaluate the sensitivity of these metrics to myelin content across spinal pathways and age groups, in the context of the test-retest measurement errors.

Data acquisition
Thirty-three right-handed healthy subjects including 19 young (aged 24.9 ± 3.9, from 21 to 33 y.o.; 9 women, 10 men) and 14 elderly (aged 67.4 ± 4.0, from 61 to 73 y.o.; 6 women, 8 men) were recruited. A written consent form was obtained from each participant as supervised by the ethical review board of the Research Center of Montreal University Geriatric Institute (Comité mixte d'éthique de la recherche du RNQ, approval number CMER-RNQ_14-15-010).
To assess the metrics repeatability, 8 young (aged 24.0 ± 3.9, from 21 to 31 y.o., 2 women, 6 men) and 8 elderly (aged 67 ± 4.5, from 61 to 72 y.o., 2 women, 6 men) subjects from the previously described cohort underwent two scanning sessions: 12 subjects were scanned twice within a 10-month interval, and 4 within the same session (with a 5-minute break out of the scanner between scan and rescan). All data were acquired on a 3T Siemens TIM TRIO scanner and with a standard 12-channels head coil and a standard 4-channels neck coil.

Data processing
Analysis was performed using the Spinal Cord Toolbox (SCT) version 2.2.3 [54]. The four datasets were first co-registered, then metrics were calculated. For extracting metrics within specific pathways in the white matter (dorsal column, DC, lateral funiculi, LF, ventral funiculi, VF), data were registered to the MNI-Poly-AMU template [55], which includes an atlas of WM tracts [56]. For sake of clarity, details about the processing pipeline are included in the supplementary material (see S1 File in section 8. Supporting information).

Repeatability. Systematic change between test and retest
The mean of the difference between test and retest ðdÞ across subjects was computed along with a 95% confidence interval for the true test-retest difference (CI d ) derived according to: is the Standard Error, SD d is the standard-deviation (SD) of the difference between test and retest across the subjects, n is the number of subjects and t n−1 is the t statistics with n − 1 degrees of freedom and type I error of 5% [57]. In our case, t n−1 = 2.131.
If zero is not included in CI d , we can consider that a systematic change between test and retest has occurred [50]. In addition to assess the systematic bias between test and retest, the CI d gives the minimum difference between two subjects groups that is distinguishable from measurement errors.
Absolute test-retest difference The absolute difference between test and retest, termed |d|, and its mean across subjects (jdj) were computed to give to the reader a basic and direct measure of the measurement errors magnitude.

Reliability
The Intra-Class Correlation (ICC) coefficient is an appropriate coefficient to assess the testretest reliability [58]. It measures the proportion of variance that is attributable to the "true" error-free scores of subjects (inter-subject variance) compared to the total variance ("true" variance + variance due to measurement errors). The ICC is calculated from a 2-way mixed effects model of repeated-measures analysis of variance which particularly fits any kind of test-retest experiment designs: the total variance is partitioned between within-and between-objects (subjects) variances. A commonly used index to report repeatability is the Pearson's correlation coefficient. The ICC coefficient value is often close to the Pearson's correlation value. However, the ICC includes a penalization for a systematic error between measurements (in this case, the ICC would be lower than the Pearson's) and it can also assess the reliability of a measure based on more than two measurements by subjects (thanks to the model of analysis of variance used for computation). Moreover, the Pearson's coefficient normalizes each measurement by its own mean and SD, whereas the ICC normalizes the variables by the pooled mean and SD of both measurements. So if the variables do not have a common unit and variance, the Pearson's is more appropriate. But, for test-retest measurements having the same units, the ICC is a better index [59].
The higher the ICC, the higher the reliability; the upper threshold above which the ICC would reflect a good reliability remains subjective and depends on the application but we can still refer to the scale proposed by Shrout and Fleiss [58], Fleiss [60] and Cicchetti [61]: poor < 0.4 < fair < 0.6 < good < 0.75 < excellent 1. Chinn [62] suggests that measure needs to have at least an ICC coefficient of 0.6 to be useful. Contrary to the other repeatability indexes of this section, the ICC coefficient is a dimensionless index.
In this study, the ICC coefficient was computed according to the Matlab implementation of McGraw and Wong [59] (case 3A).

Minimal Detectable Change
Another useful index is the Minimal Detectable Change (MDC). It estimates the minimal difference between two scores that would reflect a "true" difference (i.e., not completely due to measurement error). It can be derived according to: is the Standard Error of Measurement and SD pooled is the standard-deviation across all measurements [49,63]. The MDC can also be interpreted as an interval for repeated measures. If x is the score of a subject for a single measurement, there is a 95% chance that the score of a repeated measurement lies within x ± MDC, assuming that the measurement errors are normally distributed. Any difference of ± MDC between two metric values can be considered as usual variation (due to measurement error); such a difference is not exceptional enough to be considered as a real change in the microstructure.
The MDC and the CI d are based on the same idea of estimating the magnitude of the difference in metric values that can be only due to measurement errors. However, the MDC applies for two single metric values whereas CI d , which takes into account the sign of the difference between test and retest, applies for group comparison where negative measurement errors compensate for positive ones.

Comparison of indexes with different units across studies
To allow the comparison between techniques having different measuring units, one can express the repeatability indexes as a percentage of the mean across all measures, similar to calculation of the coefficient of variation (CV = 100 Á SD/mean). This method works fine when the mean is similar between techniques, otherwise the comparison is biased by the mean. For example, it has been shown that MTR could lead to drastically different mean values when acquired with different offset saturation pulse parameters, e.g. from 9 to 51% in the healthy WM [45]. Hence, normalizing by the mean would yield lower indexes for techniques with higher mean value, whereas these techniques could have the same test-retest repeatability as other techniques with lower mean values. To avoid this while still being able to compare between techniques side by side, we expressed these reliability indexes as a percentage of the SD across subjects of the first MRI session values only (SD subjects ), i.e.: where Index represents any reliability index expressed in the metric unit such as the MDC. Indeed, this manipulation enables us to compare metrics side by side while accounting for the property we are looking for. Here, we are looking for a metric that has low test-retest variability relative to the inter-subject variability, i.e. relative to the dispersion of the sample this metric can offer. The SD across subjects is the most basic measure of the sample dispersion. In this way, we would like the Index % of SD subjects to be as low as possible (i.e., a low measurement error and a high SD across subjects) in order to observe differences between subjects that are higher than measurement errors.

Sensitivity to myelin content variations.
To assess the metrics sensitivity to the variations in myelin content across vertebral levels/WM regions relative to the repeatability, differences in group mean (n = 33) between levels/regions were compared along with their measurement error (assessed by the CI d ).
Moreover, a one-way repeated measures ANOVA between levels/regions was performed independently for each metric (n = 33). The assumptions of normal distribution within each group (i.e., level or WM region) and of sphericity were checked using Lilliefors's test and Mauchly's test respectively. When the assumption of sphericity was not met, a Greenhouse-Geisser correction was used to compute the ANOVA. When the ANOVA detected a significant difference, a post hoc multiple comparison test using the Tukey's honestly significant difference criterion was performed in order to find which groups were significantly different from each other.
To test the metrics sensitivity to the demyelination with aging reported by histology in the literature [64][65][66], for each vertebral level/WM region, means across each age group were compared taking the measurement error (assessed by the CI d from the previous analysis) into account in order to investigate whether the difference in means could reflect a "true" difference or whether it is indistinguishable from measurement errors.
In addition, to test for significant differences, we performed independently for each metric, on the larger sample (n = 33, n young = 19, n elderly = 14), two-way repeated ANOVAs with the age group as between-subjects factor and, as within-subjects factor: • vertebral levels to determine if this effect was consistent across levels (the metric being quantified in the whole WM); • ROIs (WM, DC, LF, VF) to determine if this effect was consistent across ROIs (the metric being quantified from C2 to C4).
Finally, to complete this study, a power analysis was performed for two-tailed t-tests between young and elderly subjects based on whole WM values of each metric.

Repeatability
Fig 1 shows test and retest multi-parametric maps by vertebral levels, for one single young and one single elderly subject, as well as for the group average (n = 33). The single subject data look noisy, however the average map shows clear distinction between WM and GM. Moreover, the symmetry that can be observed on the group average maps suggests no apparent differences in myelin content between left and right cord. In all metrics, the heterogeneity of values across WM regions suggests different microstructural compositions. For example, the fasciculus cuneatus shows higher MTV than the fasciculus gracilis, suggesting higher myelin content in agreement with previous histology studies [1,67]. Apart from MTR, all metrics show fairly stable values across vertebral levels.
A guide for reading (and understanding) figures and tables in the paper.  Table 1, which quantifies the metrics repeatability over all WM at the different cervical levels  Table 2 are their analogs quantifying the metrics repeatability over all reliable levels within the different WM sub-regions). Let's take an example to better explain how to use these repeatability indexes. Let's take the T 1 at C3. Regarding only one scan, the mean T 1 across the group is 1007.2ms and the SD is 74.3ms. A 95% confidence interval for the mean test-retest difference of [-38.5; 23.1]ms indicates that if we rescan the same group a second time, the mean is likely to lie between 968.7 and 1030.3ms (with 95% probability). Now, if we measure T 1 at C3 in a different group (e.g., a group of patients) and the resulting mean lies between 968.7 and 1030.3ms, we will not be able to report whether the difference in T 1 between the two groups is due to measurement errors or to a true difference in T 1 . The MDC (113.2ms in our example case) will be useful for instance in a case where a clinician measures the T 1 in a new lesion of his patient at one time point t; say he gets a measure of T 1 (t) = x ms. If he re-measures it right after, there is 95% probability that T 1 (t + 30min) lies within x ± 113.2 ms. Now, if he wants to control the evolution of the lesion one year later and he measures T 1 (t + 1year) still within x ± 113.2 ms, he will not be able to say whether this change between T 1 (t) and T 1 (t + 1year) is due to an evolution of the tissue or to measurement errors.
The ICC and the MDC (expressed in percentage of the SD across subjects) are useful to compare repeatability across metrics (more extensively done in Fig 4). For example, if we compare T 1 to MTR at C3, the ICC is much higher for T 1 (0.72) than MTR (-0.3)-note here that the interpretation of a negative value for the ICC is the same as for a null value (very poor reliability). This is because T 1 has a lower test-retest variation (jdj C3 = 47.1ms in Fig 2) compared to the variation between subjects (SD subjects = 74.3ms in Table 1), whereas MTR has a high testretest variation (jdj C3 = 1.43% in Fig 2) compared to the variation between subjects (SD subjects = 1.38% in Table 1). This also reflects in the MDC (MDC ¼ 1:96 ffiffi ffi 2 p Á SD total ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 À ICC p ). For T 1 at C3, MDC = 113.2ms, which is 152.3% of SD subjects (Table 1), whereas for MTR at C2, MDC = 3.76%, which is 271.6% of SD subjects . This result shows that measurement errors in MTR cover almost 3 times the standard variations between subjects, making it difficult to observe true differences in MTR.
The mean test-retest difference (jdj, displayed in gray at the top left of each graph) is higher at C5 (Fig 2); however, one-way repeated ANOVAs testing the effect of vertebral levels on the absolute test-retest difference did not report significant results (p-values were 0.183, 0.195, 0.389 and 0.579 for T 1 , MTR, MT sat and MTV respectively). No clear test-retest difference between young and elderly subjects is observed on this graph.
For all metrics and all levels, no significant systematic bias between test and retest is detected (all CI d include 0, see Table 1). When compared to other metrics, mean MT sat shows minimal variations across vertebral levels (p-values of the repeated ANOVAs between levels were <<0.0001, <<0.0001, 0.02 and <0.0001 for T 1 , MTR, MT sat and MTV respectively). The ICC coefficient highlights a poor test-retest reliability, barely exceeding 0.5, especially for MTR and MT sat . This point is supported by the MDC, which is generally around 2 times the SD across subjects.  Test and retest maps in a young and an elderly subject at each vertebral level (mean across levels) along with the mean maps across the 33 subjects. All these maps are in the template space. Note that the color bar scale has been adjusted to the mean maps contrast. On a single-basis subject, one can observe a somewhat poor test-retest repeatability, within and across slices. However, despite this poor repeatability, the average maps (here, n = 33) are more consistent in terms of symmetry and tract-specific variations. For example, we can clearly distinguish higher MTV in the fasciculus cuneatus versus in the gracilis (dorsal column), which is in agreement with previous histology work [1,67].
These observations were confirmed (except for MT sat which shows large test-retest differences in the DC) by one-way repeated ANOVAs performed between ROIs on the absolute test-retest difference (p-values <0.01, 0.01, 0.08, <0.01 for T 1 , MTR, MT sat , MTV respectively). In addition, similar repeatability is found when the metrics are estimated over all WM or within the DC or the LF.  Table 2, which quantifies the metrics repeatability within sub-ROIs of the WM from C2 to C4. Interestingly, MT sat performs really differently according to the ROI, yielding the worst repeatability result in the DC (ICC = 0.1, MDC % 3 inter-subject SDs) and the best one in the LF (ICC = 0.82, MDC % 1.2 inter-subject SDs). Note however that estimating the metric at several levels (here, C2 to C4) is not favorable to MT sat given that its ICC in WM at C4 is half its ICC at C3 (Table 1). Overall, T 1 and MTV yield the best results. MTV regularly shows a fair repeatability whatever the ROI is, with a MDC about 1.5 to 2 times the inter-subject SD (which is equivalent to 87-95% of the sample distribution). In the level-wise analysis, MTV performs slightly better than T 1 . We suspect that these results reflect the clearer delineation between the cord sub-regions and the more homogeneous values in those subregions that could be observed in MTV maps when compared to T 1 or even MT sat maps ( Fig   Fig 2. Subjects' distribution with test-retest differences quantified over all WM according to vertebral levels. The top and bottom of the orange boxes respectively represent the max and min among test and retest, while the black line in the middle of the box represents the mean. Note that the y-axis does not start from zero for the sake of clarity. The mean absolute difference between test and retest (mean height of orange boxes, jdj) is displayed in the top left hand corner of each graph. This figure gives a comprehensive view of the repeatability compared to between-subject differences. https://doi.org/10.1371/journal.pone.0189944.g002 Test-retest reliability of myelin imaging in the human spinal cord 1). Furthermore, as expected, MTR regularly performs worst, in part because of the low contrast between subjects it exhibits, whatever the ROI is. Fig 4 compares three main repeatability indexes (absolute test-retest difference, ICC and MDC) between the different metrics. While no particular metric stands out from this comparison, MTR seems to be the least reliable at every level. For most of the vertebral levels, jdj of MTR is on the same order as the inter-subject SD (which is equivalent to 68% of the population if we assume a Normal distribution for the sample), the ICC is below 0.4 at every level and the MDC exceeds 2.5 inter-subjects SDs (equivalent to 98.8% of the population) at 2 levels over 4. When considering the effect of vertebral level, C5 seems to be the least reliable (ICC < 0.5 for all metrics). Regarding the effect of WM regions (Fig 4B), some differences are observed. For instance, MT sat yields the best ICC score in the LF (0.82) and the worst in the DC (0.1).

Sensitivity to myelin content
This section deals with the larger sample (n = 33 subjects). Fig 5 plots the group mean along with the measurement error magnitude (CI d ) in order to allow the reader to assess whether differences between vertebral levels or WM regions can be distinguished from measurement errors or not. Individual subjects data are also plotted to see if differences between subjects can be carried out despite the measurement error. However, for individual comparison, measurement errors are assessed by the MDC, which is much larger than the CI d (as negative and positive errors do not compensate for each other). Only T 1 and MTV seem to allow the comparison between some healthy subjects. From left to right, the columns correspond to the mean ± SD across subjects (n = 16) based on values from the first scan session only, the 95% confidence interval for the true test-retest difference, the ICC coefficient, the MDC. All numbers are in the metric unit except those in square brackets, which are expressed as a percentage of the SD across subjects to quantify the repeatability relative to the inter-subject difference, i.e. the reliability. Fig 2 is  Test-retest reliability of myelin imaging in the human spinal cord

Effects of vertebral levels and WM regions.
The differences that are distinguishable from measurement errors were sum up in Table 3, along with the results of the one-way repeated ANOVAs. One can observe that some cases show significant differences but those differences are too small to be distinguished from measurement errors. This is the case for the MTR which is significantly different between every vertebral level but only C2 and C5 show a difference large enough to be due to something else than measurement errors. Also, significant differences between WM regions are found with MTR and T 1 but none of them are larger than measurement errors. Fig 6 compares the differences between young and elderly to the measurement errors assessed by the CI d . With all metrics within every spinal cord region (vertebral level or WM region), the difference between young and elderly can always be explained by measurement errors only. Moreover, the repeated ANOVAs did not report any significant effect of age for all metrics, neither level-wise nor ROI-wise. However, we can still notice some general trends: T 1 , MTR and MTV generally support the demyelination with aging histologically reported in the literature, whereas MT sat constantly shows the reverse trend. Test-retest reliability of myelin imaging in the human spinal cord

Effect of age.
To complete this study, Table 4 reports the statistical power analysis. From this analysis, one can compare the difference that can be detected given the metrics test-retest errors (length of the CI d , 2 nd column) to the minimum difference in the true metric values required to detect a significant difference (1 st column) between young and elderly (with a fair test power). We can notice for example that, given the measurement errors of MTR (1.36%), even if the difference in means were large enough (!1.27%) to yield a significant result, the imprecision of measurement is too large to detect such a difference. It is not the case with the other metrics. Moreover, we can notice that the observed differences in means (3 rd column) are very low compared to the difference needed to obtain significant results (1 st column), yielding very low statistical power for those tests (4 th column). Finally, given the large sample size required to obtain a significant difference (5 th column), T 1 and MTV do not seem sensitive to age groups (based on their mean WM values in this study).

Discussion
This study proposes a statistical framework for comparing clinically feasible myelin imaging techniques (T 1 , MTR, MT sat and MTV) in the cervical spinal cord.

Myelin-sensitive metrics values in the spinal cord
The resulting mean values across subjects are in agreement with previous studies. Stikov et al. [68] observed a T 1 around 1000ms in the brain, which is comparable to the T 1 in the spinal cord WM in-vivo at 3T [69,70]. The same holds for our MTV measurements which are in agreement with reported PD values [12,[18][19][20][21][22][23]69,71]. There is no gold-standard for clinically feasible MT-based protocols due to their dependence on pulse sequence parameters. However,

Repeatability
Even for the most reliable metrics (T 1 and MTV, see Fig 4), the ICC is moderate (around 0.5) and the MDC is on the order of two inter-subject SDs. Given the test-retest variations, the minimal difference between individual healthy subjects that can be detected with these metrics (MDC) is much larger than the usual variations we observed (see Fig 5). Looking at groups of subjects, significant differences between spinal cord regions stand out but still, they are not large enough to be distinguished from measurement errors (quantified by the CI d in this case, as shown in Fig 5).
In comparison with the brain, repeatability in the spinal cord is hampered by multiple sources of artifacts (motion, susceptibility) and low SNR [43]. Better repeatability might be achieved with coarser resolution and/or more averaging, though at the cost of longer acquisition times, which could be associated with more subject motion.
Taso et al. [44] reported results for myelin-related metrics in the spinal cord WM: a CV of 5.3% for MTR and 2.9% for ihMT ratio. However, this study reported the repeatability in terms of CVs, which are misleading when comparing metrics with different units and/or dynamic ranges (as mentioned in section 2.3.1. Repeatability). Smith et al. [48] reported a CI d of [− 3%, +5%] for MTR over all WM from C2 to C5 within 10 young healthy subjects. Even if the repeatability of the metrics reported in our study is not good enough to differentiate between WM regions or age groups, it is still much better (CI d of [− 0.99%, +0.54%] for MTR). This may suggest that significant differences not accounting for precision of measurements might have been reported in the literature, whereas they could be only explained by measurement errors.
Looking at the metrics individually, T 1 -based metrics (MTV and T 1 ) generally show the best reliability (Fig 4). Regarding sensitivity to myelin, MTV shows clearer delineation of the GM and smooth variations in the WM (Fig 1), but no difference between WM regions stood out when compared to the measurement error. When looking at individual maps, T 1 seems particularly affected by cord movements and compressions occurring during respiratory and cardiac cycles (Fig 1), which produces statistically significant differences (see Table 3), but those differences are not larger than measurement errors. The same applies for MTR, which emerges as the less reliable metric due to its very small variation between subjects (Fig 4). However, MTR is the only metric exhibiting a significant effect that accounts for measurement error (difference between vertebral levels C2 and C5 in Table 3). This decrease in MTR towards lower levels could reflect a true decrease in myelin content, but could also be due to B 1 + inhomogeneity. MTR variations due to B 1 errors have already been reported in the brain [76] and correcting for them should be further investigated in the spinal cord. MT sat minimizes the T 1 contribution included in MTR, and is thereby less variable across vertebral levels.

Fig 5. Comparison across vertebral levels and WM regions along with the measurement errors for the group mean (n = 33) and individual subjects.
The red envelope represents the 95% confidence interval for the test-retest difference (CI d ), which assesses the measurement error magnitude of the group mean (in black). The orange envelope represents the MDC (Minimum Detectable Change), difference required to compare individual subjects (faded gray lines). Note that the group mean approaching the edges of the CI d (red envelope) reflects an asymmetric confidence interval due to a non-null offset between test and retest (non-null mean test-retest difference, d). However, no offset was large enough to report a significant systematic bias between test and retest (see section 3.1. Repeatability, Table 1 and Table 2).
https://doi.org/10.1371/journal.pone.0189944.g005 For each analysis (A, B), the left column is the results of the one-way repeated ANOVAs whereas the right column reports the vertebral levels/WM regions showing differences larger than measurement errors (see also Fig 5).

Table 3. Comparison of significantly different vertebral levels (A) or WM regions (B) with differences larger than measurement errors. (A) Analysis by vertebral levels (B) Analysis by WM regions
https://doi.org/10.1371/journal.pone.0189944.t003 Test-retest reliability of myelin imaging in the human spinal cord Comparison between young (n young = 19) and elderly (n elderly = 14) subjects along with measurement errors. For each case, the corresponding 95% confidence interval for the mean test-retest difference (CI d ), estimated from the test-retest analysis (see section 3.1. Repeatability) was centered at the mean of each group, in order to assess whether the difference between young and elderly is larger than the test-retest errors or not. With all metrics within every spinal cord region (vertebral level or WM region), the difference in means between young and elderly was undistinguishable from measurement errors. https://doi.org/10.1371/journal.pone.0189944.g006 Test-retest reliability of myelin imaging in the human spinal cord

Sensitivity and specificity to myelin with MRI
The assessment of the sensitivity of metrics to myelin content remains difficult, due to the lack of a ground truth. A loss of myelinated fibers with aging (mainly the small caliber ones) was observed histologically in the brain [77] and cervical spinal cord [64][65][66] but it remains unclear if these variations can be detected by clinical MRI nowadays. Age effects have been reported in the brain with MTR [78] and DTI [79][80][81][82]. In the spinal cord, most age effects are reported with DTI [83][84][85]. One study investigated MTR evolution in the spinal cord during aging, but no significant effect was reported [44]. The same study reported a decrease in ihMT ratio between subjects aged 35 to 50 and subjects aged over 50, not accounting for measurement errors however. Our study did not observe any difference between age groups, with or without accounting for measurement error (Fig 6). This lack of sensitivity to aging could be due to the choice of acquisition parameters, the small effect/sample size, or simply due to a lack of true differences in myelination.
As noted in the introduction, some of the myelin-sensitive techniques are also hampered by confounding factors. For example, T 2 Ã is affected by iron content, fiber orientation, blood vessels and blood oxygen level. MTR is affected by T 1 and B 1 field, and more generally, magnetization transfer and MTV are sensitive to macromolecules (i.e., not only myelin). For each of these techniques, there are ways to mitigate those confounds. For example, quantitative susceptibility maps could inform T 2 Ã maps, or T 1 and B 1 + fields could be acquired to correct MTR maps [76]. All these strategies come at the cost of additional scan time, and possibly larger output variance (due to the introduction of yet other noisy measures). While DTI has some intrinsic limitations, other techniques also based on diffusionweighted imaging might offer more sensitivity to myelin. It is important to note, however, that because water protons trapped between myelin sheaths have a short T 2 (around 10 ms at 3T, which could be quantified using myelin water fraction techniques) and that protons from bound molecules have an even shorter T 2 (order of μs, which could be quantified with ultrashort TE imaging or magnetization transfer techniques), diffusion-weighted protocols typically use a TE (> 60ms) too long to be sensitive to signal coming from the myelin (and from water trapped in it). Some advanced diffusion-weighted techniques include NODDI [47,86], which can notably estimate the intra-cellular volume fraction and CHARMED/AxCaliber [87][88][89], which can notably estimate the hindered (extra-cellular) and restricted (intra-cellular) water fraction. All these metrics are thus indirectly related to the myelin volume fraction, although additional information would be required to be able to quantify absolute myelin content. Test-retest reliability of myelin imaging in the human spinal cord To improve specificity to myelin, combining several metrics, using for example independent component analysis, or acquiring maps of confounding factors for a posteriori corrections, might be advisable [90]. Future work will be undertaken in this direction [91].

Perspective of repeatability assessment
Repeatability assessment is crucial for the development of qMRI biomarkers. Our results show that significant differences between groups can be reported with standard statistical tests, yet these differences are comparable to (or even smaller than) test-retest measurement errors. Controlling for both aspects (statistical significance and measurement errors) is necessary for qMRI studies.
The indexes reported in this work (95% confidence interval for the test-retest difference (CI d ), ICC and MDC) are useful for quantifying repeatability and allowing comparisons across studies. As mentioned before, the coefficient of variation depends on the magnitude of the metric, and should not be the primary index for assessing repeatability, especially if metrics have different means or units. The CI d first allows to control for the existence of a potential systematic bias between measurements (i.e. scan sessions). In addition, it gives an estimation of the measurement error for group averages. In the same vein, the MDC provides a measure of the minimum difference between two individual measurements to report a true difference, taking into account the measurement errors. For example, the CI d would be useful for researchers comparing different populations, whereas the MDC would be useful for a clinician needing to assess the evolution of a WM lesion within a single patient. Furthermore, the ICC coefficient has the advantage to be dimensionless, and can thus be easily compared to assess reliability across metrics, studies, vendors or sites. Aside from providing a robust quantification of the repeatability with two measurements (test-retest studies), the ICC coefficient (and consequently, the MDC) can also be consistently used with more than two measurements. Those reliability indexes have already been extensively used in test-retest studies from other research fields, such as rehabilitation, where the precision of tests is crucial [49][50][51][52][53]. In this work, the absolute test-retest difference (|d|) was reported to provide the reader with a direct and basic measure of measurement errors; however, this index is not sufficient to estimate the repeatability and compare it across studies.
Finally, the assessment of the repeatability needs to be adapted to the study goals. Indeed, the ICC depends on the sample homogeneity. Therefore, if the goal is to differentiate between the microstructure of healthy subjects, including patients in the sample will artificially increase the between-subjects variability and overestimate the ICC. In this study, we can confidently assert that the ICC is lower (and the MDC is higher) than it would have been for a sample that includes patients and controls. Therefore, if the goal is to distinguish between pathological cases, we recommend including the different types of tissue (healthy and pathological tissues, with different stages of the disease) in the cohort. This way, the MDC and ICC would integrate the associated between-subjects variability.

Data sharing
Due to IRB restrictions, all data used here could not be publicly shared. However, we obtained specific consent for sharing MRI data from four young volunteers. Three of them were part of the tested and retested group. Along with those datasets, we provide the batch scripts used to produce the myelin-sensitive metric maps and to register them to spinal cord template and white matter atlas. Also available is a Microsoft Excel spreadsheet gathering all results of the metric estimations within each region of interest for every scan session and every volunteer of the cohort. The 1 st tab of the sheet corresponds to the tested and retested cohort only (n = 16), and the 2 nd tab corresponds to the whole cohort (n = 33). Finally, also shared are the scripts to extract these metrics values, to compute the statistical indices for reliability assessment and to produce the figures presented in this work. All these data and code are available at: https://osf. io/ezmrj/.

Conclusion
In this study, we assessed the repeatability and distribution of myelin-sensitive metrics (T 1 , MTR, MT sat and MTV) in the spinal cord. T 1 and MTV (1proton density) showed the best reliability regarding the inter-subject variations, but the measurement error remains too large to detect differences between healthy individuals. T 1 , MTR and MTV showed trends consistent with the hypothesis of demyelination with aging, but again the differences were not large enough to be distinguishable from measurement errors, or to be significant.
This study used a range of statistical tools to explore the differences between myelin-sensitive metrics. We show that even though statistically significant differences can be reported using standard statistical tests, an important proportion of these differences can be attributed to measurement error. In particular, the coefficient of variation is a misleading index when comparing metrics with different units, and we recommend using the MDC when comparing individual measurements, and the 95% confidence interval of the test-retest difference when comparing groups. The indexes explored in this study allow for a fair comparison of qMRI metrics across studies, MRI vendors and sites, leading toward standardizing the field of myelin imaging and increasing its clinical relevance.
Supporting information S1 File. Data processing pipeline. This section describes the data processing steps performed to estimate MTR, MT sat , T 1 and MTV maps and to register those maps to the MNI-Poly-AMU template [55] and WM atlas [56].