^{1}

^{2}

^{¤}

^{3}

^{2}

^{4}

^{5}

^{6}

^{7}

^{2}

^{2}

^{8}

^{1}

^{9}

^{2}

^{8}

^{1}

^{10}

The authors have declared that no competing interests exist.

Current address: Centre d'Exploration Métabolique par Résonance Magnétique (CEMEREM), AP-HM, Hôpital de la Timone, Pôle d'imagerie médicale, Marseille, France

To implement a statistical framework for assessing the precision of several quantitative MRI metrics sensitive to myelin in the human spinal cord: T_{1}, Magnetization Transfer Ratio (MTR), saturation imposed by an off-resonance pulse (MT_{sat}) and Macromolecular Tissue Volume (MTV).

Thirty-three healthy subjects within two age groups (young, elderly) were scanned at 3T. Among them, 16 underwent the protocol twice to assess repeatability. Statistical reliability indexes such as the Minimal Detectable Change (MDC) were compared across metrics quantified within different cervical levels and white matter (WM) sub-regions. The differences between pathways and age groups were quantified and interpreted in context of the test-retest repeatability of the measurements.

The MDC was respectively 105.7ms, 2.77%, 0.37% and 4.08% for T_{1}, MTR, MT_{sat} and MTV when quantified over all WM, while the standard-deviation across subjects was 70.5ms, 1.34%, 0.20% and 2.44%. Even though particular WM regions did exhibit significant differences, these differences were on the same order as test-retest errors. No significant difference was found between age groups for all metrics.

While T_{1}-based metrics (T_{1} and MTV) exhibited better reliability than MT-based measurements (MTR and MT_{sat}), the observed differences between subjects or WM regions were comparable to (and often smaller than) the MDC. This makes it difficult to determine if observed changes are due to variations in myelin content, or simply due to measurement error. Measurement error remains a challenge in spinal cord myelin imaging, but this study provides statistical guidelines to standardize the field and make it possible to conduct large-scale multi-center studies.

Precise techniques are needed to monitor microstructural degeneration of the nervous tissue in clinics, especially for longitudinal follow up of white matter (WM) lesions in neurodegenerative pathologies, such as demyelination in multiple sclerosis. Rather than using MRI as a technique for simply viewing the anatomy, quantitative MRI (qMRI) aims to provide quantitative

The longitudinal relaxation time T_{1} has shown high correlation with the myelin volume quantified by histology [_{1} is also affected by iron concentration [_{1} and fraction F of exchanging protons bound to macromolecules) [_{sat}) has been proposed to minimize T_{1} effects and increase the specificity to myelin [

Proton density (PD) is also a promising metric, as it measures the density of MRI-visible protons–i.e. protons with sufficiently long transversal relaxation time (T_{2})–which are water (or liquid) protons. In the Central Nervous System (CNS), the complement of PD yields an estimate of the density of non-free protons, which are mostly bound to lipids and other macromolecules. Since myelin consists of 70 to 80% lipids and some macromolecules [

Myelin Water Imaging (MWI) using multi-echo T_{2} [

The time constant of the transverse relaxation due to spin-spin interactions and local field inhomogeneities (T_{2}*) has also exhibited sensitivity to myelin [_{2}* includes important contributions from other factors, such as iron content [

Inhomogeneous Magnetization Transfer (ihMT) ratio is another recent metric [

The above-mentioned metrics have their own advantages and limitations in quantifying myelin content in the CNS. To compare them, the relevant criteria for a myelin biomarker needs to be defined properly.

The question of repeatability is even more relevant for spinal cord studies, where noise, motion and susceptibility artifacts make it difficult to acquire high quality images [

The test-retest repeatability has been studied extensively in research fields other than qMRI, notably in rehabilitation research [_{d}), as used in Smith et al. [_{d}, showing whether the difference between groups is distinguishable from measurement errors or not. In the same vein, one can compute the Minimum Detectable Change (MDC) to quantify the minimum difference between two single metric values that is necessary to report a “true” error-free change, again taking the measurement errors into account. The MDC is particularly appropriate and intuitive for clinicians who would like to assess whether a treatment affects their patient or not.

In this work, we propose a statistical framework to quantify the test-retest reliability of qMRI metrics. We _{1}, MTR, MT_{sat} and MTV in the spinal cord using a clinically-compatible protocol and

Thirty-three right-handed healthy subjects including 19 young (aged 24.9 ± 3.9, from 21 to 33 y.o.; 9 women, 10 men) and 14 elderly (aged 67.4 ± 4.0, from 61 to 73 y.o.; 6 women, 8 men) were recruited. A written consent form was obtained from each participant as supervised by the ethical review board of the Research Center of Montreal University Geriatric Institute (Comité mixte d’éthique de la recherche du RNQ, approval number CMER-RNQ_14-15-010).

To assess the metrics repeatability, 8 young (aged 24.0 ± 3.9, from 21 to 31 y.o., 2 women, 6 men) and 8 elderly (aged 67 ± 4.5, from 61 to 72 y.o., 2 women, 6 men) subjects from the previously described cohort underwent two scanning sessions: 12 subjects were scanned twice within a 10-month interval, and 4 within the same session (with a 5-minute break out of the scanner between scan and rescan). All data were acquired on a 3T Siemens TIM TRIO scanner and with a standard 12-channels head coil and a standard 4-channels neck coil.

The protocol consisted of:

One sagittal turbo-spin-echo 3D SPACE T_{2}-weigthed anatomic image (TR = 1500 ms; TE = 119 ms; flip angle = 120°; BW = 723 Hz/voxel; matrix = 384x384x52; resolution = 1x1x1 mm; FOV = 384x384x52 mm) with a high contrast between cord and cerebrospinal fluid (CSF) to further take the curvature of the cord into account in the data processing;

Four 3D FLASH acquisitions (TR = 35 ms; TE = 5.92 ms; BW = 260 Hz/voxel; matrix = 192x192x22; resolution = 0.9x0.9x5 mm; gap = 1 mm; FOV = 174x174x110 mm; R = 2 acceleration; phase encoding direction = right-left). The four FLASH scans consisted of:

One with a prior RF saturation pulse (Gaussian-shaped, duration = 9984 μs, offset frequency = 1.2 kHz) and an excitation flip angle of 10°;

Three without a saturation pulse and flip angles of 4°, 10°, and 20°;

Two axial 2D segmented spin-echo EPI acquisitions (TR = 3000 ms; TE = 19 ms; BW = 1905 Hz/voxel; matrix = 64x64, 17 slices; resolution = 3.0x3.0x5.5 mm; FOV = 192x192 mm) with a flip angle of 60 and 120° respectively (for B_{1}^{+} estimation purposes);

All images spanned at least C2 to C5 vertebral bodies. The duration of the protocol was 18 minutes.

Analysis was performed using the

Statistical analyses were performed using MATLAB R2014a (The MathWorks, Inc., Natick, Massachusetts, USA) and SPSS (IBM SPSS Statistics–Release 24.0.0.0) at the 0.05 significance level unless otherwise stated.

The mean of the difference between test and retest _{d}) derived according to:
_{d} is the standard-deviation (SD) of the difference between test and retest across the subjects, _{n−1} is the _{n−1} = 2.131.

If zero is not included in _{d}, we can consider that a systematic change between test and retest has occurred [_{d} gives the minimum difference between two subjects groups that is distinguishable from measurement errors.

The absolute difference between test and retest, termed |

The Intra-Class Correlation (ICC) coefficient is an appropriate coefficient to assess the test-retest reliability [

The higher the ICC, the higher the reliability; the upper threshold above which the ICC would reflect a good reliability remains subjective and depends on the application but we can still refer to the scale proposed by Shrout and Fleiss [

In this study, the ICC coefficient was computed according to the Matlab implementation of McGraw and Wong [

Another useful index is the Minimal Detectable Change (MDC). It estimates the minimal difference between two scores that would reflect a “true” difference (i.e., not completely due to measurement error). It can be derived according to:
_{pooled} is the standard-deviation across all measurements [

The MDC and the _{d} are based on the same idea of estimating the magnitude of the difference in metric values that can be only due to measurement errors. However, the MDC applies for two single metric values whereas _{d}, which takes into account the sign of the difference between test and retest, applies for group comparison where negative measurement errors compensate for positive ones.

To allow the comparison between techniques having different measuring units, one can express the repeatability indexes as a percentage of the mean across all measures, similar to calculation of the coefficient of variation (_{subjects}), i.e.:

To assess the metrics sensitivity to the variations in myelin content across vertebral levels/WM regions relative to the repeatability, differences in group mean (_{d}).

Moreover, a one-way repeated measures ANOVA between levels/regions was performed independently for each metric (

To test the metrics sensitivity to the demyelination with aging reported by histology in the literature [_{d} from the previous analysis) into account in order to investigate whether the difference in means could reflect a “true” difference or whether it is indistinguishable from measurement errors.

In addition, to test for significant differences, we performed independently for each metric, on the larger sample (_{young} = 19, _{elderly} = 14), two-way repeated ANOVAs with the age group as between-subjects factor and, as within-subjects factor:

vertebral levels to determine if this effect was consistent across levels (the metric being quantified in the whole WM);

ROIs (WM, DC, LF, VF) to determine if this effect was consistent across ROIs (the metric being quantified from C2 to C4).

Finally, to complete this study, a power analysis was performed for two-tailed

All these maps are in the template space. Note that the color bar scale has been adjusted to the mean maps contrast. On a single-basis subject, one can observe a somewhat poor test-retest repeatability, within and across slices. However, despite this poor repeatability, the average maps (here, n = 33) are more consistent in terms of symmetry and tract-specific variations. For example, we can clearly distinguish higher MTV in the fasciculus cuneatus versus in the gracilis (dorsal column), which is in agreement with previous histology work [

_{1} at C3. Regarding only one scan, the mean T_{1} across the group is 1007.2ms and the SD is 74.3ms. A 95% confidence interval for the mean test-retest difference of [-38.5; 23.1]ms indicates that if we rescan the same group a second time, the mean is likely to lie between 968.7 and 1030.3ms (with 95% probability). Now, if we measure T_{1} at C3 in a different group (e.g., a group of patients) and the resulting mean lies between 968.7 and 1030.3ms, we will not be able to report whether the difference in T_{1} between the two groups is due to measurement errors or to a true difference in T_{1}. The MDC (113.2ms in our example case) will be useful for instance in a case where a clinician measures the T_{1} in a new lesion of his patient at one time point _{1}(_{1}(_{1}(_{1}(_{1}(

The top and bottom of the orange boxes respectively represent the max and min among test and retest, while the black line in the middle of the box represents the mean. Note that the y-axis does not start from zero for the sake of clarity. The mean absolute difference between test and retest (mean height of orange boxes,

The top and bottom of the orange boxes are respectively the max and min among test and retest, while the black line in the middle of the box is the mean. The mean absolute test-retest difference (mean height of orange boxes,

_{subjects} |
_{d} |
_{subjects} |
|||
---|---|---|---|---|---|

_{1} |
964.9 ± 70.7 | -17.2 to 66.0 | 0.46 | ± 158.4 [224.1] | |

1007.2 ± 74.3 | -38.5 to 23.1 | 0.72 | ± 113.2 [152.3] | ||

1060.0 ± 69.5 | -63.6 to 4.8 | 0.53 | ± 135.5 [195.0] | ||

1083.6 ± 95.2 | -68.0 to 33.4 | 0.43 | ± 189.1 [198.7] | ||

46.83 ± 1.52 | -0.99 to 0.54 | 0.43 | ± 2.85 [186.7] | ||

45.78 ± 1.38 | -1.54 to 0.42 | -0.3 | ± 3.76 [271.6] | ||

44.87 ± 1.55 | -1.14 to 0.75 | 0.16 | ± 3.53 [228.1] | ||

44.02 ± 1.9 | -2.0 to 0.66 | 0.05 | ± 5.06 [265.9] | ||

_{sat} |
3.579 ± 0.194 | -0.12 to 0.113 | 0.5 | ± 0.429 [220.6] | |

3.492 ± 0.184 | -0.058 to 0.189 | 0.51 | ± 0.466 [253.6] | ||

3.49 ± 0.21 | -0.003 to 0.247 | 0.27 | ± 0.501 [238.1] | ||

3.562 ± 0.266 | -0.132 to 0.162 | 0.33 | ± 0.544 [204.9] | ||

37.36 ± 2.38 | -1.58 to 0.81 | 0.48 | ± 4.46 [187.4] | ||

36.84 ± 2.55 | -0.94 to 1.53 | 0.52 | ± 4.57 [178.8] | ||

36.25 ± 2.44 | -0.64 to 1.57 | 0.6 | ± 4.15 [169.8] | ||

35.92 ± 2.5 | -0.98 to 1.73 | 0.33 | ± 5.05 [202.2] |

From left to right, the columns correspond to the mean ± SD across subjects (n = 16) based on values from the first scan session only, the 95% confidence interval for the true test-retest difference, the ICC coefficient, the MDC. All numbers are in the metric unit except those in square brackets, which are expressed as a percentage of the SD across subjects to quantify the repeatability relative to the inter-subject difference, i.e. the reliability.

_{subjects} |
_{d} |
_{subjects} |
|||
---|---|---|---|---|---|

_{1} |
1011.2 ± 60.8 | -29.4 to 19.3 | 0.74 | ± 89.3 [146.8] | |

1068.3 ± 63.7 | -69.0 to 5.4 | 0.41 | ± 146.8 [230.5] | ||

971.6 ± 64.5 | -22.9 to 39.6 | 0.52 | ± 115.6 [179.2] | ||

1006.8 ± 168.0 | -60.0 to 71.5 | 0.68 | ± 240.8 [143.4] | ||

45.82 ± 1.3 | -0.99 to 0.37 | 0.28 | ± 2.55 [196.2] | ||

46.08 ± 0.87 | -1.02 to 0.07 | 0.29 | ± 2.15 [247.3] | ||

46.09 ± 1.54 | -0.94 to 0.53 | 0.38 | ± 2.73 [178.1] | ||

44.64 ± 2.48 | -1.65 to 0.93 | 0.35 | ± 4.8 [193.1] | ||

_{sat} |
3.517 ± 0.177 | -0.022 to 0.152 | 0.6 | ± 0.34 [192.4] | |

3.452 ± 0.181 | -0.029 to 0.25 | 0.1 | ± 0.543 [299.5] | ||

3.59 ± 0.202 | -0.009 to 0.118 | 0.82 | ± 0.252 [124.9] | ||

3.438 ± 0.283 | -0.124 to 0.172 | 0.54 | ± 0.546 [192.6] | ||

36.79 ± 2.3 | -0.96 to 1.13 | 0.6 | ± 3.83 [166.4] | ||

36.46 ± 2.24 | -0.65 to 1.34 | 0.6 | ± 3.7 [164.7] | ||

36.88 ± 2.41 | -1.21 to 0.89 | 0.65 | ± 3.87 [160.6] | ||

37.17 ± 2.95 | -1.2 to 1.81 | 0.42 | ± 5.58 [189.1] |

From left to right, the columns correspond to the mean ± SD across subjects (n = 16) based on values from the first scan session only, the 95% confidence interval of the true test-retest difference, the ICC coefficient, the MDC. All numbers are in the metric unit except those in square brackets, which are expressed as a percentage of the SD across subjects to quantify the repeatability with respect to the inter-subject difference, i.e. the reliability.

The ICC and the MDC (expressed in percentage of the SD across subjects) are useful to compare repeatability across metrics (more extensively done in _{1} to MTR at C3, the ICC is much higher for T_{1} (0.72) than MTR (-0.3)–note here that the interpretation of a negative value for the ICC is the same as for a null value (very poor reliability). This is because T_{1} has a lower test-retest variation (_{subjects} = 74.3ms in _{subjects} = 1.38% in _{1} at C3, MDC = 113.2ms, which is 152.3% of _{subjects} (_{subjects}. This result shows that measurement errors in MTR cover almost 3 times the standard variations between subjects, making it difficult to observe true differences in MTR.

This section deals with the larger sample (n = 33 subjects).

_{d}) in order to allow the reader to assess whether differences between vertebral levels or WM regions can be distinguished from measurement errors or not. Individual subjects data are also plotted to see if differences between subjects can be carried out despite the measurement error. However, for individual comparison, measurement errors are assessed by the MDC, which is much larger than the _{d} (as negative and positive errors do not compensate for each other). Only T_{1} and MTV seem to allow the comparison between some healthy subjects.

The red envelope represents the 95% confidence interval for the test-retest difference (CI_{d}), which assesses the measurement error magnitude of the group mean (in black). The orange envelope represents the MDC (Minimum Detectable Change), difference required to compare individual subjects (faded gray lines). Note that the group mean approaching the edges of the CI_{d} (red envelope) reflects an asymmetric confidence interval due to a non-null offset between test and retest (non-null mean test-retest difference,

The differences that are distinguishable from measurement errors were sum up in _{1} but none of them are larger than measurement errors.

(A) Analysis by vertebral levels | (B) Analysis by WM regions | |||||
---|---|---|---|---|---|---|

_{1} |
0.041 | • C2 vs. C5 | None. | <0.01 | • DC vs. LF |
None. |

<<10^{−4} |
• All levels are significantly different from each other. | • C2 vs. C5 | <<10^{−4} |
• DC vs. VF |
None. | |

_{sat} |
0.189 | None. | 0.076 | None. | ||

0.081 | None. | 0.085 | None. |

For each analysis (A, B), the left column is the results of the one-way repeated ANOVAs whereas the right column reports the vertebral levels/WM regions showing differences larger than measurement errors (see also

_{d}. With all metrics within every spinal cord region (vertebral level or WM region), the difference between young and elderly can always be explained by measurement errors only. Moreover, the repeated ANOVAs did not report any significant effect of age for all metrics, neither level-wise nor ROI-wise. However, we can still notice some general trends: T_{1}, MTR and MTV generally support the demyelination with aging histologically reported in the literature, whereas MT_{sat} constantly shows the reverse trend.

For each case, the corresponding 95% confidence interval for the mean test-retest difference (_{d}), estimated from the test-retest analysis (see section 3.1. Repeatability) was centered at the mean of each group, in order to assess whether the difference between young and elderly is larger than the test-retest errors or not. With all metrics within every spinal cord region (vertebral level or WM region), the difference in means between young and elderly was undistinguishable from measurement errors.

To complete this study, _{d}, 2^{nd} column) to the minimum difference in the true metric values required to detect a significant difference (1^{st} column) between young and elderly (with a fair test power). We can notice for example that, given the measurement errors of MTR (1.36%), even if the difference in means were large enough (≥1.27%) to yield a significant result, the imprecision of measurement is too large to detect such a difference. It is not the case with the other metrics. Moreover, we can notice that the observed differences in means (3^{rd} column) are very low compared to the difference needed to obtain significant results (1^{st} column), yielding very low statistical power for those tests (4^{th} column). Finally, given the large sample size required to obtain a significant difference (5^{th} column), T_{1} and MTV do not seem sensitive to age groups (based on their mean WM values in this study).

Minimum difference required to detect a significant difference with such a sample and 80% probability (effect size) | Length of _{d} |
Observed difference in means ( |
Power (probability to detect a significant difference with such a sample) | Sample size required to detect a significant difference with such means and 80% probability | |
---|---|---|---|---|---|

_{1} |
93.8 | 48.7 | -3.6 | 5.1% | 10394 |

1.27 | 1.36 | 0.70 | 33.5% | 52 | |

_{sat} |
0.203 | 0.174 | -0.092 | 24.5% | 75 |

2.67 | 2.09 | -0.01 | 5.0% | 1690133 |

This study proposes a statistical framework for comparing clinically feasible myelin imaging techniques (T_{1}, MTR, MT_{sat} and MTV) in the cervical spinal cord.

The resulting mean values across subjects are in agreement with previous studies. Stikov et al. [_{1} around 1000ms in the brain, which is comparable to the T_{1} in the spinal cord WM _{sat} we observed are also in agreement with literature [

Even for the most reliable metrics (T_{1} and MTV, see _{d} in this case, as shown in

In comparison with the brain, repeatability in the spinal cord is hampered by multiple sources of artifacts (motion, susceptibility) and low SNR [

Taso et al. [_{d} of [− 3%, +5%] for MTR over all WM from C2 to C5 within 10 young healthy subjects. Even if the repeatability of the metrics reported in our study is not good enough to differentiate between WM regions or age groups, it is still much better (_{d} of [− 0.99%, +0.54%] for MTR). This may suggest that significant differences not accounting for precision of measurements might have been reported in the literature, whereas they could be only explained by measurement errors.

Looking at the metrics individually, T_{1}-based metrics (MTV and T_{1}) generally show the best reliability (_{1} seems particularly affected by cord movements and compressions occurring during respiratory and cardiac cycles (_{1}^{+} inhomogeneity. MTR variations due to B_{1} errors have already been reported in the brain [_{sat} minimizes the T_{1} contribution included in MTR, and is thereby less variable across vertebral levels.

The assessment of the sensitivity of metrics to myelin content remains difficult, due to the lack of a ground truth. A loss of myelinated fibers with aging (mainly the small caliber ones) was observed histologically in the brain [

As noted in the introduction, some of the myelin-sensitive techniques are also hampered by confounding factors. For example, T_{2}* is affected by iron content, fiber orientation, blood vessels and blood oxygen level. MTR is affected by T_{1} and B_{1} field, and more generally, magnetization transfer and MTV are sensitive to macromolecules (i.e., not only myelin). For each of these techniques, there are ways to mitigate those confounds. For example, quantitative susceptibility maps could inform T_{2}* maps, or T_{1} and B_{1}^{+} fields could be acquired to correct MTR maps [

While DTI has some intrinsic limitations, other techniques also based on diffusion-weighted imaging might offer more sensitivity to myelin. It is important to note, however, that because water protons trapped between myelin sheaths have a short T_{2} (around 10 ms at 3T, which could be quantified using myelin water fraction techniques) and that protons from bound molecules have an even shorter T_{2} (order of μs, which could be quantified with ultra-short TE imaging or magnetization transfer techniques), diffusion-weighted protocols typically use a TE (> 60ms) too long to be sensitive to signal coming from the myelin (and from water trapped in it). Some advanced diffusion-weighted techniques include NODDI [

To improve specificity to myelin, combining several metrics, using for example independent component analysis, or acquiring maps of confounding factors for a posteriori corrections, might be advisable [

Repeatability assessment is crucial for the development of qMRI biomarkers. Our results show that significant differences between groups can be reported with standard statistical tests, yet these differences are comparable to (or even smaller than) test-retest measurement errors. Controlling for both aspects (statistical significance and measurement errors) is necessary for qMRI studies.

The indexes reported in this work (95% confidence interval for the test-retest difference (_{d}), ICC and MDC) are useful for quantifying repeatability and allowing comparisons across studies. As mentioned before, the coefficient of variation depends on the magnitude of the metric, and should not be the primary index for assessing repeatability, especially if metrics have different means or units. The _{d} first allows to control for the existence of a potential systematic bias between measurements (i.e. scan sessions). In addition, it gives an estimation of the measurement error for group averages. In the same vein, the MDC provides a measure of the minimum difference between two individual measurements to report a true difference, taking into account the measurement errors. For example, the _{d} would be useful for researchers comparing different populations, whereas the MDC would be useful for a clinician needing to assess the evolution of a WM lesion within a single patient. Furthermore, the ICC coefficient has the advantage to be dimensionless, and can thus be easily compared to assess reliability across metrics, studies, vendors or sites. Aside from providing a robust quantification of the repeatability with two measurements (test-retest studies), the ICC coefficient (and consequently, the MDC) can also be consistently used with more than two measurements. Those reliability indexes have already been extensively used in test-retest studies from other research fields, such as rehabilitation, where the precision of tests is crucial [

Finally, the assessment of the repeatability needs to be adapted to the study goals. Indeed, the ICC depends on the sample homogeneity. Therefore, if the goal is to differentiate between the microstructure of healthy subjects, including patients in the sample will artificially increase the between-subjects variability and overestimate the ICC. In this study, we can confidently assert that the ICC is lower (and the MDC is higher) than it would have been for a sample that includes patients and controls. Therefore, if the goal is to distinguish between pathological cases, we recommend including the different types of tissue (healthy and pathological tissues, with different stages of the disease) in the cohort. This way, the MDC and ICC would integrate the associated between-subjects variability.

Due to IRB restrictions, all data used here could not be publicly shared. However, we obtained specific consent for sharing MRI data from four young volunteers. Three of them were part of the tested and retested group. Along with those datasets, we provide the batch scripts used to produce the myelin-sensitive metric maps and to register them to spinal cord template and white matter atlas. Also available is a Microsoft Excel spreadsheet gathering all results of the metric estimations within each region of interest for every scan session and every volunteer of the cohort. The 1^{st} tab of the sheet corresponds to the tested and retested cohort only (n = 16), and the 2^{nd} tab corresponds to the whole cohort (n = 33). Finally, also shared are the scripts to extract these metrics values, to compute the statistical indices for reliability assessment and to produce the figures presented in this work. All these data and code are available at:

In this study, we assessed the repeatability and distribution of myelin-sensitive metrics (T_{1}, MTR, MT_{sat} and MTV) in the spinal cord. T_{1} and MTV (1 – _{1}, MTR and MTV showed trends consistent with the hypothesis of demyelination with aging, but again the differences were not large enough to be distinguishable from measurement errors, or to be significant.

This study used a range of statistical tools to explore the differences between myelin-sensitive metrics. We show that even though statistically significant differences can be reported using standard statistical tests, an important proportion of these differences can be attributed to measurement error. In particular, the coefficient of variation is a misleading index when comparing metrics with different units, and we recommend using the MDC when comparing individual measurements, and the 95% confidence interval of the test-retest difference when comparing groups. The indexes explored in this study allow for a fair comparison of qMRI metrics across studies, MRI vendors and sites, leading toward standardizing the field of myelin imaging and increasing its clinical relevance.

This section describes the data processing steps performed to estimate MTR, MT_{sat}, T_{1} and MTV maps and to register those maps to the MNI-Poly-AMU template [

(PDF)

The authors would like to sincerely thank Robert Brown for the helpful discussions.